Using Visualizations and Tree-Base Models to Analyze and Predict Outcomes
Many of the causes of healthcare worker attrition are related to the stressful nature of the healthcare work industry. Many employees work long hours and often experience high burnout. Employee attrition in healthcare is an issue because it exacerbates the issue of the limited supply of workers in the space. Since the healthcare space is significantly understaffed and many healthcare employees are overworked, the quality of care and the speed to care are often negatively impacted.
In general, efforts towards reducing healthcare attrition involve improving the quality of the work environment. For example, some common efforts toward improving employee workspace involve increasing employee engagement and structuring concrete career paths for employees.
In addition to these common methods, healthcare employers also can use their proprietary data, much of which contain insightful signals on causes of attrition and burn out. This is where data analytics and predictive modeling can be useful. For example, data analytics can aid employers in identifying employees and departments at high risk of attrition. Further, this can aid employers in determining the factors that contribute to high attrition rates.
The Employee Attrition for Healthcare data is a synthetic data set released by IBM. The data is publicly free to use, modify and share under the creative commons license (CC0: Public Domain). While the data is synthetic, it can aid data analysts and data scientists in use-case formulation particularly around solving the problem of employee attrition in healthcare. For example, data visualizations such as box-plots, histograms, and pie charts can help give insights into which roles in healthcare have the highest attrition rates. This can provide quantitative means for comparing disparate groups in the healthcare space. Regarding predictive modeling, state of the art tree-base models, like CatBoost, can be used to predict employee attrition outcomes as well as analyze the factors that most contribute to the risk of attrition.
Here I will be performing exploratory data analysis and building classification model that predicts attrition outcomes. For my analysis and modeling, I will be writing code in DeepNote which is a collaborative data science notebook which makes managing development environments straightforward.
Reading in the Data
To start, let’s navigate to DeepNote and create a new project:
Next let’s add our employee attrition data by clicking the ‘+’ symbol next to the file tab on the left:
Now let’s import the Pandas library and import our data into a Pandas data frame:
We can then display the first five rows of data using the ‘.head()’ method:
And we can also display the full list of columns in our data:
We see columns for EmployeeID, Age, Attrition, BusinessTravel, DailyRate and more. Let’s generate a pie-chart to see the distribution in positive and negative attrition outcomes. To do this let’s import the Counter method from the collections module and count the number of positive and negative instances:
We see that there are 1477 negative instances and 199 positive instances. From this, we see that the data is imbalanced. Let’s generate a pie chart from this data:
We see that negative instances make up 88% of the data, while positive instances make up 12% of the data.
We can even define a function that takes a categorical column, a categorical value and generates this pie chart within that category value:
And let’s look at this pie chart for males and females:
We see that in our synthetic data attrition is higher for females (13%) than males (12%). Another useful visualization is box-plots. We can also use box plots to visualize the distribution in numeric values based on the minimum, maximum, median, first quartile, and third quartile. This can help us answer if there are differences in certain numerical fields for those who leave versus those who stay. Let’s write a function that looks generate a box-plot for a numerical field for negative and positive attrition instances:
Now let’s call our function with our data frame and the MonthlyIncome column:
We see that negative instances of attrition are more strongly associated with higher pay, which intuitively makes sense. Let’s call our function with our data frame and YearsSinceLastPromotion:
Another type of visualization that can be insightful is the histogram. This help us gain insight into the distribution in some numerical field. These can also be used to compare categories. Let’s generate the MonthlyIncome distributions for negative and positive cases :
We see that the center of the distribution for negative cases sits at a larger value than for positive cases. Further, the tail for the negative cases is much longer which indicates that there are many highly paid employees who stay in the healthcare space.
Building an Employee Attrition Classifier
Now that we’ve done some basic analysis of the data, let’s build a simple classier that predicts the outcome of employee attrition. To keep it simple, let’s use MonthlyIncome, Gender, YearsSinceLastPromotion, and JobRole to predict the employee attrition outcome:
Next let’s split our data for training and testing. We will import the train_test_split method from the model_selection module in scikit-learn and pass our input (X) and output (y) as argument in the method:
We will use a CatBoost classification model. CatBoost is useful since it can handle categorical variable directly without the need for converting to numerical values.
Let’s install the CatBoost package:
Next let’s import CatBoost, train our model and generate predictions on our test set:
Next let’s calculate performance. Since our data is imbalanced average precision is a useful metric for measuring performance:
We see that our model has an average precision of 0.148. A good precision value would be above 0.7 or as close to 1.0 as possible. The performance of our model can be further improved in a variety of ways. The simplest is to downsample the negative instances such that they are equal to the number of positive outcomes. Let’s try this and see if our performance improves:
Now we can train our model and generate a new set of predictions:
Now let’s evaluate performance:
We see that average precision improved from 0.148 to 0.627. We can even further improve this by increase the number of iterations and including more features in our model.
The next thing we can do is generate a feature importance plot. This we allow us to see which factors contributed most to the risk of attrition:
We see that from the feature importance plot MonthlyIncome is the factor, out of the inputs we used for modeling, that contributes most to positive attrition outcomes. I encourage you to perform additional feature exploration and analysis to see if there are any other features that contribute even more than MonthlyIncome. Further, experimenting with feature selection can further improve the average precision.
The code used in this post is available on GitHub.
Employee attrition is a growing issue in healthcare spaces. Issues with long hours, low pay, and low supply in the workforce contribute to the high burnout rate of healthcare workers. While some employees work to foster better work life balance and work environments, using data analytics can help identify employees at high risk of leaving and even take preventative measures. Having insights into which factors contribute to attrition can aid employers in taking these preventative measures.
Analyzing Employee Attrition in Healthcare Data and Predicting Outcomes Republished from Source https://towardsdatascience.com/analyzing-employee-attrition-in-healthcare-data-and-predicting-outcomes-9afe822dcee?source=rss----7f60cf5620c9---4 via https://towardsdatascience.com/feed