پیش‌بینی ریزش کارمندان با استفاده از ‌الگوریتم‌های یادگیری گروهی مبتنی بر درخت ‌تصمیم

Fa | Ar | En

پیش‌بینی ریزش کارمندان با استفاده از ‌الگوریتم‌های یادگیری گروهی مبتنی بر درخت ‌تصمیم


نویسنده	مزارعی محبوبه ,پورامینی جعفر
منبع	پردازش علائم و داده ها - 1402 - شماره : 3 - صفحه:73 -86
چکیده	یکی از مهم‌ترین نگرانی‌های مدیران ترک خدمت کارکنان کلیدی است؛ زیرا سازمان با ازدست‌دادن نیروهای ارزشمند خود، متحمل ازدست‌دادن دانش و تجربیاتی می‌شود که طی سال‌ها تلاش به‌دست‌آمده است؛ بنابراین پیش‌بینی ریزش کارکنان به مدیران منابع انسانی در استخدام نیروهای ماندگار و حفظ و نگهداری آنها کمک می‌کند. یکی از ابزارهای کارآمد دراین‌خصوص استفاده از روش‌های مختلف داده‌کاوی است. تعداد کم نمونه‌ها و نامتوازن بودن داده‌های ریزش کارکنان و تنظیم ابر پارامترها از جمله مشکلات استفاده از روش‌های داده‌کاوی برای پیش‌بینی ریزش کارکنان است. هدف این تحقیق، ارائه روش‌های مناسب کاهش ویژگی و پیش‌پردازش داده‌ها به همراه ارائه راهکار برای تنظیم مناسب ابر پارامترها برای پیش‌بینی ریزش کارکنان با استفاده از تکنیک‌های مختلف داده‌کاوی و الگوریتم‌های یادگیری ماشین تجمعی است. باتوجه‌به نامتوازن بودن داده‌ها، از روش‌های کم نمونه‌گیری تصادفی و ترکیب آن با بیش نمونه‌گیری تصادفی برای متوازن‌سازی داده‌ها با نسبت‌های متفاوت استفاده شد. باتوجه‌به مثبت‌بودن همه داده‌ها از روش کاهش ابعاد تجزیۀ ماتریس نامنفی nmf استفاده گردید. با استفاده از روش‌های جستجوی ابر پارامترها، مقادیر بهینه ابر پارامترها برای الگوریتم‌های پیشنهادی، تعیین شد. برای ارزیابی روش پیشنهادی از مجموعه‌داده‌های استاندارد با اندازه‌های مختلف استفاده شده است. نتایج روش پیشنهادی با نتایج حاصل از سایر روش‌های مطرح در این حوزه مانند knn, adaboost, dt و svc مقایسه شده است. نتایج به‌دست‌آمده نشان می‌دهند که مدل پیشنهادی این تحقیق نسبت به تحقیقاتی که در گذشته روی همین داده‌ها صورت‌گرفته، دارای دقت پیش‌بینی بهتری است. طبق بررسی‌های انجام شده در این تحقیق که با استفاده از یک روش ترکیبی انتخاب ویژگی انجام شد، ویژگی‌های «سن»، «درآمد ماهیانه»، «نرخ روزانه»، «اضافه‌کاری» و «تعداد شرکت‌هایی که کارمند در آنها کارکرده»، بیش‌ترین تاثیر را بر ریزش کارکنان داشته‌اند.
کلیدواژه	داده‌کاوی، مدیریت منابع انسانی، ‎ ‎یادگیری تجمعی، ریزش کارکنان
آدرس	دانشگاه پیام نور مرکز بین الملل عسلویه, ایران, دانشگاه پیام نور مرکز تهران, ایران
پست الکترونیکی	j_pouramini@pnu.ac.ir

predicting employee turnover using tree-based ensemble ‎learning algorithms ‎

Authors	mazarei mahboobe ,pooramini jafar
Abstract	key employee’s turnover is one of the most important concerns of human resource managers (hrm); because the organization by losing its valuable staff, suffers from the loss of skills and experience gained over the years, so predicting employee turnover helps hrms to hire and retain permanent employees. one of the effective tools in this regard is the use of different data mining methods. many researchers have done research in this field. this study reviewes recently published articles based on machine learning models, using kaggle human resource (hr) databases [1-5] to compare them with this proposed models. in the article [9], the authors have selected 11 of the most important features by collecting common features from previous articles and filtering them using feature review and selection algorithms. after converting non-numerical variables to numerical and normalizing the data in the range [0,1], those attrition prediction approach is based on machine, deep and ensemble learning models and is experimented on a large-sized and a medium-sized simulated hr datasets and then a real small-sized dataset from a total of 450 responses. those approach achieves higher accuracy (0.96, 0.98 and 0.99 respectively) for the three datasets when compared previous solutions. in 2021, authors examined the relationship between features using pearson correlation coefficient and selected 11 features with the highest correlation coefficient. then used from six different machine learning algorithms including random forest (rf), logistic regression (lr), …, to predict employee turnover. the highest accuracy they obtained was 0.85 for rf [3]. in the article[1], the authors used two ibm datasets and a database containing hr information from a regional bank in the usa to predict employees turnover. after cleaning and preprocessing the data, the performance of 10 different machine learning algorithms such as decision tree (dt), rf, lr, neural network, …, was evaluated using roc criteria on 10 small, medium, and large subsets of randomly selected, unassigned primary datasets. the average accuracy of algorithms is 0.83 in small datasets, 0.81 in medium datasets and 0.86 in large datasets. the authors of the paper [4] used three main experiments on ibm watson simulated datasets to predict employees turnover. the first experiment involved training the original class-imbalanced dataset with the following machine learning models: support vector machine with several kernel functions, random forest and k-nearest neighbour (knn). the second experiment focused on using adaptive synthetic (adasyn) approach to overcome class imbalance, then retraining on the new dataset using the abovementioned machine learning models. as a result, training an adasyn-balanced dataset with knn (k = 3) achieved the highest performance, with 0.93 f1-score. this turnover prediction approach is based on tree-based ensemble learning models and is experimented on a large-sized standard simulated hr dataset (hr_data), including 15,000 samples with 10 features and a medium-sized (ibm) including 1470 samples with 34 features. the employees turnover rate in the ibm is 16.1% and in the hr_data is 23.8%, so datasets are unbalanced. to balance the data, the random-under-sampling technique and its combination of random-over-sampling with a ratio of 0.5965 for the ibm and 0.6558 for the hr_data has been used. in the preprocessing stage, features with zero variance and samples containing the missing value were also removed. then categorical (non-numeric) values were converted to binary fields and then all features were scaled using data normalization in [0,1]. in order to reduce the feature dimensions in the ibm dataset, we used the non-negative matrix factorization (nmf) technique (n_components=17, max_iter=500) and for initialization, non-negative singular value analysis method with zeros filled with x value has been used. after reviewing and cleaning the data, in the processing stage, six different classification algorithms, including knn (k=1), rf (number of trees= 1500), dt, extratreesclassifier (number of trees= 1000) and support vector classifier were training on 70% of data. the optimal value of the hyperparameters for the algorithms, was set using randomizedsearchcv and gridsearchcv techniques. in order to investigate the effect of balancing and dimensionality reduction on the performance of models, experiments were performed in 3 stages (befor balancing, after balancing befor dimensionality reduction, after balancing and dimensionality reduction) on 30% of the remaining data. the results shown in table (2-4) indicate that this proposed model, which uses tree-based optimized ensemble learning algorithms with data balancing and nmf dimensionality reduction method, increases the f1score of turnover prediction. in the hr_data dataset, the best f1score for the randomforest algorithm was 99.52% and for the ibm hr dataset, the best f1score for the extratreesclassifier algorithm was 95.82%, which is higher than previous research. table 5 compares the results of previous research with this research. since, the prediction of employee attrition will not be enough without finding the characteristics that affect it, therefore, after building models and evaluating their performance, using a combined feature selection method by averaging the results of the single-variable feature selection method called selectkbest, and a wrapper feature selection method called recursive feature elimination (rfe) with four learning algorithms rf, dt, extratreesclassifier and adaboost, the most effective features were selected. selectkbest combines the chi2 univariate statistical test with the selection of k features based on the statistical result between the features and the target variable. also, in the rfe method, machine learning algorithms are used to remove the least important features after recursive training, so that finally the number of features reaches the set number (17 features in this article). the performance results of the models based on the selected features are shown in table 6. the most effective characteristics are age, daily rate, over time, numcompaniesworked and, monthly income .
Keywords	data mining ,human resource management ,ensemble learning ,employee turnover