طراحی مدل ترکیبی برای طبقه‌بندی داده‌های نامتوازن در رشته بیمه شخص ثالث

Fa | Ar | En

طراحی مدل ترکیبی برای طبقه‌بندی داده‌های نامتوازن در رشته بیمه شخص ثالث


نویسنده	منطقی پور مهناز ,رحیم خانی پریسا
منبع	سامانه هاي پردازشي و ارتباطي چند رسانه اي هوشمند - 1401 - دوره : 3 - شماره : 2 - صفحه:1 -9
چکیده	بخش عمده پورتفوی صنعت بیمه کشور ایران را رشته بیمه اجباری مسئولیت مدنی دارندگان وسایل نقلیه موتوری زمینی در مقابل اشخاص ثالث، تشکیل داده است. توانایی پیش بینی وقوع و یا عدم وقوع خسارت به ویژه خسارت‌های جانی نه تنها برای شرکت‌های بیمه بلکه برای تصمیم گیرندگان در حوزه‌های افزایش امنیت جاده‌ها اهمیت بسیاری دارد. به منظور پیش بینی برچسب وقوع یا عدم وقوع خسارت از روش‌های طبقه بندی استفاده می‌شود که در واقع یک مساله طبقه بندی نامتوازن است. این نامتوازن بودن شدید، ناشی از ماهیت کسب و کار بیمه است. نامتوازن بودن داده‌های صنعت بیمه باعث ایجاد چالش‌های بسیاری در تجزیه و تحلیل داده‌های مربوطه می‌شود. در این پژوهش، ما به طبقه‌بندی داده‌های نامتوازن بیمه شخص ثالث در یک شرکت بیمه معتبر می‌پردازیم. در این راستا دو روش ترکیبی برای رفع مشکل نامتوازن بودن داده‌ها براساس 5 مدل پایه گاوسین بیز، بردارهای پشتیبان، لجستیک رگرسیون، درخت تصمیم، نزدیکترین همسایگی به منظور طبقه‌بندی موثرتر داده‌های مربوطه ارائه می‌شود. نتایج بدست آمده نشان می‌دهد که مدل‌های ترکیبی ارائه شده بهتر از سایر الگوریتم‌های داده‌کاوی برای داده‌های مربوطه جواب می‌دهند و استفاده از درخت تصمیم در تجمیع مدل‌های پایه برای ساخت مدل ترکیبی نتایج بهتری نسبت به رای گیری ساده مدل‌ها ارائه می‌کند. همچنین ابر پارامتر تعداد مدل‌های لازم در رای گیری بر اساس استراتژی شرکت قابل تنظیم است. تعداد ویژگی‌های ثبت شده از بیمه نامه‌ها در شرکت‌های بیمه محدود است با تکمیل این ویژگی‌ها به ویژه اضافه شدن سوابق رانندگی و سایر ویژگی‌های فردی می‌توان به مدل بهتری دست یافت.
کلیدواژه	مدل ترکیبی، داده های نا متوازن، داده کاوی، بیمه شخص ثالث
آدرس	دانشگاه شهید مدنی آذربایجان, ایران, دانشگاه الزهرا, ایران
پست الکترونیکی	rahimkhani.parisa@gmail.com

designing a hybrid model for classification of imbalanced data in the field of third party insurance

Authors	manteqipour mahnaz ,rahimkhani parisa
Abstract	the major part of iran’s insurance industry portfolio is the field of compulsory civil liability insurance of motor vehicle owners against third parties. therefore, detecting the behavior of this insurance field will be effective in order to provide better services to the customers of the insurance industry. predicting the claim rates for insurance policies, based on the features saved for each insurance policy, is one of the problems of the insurance industry that can be solved with the help of data mining techniques. insurance is designed using the law of large numbers. in simpler words, a sufficient number of insurance policies are issued, and a small part of this number of insurance policies deal with claims. from the sum of the issued insurance premiums, the cost of claims will be compensated. therefore, the insurance industry is faced with imbalanced data. the imbalances of insurance industry data causes many challenges in data classification. in the field of third-party insurance and in the data set of this research, there are 14 features for every policies and the data imbalance ratio is 1 to 0.0092, which is considered severe imbalanced.methodin this research, we deal with the classification of severe imbalanced data in the field of third party insurance. to overcome the problem of imbalanced data, two hybrid models with different architectures based on 5 basic gaussian bayes models, support vectors, logistic regression, decision tree and nearest neighbor are designed. first proposed hybrid model is using random sampling from whole dataset and applying a resampling method for classification and second one selects samples from each labels separately and apply a classification model on the whole selected data. the results of these models are compared. resultsthe obtained results show that the proposed hybrid models can predict the occurrence or non-occurrence of traffic accidents better than other data mining algorithms. the popular measures such as precisions and recalls of two proposed hybrid models show that second hybrid model has higher performance. and in ensemble phase, the number of models in simple voting as a hyper parameter can be adjusted based on the company’s strategy. also, the use of decision tree to ensemble basic models to build a combined model provides better results than simple voting of basic models.discussionto do more research on the problem of imbalance data classification more complicated resampling data algorithms could be applied and the results be compared.
Keywords	hybrid model ,imbalance data ,data mining ,third party insurance