یک روش جدید انتخاب ویژگی یک‌طرفه در دسته‌بندی داده‌های متنی نامتوازن

Fa | Ar | En

یک روش جدید انتخاب ویژگی یک‌طرفه در دسته‌بندی داده‌های متنی نامتوازن


نویسنده	پورامینی جعفر ,مینایی بیدگلی بهروز ,اسماعیلی مهدی
منبع	پردازش علائم و داده ها - 1398 - شماره : 1 - صفحه:21 -40
چکیده	توزیع نامتوازن داده ها باعث افت کارایی دسته بندها می شود. راه حل های پیشنهاد شده برای حل این مشکل به چند دسته تقسیم می شوند، که روش های مبتنی بر نمونه گیری و روش های مبتنی بر الگوریتم از مهم ترین روش ها هستند. انتخاب ویژگی نیز به عنوان یکی از راه حل های افزایش کارایی دسته بندی داده های نامتوازن مورد توجه قرار گرفته است. در این مقاله یک روش جدید انتخاب ویژگی یک طرفه برای دسته بندی متون نامتوازن ارائه شده است. روش پیشنهادی با استفاده از توزیع ویژگی ها میزان نشان گر بودن ویژگی را محاسبه می کند. به منظور مقایسه عملکرد روش پیشنهادی، روش های انتخاب ویژگی مختلفی پیاده سازی و برای ارزیابی روش پیشنهادی از درخت تصمیم c4.5 و نایوبیز استفاده شد. نتایج آزمایش ها بر روی پیکره های reuters21875 و webkb برحسب معیار micro f ، macro f و gmean نشان می دهد که روش پیشنهادی نسبت به روش های دیگر، کارایی دسته بندها را به اندازه قابل توجهی بهبود بخشیده است.
کلیدواژه	انتخاب ویژگی، روش پالایه، داده‌های نامتوازن، دسته‌بندی متون
آدرس	دانشگاه پیام نور تهران, دانشکده فنی و مهندسی, گروه مهندسی فناوری اطلاعات, ایران, دانشگاه علم و صنعت ایران, دانشکده مهندسی کامپیوتر, ایران, دانشگاه آزاد اسلامی واحد کاشان, دانشکده مهندسی کامپیوتر, ایران

A Novel One Sided Feature Selection Method for Imbalanced Text Classification

Authors	Pouramini Jafar ,Minaei-Bidgoli Behrouze ,Esmaeili Mahdi
Abstract	The imbalance data can be seen in various areas such as text classification, credit card fraud detection, risk management, web page classification, image classification, medical diagnosis/monitoring, and biological data analysis. The classification algorithms have more tendencies to the large class and might even deal with the minority class data as the outlier data. The text data is one of the areas where the imbalance occurs. The amount of text information is rapidly increasing in the form of books, reports, and papers. The fast and precise processing of this amount of information requires efficient automatic methods. One of the key processing tools is the text classification. Also, one of the problems with text classification is the high dimensional data that lead to the impractical learning algorithms. The problem becomes larger when the text data are also imbalance. The imbalance data distribution reduces the performance of classifiers. The various solutions proposed for this problem are divided into several categories, where the samplingbased methods and algorithmbased methods are among the most important methods. Feature selection is also considered as one of the solutions to the imbalance problem. In this research, a new method of oneway feature selection is presented for the imbalance data classification. The proposed method calculates the indicator rate of the feature using the feature distribution. In the proposed method, the onefigure documents are divided in different parts, based on whether they contain a feature or not, and also if they belong to the positiveclass or not. According to this classification, a new method is suggested for feature selection. In the proposed method, the following items are used. If a feature is repeated in most positiveclass documents, this feature is a good indicator for the positiveclass; therefore, this feature should have a high score for this class. This point can be shown as a proportion of positiveclass documents that contain this feature. Besides, if most of the documents containing this feature are belonged to the positiveclass, a high score should be considered for this feature as the class indicator. This point can be shown by a proportion of documents containing feature that belong to the positiveclass. If most of the documents that do not contain a feature are not in the positiveclass, a high score should be considered for this feature as the representative of this class. Moreover, if most of the documents that are not in the positive class do not contain this feature, a high score should be considered for this feature. Using the proposed method, the score of features is specified. Finally, the features are sorted in descending order based on score, and the necessary number of required features is selected from the beginning of the feature list. In order to evaluate the performance of the proposed method, different feature selection methods such as the Gini, DFS, MI and FAST were implemented. To assess the proposed method, the decision tree C4.5 and Naive Bayes were used. The results of tests on Reuters21875 and WebKB figures per Micro F , Macro F and Gmean criteria show that the proposed method has considerably improved the efficiency of the classifiers than other methods.
Keywords	Feature selection ,Imbalanced class ,High dimensionality ,Text classification