ارائه روشی جدید برای خوشه بندی داده های مخلوط بر مبنای تعداد ویژگی مشابه

Fa | Ar | En

ارائه روشی جدید برای خوشه بندی داده های مخلوط بر مبنای تعداد ویژگی مشابه


نویسنده	رضایی حمید ,دانشپور نگین
منبع	پردازش علائم و داده ها - 1403 - شماره : 1 - صفحه:39 -52
چکیده	خوشه بندی عملیاتی است که در آن مجموعه ای از نمونه داده‌ها، نسبت به میزان شباهت، دسته بندی می شوند. نمونه داده های خوشه بندی، عددی یا مخلوطی از عددی و غیرعددی (اسمی) هستند. یافتن میزان شباهت و اندازه‌گیری فاصله، از چالش های خوشه بندی داده های مخلوط است. در این مقاله سعی شده است در محاسبه میزان شباهت و تعیین فاصله، به پارامتر تعداد ویژگی‌های مشابه توجه شود. در نسبت دادن هر نمونه به خوشه در مواردی که فاصله‌ها برابر یا نزدیک باشد، تعداد ویژگی‌های مشترک نمونه‌ها تعیین کننده خوشه مناسب خواهد بود. برای محاسبه فاصله در الگوریتم مورد نظر از تفاضل عددی نرمالسازی شده برای ویژگی‌های عددی و از فاصله همینگ برای ویژگی‌های غیرعددی استفاده شده است. تعیین مرکز خوشه اولیه نیز مانند بسیاری از روش‌ها بصورت تصادفی انجام شده است و در تکرارهای بعدی الگوریتم، نمونه مناسب‌تر به عنوان مرکز خوشه انتخاب می‌شود. الگوریتم مورد نظر با 5 الگوریتم دیگر در 5 مجموعه‌ داده مقایسه شده است. در بررسی نتایج، از سه معیارaccuracy ، ri، f-measure استفاده شده است. طبق نتایج آزمایشات، در سه مجموعه‌داده، الگوریتم موردنظر حداقل دو درصد بهتر از دو الگوریتم و یک درصد بهتر از یکی دیگر از الگوریتم‌ها عمل کرده است. در یکی دیگر از مجموعه‌داده‌ها الگوریتم موردنظر نتایج برابر یا نزدیک به یک درصد دقت بهتر نسبت به الگوریتم برتر داشت. در مجموعه‌داده آخر نیز الگوریتم مورد نظر در رتبه دوم از بین پنج الگوریتم قرار داشت.
کلیدواژه	خوشه‌بندی، داده مخلوط، فاصله مقادیر، تشابه مقادیر، مرکز خوشه
آدرس	دانشگاه تربیت دبیر شهید رجایی, دانشکده مهندسی کامپیوتر, ایران, دانشگاه تربیت دبیر شهید رجایی, دانشکده مهندسی کامپیوتر, ایران
پست الکترونیکی	ndaneshpour@sru.ac.ir

presenting a new method for mixed data clustering based on the number of similar features

Authors	rezaei hamid ,daneshpour negin
Abstract	clustering is an operation in which a set of data samples is categorized according to the degree of similarity. examples of clustering data are numerical or a mixture of numerical and non-numerical (nominal) data. finding similarities and measuring distances is one of the challenges of mixed data clustering. in the related works, to detect the degree of similarity and obtain the distance value, only the parameter of the distance value was considered and the cluster was selected based on its value. clustering in this way, especially for mixed data, has not had very accurate results.in this paper, we have tried to pay attention to the parameter number of similar features in calculating the degree of similarity and determining the distance. in assigning each sample to a cluster in cases where the distances are equal or close, the number of common features of the samples will determine the appropriate cluster. that is, we will pay attention to the number of similar features in addition to the distance to select the cluster. this idea believes that in cases where the distance of the cluster centers is close to the data object, it is better to choose the cluster center that has more features similar to the data object. logically and also according to the proposed algorithm, the amount of similarity should be in a larger number of features, not just a few limited features but with high similarity.the parameter of the number of similar features has a specific definition and is obtained with a suitable threshold. if the distance value of two features is less than the threshold, those two features are considered as similar features.to calculate the distance in the algorithm, the normalized numerical difference for numerical properties and the hamming distance for non-numerical properties are used. determining the initial cluster centers, like many methods, is done randomly, and in subsequent iterations of the algorithm, more appropriate samples are selected as the cluster centers. the algorithm is compared with 5 other algorithms in 5 datasets. in examining the results, three criteria of accuracy, ri and f-measure have been used. according to the test results, in the mixed and integer datasets, the algorithm performs at least two percent better than the two algorithms and one percent better than the other algorithm. in another data set, the proposed algorithm had results equal to or close to one percent better accuracy than the superior algorithm. in the last data set, the proposed algorithm was ranked second among 5 algorithms. in general, the proposed algorithm won the top rank in most of the results, and in the rest of the cases, it won the second rank out of the five tested algorithms.
Keywords	clustering ,mixed data ,distance of values ,similarity of values ,cluster center