ترکیب وزن‌دار خوشه‌بندی‌ها با هدف افزایش صحّت خوشه‌بندی نهایی

Fa | Ar | En

ترکیب وزن‌دار خوشه‌بندی‌ها با هدف افزایش صحّت خوشه‌بندی نهایی


نویسنده	وحیدی فردوسی صدیقه ,امیرخانی حسین
منبع	پردازش علائم و داده ها - 1399 - شماره : 2 - صفحه:100 -85

چکیده	با توجه به ماهیت بدون ناظر مسائل خوشه‌بندی و تاثیرگذاری مولفه‌های مختلف از جمله تعداد خوشه‌ها، معیار فاصله و الگوریتم انتخابی، ترکیب خوشه‌بندی‌ها برای کاهش تاثیر این مولفه‌ها و افزایش صحت خوشه‌بندی نهایی معرفی شده است. در این مقاله، روشی برای ترکیب وزن‌دار خوشه‌بندی‌های پایه با وزن‌دهی به خوشه‌بندی‌ها بر اساس روش ad ارائه شده است. روش ad برای برآورد صحّت انسان‌ها در مسائل جمع سپاری از هماهنگی یا تضاد بین آرای آنها استفاده می‌کند و با پیشنهاد مدلی احتمالاتی، فرآیند برآورد صحّت را به‌کمک یک فرآیند بهینه‌سازی انجام می‌دهد. نوآوری اصلی این مقاله، تخمین صحت خوشه‌بندی‌های پایه با استفاده از روش ad و استفاده از صحت‌های تخمین زده‌شده در وزن‌دهی به خوشه‌بندی‌های پایه در فرآیند ترکیب است. نحوه تطبیق مساله خوشه‌بندی به روش برآورد صحّت ad و نحوه استفاده از صحّت‌های برآورد‌شده در فرآیند ترکیب نهایی خوشه‌ها، از چالش‌هایی است که در این پژوهش به آنها پرداخته شده است. چهار روش برای تولید خوشه‌بندی‌های پایه شامل الگوریتم‌های متفاوت، معیارهای فاصله‌ی متفاوت در اجرای kmeans، ویژگی‌های توزیع‌شده و تعداد خوشه‌های متفاوت بررسی شده است. در فرآیند ترکیب، قابلیت وزن‌‌دهی به الگوریتم‌های خوشه‌بندی ترکیبی cspa و hgpa اضافه شده است. نتایج روش پیشنهادی روی سیزده مجموعه داده مصنوعی و واقعی مختلف و بر اساس نُه معیار ارزیابی متفاوت نشان می‌دهد که روش ترکیب وزن‌دار ارائه‌شده در بیش‌تر موارد بهتر از روش ترکیب خوشه‌بندی بدون وزن عمل می‌کند که این بهبود برای روش hgpa نسبت به cspa بیشتر است.
کلیدواژه	خوشه‌بندی ترکیبی وزندار، یادگیری بدون نظارت، Hgpa ، Cspa، Ad
آدرس	دانشگاه قم, گروه مهندسی کامپیوتر و فناوری اطلاعات, ایران, دانشگاه قم, گروه مهندسی کامپیوتر و فناوری اطلاعات, ایران

Weighted Ensemble Clustering for Increasing the Accuracy of the Final Clustering

Authors	Amirkhani Hossein ,Vahidi Ferdosi Sedigheh
Abstract	Clustering algorithms are highly dependent on different factors such as the number of clusters, the specific clustering algorithm, and the used distance measure. Inspired from ensemble classification, one approach to reduce the effect of these factors on the final clustering is ensemble clustering. Since weighting the base classifiers has been a successful idea in ensemble classification, in this paper we propose a method to use weighting in the ensemble clustering problem. The accuracies of base clusterings are estimated using an algorithm from crowdsourcing literature called agreement/disagreement method (AD). This method exploits the agreements or disagreements between different labelers for estimating their accuracies. It assumes different labelers have labeled a set of samples, so each two persons have an agreement ratio in their labeled samples. Under some independence assumptions, there is a closedform formula for the agreement ratio between two labelers based on their accuracies. The AD method estimates the labelers rsquo; accuracies by minimizing the difference between the parametric agreement ratio from the closedform formula and the agreement ratio from the labels provided by labelers. To adapt the AD method to the clustering problem, an agreement between two clusterings are defined as having the same opinion about a pair of samples. This agreement can be as either being in the same cluster or being in different clusters. In other words, if two clusterings agree that two samples should be in the same or different clusters, this is considered as an agreement. Then, an optimization problem is solved to obtain the base clusterings rsquo; accuracies such that the difference between their available agreement ratios and the expected agreements based on their accuracies is minimized. To generate the base clusterings, we use four different settings including different clustering algorithms, different distance measures, distributed features, and different number of clusters. The used clustering algorithms are mean shift, kmeans, minibatch kmeans, affinity propagation, DBSCAN, spectral, BIRCH, and agglomerative clustering with average and ward metrics. For distance measures, we use correlation, city block, cosine, and Euclidean measures. In distributed features setting, the kmeans algorithm is performed for 40%, 50%, hellip;, and 100% of randomly selected features. Finally, for different number of clusters, we run the kmeans algorithm by k equals to 2 and also 50%, 75%, 100%, 150%, and 200% of true number of clusters. We add the estimated weights by the AD algorithm to two famous ensemble clustering methods, i.e., Clusterbased Similarity Partitioning Algorithm (CSPA) and Hyper Graph Partitioning Algorithm (HGPA). In CSPA, the similarity matrix is computed by taking a weighted average of the opinions of different clusterings. In HGPA, we propose to weight the hyperedges by different values such as the estimated clustering accuracies, size of clusters, and the silhouette of clusterings. The experiments are performed on 13 real and artificial datasets. The reported evaluation measures include adjusted rand index, FowlkesMallows, mutual index, adjusted mutual index, normalized mutual index, homogeneity, completeness, vmeasure, and purity. The results show that in the majority of cases, the proposed weightedbased method outperforms the unweighted ensemble clustering. In addition, the weighting is more effective in improving the HGPA algorithm than CSPA. For different weighting methods proposed for HGPA algorithm, the best average results are obtained when we use the accuracies estimated by the AD method to weight the hyperedges, and the worst results are obtained when using the normalized silhouette measure for weighting. Finally, among different methods for generating base clusterings, the best results in weighted HGPA are obtained when we use different clustering algorithms to come up with different base clusterings.
Keywords	Weighted Ensemble Clustering ,Unsupervised Learning ,HGPA ,CSPA ,AD ,HGPA ,CSPA ,AD