خوشه بندی گروهی طیفی لاپلاسی-p نیمه نظارتی برای داده های با ابعاد بالا

Fa | Ar | En

خوشه بندی گروهی طیفی لاپلاسی-p نیمه نظارتی برای داده های با ابعاد بالا


نویسنده	صفری صدیقه ,افسری فاطمه
منبع	پردازش علائم و داده ها - 1402 - شماره : 1 - صفحه:39 -58
چکیده	با توجه به افزایش روزافزون اطلاعات و تحلیل دقیق آنها مساله خوشه بندی که برای آشکارسازی الگوهای پنهان موجود در داده ها مورد استفاده قرار می گیرد، همچنان از اهمیت بالایی برخوردار است. از طرفی خوشه بندی داده های با ابعاد بالا با استفاده از روش های سنتی پیشین دارای محدودیت های زیادی است. در مقاله حاضر، یک روش خوشه بندی گروهی نیمه نظارتی برای مجموعه ای از داده های پزشکی با ابعاد بالا پیشنهاد می شود. در فرموله سازی مساله خوشه بندی اطلاعات نظارتی اندکی به عنوان دانش پیشین با استفاده از اطلاعات مربوط به تشابه و یا عدم تشابه (بصورت تعدادی زوج محدودیت های دوبه دو) در نظر گرفته می شود. در ابتدا با استفاده از خاصیت تراگذری زوج محدودیت های دوبه دو را بر روی تمام داده ها تعمیم می دهیم. سپس با تقسیم فضای ویژگی به صورت تصادفی به چندین زیرفضای نابرابر ابعاد داده ها را کاهش می دهیم. خوشه بندی طیفی نیمه نظارتی مبتنی بر گراف لاپلاسی- p در هر زیر فضا بطور مستقل انجام می شود. سپس با استفاده از نتایج هر کدام یک ماتریس مجاورت، حاصل از تجمیع نتایج هر کدام (مبتنی بر یادگیری گروهی) ایجاد می شود. در نهایت با استفاده از چند عملگر جستجو روی زیرفضاها، بهترین زیرفضا، یعنی زیرفضایی که بهترین نتیجه خوشه بندی را دارد، می یابیم. نتایج آزمایشات متعدد بر روی چندین داده ی پزشکی با ابعاد بالا نشان می دهد که رویکرد پیشنهادی، عملکرد و کارآیی بهتری نسبت به روش های پیشین دارد.
کلیدواژه	خوشه بندی، یادگیری زیرفضا، یادگیری گروهی، یادگیری نیمه نظارتی، زوج محدودیت های دوبه دو
آدرس	دانشگاه شهید باهنر کرمان, دانشکدۀ فنی و مهندسی, بخش مهندسی کامپیوتر, ایران, دانشگاه شهید باهنر کرمان, دانشکدۀ فنی و مهندسی, بخش مهندسی کامپیوتر, ایران
پست الکترونیکی	afsari.f@gmail.com

semi-supervised ensemble p-laplacian spectral clustering for high dimensional data

Authors	safari sedighe ,afsari fatemeh
Abstract	due to information increasing and the detailed analysis of them, the clustering problems that detect the hidden patterns lie in the data, are still of a great importance. on the other hand, clustering of high-dimensional data using previous traditional methods has many limitations. in this study, a semi-supervised ensemble clustering method is proposed for a set of high-dimensional medical data. in the proposed method of this study, little information is available as prior knowledge using the information on similarity or dissimilarity (as a number of pairwise constraints). initially, using the transitive property, we generalize the pairwise constraints to all data. then, we divide the feature space into a number of sub-spaces, and to find the optimal clustering solution, the feature space is divided into an unequal number of sub-spaces randomly. a semi-supervised spectral clustering based on the p-laplacian graph is performed at each sub-space independently. specifically, to increase the accuracy of spectral clustering, we have used the spectral clustering method based on the p-laplacian graph. the p-laplacian graph is a nonlinear generalization of the laplacian graph. the results of any clustering solutions are compared with the pairwise constraints and according to the level of matching, a degree of confidence is assigned to each clustering solution. based on these degrees of confidence, an ensemble adjacency matrix is formed, which is the result of considering the results of all clustering solutions for each sub-space. this ensemble adjacency matrix is used in the final spectral clustering algorithm to find the clustering solution of the whole sub-space. since the sub-spaces are generated randomly with an unequal number of features, clustering results are strongly influenced by different initial values. therefore, it is necessary to find the optimal sub-space set. to this end, a search algorithm is designed to find the optimal sub-space set. the search process is initialized by forming several sets (we call each set an environment) consisting of several numbers of sub-spaces. an optimal environment is the one that has the best clustering results. thesearch algorithm utilized three search operators to find the optimal environment. the search operators search all the environments and the consequent sub-spaces both locally and globally. these operators combine two environments and/or replace an environment with a newly generated one. each search operator tries to find the best possible environment in the entire search space or in a local space.
Keywords	clustering ,subspace learning ,ensemble learning ,semi-supervised learning ,pairwise constraints