روش نوین خوشه‌بندی داده‌های بیان‌ژنی

Fa | Ar | En

روش نوین خوشه‌بندی داده‌های بیان‌ژنی


نویسنده	شاهسونی داوود ,فرهادی زهره
منبع	انفورماتيك سلامت و زيست پزشكي - 1395 - دوره : 3 - شماره : 3 - صفحه:205 -213
چکیده	مقدمه: یکی از تحولات مهم علم ژنتیک، ظهور فناوری ریزآرایه و تولید داده های بیان ژنی است که امکان مطالعه رفتار هزاران ژن را به طور همزمان فراهم می کند. خوشه بندی یکی از روش های داده کاوی است که در تحلیل داده های بیان ژنی مورد استفاده قرار می گیرد. از آنجا که عملکرد روش های خوشه بندی به شدت تحت تاثیر داده ها است، نتیجه خوشه بندی همواره با عدم قطعیت روبه رو بوده و الگوریتمی وجود ندارد که بتوان آن را برای تمام داده ها، کارا قلمداد نمود. در این تحقیق، در تحلیل داده های بیان ژنی از خوشه بندی اجماعی (ترکیب نتایج چندین الگوریتم خوشه بندی) به جای اجرای یک الگوریتم منفرد استفاده شده است.روش: این مقاله عملکرد خوشه بندی اجماعی را بر روی سه مجموعه داده بیان ژنی nutt-v3، alizadeh-v2 وsu، توسط شاخص رند تعدیل یافته مورد ارزیابی قرار می دهد. برای پیاده سازی خوشه بندی اجماعی، دوازده خوشه بندی متفاوت حاصل از ترکیب چهار الگوریتم خوشه بندی با سه معیار عدم تشابه، به طور همزمان روی داده ها اجرا شده اند. پس از ادغام نتایج، میزان تطابق خوشه های تخمینی با گروه های واقعی توسط شاخص رند تعدیل یافته سنجیده شده است.نتایج: مقدار شاخص رند تعدیل یافته برای سه مجموعه داده nutt-v3 ، alizadeh-v2 و su، به ترتیب برابر 1، 0/9 و 0/58به دست آمد که حاکی از دقت بالای روش پیشنهادی در کشف ساختارهای نهفته در داده ها است. همچنین الگوریتم طراحی شده، توانست تعداد واقعی خوشه ها را بدون خطا تشخیص دهد.نتیجه گیری: خوشه بندی اجماعی روشی توانمند برای خوشه بندی داده های بیان ژنی است. با توجه به دقت این روش در کشف ساختارهای واقعی، می توان آن را با اطمینان جایگزین الگوریتم های خوشه بندی منفرد نمود.
کلیدواژه	داده‌کاوی، خوشه‌بندی اجماعی، خوشه‌بندی سلسله مراتبی، خوشه‌بندی افراز حول مدوید، مقیاس‌گذاری چند بعدی کلاسیک
آدرس	دانشگاه صنعتی شاهرود, دانشکده علوم ریاضی, گروه آمار, ایران, دانشگاه صنعتی شاهرود, دانشکده علوم ریاضی, ایران

A Novel Method of Gene Expression Data Clustering

Authors	Shahsavani Davood ,Farhadi Zohreh
Abstract	Introduction: The microarray technology and production of gene expression data are among the important developments in genetic science that provide ability to study the behavior of thousands of genes, simultaneously. Clustering is one of the most important data mining techniques used in gene expression data analysis. As, the performance of clustering methods is strongly affected by the structure of data, the result of clustering is always uncertain and there is no algorithm that can be used for all kinds of data. In this study, ensemble clustering (combined results of multiple clustering algorithms) was used for gene expression data analysis rather than using a single algorithm.Methods: The performance of ensemble clustering in three gene expression data sets, Nuttv3, Alizadehv2 and SU, were evaluated by adjusted Rand index. Twelve different clusterings resulted from the combination of four clustering algorithms with three dissimilarity matrices were simultaneously applied on data. After merging the results,and running the final clustering, the estimated clusters were compared with actual groups by the adjusted Rand index.Results: The adjusted Rand index for the three data sets of Nuttv3, Alizadehv2 and SU, were respectively 1, 0.9 and 0.58 which shows the remarkable accuracy of the proposed method in detecting patterns in data sets. Moreover, the designed algorithm could detect the actual number of clusters without errors.Conclusion: Ensemble clustering is a powerful and reliable method for gene expression data analysis. Due to the accuracy and quality of this method in detection of real data structures, it can be replaced the individual clustering algorithms.
Keywords	Data mining ,Ensemble clustering ,Hierarchical clustering ,Partition around medoids ,Classic multidimensional scaling