خوشه‌بندی داده‌های بیان‌ژنی توسط عدم تشابه جنگل تصادفی

Fa | Ar | En

خوشه‌بندی داده‌های بیان‌ژنی توسط عدم تشابه جنگل تصادفی


نویسنده	فرهادی زهره ,شاهسونی داود
منبع	مجله علوم پزشكي رازي - 1394 - دوره : 22 - شماره : 136 - صفحه:109 -118
چکیده	زمینه و هدف: خوشه بندی داده های بیان ژنی در تشخیص و درمان سرطان، دارای اهمیت بسزایی است. مشخصه ی بارز این داده ها تعداد زیاد متغیرها (ژن ها) نسبت به تعداد داده ها (بیماران) است. بسیاری از روش های خوشه بندی بر پایه ی عدم تشابه داده ها که حاصل محاسبه ی یک تابع فاصله است ، بنا شده اند و افزایش بعد ، کارآیی توابع فاصله را کاهش می دهد . در این تحقیق معیاری جدید برای محاسبه ی عدم تشابه در ابعاد بالا ، بر اساس یک روش رده بندی به نام جنگل تصادفی معرفی شده و کارایی آن در تحلیل داده های بیان ژنی، مورد ارزیابی قرار گرفته است.روش کار: در این مقاله خوشه بندی مجموعه داده ی چاودری و همکاران توسط عدم تشابه جنگل تصادفی مد نظر قرار گرفته است. بدین منظور ابتدا مسئله ی خوشه بندی به مسئله ی رده بندی تبدیل شده و با انجام رده بندی جنگل تصادفی، عدم تشابه مربوطه محاسبه شده است. سر انجام داده ها توسط روش خوشه بندی افراز حول مدوید، خوشه بندی شده و نتیجه ی خوشه بندی توسط شاخص رند تعدیل یافته مورد ارزیابی قرار گرفته است. تمامی تحلیل ها با نرم افزار r انجام شده است.یافته ها: مقدار شاخص رند تعدیل یافته (8149.0)، نشان دهنده ی انطباق مطلوب خوشه های تخمینی با گروه های واقعی است. همچنین با استفاده از قابلیت تعیین اهمیت متغیرها در روش جنگل تصادفی، ژن شماره ی 31 موثرترین ژن در این خوشه بندی شناخته شد و توانستیم خوشه های تخمینی را تنها بوسیله ی این ژن توصیف کنیم.نتیجه گیری: عدم تشابه جنگل تصادفی، معیاری کارا برای سنجش عدم تشابه داده ها در خوشه بندی داده های بیان ژنی است. همچنین می توان با استفاده از قابلیت متحصر به فرد این روش، ژن های موثر در خوشه بندی را شناسایی نموده و خوشه های تخمینی را به وسیله ی آن ها توصیف نمود.
کلیدواژه	خوشه‌بندی، داده‌های بیان‌ژنی، عدم تشابه جنگل تصادفی، تعیین اهمیت متغیرها
آدرس	دانشگاه شاهرود, ایران, دانشگاه شاهرود, گروه آمار, ایران
پست الکترونیکی	dshahsavani@shahroodut.ac.ir

Gene Expression Data Clustering with Random Forest Dissimilarity

Authors
Abstract	Background: The clustering of gene expression data plays an important role in the diagnosis and treatment of cancer. These kinds of data are typically involve in a large number of variables (genes), in comparison with number of samples (patients). Many clustering methods have been built based on the dissimilarity among observations that are calculated by a distance function. As increasing the dimensions reduces the performance of distance functions, most of the methods provide low accuracy. In this paper a new dissimilarity measure is introduced based on a classification method, called Random forests (RF). The performance of this new measure has been evaluated in the gene expression data.Methods: In this article, the clustering problem of Chowdary data set, using the RF dissimilarity measure, is under consideration. At the first step, the clustering problem is converted to classification problem, thereafter the new dissimilarity is calculated using the classification method of random forests. Finally, the data are clustered with a partition around mediod algorithm and the results are then evaluated by adjusted rand index. All the analysis is implemented with R software.Results: The value of adjusted rand index (0.8149) represents an acceptable agreement between clusters and true groups. The most effective gene in constructing the clusters was gene no.31 which was detected by using the unique ability of RF that is identifying the importance of variables.Conclusion: The random forest dissimilarity is an efficient criterion for measuring dissimilarity in gene expression data clustering. Detection of effective genes in clustering that is done with RF, helps the researcher in the diagnosing and treatment of the cancers
Keywords	Clustering ,Gene expression data ,Random forest dissimilarity ,Variables importance