|
|
|
|
رویکردی نوین در بکارگیری روش دسته ماشین بردار پشتیبان تصادفی در تحلیل دادههای بیان ژن سرطان پروستات
|
|
|
|
|
|
|
|
نویسنده
|
موسوی نیلیا ,گلعلی زاده موسی
|
|
منبع
|
علوم آماري - 1402 - دوره : 17 - شماره : 2 - صفحه:459 -476
|
|
چکیده
|
پیشرفت سرطان در بین بیماران را میتوان از طریق ایجاد مجموعهای از نشانگرهای ژن با روشهای تحلیل آماری دادهها بررسی کرد. اما یکی از مشکلات اساسی در مطالعه آماری این نوع دادهها وجود تعداد زیاد ژنها در مقابل تعداد کم نمونههاست. بنابراین، استفاده از روشهای کاهش ابعاد برای حذف و یافتن تعداد بهینهای از ژنها برای پیشبینی صحیح ردههای موردنظر، امری ضروری است. از طرفی، انتخاب یک روش کاهش ابعاد مناسب، میتواند به استخراج اطلاعات ارزشمند و افزایش کارایی یادگیری کمک کند. در این پژوهش از رویکرد یادگیری دستهای به نام دسته ماشین بردار پشتیبان تصادفی برای یافتن مجموعه ویژگی بهینه، استفاده میشود. در تحلیل دادههای واقعی مقاله حاضر، نشان داده میشود با تبدیل دادههای بُعد بالا به زیرفضاهایی با بُعد پایینتر و ترکیب مدلهای ماشین بردار پشتیبان، علاوه بر یافتن مجموعهای از ژنهای موثر در بروز سرطان پروستات، دقت ردهبندی نیز افزایش مییابد.
|
|
کلیدواژه
|
یادگیری دستهای، کاهش ابعاد، ردهبندی، دسته ماشین بردار پشتیبان تصادفی، مجموعه ویژگی بهینه
|
|
آدرس
|
دانشگاه تربیت مدرس, گروه آمار, ایران, دانشگاه تربیت مدرس, گروه آمار, ایران
|
|
پست الکترونیکی
|
golalizadeh@modares.ac.ir
|
|
|
|
|
|
|
|
|
|
|
|
|
a new approach in using random support vector machine cluster in analyzing prostate cancer gene expression data
|
|
|
|
|
Authors
|
mosavi nilia ,golalizadeh mousa
|
|
Abstract
|
many statistical data analysis methods can help evaluate cancer progression among patients by creating a set of gene markers. however, one of the main problems in the statistical study of this type of data is the large number of genes versus the small number of samples. the situation is known as “big p and small n” among the scientific communities. consequently, one shouldutilize some dimensionality reduction techniques for proper statistical analysis.one essential purpose of studying the gene data is to find the optimalnumber of genes to predict the desired classes accurately. many machinelearning tools were provided, so choosing an appropriate method is criticalto providing an efficient statistical model. support vector machine is a valuabletechnique to classify complex data such as gene expressions for prostatecancer. a new modified version of this tool called the random support vectormachine cluster has been introduced in the machine learning communities.it is an ensemble learning approach and suitable to find the optimal featureset. the primary rationale of this technique is randomly projecting the originalhigh-dimensional feature space onto multiple lower-dimensional featuresubspaces and combining support vector machine classifiers. this paper willhighlight the procedure for implementing this technique. it is shown thatthe main outcome of applying this tool to analyze the gene expression datafor prostate cancer is twofold. it gives us not only the important genes butalso a high level of classification precision.material and methodswe use a random subsample ensemble (rse) to overcome the problem caused by treating the high dimensional data. it is a variable selection basedon the learning ensemble. then, we utilize the random support vector machinecluster (rsvmc) to classify the data and select the set of optimalvariables. note that we should repeat selecting the necessary variable procedurein invoking the svm to allow the randomness of svmc. to evaluatethe model, we divide the data set into three common parts, i.e., training,validation, and testing samples. moreover, we use the sigmoid kernel duringthe fitting step. we consider the accuracy, sensitivity, and specificity measuresto showcase the model’s superiority.results and discussionour results, implemented on the prostate cancer data, show the rsvmcwas able to identify thirteen patients with prostate cancer correctly. however,it made the mistake of recognizing two persons with having the diseasewhile they did not have it. regarding accuracy, sensitivity, and specificitymeasures, our method reached about ninety-three hundred and eighty-eightpercent values, respectively. there is still room to implement our approachin the multi-class classification problem and compare it with other variableselection techniques, such as the regularization strategy.conclusionthe new idea presented in this paper, i.e., rsvmc, is a powerful tool toselect an optimal subset of the optimal variable and then use it in a classificationproblem by invoking the support vector machine technique. sucha strategy will lead to the high efficiency of the model as well as provide asmooth and relevant interpretation of the essential genes. moreover, it gainsa high actual positive rate, leading to correctly identifying the patients whohave prostate cancer.
|
|
Keywords
|
ensemble learning ,dimensionality reduction ,classification ,random support vector machine cluster ,optimal feature set
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|