بهینه‌سازی سازماندهی اسناد متنی فارسی با استفاده از تکنیک خوشه‌بندی

Fa | Ar | En

بهینه‌سازی سازماندهی اسناد متنی فارسی با استفاده از تکنیک خوشه‌بندی


نویسنده	یلوه الهام ,نوروزی یعقوب ,خطیر اشکان
منبع	پژوهشنامه پردازش و مديريت اطلاعات - 1402 - دوره : 38 - شماره : 3 - صفحه:981 -1010
چکیده	پژوهش حاضر با هدف ارائه‌ روشی برای سازماندهی اسناد متنی فارسی با استفاده از تکنیک خوشه‌بندی انجام شد. مجموعه داده‌های مربوط به پایان‌نامه‌ها و رساله‌ها شامل 2943 تحقیق به‌عنوان جامعه‌ آماری در نظر گرفته شد. جمع‌آوری داده‌ها از مجموعه داده‌های‌ مربوط به تحقیقات علمی که شامل 5000 ‌پژوهش در قالب فایل اکسل بود، انجام شد. در این پژوهش پس از تبدیل داده‌ها به قالب ساخت‌یافته، عملیات پردازش با استفاده از اعمال پیش‌پردازش صورت گرفت. در مرحله‌ پردازش از تکنیک خوشه‌بندی برای ارائه‌ الگوریتم پیشنهادی در راستای سازماندهی اسناد متنی فارسی بهره گرفته شد. این الگوریتم با بهبود الگوریتم k-means در جهت خوشه‌بندی اسناد ارائه شد. نتایج حاصل از ارزیابی نشان داد که الگوریتم پیشنهادی بر اساس معیارهای خارجی نسبت به دو الگوریتم k-means و k-means++ در کیفیت خوشه‌بندی اسناد تاثیر مثبتی داشت؛ به‌طوری که تحقیقات هر رده‌ تعیین شده در خوشه‌ موضوعی مرتبط دارای توزیع یکنواختی شد، و به حصول هدف پژوهش حاضر منجر گردید. در جداول رده/ خوشه‌ حاصل از دو الگوریتم k-means‌ و k-means++ توزیع غیریکنواخت تحقیقات در خوشه‌ها مشاهده شد. بنابراین، ارزیابی بر اساس معیار‌های داخلی متاثر از تراکم متفاوت خوشه‌ها و شباهت بین خوشه‌ای بود. حجم دیتاسِت نیز متاثر از راهکارهای‌ پیشنهادی برای انتخاب دیتاسِت نهایی و فرایند پژوهش نبود. بنابراین، الگوریتم پیشنهادی برای ابعاد بالای ویژگی نیز مناسب عمل می‌کند.
کلیدواژه	سازماندهی اسناد متنی، تکنیک خوشه‌بندی، متن‌کاوی، تجزیه و تحلیل هوشمند متن
آدرس	دانشگاه قم, ایران, دانشگاه قم, گروه علم اطلاعات و دانش شناسی, ایران, پژوهشگاه علوم و فناوری اطلاعات ایران (ایرانداک), ایران
پست الکترونیکی	khatir@students.irandoc.ac.ir

optimizing the organization of persian text documents using clustering technique

Authors	yalveh elham ,norouzi yaghoub ,khatir ashkan
Abstract	the present study aimed to designing a method for organizing persian text documents using the clustering technique. the data set related to theses and dissertations including 2943 researches was considered as a statistical population. data were collected from a set of data related to scientific research, which included 5,000 researches in excel format. in this study, after converting the data into a structured format, the processing operation was performed using preprocessing operations. in the processing stage, the clustering technique was used to present the proposed algorithm in order to organize persian text documents. this algorithm was introduced by improving the k-means algorithm for document clustering. the results of the evaluation showed that the proposed algorithm based on external criteria had a positive effect on the clustering quality of documents compared to the two algorithms k-means and k-means++. so that the research of each designated category in the related subject cluster had a uniform distribution, and led to the achievement of the purpose of the present study. in the category/cluster tables obtained from the two algorithms k-means and k-means++, we saw a non-uniform distribution of research in clusters, so the evaluation based on internal criteria was affected by different cluster densities and inter-cluster similarity. the size of the dataset was also not affected by the proposed solutions for selecting the final dataset and the research process, so the proposed algorithm works well for the high dimensions of the feature.
Keywords	organizing text documents ,clustering techniques ,text mining ,textual data mining