ارائه روشی نوین برای استخراج خودکار چهریزه‌ها در جستجوهای چهریزه‌ای (مورد مطالعه: حوزه زنان و زایمان)

Fa | Ar | En

ارائه روشی نوین برای استخراج خودکار چهریزه‌ها در جستجوهای چهریزه‌ای (مورد مطالعه: حوزه زنان و زایمان)


نویسنده	فرج‌پهلو عبدالحسین ,عصاره فریده ,فخراحمد مصطفی ,دهقانی لیلا
منبع	پژوهشنامه پردازش و مديريت اطلاعات - 1401 - دوره : 37 - شماره : 3 - صفحه:807 -838
چکیده	هدف این پژوهش ابداع و معرفی الگوریتمی نو برای استخراج چهریزه‌هاست که امکان شناسایی تجربی چهریزه‌ها را با کمک پشتوانه انتشاراتی فراهم می‌کند. الگوریتم پیشنهادی بر مبنای دو ایده شکل گرفته است: ایده اول اینکه چهریزه در بافت بروز پیدا می‌کند. بنابراین، برای تشخیص چهریزه در یک بدنه متنی بایستی بافت یا بستر آن مورد بررسی قرار گیرد و ایده دوم این است که چهریزه نقطه تمرکز در یک درخت واژگانی است که نه بسیار عام و نه بسیار خاص است.در حوزه پزشکی، دامنه زنان و زایمان به‌عنوان بستر آزمون انتخاب گردید. سه پیکره متنی از درون پشتوانه انتشاراتی انتخاب شد. پیکره بستر، از چکیده و عنوان مجموعه مقالات موجود در 20 مجله برتر حوزه انتخاب شد که در‌برگیرنده 167071 سند بود. پیکره دوم، پیکره منشا بود که 2000 مقاله به‌صورت تصادفی از پیکره بستر انتخاب شد. پیکره سوم، پیکره واژگانی است که با استفاده از یک سرویس تحت وب و معیار رتبه‌بندی واژگان lidfvalue استخراج گردید. خروجی حاصل دربرگیرنده ‌514 واژه بود. واژگان تکراری حذف شدند و سرانجام، 480 واژه مهم شناسایی شد. سپس، واژگان در پیکره بستر با کمک مجموعه راهنما یعنی »مش » بسط داده شد و پس از آن، بر اساس دو شرط انتقال مبتنی بر تکرار یعنی بیشتر بودن اسناد مرتبط با واژه در بستر نسبت به منشا و انتقال مبتنی ‌بر رتبه یعنی رشد رتبه موجود واژه در پیکره بستر ‌نسبت به منشا که نشان‌دهنده عام شدن واژه است، ‌چهریزه‌های کاندید استخراج شدند. سرانجام، با استفاده از سه قاعده اخص بودن، جایگزنی و اعم بودن، چهریزه‌های شناسایی‌شده اصلاح و نام‌گذاری شدند. در نهایت، 26 چهریزه به‌عنوان چهریزه‌های حوزه زنان و زایمان شناسایی شدند.‌با مقایسه الگوریتم پیشنهادی با دیگر الگوریتم‌ها مشخص شد که ایجاد سه افراز (افراز منشا و بدنه متنی و افراز برای شناسایی واژگان مهم) و مقایسه رفتار واژه در آن‌ها و سپس، ایجاد درخت بر اساس چهریزه‌های کاندید، یعنی ترکیب رویکرد آماری و هرس درخت می‌تواند نتایج مناسب‌تری نسبت به رویکرد صرفاً آماری یا هرس درخت داشته است. همچنین، مقایسه چهریزه‌های خروجی از الگوریتم و چهریزه‌های سنتی در این زمینه نشان داد که چهریزه‌های خروجی الگوریتم، خرد‌تر و برای مرور در ابزارهای بازیابی اطلاعات مفید‌تر هستند. همچنین، در این پژوهش مشخص شد که چهریزه‌های دامنه تخصصی از چهریزه‌های عمومی در حوزه پزشکی متفاوت است و مستقل از آن‌ها قابل شناسایی و تعریف ‌است، اما نمی‌توان نتایج را به تمامی دامنه‌های پزشکی تعمیم داد و نیاز است که پژوهش‌هایی در دیگر حوزه‌ها صورت گیرد.
کلیدواژه	بازیابی اطلاعات، چهریزه، جستجوی چهریزه‌ای، استخراج خودکار چهریزه.
آدرس	دانشگاه شهید چمران اهواز, ایران, دانشگاه شهید چمران اهواز, ایران, دانشگاه شیراز, ایران, دانشگاه علوم پزشکی بوشهر, ایران
پست الکترونیکی	leiladehghani@yahoo.com

Introducing a novel method for Automatic facet extraction in the faceted search (Case Study: gynecology and obstetrics domain)

Authors	Farajpahlou Abdolhossein ,Osareh Farideh ,Fakhrahmad Seyed Mostafa ,Dehghani Leila
Abstract	In this research a new algorithm for facet extraction has been developed and introduced, which provides the experimental possibility of identifying facets based on a literary warrant. In the field of automatic facet extraction two main ideas were considered by reviewing the researches. The first idea is that the facet appears in the context. Therefore, to identify the facet in a corpus, its context must be examined. The second idea is that the facet is the focal point in a lexical tree that is neither very general nor very specific. Based on these two ideas, first, the corpus in the medicine area and the obstetrics and gynaecology domain was prepared. The research team selected three corpora from the literary warrant and used the abstract and title of the collection of articles in top 20 journals of the field to create a contextual corpus. This collection contained 167071 documents. 2000 articles were randomly selected to create the origin corpus. The third body is the lexical corpus. The proper words of the corpus were extracted using a webbased service. The output contained 514 words. Duplicate words were removed and finally, 480 important words were identified. Then, the words were expanded in the contextual corpus with the help of the supervisor (Mesh) and thencandidate dissertations were extracted based on the two conditions of frequencybased Shifting and rankbased Shifting. Finally, using the three rules of specificity, substitution, and generality, the identified facets were modified and named. Finally, 26 facets were identified in the domain of gynaecology and obstetrics. Comparing the proposed algorithm with other algorithms, it was found that the combination of statistical approach and tree pruning can have better results than purely statistical approach or tree pruning. Also, the comparison of the output facets of the algorithm with the traditional facets in this obstetrics and gynaecology domain showed that the output of the algorithm is smaller and more useful for browsing information retrieval tools. Also, in this study was specified that specialized domain facets are different from general facets and can be redefined independently, but the results cannot be generalized to all medical domains and other researches are needed to be done in other fields.
Keywords	data retrieval ,facet ,faceted search ,automatic facet extraction.