بهینه‌سازی آشفتگی اسامی نویسندگان مقالات فارسی با استفاده از روش جنگل تصادفی

Fa | Ar | En

بهینه‌سازی آشفتگی اسامی نویسندگان مقالات فارسی با استفاده از روش جنگل تصادفی


نویسنده	مظفری نیلوفر ,ورع نرجس
منبع	پژوهش نامه علم سنجي - 1401 - دوره : 8 - شماره : 2 - صفحه:203 -220
چکیده	هدف: ارائه چارچوبی جهت حل مشکل آشفتگی و پراکندگی اسامی نویسندگان در مقالات فارسی که منجر به گسیختگی و فقدان جامعیت در بازیابی اطلاعات شده است. روش‌شناسی: پژوهش حاضر از نوع کاربردی علم‌سنجی است که به روش اسنادی انجام شده است. جامعه آماری را از 913 رکورد از نام نویسندگان مقالات فارسی برگرفته از پایگاه استنادی علوم جهان اسلام، طی بازه زمانی 1395 تا 1397 تشکیل می‌دهد. چارچوب پیشنهادی از سه مرحله جستجو، تطابق و گروه‌بندی تشکیل شده است. در این راستا، بعد از پیش‌پردازش اولیه و استخراج ویژگی، عملیات جستجو با هدف یافتن رکوردهایی که بالقوه احتمال یکسان‌بودن آنها وجود دارد انجام شده و سپس رکوردهای یکسان از طریق بررسی‌های بیشتر در مرحله تطابق که مبتنی بر جنگل تصادفی است یافت می‌شود. یافته‌ها: ویژگی‌های پست الکترونیک، نام خانوادگی و نام از مهم‌ترین ویژگی‌ها برای بهینه‌سازی آشفتگی نگارش اسامی هستند. استفاده از جنگل تصادفی به‌عنوان طبقه‌بند در مرحله تطابق، با دقت بالای 99 درصد می‌تواند مشکل آشفتگی نگارش اسامی نویسندگان را برطرف نماید. نتیجه‌گیری: نتایج نشان از کارایی بالای این روش در یکدست‌سازی اسامی با توجه به معیارهای دقت، بازیافت و مقدار اف نسبت به طبقه‌بندهای بردار پشتیبان، نزدیک‌ترین همسایه و ژنتیک دارد.
کلیدواژه	آشفتگی نگارش، جنگل تصادفی، نویسندگان مقالات فارسی، مستندسازی نام‌ها، الگوریتم ساندکس
آدرس	مرکز منطقه‌ای اطلاع‌رسانی علوم و فناوری, گروه پژوهشی طراحی و عملیات سیستم‌ها, ایران, مرکز منطقه‌ای اطلاع‌رسانی علوم و فناوری, گروه پژوهشی ارزیابی و توسعه منابع, ایران
پست الکترونیکی	narsisvara@gmail.com

optimizing confusion of authors’ names in persian articles using random forest algorithm

Authors	mozafari niloofar ,vara narjes
Abstract	purpose:name is a key factor for distinguishing authors. in the academic databases that store information on papers, searching for the name of the article author is one of the most important elements in increasing visibility and the quantitative studies in the field of scientology including the amount of citing works. the diversity of writings is one of the issues that lead to challenges in various scientific fields. in addition, the lack of writing standards in the persian language and the lack of keyboards and standard codes, the habit of simply writing are among the factors that lead to the author’s name disambiguation. also, the spelling mistakes that occur by the writers in writing the name lead to the creation of different forms of writing for a single name. considering the importance of solving the confusion of authors’ names in persian articles, this paper aims to propose a framework to solve the problem of confusion and dispersion of authors’ names in persian articles, which has led to a rupture and lack of comprehensiveness in information retrieval.methodology: the present research is an applied scientometrics method carried out by documentary procedure, and the required data is collected from the isc database. the initial statistical population is 913 records during the period 2015 to 2017. the proposed framework consists of three stages: searching, matching, and grouping. in this regard, after initial pre-processing and feature extraction, the search operation is performed to find records that are potentially likely to be identical. our method extracts two types of features including internal and external. the internal feature has been extracted from the author’s information like first name, last name, affiliation, email, and co-authors. in addition, the external feature uses the scientific history of authors like articles and research interests. next, in the search phase, the records that are potentially the same are identified. we propose a new method called farsi-soundex, which has been inspired by the well-known soundex to categorize potential unique names. the same records are then found through further investigation in the adaptation phase, which is based on random forests. therefore, the input of the matching stage is a group of records that have been detected the same based on the farsi-soundex algorithm. to specify whether these records are the same or not, a random forest algorithm has been applied to them. finally, in the grouping stage, all the records that have been identified as the same using random forest are placed in one group by a hash-based algorithm.finding: the internal features of email address, last name, and first name are the most significant features to optimize name-writing confusion. also, the obtained results show the external features of the main subject and sub-subject provide the least effective features for solving the author name disambiguation problem in the academic database. in addition, using a random forest as a classifier in the matching phase, with an accuracy of over 99%, can solve the problem of confusion in writing the authors’ names.conclusion: results show the high efficiency of our framework in uniformity of names according to the criteria of accuracy, recall, and f value compared to the support vector machine, the nearest neighbor, and genetics. our proposed method can be applied to scientific databases to standardize the names of the authors. in the future, we are investigating the efficiency of our proposed framework in a non-stationary environment in which the distribution of data may be changed over time.