شناسایی و استخراج باهمایی های زبان فارسی با استفاده از روش های رایانشی

Fa | Ar | En

شناسایی و استخراج باهمایی های زبان فارسی با استفاده از روش های رایانشی


نویسنده	حشمتی زینب الهدی ,ملکی ویکاء مینا ,بی جن خان محمود ,ویسی هادی
منبع	پژوهشنامه پردازش و مديريت اطلاعات - 1403 - دوره : 40 - شماره : 2 - صفحه:577 -604
چکیده	در این مقاله به بازشناسی باهمایی‌ها در زبان فارسی پرداخته می‌شود. پژوهش‌های صورت‌گرفتة زبان فارسی‌ در این زمینه عمدتاً آماری و مقابله‌ای بوده است. هدف این پژوهش بازشناسی باهمایی‌ها به ‌روش پیکره‌بنیاد و رایانشی است. برای این منظور از پایگاه داده زبان فارسی به‌عنوان پیکره پژوهش استفاده شده است. همچنین به‌علت نداشتن لغت‌نامة باهمایی‌ها‌ در زبان فارسی‌ مجموعه داده‌ای از باهمایی‌ بر اساس کتاب فرهنگ زبان‌آموز پیشرفته فارسی ساخته شده است. با استفاده از بردارهای تعبیة fasttext مدل زبانی با شبکة حافظه کوتاه‌مدت ماندگار آموزش داده می‌شود. همچنین با تنظیم دقیق «پارس‌برت» فراخوانی این مدل‌ زبانی با استفاده از لیست‌های هزارتایی باهمایی‌ها و ناباهمایی‌ها محاسبه شد. در انتها، بررسی مقابله‌ای بازشناسی باهمایی‌ در موتور ترجمه گوگل با استفاده از ترجمه‌ هزار جمله فارسی به انگلیسی که هر یک از جملات دارای یک باهمایی‌ است، انجام شد. نتایج نشان می‌دهد که مدل «پارس‌برت» با فراخوانی 95.93 درصد و 85.8 درصد به‌ترتیب، به بازشناسی باهمایی‌ و ناباهمایی و مدل زبانی آموزش‌دیده با شبکة حافظه کوتاه‌مدت ماندگار به‌ترتیب باهمایی‌ و ناباهمایی را با فراخوانی 6.6 درصد و 0 درصد بازشناسی کرد. همچنین بررسی مقابله‌ایِ دقت ترجمه موتور گوگل در ترجمه باهمایی‌ها سه نتیجه را در‌بر‌داشت:‌ 1) باهمایی‌ به‌درستی بازشناسی و ترجمه شد، 2) باهمایی‌ به‌درستی بازشناسی نشد و ترجمه به‌صورت تحت‌اللفظی و واژه‌به‌واژه است، و 3) باهمایی‌ بازشناسی نشد و ترجمة غلطی صورت پذیرفته است.
کلیدواژه	باهمایی، پارس برت، حافظه کوتاه ‎مدت ماندگار، زبان شناسی رایانشی، زبان فارسی
آدرس	دانشگاه تهران, دانشکده سامانه های هوشمند, ایران, دانشگاه تهران, ایران, دانشگاه تهران, دانشکده ادبیات و علوم انسانی, ایران, دانشگاه تهران, دانشکده سامانه های هوشمند, ایران
پست الکترونیکی	h.veisi@ut.ac.ir

using computational methods for persian collocations identification and extraction

Authors	heshmati zainabolhoda ,maleki vika mina ,bijankhan mahmood ,veisi hadi
Abstract	this article explores the recognition of collocations in persian language. previous research in this field has primarily been statistical and comparative in nature. the objective of this study is to identify collocations using a corpus-based and computational approach. to this end, the persian language database is utilized as the research corpus. additionally, due to the absence of a comprehensive collocation dictionary for persian, a dataset of collocations has been constructed based on the advanced learners’ persian dictionary. using fasttext embedding vectors, a language model is trained with a long short-term memory (lstm) network. furthermore, by fine-tuning parsbert, the performance of this language model is evaluated using lists of a thousand collocations and non-collocations. finally, a comparative analysis of collocation recognition is conducted using google translate by translating a thousand persian sentences into english, each containing at least one collocation. the results indicate that the parsbert model achieves recall rates of 93.95% and 85.8% for collocation and non-collocation recognition, respectively. in contrast, the lstm-based language model achieves recall rates of 6.6% and 0% for collocation and non-collocation recognition, respectively. the comparative analysis of google translate accuracy in translating collocations yielded three key findings: 1) the collocation was correctly recognized and translated; 2) the collocation was not correctly recognized, resulting in a literal, word-for-word translation; and 3) the collocation is not recognized, leading to an incorrect translation
Keywords	collocation ,parsbert ,long short-term memory ,computational linguistics ,persian language