تعیین خودکار معانی واژه‌های فارسی با استفاده از تعبیه معنایی واژه

Fa | Ar | En

تعیین خودکار معانی واژه‌های فارسی با استفاده از تعبیه معنایی واژه


نویسنده	قیومی مسعود
منبع	پژوهشنامه پردازش و مديريت اطلاعات - 1398 - دوره : 35 - شماره : 1 - صفحه:25 -50
چکیده	واژه کوچک‌ترین واحد زبان است که دارای »صورت « و »معنا «ست. واژه ممکن است بیش‌از یک معنا داشته باشد و با توجه‌ به کاربرد آن در بافت زبانی، معنای دقیق آن مشخص می‌شود. گردآوری تمام معانی یک واژه به‌صورت دستی کار بسیار پرزحمت و زمان‌بر است. افزون بر آن، ممکن است معانی واژه با گذشت زمان دچار تغییر شود؛ به این صورت که معانی موجود واژه کم‌کاربرد شده یا معانی جدید به آن اضافه شود. یکی‌از روش‌هایی که می‌توان برای تعیین معنای واژه استفاده کرد به‌کارگیری روش‌های رایانشی با توجه‌ به بافت زبانی است.در پژوهش حاضر تلاش می‌شود با ارائه یک الگوریتم محاسباتی، معانی واژه‌های هم نگاره فارسی با توجه به بافت زبانی به‌صورت خودکار و بدون نیاز به ناظر انسانی تعیین شود. برای رسیدن به این هدف، از روش تعبیه معنای واژه در یک مدل فضای برداری استفاده می‌شود. برای ساخت بردار واژه، از یک رویکرد مبتنی‌ بر شبکه عصبی استفاده می‌شود تا اطلاعات بافت جمله به‌ خوبی در بردار واژه گنجانده شود. در گام بعدی مدل پیشنهادی، برای ساخت بردار متن و تعیین معنای واژه، دو حالت جمله‌بنیان و بافت‌بنیان معرفی می‌شود. در حالت جمله‌بنیان، تمام واژه‌های جمله‌ای که واژه هدف در آن وجود دارد، در ساخت بردار نقش دارد؛ ولی در حالت بافت‌بنیان فقط تعداد محدودی از واژه‌های اطرافِ واژه هدف برای ساخت بردار در نظر گرفته می‌شود. دو معیار ارزیابی درونی و برونی برای ارزیابی کارایی الگوریتم خوشه‌بندی به‌‌کار گرفته می‌شود. معیار ارزیابی درونی که محاسبه مقدار تراکم داده در هر خوشه است برای دو حالت جمله‌بنیان و بافت‌بنیان محاسبه می‌گردد. ارزیابی برونی به داده استاندارد طلایی نیاز دارد که برای این هدف، یک مجموعه داده شامل 20 واژه هدف فارسی و تعداد 100 جمله نشانه‌گذاری‌شده برای هر یک از این واژه‌ها تهیه شده‌ است. بر اساس نتایج به‌دست‌آمده از ارزیابی درونی، تراکم خوشه‌ای حالت جمله‌بنیان با تفاوتی معنادار بالاتر از حالت بافت‌بنیان است. با در نظر گرفتن دو شاخص v و f در ارزیابی برونی، مدل بافت‌بنیان به صورتی معنادار کارایی بالاتری را نسبت ‌به جمله بنیان و مدل های پایه به‌‌دست آورده‌ است.
کلیدواژه	تعبیه معنایی واژه، خوشه بندی، یادگیری ماشین بی مربی، فضای برداری، پردازش زبان طبیعی، بازنمایی معنایی واژه، زبان فارسی
آدرس	پژوهشگاه علوم انسانی و مطالعات فرهنگی, ایران
پست الکترونیکی	m.ghayoomi@ihcs.ac.ir

Identifying Persian Words’ Senses Automatically by Utilizing the Word Embedding Method

Authors	Ghayoomi Masood
Abstract	: A word is the smallest unit in a language that has lsquo;form rsquo; and lsquo;meaning rsquo;. The word might have more than one meaning in which its exact meaning is determined according to the context it is appeared. Collecting all words rsquo; senses manually is a tedious and time consuming task. Moreover, it is possible that the words rsquo; meanings change over time such that the meaning of an existing word will become unusable or a new meaning will be added to the word. Computational methods is one of the approaches used for identifying words rsquo; senses with respect to the linguistic contexts.In this paper, we put an effort to propose an algorithm to identify senses of Persian words automatically without a human supervision. To reach this goal, we utilize the word embedding method in a vector space model. To build words rsquo; vectors, we use an algorithm based on the neural network approach to gather the context information of the words in the vectors. In the proposed model of this research, the divisive clustering algorithm as one of hierarchical clustering algorithms fits with the requirements of our research question. In the proposed model, two modes, namely the Sentencebased and the Contextbased, are introduced to identify words rsquo; senses. In the Sentencebased mode, all of the words in a sentence that contain the target word are involved to build the sentence vector; while in the Contextbased mode, only a limited number of surrounding words of the target word is involved to build the sentence vector. Two evaluation metrics, namely internal and external, are required to evaluate the performance of the clustering algorithm. The silhouette score for each cluster is computed as the internal evaluation metric for both modes of the proposed model. The external evaluation requires a gold standard data for which a data set containing 20 ambiguous words and 100 sentences for each target word is developed.According to the obtained results of the internal evaluation, the Sentencebased mode has higher density of clusters than the Contextbased mode, and the difference between them is statistically significant. According to the V and Fmeasure evaluation metrics in the external evaluation, the Contextbased mode has obtained higher performance against the baselines with statistically significant difference.
Keywords	: Word Embedding ,Clustering ,Unsupervised Machine Learning ,Vector Space ,Natural Language Processing ,Word sense representation ,Persian