ارائه یک رویکرد ترکیبی جدید برای یافتن بردارهای پایه معنادار جهت بازنمایی صریح بردارهای کلمه

Fa | Ar | En

ارائه یک رویکرد ترکیبی جدید برای یافتن بردارهای پایه معنادار جهت بازنمایی صریح بردارهای کلمه


نویسنده	پاکزاد عاطفه ,آنالویی مرتضی
منبع	رايانش نرم و فناوري اطلاعات - 1401 - دوره : 11 - شماره : 1 - صفحه:1 -17
چکیده	هدف اصلی این پژوهش بازنمایی صریح بردارهای معنایی کلمه با ابعاد کم است. برای تولید بردارهای معنایی کلمه صریح، بایستی مسئله ی یافتن تعداد محدودی بردار پایه معنادار به گونه ای حل شود که با کاهش ابعاد بردارهای کلمه افت دقت زیادی ایجاد نشود. ما در این پژوهش یک رویکرد ترکیبی برای یافتن بردارهای پایه معنادار ارائه می کنیم. در ابتدا، n بردار پایه را با روش های پیشنهادی 1-معیار نسبت شباهت کلمه به تکرار کلمه، 2-انتخاب ویژگی مبتنی بر مقایسه ماتریس های فاصله، 3-وزن دهی دودویی مبتنی بر الگوریتم bpso بدست می آوریم. سپس، برای بهره‌گیری از خبرگی روش‌های 1و 2 به میزان یکسان، نیمی از بردارهای پایه بدست آمده با روش معیار نسبت شباهت کلمه به تکرار کلمه را با نیمی از بردارهای پایه انتخاب شده با روش انتخاب ویژگی، ترکیب کرده و بردارهای پایه ترکیبی اول را بدست می آوریم. در مرحله بعدی، کلمات بافتار مشترک دارای وزن 1 بدست آمده با استفاده از روش bpso را به بردارهای پایه ترکیبی اول حاصل از روش های نسبت شباهت کلمه به تکرار کلمه و روش انتخاب ویژگی می افزاییم. بنابراین، بردارهای پایه ترکیبی دوم بدست می‌آیند که معنادار هستند و هر بردار پایه معادل یک کلمه بافتار آگاهی بخش است. لذا بردارهای کلمه صریح تولید شده با استفاده از بردارهای پایه معنادار، تفسیرپذیر هستند. ما رویکرد پیشنهادی را با استفاده از پیکره ukwac آموزش داده و بر روی وظیفه شباهت کلمه مورد ارزیابی قرار می دهیم. هر دو بردارهای پایه ترکیبی اول و دوم سبب بهبود دقت می‌گردند. این افزایش دقت در بردارهای پایه ترکیبی اول بیشتر است. نتایج حاصل از ارزیابی بردارهای کلمه صریح بدست آمده با بردارهای پایه اول نشان می دهد که با وجود کاهش ابعاد بردارهای کلمه از 5000 به 1511، ضریب همبستگی اسپیرمن بر روی مجموعه های آزمون men، rg-65 و simlex-999 به ترتیب به میزان 2.47%، 7.39% و 0.52% افزایش می‌یابد.
کلیدواژه	بردارهای پایه، بازنمایی بردارهای کلمه، بردارهای کلمه تفسیرپذیر، وزن‌دهی دودویی، انتخاب ویژگی، وظیفه شباهت کلمه
آدرس	دانشگاه علم و صنعت ایران, دانشکده مهندسی کامپیوتر, ایران, دانشگاه علم و صنعت ایران, دانشکده مهندسی کامپیوتر, ایران
پست الکترونیکی	analoui@iust.ac.ir

a novel hybrid approach to finding meaningful basis vectors for explicit representation of word vectors

Authors	pakzad atefe
Abstract	the main purpose of this study is to represent the semantic word vectors with low dimensions, explicitly. the problem of finding a limited number of meaningful basis vectors for producing explicit semantic word vectors must be solved in such a way that a large accuracy drop is not caused by reducing the dimensions. in this study, we represent a hybrid approach to finding meaningful basis vectors. first, we obtain n basis vectors using the proposed methods: 1- the criterion of word similarity-to-word frequency ratio, 2- feature selection method based on comparison of distance matrices, 3- binary weighting method based on pso algorithm. then, to take advantage of the expertise of methods 1 and 2 to the same extent, we obtain the first combined basis vectors by combining half of the basis vectors obtained by the criterion of word similarity-to-word frequency ratio with half of the basis vectors selected by the feature selection method. in the next step, we obtain the common context words that have a weight &1& as the common basis vectors produced by the binary weighting method. in the next step, we add the common context words with a weight &1& obtained using the bpso method to the first combined basis vectors obtained from word similarity-to-word frequency ratio and the feature selection methods. thus, the second combined basis vectors are obtained, which are meaningful, and each basis vector is equivalent to an informative context word. therefore, the explicit word vectors produced by meaningful basis vectors can be interpreted. we train the proposed approach using the ukwac corpus and evaluate it using the word similarity task. both first and second combined basis vectors improve accuracy. the increase in accuracy is greater in the first combined basis vectors. the evaluation results of explicit word vectors obtained with the first basis vectors show that despite the reduction of word vector dimensions from 5000 to 1511, the spearman correlation coefficient on men, rg-65, and simlex-999 test sets is increased by 2.47%, 7.39%, and 0.52%, respectively.
Keywords	basis vectors ,word vector representation ,interpretable word vectors ,binary weighting ,feature selection ,word similarity task