مقایسه روش‌های کلاسیک و روش‌های مبتنی بر اندازه‌های آماری پیکره‏بنیاد در استخراج خودکار واژه‌های پایه علوم پزشکی به روش بسامدی

Fa | Ar | En

مقایسه روش‌های کلاسیک و روش‌های مبتنی بر اندازه‌های آماری پیکره‏بنیاد در استخراج خودکار واژه‌های پایه علوم پزشکی به روش بسامدی


نویسنده	ذوالفقار کندری زهره ,موسوی میانگاه طیبه ,روشن بلقیس
منبع	پژوهش نامه آموزش زبان فارسي به غير فارسي زبانان - 1398 - دوره : 8 - شماره : 1 - صفحه:227 -244
چکیده	طی دو دهه‌ی اخیر با پیشرفت علم و فناوری، استفاده از‌‌‌ روش‌های پیکرهبنیاد در آموزش زبان و تدوین منابع درسی گسترش چشمگیری داشته است. پژوهش حاضر با هدف دستیابی به روشی خودکار در استخراج واژه از پیکره‌ها در زبان فارسی صورت گرفته است. برای دستیابی به هدف پژوهش روش‌های بسامد‌‌شماری در دو گروه کلاسیک و روش‌های مبتنی بر اندازه‌های آماری موردبررسی قرارگرفته و توانمندی هریک از که عبارتند از بسامدشماری پیکره‌‌ی عمومی، بسامدشماری پیکره‌ی تخصصی و روش‌های بهبودیافته‌ی آن‌ها موردمقایسه قرارمی گیرند. نتایج نشان می‌دهد که در روش‌های کلاسیک با اعمال تکنیک‌هایی می‌توان فرایند انتخاب واژه‌های تخصصی را بهبود بخشید و در این میان بهترین عملکرد مربوط به روش بسامدشماری بهبود‌یافته در پیکرهی تخصصی بوده است. روش‌های بهکاررفته در پژوهش عبارتند از اطلاعات متقابل نقطهای و مجذور کا[1]. نتایج بهدست آمده برای این دو روش نیز قابلیت استفاده از روش‌های بسامدشماری پیکرهبنیاد در زبان فارسی را مورد تایید قرار می‌دهد. روش مجذور کا با استخراج %32 واژهی تخصصی و روش اطلاعات متقابل نقطهای با استخراج %52 واژهی تخصصی، عملکرد مناسبی در تشخیص خودکار واژه‌های تخصصی از خود نشان می‌دهند. نتایج حاصل از اعمال این روش‌ها روی پیکره‌ها و مقایسه آنها نشان میدهند که می‌توان از روش‌های مبتنی بر اندازهگیری‌های آماری برای استخراج خودکار واژه در زبان بهره جست و به این ترتیب تحولی نوین در تهیه و تدوین متون آموزشی حاصل خواهد شد و آموزش‌دهندگان می‌توانند به فهرست واژگانی دسترسی داشته باشند که دانستن آن برای زبان آموزانشان مفید و گاه ضروری است.
کلیدواژه	استخراج خودکار واژه‌های پزشکی، پیکره، روش‌های ترکیبی استخراج، آموزش زبان فارسی
آدرس	دانشگاه پیام نور مرکز تهران, مرکز تحصیلات تکمیلی, ایران, دانشگاه پیام نور مرکز یزد, گروه زبان‏شناسی همگانی, ایران, دانشگاه پیام نور مرکز تهران, گروه زبان‏شناسی همگانی, ایران
پست الکترونیکی	bl_rowshan@pnu.ac.ir

Extraction of Core Medical Terms Using Frequency Approach

Authors	Zolfaghar Kondori Zohreh ,Mosavi Miangah Tayebeh ,Rowshan Belgheis
Abstract	During the past two decades, use of corpusbased approaches in language teaching and design of teaching materials has increased remarkably. The goal of the present study was to achieve an automatic approach in extracting medical terms from corpora in Farsi. To achieve the purpose, classic and statistical measurementbased methods of frequency counting were used and the capability of each single approach was compared with the other one. Classic frequency approaches include: General corpus frequency, Special corpus frequency and their enhanced techniques. Results showed that in classic approaches, special term extraction process can be improved by utilizing some techniques and among them, the best performance related to the improved frequency approach in special corpus which covered 60% of the special terms by 50 terms. Chisquare and PMI verified the possibility of using corpusbased frequency approaches in Farsi. Chisquare with extraction of %32 and PMI with extraction of 52% of special terms performed appropriately in automatic special term extraction. Overall, the results of applying these approaches on corpora and their comparison showed that statistical measurement approaches are appropriate for automatic term extraction and hence we will face a modern change in preparing teaching materials and teachers could access lists of words which are useful and occasionally essential for language learners. Extended AbstractOver the past few decades, with the advancement of technology, the use of corpora in linguistic studies has dramatically increased. Linguistic corpuses provide linguistic experts with the possibility to apply different methods for linguistic analysis by providing large collections. Most of the studies that have been done so far have been in English, French, and Japanese, and limited research has been conducted in Farsi language, and this lack, especially in specialized fields such as medical sciences, mathematics, science, tourism and so on is so tangible. So far most of the term or vocabulary extractions in Farsi have been done by using nonautomatic methods and through reading and collecting data by the researchers; however, due to the technical properties of Farsi language, using nonFarsi term extractors which have been quite successful in other languages such as English, French and Japanese, have been impossible to use in Farsi so far. This is because of the particularities and specific features of languages. Each of these extractors is defined based on the features and properties of language they have been used for. In order to improve teaching materials in Farsi, paying attention to this problem was of paramount importance and we decided to apply some of these extraction methods and devise an extraction method for Farsi language which works properly. Since Iran’s universities admit a lot of nonnative Farsi international students annually whose goal is to study at fields such as medicine, engineering and humanities, preparing standard modern teaching materials in Farsi, which are based on the most modern technologies, is significantly important .The purpose of this study was to improve the resources used in teaching Farsi language at university levels, especially for non native Farsi speakers and to explore the feasibility of using frequencybased methods in the automatic extraction of core medical terms and comparing the capabilities of each method. Findings of the research reveal the strengths and weaknesses of these methods in Farsi language and explore the possibility of using each of these methods in Farsi and provide technical solutions for the improvement of the results. Research Methodology: The frequency counting approaches utilized in this study included the general and a specialized corpus which was created by the researcher. The general corpus used in this study was the Hamshahri Corpus and the specialized researcher made corpus included: texts from the science books of grades 14 of senior high schools and grades 13 of junior high schools in Iran, science courses in Imam Khomeini Farsi language center, general medicine texts from journals and internet. After the formation of the corpus, preparation and tokenization, the research introduced two methods of frequency i.e. classical and modern categories. Then, in the next step, the capabilities of each method were compared. The methods used in the classical frequency approach were the frequency of the main general corpus, the frequency of the specialized corpus and their improved approaches. Also, modern methods used in the research were: PMI and Chisquare. Pearson correlation analysis and trend analysis were also used to compare the methods used in the research. Research findings The results showed that classical methods in their general form, have little accuracy in identifying specialized vocabulary, however, by applying some techniques, it was possible to improve the process of selecting specialized vocabulary, among which the best performance related to the improved numerical method in the specialized corpus which resulted in extracting 60% of the specialized vocabulary in the first 50 highfrequency words. This result improved by increasing the scope of the study to 100, 150 and 200 first extracted words and it was observed that the percentage of specialized vocabulary identified increased by about 75%. Moreover, the results obtained for modern methods indicated that these methods can be used in Farsi. It can be seen that chisquare method with 32% and PMI method with 52% extraction of specialized vocabulary in the first 50 high frequency words showed a good function in automatic term extraction in Farsi. They automatically detected specialized vocabulary and by increasing the scope of the study to 200 first words, these percentages improved. Conclusion: The results of the research showed that frequencybased methods are applicable in Farsi. If we use classic frequency methods, we will need to utilize improved classic frequency methods in order to increase the accuracy of extracted words. Also, in order to achieve reliable results in modern frequency approaches, it is necessary to choose large enough vocabulary scope for the extracted vocabulary
Keywords