پیکره اعلام: یک پیکره استاندارد واحدهای اسمی برای زبان فارسی

Fa | Ar | En

پیکره اعلام: یک پیکره استاندارد واحدهای اسمی برای زبان فارسی


نویسنده	حسین‌نژاد شادی ,شکفته یاسر ,امامی آزادی طاهره
منبع	پردازش علائم و داده ها - 1396 - دوره : 14 - شماره : 3 - صفحه:127 -140
چکیده	تشخیص واحدهای اسمی یکی از مسائل مطرح در پردازش زبان طبیعی است. کاربرد عمده شناسایی واحدهای اسمی در سامانه های خلاصه ساز متون، استخراج اطلاعات، پرسش و پاسخ، ترجمه ماشینی و دسته بندی اسناد است. یکی از روش های تهیه سامانه تشخیص واحدهای اسمی، استفاده از روش های مبتنی بر پیکره است. این مقاله نحوه و مراحل تهیه پیکره اَعلام -یک پیکره استاندارد با برچسب واحدهای اسمی برای زبان فارسی را شرح می دهد. مجموعه تهیه شده با داشتن سیزده برچسب واحدهای اسمی و حجم 250 هزار کلمه نیاز سامانه های برچسب گذاری خودکار در حوزه پردازش زبان طبیعی فارسی را برآورده می کند. با استفاده از این پیکره و به کارگیری روش یادگیری ماشین میدان تصادفی شرطی، سامانه ای برای شناسایی واحدهای اسمی جملات فارسی تهیه شده که دارای دقت 92.94 درصد و فراخوانی 78.48 درصد است.
کلیدواژه	پردازش زبان طبیعی، تشخیص واحدهای اسمی، پیکره واحدهای اسمی، یادگیری ماشین، میدان تصادفی شرطی
آدرس	پژوهشگاه توسعه فناوری‌های پیشرفته خواجه نصیرالدین طوسی, ایران, دانشگاه شهید بهشتی, دانشکده مهندسی و علوم کامپیوتر, ایران, پژوهشگاه توسعه فناوری‌های پیشرفته خواجه نصیرالدین طوسی, ایران
پست الکترونیکی	t.emami@rcdat.ir

A’laam Corpus: A Standard Corpus of Named Entity for Persian Language

Authors	Shekofteh Yasser ,Emami Azadi Tahereh
Abstract	Named entity recognition (NER) is a natural language processing (NLP) problem that is mainly used for text summarization, data mining, data retrieval, question and answering, machine translation, and document classification systems. A NER system is tasked with determining the border of each named entity, recognizing its type and classifying it into predefined categories. The categories of named entities include the names of persons, organizations, locations (e.g. city and country), expressions of times, quantities, monetary expressions, and percentages. In general, corpusbased NER approaches have been proved to be well suited for NER problem. Using a NER corpus, recognition of named entities can be done through ruledbased or machinelearning methods. Corpusbased NER systems need standard and appropriate annotated corpora. However, such corpora mainly exist in languages such as English, and are rarely found in Persian/Farsi or limited in volume. So, this paper is dedicated to describe the producing procedure of a standard named entity (NE) corpus A rsquo;laam corpus for Persian language. A rsquo;laam corpus contains about 250,000 tokens tagged with 13 NE tags. This corpus has been developed in the Research Center for Development of Advanced Technologies (RCDAT). Tokens of A rsquo;laam corpus are a part of Farsi Text Corpus. The Farsi Text Corpus is a standard Farsi corpus. This corpus, containing more than 100 million Farsi words, has been developed by the Research Center of Intelligent Signal Processing (changed to the Research Center for Development of Advanced Technologies in 2013). The words of this corpus, selected from diverse written and spoken sources, was tokenized and corrected manually. In addition, a part of the Farsi Text Corpus with 8 million words has partofspeech (POS) tags at word level. Totally, about 8,400 sentences of the Farsi Text Corpus have been randomly selected to obtain about 250,000 tokens of A rsquo;laam Corpus. This corpus included words, POS tags, and named entity tags. To evaluate A rsquo;laam corpus, a Persian NER system was trained based on this corpus. This corpus was so divided into the train and test sections. The train section accounted for 90% of the corpus and the remaining 10% belonged to the test section. Using Conditional Random Fields (CRF) method, the Persian NER system resulted in a 92.94% Precision and 78.48% Recall.
Keywords	Natural language Processing ,Named Entity Recognition ,Named Entity Corpus ,Machine learning ,Conditional Random Field