پیکرۀ ساخت‌های فعل سبک زبان فارسی

Fa | Ar | En

پیکرۀ ساخت‌های فعل سبک زبان فارسی


نویسنده	اسحاقی مهدیه ,کریمی‌دوستان غلامحسین
منبع	پژوهش هاي زبان شناسي - 1401 - دوره : 14 - شماره : 1 - صفحه:173 -198
چکیده	پیکرۀ زبانی مجموعه‌ای بزرگ از داده‌های زبانی مبتنی بر کاربرد سخنوران زبان‌هاست که الگوهای واقعی کاربرد زبانی را در اختیار پژوهشگران قرار می‌دهند. برتری پیکره‌ها در مقایسه با سایر منابع داده‌ای علاوه بر حجم زیاد داده‌، ایجاد امکان به‌کارگیری رایانه در بررسی‌های زبانی است. مقالۀ حاضر به معرفی اولین پیکرۀ ساخت‌های فعل سبک زبان فارسی می‌پردازد. آشنایی با ماهیت این ساخت‌ها و دسترسی به فهرستی از آن‌ها، علاوه بر اهمیت نظری به‌لحاظ کاربردی نیزحائز اهمیت است. این یافته‌ها در حوزۀ بررسی‌های هوش مصنوعی مرتبط با پردازش زبان‌های طبیعی، ترجمۀ ماشینی، آموزش زبان فارسی، دستورنویسی و فرهنگ‌نگاری کاربرد می‌یابد. پیکرۀ هدف این پژوهش «پیکرۀ زبانی ساخت‌های فعل‌سبک زبان فارسی» یا lcp نام دارد. برای ایجاد آن پیکرۀ تک‌زبانۀ پژوهشگاه ارتباطات و فناوری اطلاعات (بی‌جن‌خان، 1397) که حاوی 950000 فایل متنی است، به‌عنوان پیکرۀ مبنا برگزیده شد. ساخت‌های فعلی مرکب مربوط به 21 فعل سبک زایای زبان فارسی از آن استخراج ‌شده است و پس از برچسب‌زنی در چارچوب صرف توزیعی (halle & marantz, 1993; marantz, 2013) در قالب پیکره‌ای مشتمل بر بیش از 6000 ساخت فعل سبک در بیش از 2000000 بافت زبانی ارائه شده است که در بیش از 200000 بافت زبانی ارائه شده‌اند. مقایسۀ تعداد فعل‌های واژگانی زبان فارسی با تعداد ساخت‌های فعل سبک موجود در پیکرۀ حاضر، بدیهی‌ترین عاملی است که وجود چنین پیکره‌ای در میان منابع زبان فارسی را ضرورت می‌بخشد. از سوی دیگر، ماهیت این پیکره، یعنی نمایش ساخت‌های فعل سبک در بافت‌های زبانی متفاوت، می‌تواند به پژوهشگران در یافتن پاسخ پرسش‌های موجود در رابطه با این ساخت‌ها، رد یا تایید فرضیه‌ها و طرح نظریه‌های جدید کمک کند.
کلیدواژه	زبان فارسی، منابع داده‌ای، پیکرۀ زبانی، ساخت‌های فعل سبک، پردازش زبان طبیعی
آدرس	دانشگاه تهران, دانشکده ادبیات و علوم انسانی, ایران, دانشگاه تهران, دانشکده ادبیات و علوم انسانی, ایران
پست الکترونیکی	gh5karimi@ut.ac.ir

a corpus of light verb constructions in persian

Authors	eshaghi mahdieh ,karimi doostan gholamhossein
Abstract	a linguistic corpus is a collection of linguistic data derived from language texts, which represent the real patterns of language use to the researchers. the priority of the corpus over other linguistic resources stems from the amount of data it represents and the possibility of computer use in linguistic studies. in the present study, an annotated monolingual linguistic corpus of light verb constructions (lvcs) of persian language (lcp) developed by the authors was introduced. the corpus contained more than 6000 lvcs, which were used in more than 2000000 linguistic contexts. just a comparison of the number of lvcs with the number of simple verbs in persian is enough to indicate the importance of these types of language resources. this annotated corpus presented lvcs formed by 21 persian light verbs (lvs) that are used in real contexts. this unprecedented work has the capacity to easily provide a large computational bulk of various data for the researchers to assess the existing hypotheses and put forward the new ones.keywords: persian language, language resources, linguistic corpus, light verb constructions, natural language processing introductionlight verbs are a group of verbs that have lost part of their semantic contents during language evolution. these so-called light verbs in combination with a preverbal element like a noun, adjective, or prepositional phrase form light verb constructions (lvcs) in persian. the study of lvcs is important not only theoretically, but also practically. the verbal system of persian largely consists of lvcs and it doubles the importance of their study in this language. nevertheless, many studies have pointed out the challenges that persian lvcs pose for computational systems. they have emphasized the lack of appropriate computer resources and the necessity of studies that provide the researchers with their standard language patterns in this language (maerefat, 2004; hasas sediqi, 2010; taslimipoor, 2012; askariyan, 2012, and barfi, 2016 among others). although there are already valuable persian corpora developed by specialists like bijan khan (2004, 2018), asi (2005), and al-e-ahmad et al. (2010) in this field, there is no corpus to comprehensively represent lvcs of all productive persian light verbs (lvs). the only available corpus dealing with persian lvcs is prespred (samvellian & faqiri, 2013), which represents those consisting of one of the twenty-one productive persian lvs (zadan). to address this need, we developed the first corpus for persian lvcs.[1] this annotated corpus presented the lvcs formed by 21 persian lvs that are used in real contexts. the present unprecedented work has the capacity to readily provide a large computational bulk of various data for researchers. materials and methodsdevelopment of the present corpus experienced the following steps: designing the structure of the corpus, selecting a corpus as a basis, normalizing the texts, defining the search nodes, writing macro codes in visual basic analysis (vba) language for preparing the search software, extracting all the sentences containing the verbs under investigation (regardless of being light or lexical verbs), extracting the sentences with lvcs, and finally selecting an annotation model and applying it to the results. it was designed to be a synchronic monolingual corpus of persian lvcs.
Keywords	persian language ,language resources ,linguistic corpus ,light verb constructions ,natural language processing