ارتقاء و اصلاح فرایندهای رایج در بازشناسی نوری حروف متون فارسی با بکارگیری ویژگی‌های خط فارسی و الگوریتم انتقال فضا

Fa | Ar | En

ارتقاء و اصلاح فرایندهای رایج در بازشناسی نوری حروف متون فارسی با بکارگیری ویژگی‌های خط فارسی و الگوریتم انتقال فضا


نویسنده	زارعیان آرش ,موسوی میانگاه طیبه ,روشن بلقیس ,فخر احمد مصطفی
منبع	جستارهاي زباني - 1402 - دوره : 14 - شماره : 2 - صفحه:363 -400
چکیده	از آنجا که فن آوری بازشناسی نوری حروف (ا.سی.آر) اصالتاً بر پایه ویژگی‌های خطی لاتین بنا شده است، تقریباً کلیه الگوریتم‌ها و مراحل مورد استفاده در نظام‌های رایج بازشناسی حروف فارسی نیز بر اساس همان ساختار و ویژگی‌های خطوط لاتین گسترش یافته‌اند. بکارگیری ابزار و ویژگی‌های خطوط لاتین در طراحی نظام‌های فارسی محور، نه تنها در نهایت به انجام بازشناسی صحیح حروف فارسی منجر نگردیده است، بلکه باعث سردرگمی همزمان نرم‌افزار و کاربر فارسی‌زبان نیز شده است. از اینرو، در اینجا، پس از مقدمه‌ای کوتاه پیرامون اهمیت خط و زبان در حوزه فن‌آوری اطلاعات به سیر تحول خط فارسی در دوره‌های مختلف و شرح ویژگی‌های این خط و تفاوت‌های آن با خطوط دیگر پرداخته شده است و عناصر شکلی این خط، با توجه به کاربرد و اهمیت آنها در تعامل کاربر با نرم‌افزارهای بازشناسی نوری متون فارسی، طیقه‌بندی گردیده است. در این بخش، با توصیف و تحلیل مراحل بازشناسی حروف بر اساس ویژگی‌های خط فارسی و شرح تفاوتهای آن با گونه‌های لاتین محور موجود، چهره‌ای متفاوت از دستگاه خط فارسی به هنگام کار با رایانه‌ها و به ویژه در سیستم‌های بازشناسی نوری حروف عرضه می‌شود بطوری که مخاطب عملاً قابلیت و ظرفیت‌های دستگاه خط فارسی در هماوردی با دستگاه ساده خط لاتین را مشاهده خواهد نمود. با اتکا به همین ویژگی‌ها، در جهت ارتقاء و اصلاح الگوریتم‌های رایج در بازشناسی نوری حروف فارسی، تسهیل بکارگیری الگوها، و تعدیل حجم پایگاه داده‌ها، از فرایند انتقال هندسی فضای دو بعدی به تک بعدی نیز بهره جسته‌ایم.
کلیدواژه	بازشناسی نوری حروف، اُ.سی.آر، الگوریتم انتقال فضا نظام، نگارشی زبان فارسی، ویژگی‌های خطی فارسی
آدرس	دانشگاه پیام نور مرکز تهران, دانشجوی دکتری گروه زبان‌شناسی, ایران, دانشگاه پیام نور مرکز تهران, گروه زبان‌شناسی, ایران, دانشگاه پیام نور مرکز تهران, گروه زبان‌شناسی, ایران, دانشگاه شیراز, دانشکده مهندسی برق و کامپیوتر, گروه کامپیوتر, ایران
پست الکترونیکی	mfakhrahmad@yahoo.com

correction and improvement of the common processes in optical character recognition (ocr) of persian texts: using the features of the persian script and a dimension transference algorithm

Authors	zareian arash ,mosavi miangah tayebeh ,rovshan belghis ,fakhr ahmad mostafa
Abstract	since the technology of optical recognition of characters is essentially based on latin script, almost all the algorithms and processes involved in persian ocr systems are constructed upon the structure and scriptological features of latin alphabet. this utilization of the means and features of latin script to design persian-based ocr systems however, not only has not resulted in the appropriate optical recognition of persian characters but it also has simultaneously ended in confusion on the part of both the persian-speaking users and the systems. this paper, therefore, begins with a short review of the significance of language and linguistics in the field of information technology in connection with ocr systems. then, it will continue with a short history of persian/arabic script, while focusing on the scribal features of persian writing system and its differences with other scripts. in the next part, for effective utilization of the formal elements of the persian script, these elements have been categorized according to their application and significance in the process of the user’s interaction with persian ocr systems. furthermore, through a step by step discussion and analysis of the processes involved in optical recognition of characters based on the scriptological features of the persian script, not only the deficiencies and faults of the current latin-based ocr systems will be pinpointed but also a different aspect of the persian writing system, in connection with its use in computer software, especially ocr systems, will be used so that the reader will practically notice the potentials and capabilities of this complex script in contrast to the simpler latin writing system. in the end, in order to upgrade and improve the current algorithms employed in persian ocr systems, the geometrical process of transferring bi-dimensional specifications into mono-dimensional ones has been utilized. the proposed algorithm, which is based on the scriptological features of persian script, will simultaneously result in the convenient manipulation of patterns, reduction of the bulk of the database, and acceleration of the data processing rate.1. introductionsince the technology of optical recognition of characters is essentially based on latin script, almost all the algorithms and processes involved in persian ocr systems are constructed upon the structure and scriptological features of latin alphabet. this utilization of the means and features of latin script to design persian-based ocr systems however, not only has not resulted in the appropriate optical recognition of persian characters but it also has simultaneously ended in confusion on the part of both the persian-speaking users and the systems. therefore, in order to present a different portrait of persian writing system when working with computers, especially in ocr systems, this research, attempts to describe and analyze the processes involved in optical character recognition based on the scriptological features of the persian alphabet and elaborate on its differences with the existing latin-based systems. in line with this objective, after reviewing the history and evolution of the persian script through different periods, this research gives a classified illustration of the scriptological features of the persian writing system and its formal elements with a special focus on the ocr processes. consequently, in this study, the formal elements of the persian script are categorized according to their application and significance in the interaction of the user with the persian ocr softwares. furthermore, the effective utilization of these scriptological elements is expressed in the framework of the existing algorithms, as well as, in the form of a proposed algorithm. the proposed algorithm, on the one hand, results in the practical elimination of the high affectation of the existing algorithms when facing the cursiveness and elongation, of the persian letters, which previously increased the error rate of the ocr processes; and on the other hand, it highly prevents an increase in the bulk of the database and computations, related to the stored patterns, which previously caused a decrease in the software performance. 2. literature reviewthe study of persian/arabic characters, representation have been studied since 1970s (bonyani & jahangard, 2020) and the very beginning algorithms for representing arabic scripts have been released in 1990s. (margner & el-abed, 2008). many researchers including shafii (2014) gave up holistic segmentation of persian characters because of difficulties resulted from some special features of persian alphabet and only worked on sub words, representation instead. the proposed algorithm of kiaei (2019), regardless of working on printed limited omni-font texts did not lead to an accepted results and was inefficient facing to words sequence. rhmati, et al. (2020) as the latest research in the field of character segmentation like many other studies considered baseline connector as a part of a character and their algorithm suggested a procedure to shorten over length baseline connectors in order to facilitate character recognition through the existing systems. the newly done studies on optical character recognition avoid the structural features in the process of recognition and primarily utilize holistic algorithm based on neural networks inorder to extract distinctive features of characters (bonyani & jahangard, 2020).3. discussionusing the concept of baseline connector (bc) in the design of the proposed algorithm, the connected characters will all have an identical bc component. this means that each instance of the bc, regardless of its length, will be identified as one identical component. this way, the bc component of each character and its variable extra stretches are removed by means of algorithms and mathematical processes and replaced by an identical special code. this is different from the common known methods of character segmentation in which the whole character including the bc component goes through an image processing stage. here, in the pattern comparison stage, the system at first recognizes the bc component and removes its extra stretches and then compares the remaining letter image with the stored patterns. by removing the bc component from the text image and replacing it with a simple code, contrary to what is customary: 1) the letter segmentation process occurs naturally and successfully; 2) instead of comparing a letter image with all existing patterns, due to the presence of a bc component code, the comparison and recognition process occurs only between the letter image (raw letter) and the patterns belonging to the same set since based on the position of the bc component, the letters can be divided into four sets: a) letter + bc (= initial letters); b) bc + letter (= final letters); c) bc+letter+bc (= medial letters); d) isolated form without a bc on either side (= isolated letters).
Keywords	optical character recognition ,ocr ,computational linguistics ,scribal features ,persian writing system