روشی جدید در تشخیص تکراری رکوردها با استفاده از خوشه‌‌بندی سلسله مراتبی

Fa | Ar | En

روشی جدید در تشخیص تکراری رکوردها با استفاده از خوشه‌‌بندی سلسله مراتبی


نویسنده	دانشپور نگین ,برزگری علی
منبع	پردازش علائم و داده ها - 1400 - شماره : 4 - صفحه:3 -22
چکیده	به‌دلیل اهمیت بالای کیفیت داده‌‌ها در عملکرد سامانه‌های نرم‌‌افزاری، فرآیند پاکسازی داده به‌خصوص تشخیص رکوردهای تکراری، طی سالیان اخیر یکی از مهم‌‌ترین حوزه‌‌های علوم رایانه به حساب آمده است. در این مقاله روشی برای تشخیص رکوردهای تکراری ارائه شده است که با خوشه‌‌بندی سلسله‌‌مراتبی رکوردها بر اساس ویژگی‌‌های مناسب در هر سطح، میزان شباهت میان رکوردها تخمین زده می‌‌شود. این کار سبب می‌‌شود تا خوشه‌‌هایی در سطح آخر به‌دست آیند که رکوردهای درون آن‌‌ها بسیار مشابه یکدیگر باشند. برای کشف رکوردهای تکراری نیز مقایسه تنها بر روی رکوردهای درون یک خوشه از سطح آخر انجام می‌‌گیرد. همچنین در این مقاله برای مقایسه میان رکوردها، یک تابع تشابه نسبی بر پایه تابع فاصله ویرایشی ارائه شده که دقت بسیار بالایی به همراه دارد. مقایسه نتایج ارزیابی سامانه نشان می‌‌دهد که روش ارائه‌شده، در زمان کمتری، 90% تکراری‌‌های موجود را با دقت 97% کشف می‌‌کند و بهبود داشته است.
کلیدواژه	تشخیص تکراری، پاک‌سازی داده، خوشه‌‌بندی سلسله‌‌مراتبی، تابع تشابه، انتخاب ویژگی
آدرس	دانشگاه تربیت دبیر شهید رجایی, دانشکده مهندسی کامپیوتر, ایران, دانشگاه آزاد اسلامی واحد علوم تحقیقات, دانشکده مهندسی کامپیوتر, گروه کامپیوتر, ایران
پست الکترونیکی	ali.barzegari70@gmail.com

A New Method for Duplicate Detection Using Hierarchical Clustering of Records

Authors	Daneshpour Negin ,Barzegari Ali
Abstract	Accuracy and validity of data are prerequisites of appropriate operations of any software system. Always there is possibility of occurring errors in data due to human and system faults. One of these errors is existence of duplicate records in data sources. Duplicate records refer to the same real world entity. There must be one of them in a data source, but for some reasons like aggregation of data sources and human faults in data entry, it is possible to appear several copies of an entity in a data source. This problem leads to error occurrence in operations or output results of a system; also, it costs a lot for related organization or business. Therefore, data cleaning process especially duplicate record detection, became one of the most important area of computer science in recent years. Many solutions presented for detecting duplicates in different situations, but they almost are all timeconsuming. Also, the volume of data is growing up every day. hence, previous methods don rsquo;t have enough performance anymore. Incorrect detection of two different records as duplicates, is another problem that recent works are being faced. This becomes important because duplicates will usually be deleted and some correct data will be lost. So it seems that presenting new methods is necessary.In this paper, a method has been proposed that reduces required volume of process using hierarchical clustering with appropriate features. In this method, similarity between records has been estimated in several levels. In each level, a different feature has been used for estimating similarity between records. As a result, clusters that contain very similar records will be created in the last level. The comparisons are done on these records for detecting duplicates. Also, in this paper, a relative similarity function has been proposed for comparing between records. This function has high precision in determining the similarity. Eventually, the evaluation results show that the proposed method detects 90% of duplicate records with 97% accuracy in less time and results have improved.
Keywords	Duplicate Record Detection ,Data Cleaning ,Hierarchical Clustering ,Similarity Function ,Feature Selection