پرکردن داده‌های گمشده در داده‌های سری زمانی چندمتغیره

Fa | Ar | En

پرکردن داده‌های گمشده در داده‌های سری زمانی چندمتغیره


نویسنده	دانشپور نگین ,میرابوالقاسمی فاطمه
منبع	پردازش علائم و داده ها - 1401 - شماره : 2 - صفحه:39 -60
چکیده	داده های سری زمانی چندمتغیره در زمینه‌ های مختلف مانند بیوانفورماتیک، زیست ‌شناسی، ژنتیک، نجوم، علوم جغرافیایی و امور مالی یافت می‌ شوند. بسیاری از این مجموعه‌داده ‌ها دارای داده گمشده هستند. جایگذاری داده ‌های گمشده سری زمانی چندمتغیره، یکی از مباحث چالش برانگیز است و قبل از فرایند یادگیری یا پیش بینی سری ‌های زمانی باید با دقت مورد توجه و بررسی قرار گیرد. تحقیقات فراوانی در استفاده از روش‌های مختلف برای جایگذاری داده‌ های گمشده سری زمانی انجام شده است که به‌طورمعمول شامل روش‌ های تجزیه و تحلیل و مدل ‌سازی ‌های ساده در کاربردهای خاص و یا سری ‌های زمانی تک‌متغیره هستند. در این مقاله یک نسخه بهبود‌یافته از درون‌یابی معکوس فاصله وزن‌دار برای جایگذاری داده‌ های گمشده پیشنهاد شده‌ است. روش درون‌یابی معکوس فاصله وزن‌دار دو محدودیت اساسی دارد: 1) یافتن بهترین نقاط نزدیک تر به داده‌ های گمشده 2) انتخاب توان تاثیر بهینه برای همسایگان داده گمشده. برای بهبود روش درون‌یابی، از خوشه ‌بندی kmeans استفاده شده ‌است، تا همسایه‌ های با بیشترین شباهت به الگوی داده ای انتخاب شوند. از آنجا که میزان تاثیر هر یک از همسایه‌ ها بر روی داده گمشده متفاوت است، از الگوریتم جستجوی فاخته برای تعیین توان تاثیر همسایگی استفاده می شود. برای ارزیابی عملکرد روش پیشنهادی، از پنج معیار ارزیابی شناخته‌شده ‌استفاده می شود. نتایج تجربی بر روی چهار مجموعه‌داده uci با درصدهای مختلف گمشدگی مورد بررسی قرار گرفته و در‌مجموع الگوریتم پیشنهادی نسبت به سه روش مقایسه‌ ای دیگر عملکرد بهتر و به‌طور میانگین حدود 0.05 خطای rmse، 0.04 خطای mae، 0.003 خطای mse و 5 درصد خطای mape داشته است. میزان همبستگی داده‌های واقعی و مقدار برآورد‌شده در روش پیشنهادی بسیار مطلوب و در حدود 99 درصد است.
کلیدواژه	جایگذاری داده‌های گمشده، درون‌یابی idw، الگوریتم جستجوی فاخته، خوشه‌بندی k‌-means، سری‌های زمانی چندمتغیره
آدرس	دانشگاه تربیت دبیر شهید رجایی, دانشکده مهندسی کامپیوتر, ایران, دانشگاه تربیت دبیر شهید رجایی, دانشکده مهندسی کامپیوتر, ایران
پست الکترونیکی	fmirabolghasemi@yahoo.com

missing data imputation in multivariate time series data

Authors	daneshpour negin ,mirabolghasemi fatemeh
Abstract	multivariate time series data are found in a variety of fields such as bioinformatics, biology, genetics,astronomy, geography and finance. many time series datasets contain missing data. multivariatetime series missing data imputation is a challenging topic and needs to be carefully considered before learning or predicting time series. frequent researches have been done on the use of different techniques for time series missing data imputation, which usually include simple analytic methods and modeling in specific applications or univariate time series. in this paper, a hybrid approach to obtain missing data is proposed. an improved version of inverse distance weighting (idw) interpolation is used to missing data imputation. the idw interpolation method has two major limitations: 1) finding closest points to missing data 2) choosing the optimal effect power for missing data neighbors. clustering has been used to remove the first constraint and find closest points to the missing data. with the help of clustering, the search radius and the number of input points that are supposed to be used in interpolation calculations are limited and controlled, and it is possible to determine which points are used to determine the value of a missing data.therefore, most similar data to the missing data are found. in this paper, the kmaens clustering method is used to find similar data. this method has been more accurate than other clustering methods in multivariate time series. evolutionary algorithms are used to find the optimal effect power of each data point to remove the second constraint. considering that each sample within each cluster has a different effect on the estimation of missing data, cuckoo search is used to find the effect on missing data. the cuckoo search algorithm is applied to the data of each cluster, and each data sample that has more similarity with the missing data has more influence, and each data sample that has less similarity has less influence and has less influence in determining the amount of missing data. among evolutionary algorithms, evolutionary cuckoo search algorithm is used due to high convergence speed, much less probability of being trapped in local optimal points, and ability to quickly solve high dimensional optimization problems in multivariate time series problems. to evaluate the performance of the proposed method, rms, mae, , mse and mape criteria are used. experimental results are investigated on four uci datasets with different percentages of missingness and in general, the proposed algorithm performs better than the other three comparative methods with an average rmse error of 0.05, mae error of 0.04, mse error of 0.003, and mape error of 5. the correlation between the actual data and the estimated value in the proposed method is about 99%.
Keywords	missing data imputation ,idw interpolation ,cuckoo search algorithm ,k-means clustering ,multivariate time series