تشخیص وقایع بصری به‌کمک اطلاعات مکانی-زمانی سیگنال ویدئو

Fa | Ar | En

تشخیص وقایع بصری به‌کمک اطلاعات مکانی-زمانی سیگنال ویدئو


نویسنده	سلطانیان محمد ,قائم‌مقامی شاهرخ
منبع	پردازش علائم و داده ها - 1400 - شماره : 1 - صفحه:134 -119

چکیده	در این مقاله، تشخیص وقایع بصری در ویدئو، با بهره ‌گیری از اطلاعات زمانی سیگنال، به‌صورت تحلیلی موردتوجه قرار دارد. با استفاده از یادگیری انتقالی ، توصیف‌گرهای آموزش‌دیده روی تصاویر به ویدئو اعمال می‌شوند تا تشخیص وقایع را با استفاده از منابع محاسباتی محدود ، ممکن سازند. در این مقاله، یک شبکه عصبی کانولوشنی به‌عنوان استخراج‌کننده نمرات مفاهیم از قاب‌‌های ویدئو به‌کار می‌رود . ابتدا پارامترهای این شبکه روی زیرمجموعه‌ای از داده‌های آموزش تنظیم دقیق می‌شوند؛ سپس، توصیف‌گرهای خروجی از لایه‌های تمام‌متصل آن به‌عنوان توصیف‌گر سطح قاب مورداستفاده قرار می‌گیرند. توصیف‌گرهای به‌دست‌آمده، کدگذاری و در‌نهایت نرمالیزه‌سازی و طبقه‌بندی می‌شوند. نوآوری عمده این مقاله ، ترکیب اطلاعات زمانی ویدئو در کدگذاری توصیف‌گرهای آن است. گنجاندن ساختاری اطلاعات بصری در فرایند کدگذاری توصیف‌گرهای ویدئویی، ، اغلب نادیده گرفته می‌شود. این موضوع به کاهش دقت منجر می ‌شود. برای حل این مساله، یک روش کدگذاری نوین ارائه می‌شود که مصالحه بین پیچیدگی محاسبات و دقت در شناسایی وقایع ویدیویی را بهبود می ‌دهد. در این کدگذاری ، بعد زمانی سیگنال ویدئویی برای ساخت یک بردار مکانیزمانی از توصیف‌گرهای مجتمع محلی ( vlad ) استفاده، سپس نشان داده می‌شود که کدگذاری پیشنهادی ماهیتاً یک مساله بهینه‌سازی است که با الگوریتم‌های موجود به‌راحتی قابل‌حل است. در مقایسه با بهترین روش‌های موجود در حوزه تشخیص وقایع بصری مبتنی بر توصیف‌گرهای سطح قاب ، روش پیشنهادی مدل بهتری را از ویدئو ارائه می‌کند. روش ارائه‌شده بر حسب سه معیار میانگین دقت متوسط، میانگین فراخوانی متوسط و معیار f به عملکرد بالاتری بر روی هر دو مجموعه‌‌‌داده آزمون مورد بررسی دست می‌یابد. نتایج به‌دست‌آمده توانمندی روش پیشنهادی را در بهبود عملکرد سامانه‌های تشخیص وقایع بصری تایید می‌کنند.
کلیدواژه	شبکه عصبی کانولوشنی‌، ‎‌ادغام میانگین‌، ادغام بیشینه‌، ماشین بردار پشتیبان‌، ‌بردار توصیف‌گرهای مجتمع محلی
آدرس	دانشگاه صنعتی شریف, دانشکده مهندسی برق و پژوهشکده الکترونیک, ایران, دانشگاه خوارزمی, دانشکده علوم ریاضی و کامپیوتر, گروه علوم کامپیوتر, ایران

Recognition of Visual Events using Spatio-Temporal Information of the Video Signal

Authors	Soltanian Mohammad ,Ghaemmaghami Shahrokh
Abstract	Recognition of visual events as a video analysis task has become popular in machine learning community. While the traditional approaches for detection of video events have been used for a long time, the recently evolved deep learning based methods have revolutionized this area. They have enabled event recognition systems to achieve detection rates which were not reachable by traditional approaches.Convolutional neural networks (CNNs) are among the most popular types of deep networks utilized in both imaga and video recognition tasks. They are initially made up of several convolutional layers, each of which followed by proper activation and possibly pooling layers. They often encompass one or more fully connected layers as the last layers. The favorite property of them in this work is the ability of CNNs to extract midlevel features from video frames. Actually, despite traditional approaches based on lowlevel visual features, the CNNs make it possible to extract higher level semantic features from the video frames.The focus of this paper is on recognition of visual events in video using CNNs. In this work, image trained descriptor s are used to make video recognition can be done with low computational complexity. A tuned CNN is used as the frame descriptor and its fully connected layers are utilized as concept detectors. So, the featue maps of activation layers following fully connected layers act as feature vectors. These feature vectors (concept vectors) are actually the midlevel features which are a better video representation than the low level features. The obtained midlevel features can partially fill the semantic gap between low level features and high level semantics of video.The obtained descriptors from the CNNs for each video are varying length stack of feature vectors. To make the obtained descriptors organized and prepared for clasification, they must be properly encoded. The coded descriptors are then normalized and classified. The normaliztion may consist of conventional and normalization or more advanced powerlaw normalization. The main purpose of normalization is to change the distribution of descriptor values in a way to make them more uniformly distributed. So, very large or very small descriptors could have a more balanced impact on recognition of events.The main novelty of this paper is that spatial and temporal information in midlevel features are employed to construct a suitable coding procedure. We use temporal information in coding of video descriptors. Such information is often ignored, resulting in reduced coding efficiency. Hence, a new coding is proposed which improves the tradeoff between the computation complexity of the recognition scheme and the accuracy in identifying video events. It is also shown that the proposed coding is in the form of an optimization problem that can be solved with existing algorithms. The optimization problem is initially nonconvex and not solvable with the existing methods in polynomial time. So, it is transformed to a convex form which makes it a well defined optimization problem. While there are many methods to handle these types of convex optimization problems, we chose to use a strong convex optimization library to efficiently solve the problem and obtain the video descriptors. To confirm the effectiveness of the proposed descriptor coding method, extensive experiments are done on two large public datasets: Columbia consumer video (CCV) dataset and ActivityNet dataset. Both CCV and ActivityNet are popular publically available video event recognition datasets, with standard train/test splits, which are large enough to be used as reasonable benchmarks in video recognition tasks.Compared to the best practices available in the field of detecting visual events, the proposed method provides a better model of video and a much better mean average precision, mean average recall, and F score on the test set of CCV and ActivityNet datasets. The presented method not only improves the performance in terms of accuracy, but also reduces the computational cost with respect to those of the state of the art. The experiments vividly confirm the potential of the proposed method in improving the performance of visual recognition systems, especially in supervised video event detection.
Keywords	Convolutional neural network ,Average pooling ,Max pooling ,Support vector machine ,Vector of locally aggregated descriptors