یادگیری تقویتی براساس معماری عملگر- نقاد در سیستم های چند عامله برای کنترل ترافیک

Fa | Ar | En

یادگیری تقویتی براساس معماری عملگر- نقاد در سیستم های چند عامله برای کنترل ترافیک


نویسنده	اصلانی محمد ,مسگری محمدسعدی ,مطیعیان حمید
منبع	علوم و فنون نقشه برداري - 1394 - دوره : 5 - شماره : 3 - صفحه:233 -245
چکیده	در نیمه دوم قرن گذشته اغلب جوامع شاهد شروع پدیده ای بنام ترافیک شهری در خود بوده اند که علت رخداد چنین پدیده ای عبور تعداد زیادی خودرو در زمان یکسان از یک زیر ساخت حمل و نقلی یکسان می باشد. پدیده ترافیک شهری دارای پیامدهای اقتصادی و محیط زیستی کاملاً شناخته شده ای از جمله آلودگی هوا، کاهش در سرعت، افزایش زمان سفر، افزایش مصرف سوخت و حتی افزایش تصادفات می باشد. یکی از راه های اقتصادی برای مدیریت کردن افزایش تقاضای سفر و جلوگیری از ترافیک شهری، افزایش کارایی زیر ساخت های موجود از طریق سیستم های هوشمند کنترل ترافیک می باشد.از سوی دیگر کنترل ترافیک به دلیل طبیعت توزیع یافته و خودمختار آن توسط سیستم های چند عامله به خوبی قابل مدلسازی می باشد. رانندگان و چراغ های راهنمایی را می توان به عنوان عامل هایی که رفتارهای هوشمندانه ای از خود نشان می دهند در نظر گرفت. برای ایجاد چنین رفتارهایی نیاز است که دانش لازمه از محیط اطراف در ذهن عامل قرار داده شود اما به دلیل پیچیدگی های بالای موجود در الگوهای ترافیک شهری و ناایستا بودن اغلب محیط های ترافیکی قرار دادن یک دانش اولیه از محیط در ذهن عامل ها بسیار دشوار و غیر عملی می باشد. بنابراین نیاز به یک روشی که عامل در طول تعامل با محیط بتواند دانش لازمه را بدست آورد کاملاً ضروری است که در این تحقیق برای حل این چالش از یادگیری تقویتی استفاده شد. هدف مقاله حاضر بهبود استراتژی های کنترل ترافیک و به طور خاص کنترل هوشمند چراغ های راهنمایی از طریق توسعه تکنیک های یادگیری تقویتی در سیستم های چند عامله است. معماری عملگر نقاد به عنوان یک معماری رایج در یادگیری تقویتی که دارای ساختار حافظه جداگانه ای هم برای سیاست و هم برای تابع ارزش است مورد استفاده قرار گرفت. نتایج این تحقیق نشان دادند که کنترل هوشمند چراغ های راهنمایی منجر به کاهش 23% طول صف و 16% زمان سفر نسبت به کنترل غیر هوشمند چراغ های راهنمایی برای یک تقاطع منفرد می شود.
کلیدواژه	سیستم های چند عامله، یادگیری تقویتی، معماری عملگر - نقاد و کنترل ترافیک
آدرس	دانشگاه صنعتی خواجه نصیرالدین طوسی, دانشکده مهندسی نقشه برداری, ایران, دانشگاه صنعتی خواجه نصیرالدین طوسی, دانشکده مهندسی نقشه برداری, گروه سیستم های اطلاعات مکانی, ایران, دانشگاه صنعتی خواجه نصیرالدین طوسی, دانشکده مهندسی نقشه برداری, ایران

An Actor-Critic Reinforcement Learning Approach in Multi-Agent Systems for Urban Traffic Control

Authors
Abstract	Nowadays, most urban societies have experienced a new phenomenon socalled urban traffic congestion, which is caused by crossing too many vehicles from the same transportation infrastructure at the same time. Traffic congestion has different consequences such as air pollution, decrease in speed, increase in travel time, fuel consumption and even incidents. One of the feasible solutions for bringing off the increase in transportation demand is to improve the existing infrastructure by means of intelligent traffic control systems. From a traffic engineering point of view, a traffic control system consists of physical network, control devices (traffic signals, variable message signs, so forth), the model of transportation demand and control strategy. The focus of this paper is on the latter especially traffic signal control.Traffic signal control can be modeled by multiagent systems perfectly because of its distributed and autonomous nature. In this context, drivers and traffic signals are considered distributed, autonomous and intelligent agents. Besides, due to high complexity arising in urban traffic patterns and nonstationarity of traffic environment, developing an optimized multiagent system by preprogrammed agent’s behavior is most impractical. Therefore, the agents must, instead, discover their knowledge through a learning mechanism by interacting with the environment.Reinforcement Learning (RL) is a promising approach for training the agent in which optimizes its behavior by interacting with the environment. Each time the agent receives information on the current state of the environment, performs an action in its environment, which may changes the state of the environment, and receives a scalar reward that reflects how appropriate the agent’s behavior has been in the past. The function that indicates the action to take in a certain state is called the policy. The goal of RL is to find a policy that maximizes the longterm reward. Several types of RL algorithms have been introduced and they can be divided into three groups: ActorOnly, CriticOnly and ActorCritic methods.ActorOnly methods typically work with a parameterized family of policies over which optimization procedures can be used directly. Often the gradient of the value of a policy with respect to the policy parameters is estimated and then used to improve the policy. The drawback of ActorOnly methods is that the increase of performance is harder to estimate when no value function is learned. CriticOnly methods are based on the idea to first find the optimal value function and then to derive an optimal policy from this value function. This approach undermines the ability of using continuous actions and thus of finding the true optimum. In this research, ActorCritic reinforcement learning is applied as a learning method for true adaptive traffic signal control. ActorCritic method is a temporal difference method that has a separate memory structure to explicitly represent the policy independent of the value function. The policy structure is known as the actor, because it is used to select actions and the critic is a statevalue function.In this paper, AIMSUN, which is a microscopic traffic simulator, is used to model traffic environment. AIMSUN models stochastic vehicle flow by employing carfollowing, Lane Changing and gap acceptance. AIMSUN API was used to construct the state, execute the action, and calculate the signal reward in each traffic light. The state of the each agent is represented by a vector of 1 + P components, where the first component is the phase number and P is the number of entrance streets which goes to intersection. Also, the action of the agent is the duration of the current phase. The immediate reward is defined as the reduction in the total number of cars waiting in all entrance streets. In fact, difference between the total numbers of cars in two successive decision points is used as a signal reward. The reinforcement learning controller is benchmarked against optimized pretimed control. The results indicate that the ActorCritic controller decreases Queue length, travel time, fuel consumption and air pollution when compared to optimized pretimed controller.
Keywords