Human-Robot Collaborative Learning of a Bag Shaking Trajectory

Kartoun Uri, Stern Helman, and Edan Yael

Department of Industrial Engineering and Management,  Ben-Gurion University of the Negev, Beer-Sheeva 84105, ISRAEL

{kartoun, helman, yael}@bgu.ac.il

Download Paper (pdf)

Abstract—This paper presents a collaborative reinforcement learning algorithm, , designed to accelerate learning by integrating a human operator into the learning process. The -learning algorithm enables collaboration of knowledge between the robot and a human; the human, responsible for remotely monitoring the robot, suggests solutions when intervention is required. Based on the robot's learning performance, it switches between fully autonomous operation, and the integration of human commands. The -learning algorithm was tested on a Motoman UP-6 fixed-arm robot required to empty the contents of a suspicious bag. We demonstrate experimental results that support our hypothesis of evaluating whether learning is faster while human collaboration is triggered than when the system functions autonomously.

 

Index Terms: Roobot learning, reinforcement learning, human-robot collaboration.

I.     INTRODUCTION

Teleoperation is used when a task has to be performed in a hostile, unsafe, inaccessible or remote environment [1]. [2] suggest two components of a human-robot system when the robot is remotely located; (i) autonomy mode - an artificial intelligence or computer control of a robot that allows it to act, for a time, without human intervention, and (ii) Human-Robotic Interfaces (HRI) - software installed at the human's location allowing him to perceive the world, the robot states, and send instructions to the robot. One of the main issues in task-oriented HRI is achieving the right mixture of human and robot autonomy [3]. [4], indicates the importance of HRI in meeting operators' requirements: "an understanding of the human decision process should be incorporated into the design of human-robotic interfaces in order to support the process humans' employ."

[5], describe a HRI that supports both adjustable autonomy and hierarchical task selection. With adjustable autonomy, a computer switches among several control modes ranging from full supervision to full autonomy. With hierarchical task selection, the interface allows an operator to easily solve a high-level task autonomously or else to guide a robot through a sequence of lower-level subtasks that may or may not involve autonomous control. [6], 2005 define sliding scale autonomy as the ability to create new levels of autonomy between existing, pre-programmed autonomy levels. The suggested sliding scale autonomy system shows the ability to dynamically combine human and robot inputs, using a small set of variables such as user and robot speeds, speed limitations, and obstacle avoidance.

 

Reinforcement learning (RL) is regarded as learning through direct experimentation [7], [8]. It does not assume the existence of a teacher providing training examples. Instead, teaching derives from experience. The learner acts on the process to receive signals (reinforcements) from it, indications about how well it is performing the required task. These signals are usually associated with some dramatic condition - e.g., accomplishing a subtask (reward) or complete failure (punishment). The learning agent learns the associations between observed states and chosen actions that lead to rewards or punishments, i.e., it learns how to assign credit to past actions and states by correctly estimating costs associated with these events [9].

In [10], a collaborative process enabling a robotic learner to acquire concepts and skills from human examples is presented. During the teaching process, the robot must perform tasks based on human instructions. The robot executes its tasks by incorporating feedback until its hypothesis space is converged. Using a -learning approach, the robot learns a button pushing task. In [11], a variable autonomy approach is used. User commands serve as training inputs for the robot learning component, which optimizes the autonomous control for its task. This is achieved by employing user commands for modifying the robot's reward function. Using the potential of learning from reinforcement and human rewards illustrates the changes in user reward and -value functions accordingly [12], [13]. The task was to learn to optimally navigate to a specific target in a two-dimensional world with obstacles.

-learning requires no human intervention; the agent is placed in an unknown environment and explores it independently with the objective of finding an optimal policy. A disadvantage of this approach is the large amount of required interaction with the environment until an effective policy is determined. One example for alleviating this problem includes guiding an agent using rules suggesting trajectories of successful runs through the environment [14], [15]. [15] suggest a RL-based framework denoted as "relocation". At any time during training an agent can request to be placed in any state of the environment. The "relocation" approach assumes a cost per relocation, and seeks to limit the number of relocations. The approach requires minimal human involvement and consists of two agent conditions: (i) "in trouble" - taking actions that turn out to be a poor choice, even though it learns from the negative experience, would cause to waste time in a portion of the state-space that is unlikely to be visited during optimal behavior, and (ii) "bored" - if the -values are updated by tiny amounts, the agent is not learning anything new in the current part of the environment. In this condition it is forced to relocate with greatest probability when updating a particular -value does not change it.

 In [16] a cooperative RL algorithm of multi-agent systems denoted as the "leader-following -learning algorithm" is presented. The algorithm is based on a Markov or stochastic game, in which there are multiple stages, and each stage is a static Stackelberg game. The problem of multiple RL agents attempting to learn the value function of a particular task in parallel for the n-armed bandit task is investigated in [17].. A parallel reinforcement learning solution is suggested to overcome the problem of statistic overwhelming by an agent's information that is correspondingly has a larger accumulated experience than the other agents. Experiments on a group of four foraging mobile robots learning to map robots' conditions to behaviors was conducted by Matari? [18]. The learning algorithm of the robots consists of reward functions that combine individual conditions of a robot (such as, "grasped a puck", "dropped puck away from home") and collaborative conditions; how close the robots are to each other. Individually, each robot learns to select the behavior with the highest value for each condition, to find and take home the most pucks. Evaluation of groups of three and four robots found that interference was a detriment; in general, the more robots were learning at the same time, the longer it took for each individual to converge. Additionally, [18] found that while measuring the "percent of the correct policy the robots learned in 15 minutes, averaged over twenty trials," the use of heterogeneous reward functions results in better performance but also suffers from the credit assignment problem.

As motivation for our approach consider the usual method for bomb squad personnel to blow up suspicious bags and any explosives contained therein. However, if the bag contains chemical, biological or radiological canisters, this method can lead to disastrous results. Furthermore, the "blow-up" method also destroys important clues such as fingerprints, type of explosive, detonators and other signatures of use in subsequent forensic analysis. Learning the extraction of a bag contents by a robot acquiring knowledge from human advice is the subject addressed here. The learning task described in this paper is to observe the position of a plastic bag located on a platform, grasp it with a robot manipulator and shake out its contents on a collection container in minimum time.

Although -learning and its variation  have been used in many robotic fields (e.g., [19]-[24]), accelerating the learning process is important. This paper presents a new algorithm, referred to as , that accelerates learning. The  algorithm is a collaborative algorithm that integrates the experience of several agents. In this paper, we describe experiments of applying the  algorithm on a system that integrates two agents - a robot and a human working cooperatively to achieve a common goal. The system has no a priori knowledge regarding to efficient lifting and shaking policies of the bag and it learns this knowledge from experience and from human guidance. Although other alternatives are available for solving the proposed security problem, such as cutting open the bag, or sliding out the inspected objects, the application was selected to serve as a test-bed for testing the  algorithm. Section II presents the new  algorithm. The test-bed learning application is described in section III followed by experimental results in section IV. Experimental and analysis of the results are given in sections V and VI respectively. Concluding remarks follow in section VII.

II.      -Learning

A.      and  Learning

The basic assumption in reinforcement learning studies is that any state  made by the agent must be a function only of its last state and action:  where  and  are the state and time at step , respectively [9]. In -learning, the system estimates the optimal action-value function directly and then uses it to derive a control policy using the local greedy strategy [25]. It is stated in [23] that "-learning can learn a policy without any prior knowledge of the reward structure or a transition model. -learning is thus referred to as a model-free approach." It does not require mapping from actions to states and it can calculate the  values directly from the elementary rewards observed.  is the system's estimate of the optimal action-value function [26]. It is based on the action value measurement, defined in (1):

 (1)

which represents the expected discounted cost for taking action  when visiting state  and following an optimal policy thereafter. From this definition and as a consequence of Bellman's optimality principle [27], (2) is derived:

 (2)

The essence of -learning is that these characteristics (maximum operator inside the expectation term and policy independence) allow an iterative process for calculating an optimal action. The first step of the algorithm is to initialize the system's action-value function, . Since no prior knowledge is available, the initial values can be arbitrary (e.g., uniformly zero). Next, the system's initial control policy, , is established. This is achieved by assigning to  the action that locally maximizes the action-value. At time-step , the agent visits state  and selects an action , receives from the process the reinforcement  and observes the next state . Then it updates the action value  according to (3) which describes a -learning one step:

 (3)

where  is the current estimate of the optimal expected cost  and  is the learning rate which controls how much weight is given to the reward just experienced, as opposed to the old  estimate. The process repeats until a stopping criterion is met (e.g., robot emptied the bag from contents). The greedy action  is the best the agent performs when in state . For the initial stages of the learning process, however, it uses randomized actions that encourage exploration of the state-space. Under some reasonable conditions [28] this is guaranteed to converge to the optimal -function [26].

A generalization of -learning, represented by  [25], [29] uses eligibility traces, : the one-step -learning is a particular case with  [30]. The -learning algorithm learns quite slowly because only one time-step is traced for each action [10]. To boost learning convergence, a multi-step tracing mechanism, the eligibility trace, is used in which the  values of a sequence of actions can be updated simultaneously according to the respective lengths of the eligibility traces [19].

"The convergence of  is not assured anymore for , but experience shows that learning is faster" [30]. Several action selection policies are described in the literature where the greedy policy (e.g., [31]) is to choose the best action. Other policies (e.g., "softmax" or "?-greedy" [32]) are stochastic, and based on choosing a suboptimal policy to explore the state-action space.

B.      -Cooperative  Learning

The proposed  learning algorithm (Fig. 1) has the objective to accelerate learning in a system composed of several similar learning processes. Differently from [18], where the learning algorithms of multiple robots consist of reward functions that combine individual conditions of a robot, the  learning algorithm is based on a state-action value of an agent or learning process updated according to the maximal value within all other state-action values existing in the learning system (4); collaboration is in taking the maximum of action values, i.e., the -value, across all learners at each update step [33]. Similar to the "leader-following -learning algorithm" in the joint policy approach described in [16], the  learning algorithm enables collaboration of knowledge between several agents (the human and the robot). Unlike [12] and [13], the  algorithm does not allow human rewards inserted directly to its -value functions.  rewards are achieved by interaction of learning agents with the environment.

Fig. 1. -learning algorithm

 (4)

In (4)  is the temporal difference error that specifies how different the new value is from the old prediction and  is the eligibility trace that specifies how much a state-action pair should be updated at each time-step. When a state-action pair is first visited, its eligibility is set to one. Then at each subsequent time-step it is reduced by a factor . When it is subsequently visited, its eligibility trace is increased by one [34].

C.      -Learning for Human-Robot Systems

When only two learning agents are involved such as a robot and human (Fig. 2), the robot learning function acquires state-action values achieved from policies suggested by a human operator (HO). In this case, robot learning performance measure is defined as , a minimum acceptable performance threshold in which above the human is called. The measure  is compared with the average number of rewarded policies, , over the last  recent learning trials considered (5):

 (5)

where  is the current learning trial, , and  notifies whether a policy was successful (- at least one item fell from the bag) or failed ( - no items fell from the bag) for the  trial. Based on this learning performance threshold, the robot switches between fully autonomous operation and the integration of human commands.

 Two levels of collaboration are defined: (i) autonomous - the robot decides which actions to take, acting autonomously according to its  learning function, and (ii) semi-autonomous - HO suggests actions remotely and the robot combines this knowledge, i.e., -learning is being performed. Human-robot collaboration is unnecessary as long as the robot learns policies autonomously, and adapts to new states. The HO is required to intervene and suggest alternative shaking parameters for shaking policies if the robot reports that its learning performance is low (6).

 (6)

The robot learning performance threshold, , a pre-defined minimum acceptable performance measure in which above the human is called is compared with  , the average number of rewarded policies over the last  recent learning policies performed, i.e., robot switches its learning level from autonomous (self-shaking performing) to semi-autonomous (acquiring human knowledge) based on its learning performance (see example in Fig. 3).

III.     Description of the Collaborative Learning Scheme

The system is defined by  where  is a robot,  a bag object, and  an environment containing a platform on which the inspected bag is manipulated.  is a task performed on , using , within the environment . For the task of emptying the contents of a suspicious plastic bag, learning task, , is to observe the position of the bag, located on a platform, grasp it with a robot manipulator and shake out its contents in minimum time. It is assumed that the number of items in the bag is known in advance.

Robot states denoted as  (Table I) are its gripper location in a three-dimensional grid (Fig. 4). The performance of the task  is a function of a set of actions, , for each physical state of .

Fig. 2. Robot and a human operator -learning


Fig. 3. Example of learning performances

TABLE I: State Description of the Three-Dimensional Grid

State(s)

Description

Number of states

state center

1

, ,,

,,

states where the robot can move over its axis

6

, ,,

,,

states where the robot can move over its axis

6

, ,,

,,

states where the robot can move over its axis

6



Fig. 4. Robot state-space

A typical learning episode (trial) starts when the robot arm is located above an inspection surface at a "center shaking position". The robot is programmed in advance to grasp a plastic bag located in a fixed position. Grasping includes moving down to grasp the bag, sliding under the bag, grasping the bag, then moving it's gripper to a "center shaking position". The "center shaking position" was chosen at the origin of the  and  axes and high enough above the inspection surface to allow the bag to be shaken freely without touching the ground when it is being shaked vertically over the  axis. Further this position was considered to prevent the robot from hitting itself while performing the fast shaking movements.

An action,, consists of a robot movement in the direction of one of the  ,  or  axes. The robot starts a shaking policy from a  state. From  it can move to any of the other 18 states. The distance (denoted as "the amplitude") between any two close states is set a priori to performing a shaking policy (e.g., distances between  and  or between  and  are 30 mm). From a robot state other than , an action performed by the robot is limited to (a) its mirror position along its present axsis position, or (b) back to  (e.g., from state  the robot can move only to  or to ). In addition to the dependency of an action on the predefined amplitude, two speed values are defined. This doubles the number of possible actions the robot can take resulting in 108 possible actions.

Let a policy  be a set of 100 state-action pairs, . Three performance measures be  where  were defined for a policy :

(i)  - average time to complete (or partially complete) emptying the contents of a bag. This value is a measure of the time each object  fells from the bag during a shaking policy. Shaking time for each learning episode is different because different actions are performed. In addition, for example, time of moving from  to  is not equal to the time of moving from  to. Other time differences occur when human intervenes and changes motion speeds and amplitudes.

 (ii)  - human intervention rate. This is a measure that represents the percentage of human interventions out of the total number of learning trails. Human operator (HO) collaboration is triggered when robot learning is slow.  represents the degree of HO collaboration (the lower it is, and the more autonomous the robot is).

(iii)  - average reward. Robot learning experience is achieved through direct experience with the environment according to rewards based on the number and time of items falling from a bag during shaking trial. Its value depends on the time occurrence of the falling objects according to (7).

 (7)

where  is a constant,  is the time horizon for a learning episode and  is the time past from the beginning of the shaking episode till an object falls out of the bag. If an item is not extracted then .

IV.     Experimental Setup

The experimental setup contains a Motoman UP-6 fixed-arm robot positioned over an inspection surface (Fig. 5a). A dedicated gripper designed for smooth slippage under a bag was used. At the beginning of each learning trial, the robot grasps and lifts a bag containing five wooden cubes (Fig. 5b) to a center position. It then initiates a shaking policy.

(a) Robot and inspection surface
(b) Plastic bag and cube
Fig. 5. Experimental setup

The UP-6 robot has no a priori knowledge in its initial autonomous learning stage. In a subsequent cooperative stage the robot in addition to interacting with the environment and getting rewards / punishments, policy adjustments are provided by the HO through an interface (Fig. 6).

Fig. 6. Human interface

The interface view and controls consists of: (i) "Visual Feedback" - real-time visual feedback captured from a web-camera located over the robotic scene; (ii) "System Performance" - system learning performance,  reported at the end of each episode (iii) "Collaboration Level" - autonomous or semi-autonomous, and (iv) "Human Decision Making" - when asked to intervene, the HO can determine robot shaking amplitude and speeds. The robot learning function acquires the suggested parameters and performs a new shaking policy.

An experiment using the -learning is employed. 50 episodes, each contain 100 state-action learning steps were separated into two stages: (i) autonomous - during the first 10 episodes the robot performs shaking policies autonomously and no human intervention is allowed. The shaking parameters for the first episode were set to amplitude of 30 mm, and to speeds of 1000 and 1500 mm / s, and (ii) collaborative learning - in this stage consists of 40 episodes, human intervention triggering was allowed, based on the system learning performance. In this stage the human can adjust shaking policies parameters at the range of 10 to 50 mm for the amplitude and speeds in the range of 100 to 1500 mm / s.

We chose arbitrarily the value of ten learning episodes for the autonomous stage since it is assumed that the average learning performance is low in early stages of learning and the robot would have immediately be triggered to ask for human intervention (this value of 10 will be parameratized in sunsequent studies) thereby autonomous exploration is necessary in the beginning of learning.

The system parameters were set to: , , , and . To balance between exploration and exploitation (e.g., [35]), -greedy action selection with  was used. Although the interface shows that the speeds and amplitudes can be changed independently on each of the three coordinate axes (Fig. 6), in the experiment performed, the same speeds were chosen for all axes when human intervention was triggered. This will be relaxed in subsequent studies.

V.     Experimental Results

Summary of results is shown in Table II. Results include measurements of ,  and  (See section III).

TABLE II:  Experimental results

   

Autonomous learning stage I

Collaborative learning - stage II

 *

Time (s)

11.67

12.98

Standard Deviation

1.22

1.33

 **

[%]

-

20

 ***

Score

26.5

70.57

Standard Deviation

41.6

43.02

 

Number of Trials

10

40

*  - average time to complete (or partially complete) emptying the contents of a bag.

**  - human intervention rate.

***  - average reward.

Averaging over learning episodes, the average reward () is plotted as it improves (Fig. 7).

Fig. 7. Average reward performance () over 50 learning episodes

Results indicate slightly deterioration while comparing the collaboration the autonomous stage while measuring  (7.2%) while the human collaborated in 20% of the learning episodes. This time increase can be explained due to the reason than shaking policies with higher amplitudes proved to achieve a better reward performance. This results a tradeoff because policies with higher reward performance lasted longer.

VI.     Analysis of Results

From a reward perspective a significant improvement of 266.3% was measured while comparing the collaboration stage with the autonomous. This strengthens the hypothesis that learning is faster in the collaboration stage than in the autonomous stage.

From state-action (-value) perspective, we have noticed that the highest values over all 50 learning trails performed are related to five pairs; , , , , and , i.e., the preferred shaking policy found occurs when most of the robot actions are horizontal over the  axis (Fig. 4). Due to the fact that every shaking policy performed by the robot starts from the center state, we analyzed the  values for these three states, i.e., , , and  for different trials:

·        The  table for the first episode indicates similar  values (thereby, at random) for all state-action pairs.

·        For the 10th learning trial, high  values, for various state-action transitions were observed: , , ,  , and .

·        From the 11th learning trial as described, the robot is allowed to ask for human intervention. Since we observed eight requests (all happened from the 11th to the 22nd episode) for intervention, the 22nd learning trial high  values are of an interest as follows: , , , , and .

·        For the 50th learning, the, high , , and .values were observed.

For a bag shaking learning problem, it was found the number of human interventions increases the total reward achieved by the system. Intuitively vertical shaking seems best. It was found, however, that the policy of shaking most of the time over the horizontal  axis with a small number of actions over the  axis was the most effective. Further, from the  values observed, it seems that the system starts exploring the state-action space randomly. Then, when provided by human collaboration, it converges to a shaking policy mostly in the horizontal plan, , passing back and forth through the center using the highest possible amplitude. A better policy is found later in the learning process when robot shaking skips the center state and shaking follows oscillatory horizontal high amplitude. This policy caused the bag to occasionally become entangled around the robot gripper. This was due to the fact that most of the plastic bag weight is concentrated at its bottom and the swinging momentum carried over the gripper. This counter intuitive result has an easy explanation; the plastic bag was losely tied closed by an overhand knot and pulling it vertically tends to tighten the knot, while a sideway swing makes it loser. Further, hypothetically, if a human was free to hold and shake the bag, it could have seen visually the optimal time to reverse the horizontal movement to avoid entanglement, an ability the robot does not have. Such ability could be developed with an active vision system.

VII.     Conclusions

A proposed RL-based human-robot learning collaboration decision-making method is developed for a complex task of bag shaking. The robot makes a decision whether to learn the task autonomously or ask for human intervention. The approach is aimed to integrate user instructions into an adaptive and flexible control framework and to adjust control policies on-line. To achieve this, user commands at different levels of abstraction are integrated into an autonomous learning system. Based on its learning performance, the robot switches between fully autonomous operation and the integration of human commands.

The  learning algorithm, accelerates learning in systems composed of several learning agents or systems designed for human-robot interaction overcoming the main criticism of the reinforcement learning approach, i.e., long training periods. From a reward perspective a significant improvement of 266.3% was measured while comparing the collaboration stage with the autonomous. This strengthens the hypothesis that learning is faster in the collaboration stage than in the autonomous stage.

Future work will include the design and implementation of extended collaboration strategies that include human intervention using refined or rough intervention policies, robot autonomy, and pure human control. Additionally, system automation will be expanded; one approach might be to a digital scale for counting the extracted objects in real-time. Further, sensitivity analysis will be provided, especially for the trial at which the collaboration stage is initiated (the trial in which collaborative control is allowed).

Acknowledgements

This work was partially supported by the Paul Ivanier Center for Robotics Research and Production Management, Ben-Gurion University of the Negev and by the Rabbi W. Gunther Plaut Chair in Manufacturing Engineering.

References

  1. J. Bukchin, R. Luquer, and A. Shtub, "Learning in tele-operations", IIE Trans., 2002, vol. 34, no. 3, pp. 245-252.
  2. J. W. Crandall, M. A. Goodrich, D. R. Olsen, and C. W. Nielsen, "Validating Human-Robot Interaction Schemes in Multi-Tasking Environments," IEEE Trans. on Systems, Man, and Cybernetics Part A: Systems and Humans, Special Issue on Human-Robot Interaction, July 2005, vol. 35, no. 4, pp.438-449.
  3. A. M. Steinfeld, T. W. Fong, D. Kaber, M. Lewis, J. Scholtz, A. Schultz, and M. Goodrich, "Common Metrics for Human-Robot Interaction," Human-Robot Interaction Conf., ACM, March, 2006.
  4. J. A. Adams, "Critical Considerations for Human-Robot Interface Development," AAAI Fall Sym.: Human Robot Interaction Technical Report FS-02-03, Nov. 2002, pp.1-8.
  5. M. T. Rosenstein, A. H. Fagg, S. Ou, and R. A. Grupen, "User intentions funneled through a human-robot interface," Proc. of the 10th Int. Conf. on Intelligent User Interfaces, 2005, 257-259.
  6. H. A. Yanco, M. Baker, R. Casey, A. Chanler, M. Desai, D. Hestand, B. Keyes, and P. Thoren, "Improving human-robot interaction for remote robot operation," Robot Competition and Exhibition Abstract, National Conf. on Artificial Intelligence (AAAI-05), July 2005.
  7. L. P. Kaelbling, M. L. Littman, and A. W. Moore, "Reinforcement learning: a survey," Journal of Artificial Intelligence Research, 1996, vol. 4, pp. 237-285.
  8. W. D. Smart, Making Reinforcement Learning Work on Real Robots, Ph.D. Dissertation, Brown University, 2002.
  9. C. Ribeiro, "Reinforcement learning agents," Artificial Intelligence Review, 2002, vol. 17, no. 3, pp. 223-250.
  10. A. Lockerd and C. Breazeal, "Tutelage and socially guided robot learning," Proc. of IEEE/RSJ Int. Conf. on Intelligent Robots and Systems, 2004, Sendai, Japan.
  11. Y. Wang, M. Huber, V. N. Papudesi, and D. J. Cook, "User-guided reinforcement learning of robot assistive tasks for an intelligent environment," Proc. of the IEEE/RJS Int. Conf. on Intelligent Robots and Systems, 2003.
  12. V. N. Papudesi and M. Huber, "Learning from reinforcement and advice using composite reward functions," Proc. of the 16th Int. FLAIRS Conf., 2003, pp. 361-365, St. Augustine, FL.
  13. V. N. Papudesi, Y. Wang, M. Huber, and D. J. Cook, "Integrating user commands and autonomous task performance in a reinforcement learning framework," AAAI Spring Sym. on Human Interaction with Autonomous Systems in Complex Environments, 2003, Stanford University, CA.
  14. K. Driessens and S. Deroski, Integrating guidance into relational reinforcement learning, Machine Learning, 2004, vol. 57, pp. 271-304.
  15. L. Mihalkova and R. Mooney, Using active relocation to aid reinforcement. Proc. of the 19th Int. FLAIRS Conf., May 2006, Melbourne Beach, Florida, to be published.
  16. D. Gu and H. Hu, "Fuzzy multi-agent cooperative -learning," Proc. of IEEE Int. Conf. on Information Acquisition, 2005, Hong Kong, China, pp. 193-197.
  17. R. M. Kretchmar, "Parallel reinforcement learning," The 6th World Conf. on Systemics, Cybernetics, and Informatics, 2002.
  18. M. J. Mataric, "Reinforcement learning in the multi-robot domain," Autonomous Robots, Kluwer Academic Publishers, 1997, vol. 4, pp. 73-83.
  19. W. Zhu and S. Levinson, "Vision-based reinforcement learning for robot navigation," Proc. of the Int. Joint Conf. on Neural Networks, 2001, vol. 2, pp. 1025-1030, Washington DC.
  20. P. Kui-Hong, J. Jun, and K. Jong-Hwan, "Stabilization of biped robot based on two mode -learning," Proc. of the 2nd Int. Conf. on Autonomous Robots and Agents, 2004, New Zealand.
  21. A. F. Massoud and L. Caro, "Fuzzy neural network implementation of  for mobile robots," WSEAS Trans. on Systems, 2004, vol. 3, no. 1.
  22. Y. Dahmani and A. Benyettou, "Seek of an optimal way by -learning," Journal of Computer Science, 2005, vol. 1, no. 1, pp. 28-30.
  23. R. Broadbent and T. Peterson, "Robot learning in partially observable, noisy, continuous worlds," Proc. of the 2005 IEEE Intl. Conf. on Robotics and Automation, 2005, Barcelona, Spain.
  24. T. Mart?nez-Mar?n and T. Duckett, "Fast reinforcement learning for vision-guided mobile robots," Proc. of the 2005 IEEE Int. Conf. on Robotics and Automation, 2005, Barcelona, Spain.
  25. C. J. C. H. Watkins, Learning from Delayed Rewards, Ph.D. Dissertation, Cambridge University, 1989.
  26. W. D. Smart and L. Kaelbling, "Practical reinforcement learning in continuous spaces," Proc. of the 17th Int. Conf. on Machine Learning, 2002.
  27. R. Bellman and R. Kalaba, Dynamic Programming and Modern Control Theory, NY: Academic Press Inc., 1965.
  28. C. J. C. H. Watkins and P. Dayan, "-learning," Machine Learning, 1992, vol. 8, pp. 279-292.
  29. J. Peng and R. Williams, "Incremental multi-step -learning," Machine Learning, 1996, vol. 22, no. 1-3, pp. 283-290.
  30. P. Y. Glorennec, "Reinforcement Learning: an overview," European Sym. on Intelligent Techniques, Aachen, Germany, 2000.
  31. S. Natarajan and P. Tadepalli, "Dynamic preferences in multi-criteria reinforcement learning," Proc. of the 22nd Int. Conf. on Machine Learning (ICML 2005), Bonn, Germany, 2005.
  32. R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, Cambridge, MA: MIT Press, 1998.
  33. U. Kartoun, H. Stern, Y. Edan, C. Feied, J. Handler, M. Smith, and M. Gillam, "Collaborative  Reinforcement Learning Algorithm - A Promising Robot Learning Framework," IASTED Int. Conf. on Robotics and Applications, 2005, U.S.A.
  34. A. K. MackWorth, D. Poole, and R. G. Goebel, Computational Intelligence: A Logical Approach, Oxford University Press, 1998.
  35. M. Guo, Y. Liu, and J. Malec, "A new -learning algorithm based on the metropolis criterion," IEEE Trans. on Systems, Man, and Cybernetics - Part B: Cybernetics, 2004, vol. 34, no. 5, pp. 2140-2143.