Cooperative Human-Robot Learning System using a Virtual Reality Telerobotic Interface

Download Paper (pdf)

Yael Edan, Uri Kartoun, Helman Stern

Abstract — A cooperative human-robot learning system for remote robotic operations using a virtual reality (VR) interface is presented. The paper describes the overall system architecture, and the VR telerobotic system interface. Initial tests using on-line control through the VR interface for the task of shaking out the contents of a plastic bag are presented. The system employs several state-action policies. System states are defined by: type of bag, status of robot, and environment. Actions are defined by initial grasping point, lift and shake trajectory. A policy is a set of state-action pairs to perform a robotic task. The system starts with knowledge of the individual operators of the robot arm, such as opening and closing the gripper, but it has no policy for deciding when these operators are not appropriate, nor does it have knowledge about the special properties of the bags. An optimal policy is defined as the best action for a given state that is learned from experience and human guidance. A policy is found to be beneficial if a bag was grabbed successfully and all its contents extracted.

 

Index Terms — telerobotics, human-robot collaboration, robot learning

All authors are with the Dept. of Industrial Engineering and Management, Ben-Gurion University of the Negev, Beer Sheva, Israel 84105. E-mails: {yael, kartoun, helman}@bgu.ac.il

1. Introduction


A telerobot is defined as a robot controlled at a distance by a human operator (HO). Telerobotic devices are typically developed for situations or environments that are too dangerous, uncomfortable, limiting, repetitive, or costly for humans to perform [Durlach and Mavor, 1995]. Applications include: underwater [e.g., Hsu et al., 1999], space [e.g., Hirzinger et al., 1993], and medical [e.g., Kwon et al., 1999; Sorid and Moore, 2000].

One of the difficulties associated with teleoperation is that the HO is remote from the robot; hence, feedback may be insufficient to correct control decisions. Therefore, the HO usually acts as a supervisor, defining the goals, and current plans, getting back information about accomplishments, and problemes; while the robot executes the task based on information received from the HO plus its own artificial sensing and intelligence [Earnshaw et al., 1994].

Autonomous robots are designed to build physical systems that can accomplish useful tasks without human intervention [Manesis et al., 2002]. To accomplish a given task, the robot must collect sensory information concerning its external dynamic environment, make decisions, and take actions. Full autonomy is usually complicated to achieve in unstructured environments [Hagras and Sobh, 2002].

For telerobotic control, decision-making can be performed by a combination of knowledge based autonomous procedures, sensor based autonomous procedures, learning procedures and HO procedures. Learning enables the improvement of autonomous operation by allowing the system to acquire and adapt their behavior to new unknown conditions. Learning robots have resulted in improved performance and are especially suitable for changing and dynamic environments [Aycard and Washington, 2000; Asoh et al., 2001; Bhanu et al., 2001; Carreras et al., 2002]. However, their design is complicated and costly.

Humans can easily adapt to unpredictability task environments due to their superior intelligence and perceptual abilities. Introducing a HO into the system can help improve its performance and simplify the robotic system making the telerobotic system a viable option [Rastogi et al., 1995]. Robot performance can be improved by taking advantage of human skills (e.g., perception, cognition) and benefiting from human advice and expertise. To do this, robots must function as active partners instead of as passive tools. They should have more freedom of action, be able to initiate interaction with humans, instead of merely waiting for human commands [Fong et al., 2001]. Several systems exist in which a human and a robot work as partners, collaborating to perform tasks and to achieve common goals [Scárdua et al., 2000; Asoh et al., 2001]. Instead of a supervisor dictating to a subordinate, the human and the robot engage in dialogue to exchange ideas, to ask questions, and to resolve differences [Fong et al., 2001]. In particular, if a robot is not treated as a tool, but rather as a partner, by incorporating learning, the system can continuously improve.

 

Interface design has a significant effect on the way people operate a robot [Preece et al., 1994]. Examples for using graphical models to allow users to control robots off-line through the Web [Hirukawa. and Hara, 2000 ] can be found in the RobotToy research [Sorid and Moore, 2000], the KhepOnTheWeb project [Michel et al., 1997], the WITS (Web Interface for Telescience) project [Backes et al., 1999], the Tele-Garden project [Goldberg et al., 1995], the University of Western Australia's Telerobot experiment [Taylor and Trevelyan, 1995] and the PumaPaint project [Stein, 2000]. Virtual reality (VR) is a high-end human-computer interface allowing user interaction with simulated environments in real-time and through multiple sensorial channels [Burdea, 1999]. The increased interaction enables operators to perform tasks on remote real worlds, computer-generated worlds or any combination of both resulting in improved performance [Hine et al., 1995; Burdea, 1999; Lee et al., 2000; Heguy et al., 2001].

The objective of this paper is to present the concept of a human-robot learning system for remote robotic operations using a VR interface. Problem definition and notation are presented in the next section. The overall system architecture and VR telerobotic system interface is described in sections 3 and 4, respectively. To test the system an advanced VR telerobotic bag shaking system is proposed. The task is defined to observe the position of an unknown suspicious bag located on a platform, classify it, grasp it with a robot manipulator and shake out its contents on a table or into a nearby container. Initial tests using on-line control through the VR interface for shaking out the contents of a plastic bag are presented in section 5. Section 6 concludes the paper.

2. Problem definition and notation

The system is defined by Σ = [R, O, E] where R is a robot, O an object, and E an environment. Let F be a set of features or patterns representing the physical state of Σ. Let ψ be a mapping from Σ to F. T is a task performed on O, using R, within the environment E to meet a goal G. The performance of the task T is a function of a set of actions, A, for each physical state of Σ. The state of the system, represented by an element of F, is denoted as S. Let a policy P be a set of state-action pairs, {S, A}. Let the performance measure be Z(F, P).

2.1 Goal

The goal, G, is to classify a bag correctly, grab it successfully and empty its contents onto a table or into a collection container in minimum time.

2.2 Task

The task, T, is to observe the position of an unknown bag (e.g., plastic bag, briefcase, backpack, or suitcase) located on a platform, grasp it with a robot manipulator and shake out its contents on a table or into a nearby container. It is assumed that all clasps, zippers and locks have already been opened by another robotic operation. The system is trained first for identifying several bag classes, but it has no a-priori knowledge regarding to efficient grasping and shaking policies. The system learns this knowledge from experience and from human guidance.

2.3 System

   Robot

Several states will be defined for the robot, R.

Robot states, SR, include:

{home, idle, performing a task}.

   Object

The system will contain different types of bags (e.g., plastic bag, briefcase, backpack, suitcase). Different geometrical attributes will be defined for each bag, O, to identify the bag. The state of the object So is defined by:

Sbag-class = {plastic bag, briefcase, backpack, suitcase, not recognized}.

Sbag-condition = {open, close, orientation).

Sbag-contents = {how many}.

   Environment

The robotic environment, E, contains a platform on which the inspected bag is manipulated, light sources, and extraneous objects such as undesirable human hand.

Environmental states, SE, include:

Sobstructions = {how many, positions}

Sillumination = {light, day, night}

2.4 Features

Let F be a set of features or patterns representing the state of Σ. F may include bag classifications, robot position and orientation and environmental conditions.

2.5 Mapping

Let ψ be visual mapping function which obtains a digital image I of the system Σ and extracts a set of representative features denoted as F.

2.6 Actions

Actions, A, are command instructions such as grasping, shaking, etc.

2.7 Policy

Carrying out the task involves a policy P. Given F of Σ  =  [R, O, E], a policy P = {(F, A)} is a set of state-action pairs.

2.8 Performance measures

The performance measures, Z(F, P), include:

·          Classification accuracy.

·          Whether a bag was grabbed successfully or not.

·          Quality - how empty the bag is.

·          Time to completely empty the contents of a bag.

·          Number of learning iterations to complete the task.

·          Human intervention rate.

·          Abort task rate.

3. Methodology

3.1 System architecture

The system architecture (Fig. 1) consists of state-action classifiers that receive inputs from a vision system, a robotic system and a VR interface.

Fig. 1. System architecture


 

Fig. 2. System operation description

When the state-action classifiers have little or no knowledge about how to proceed on a task, they can try to obtain that knowledge through advice from the human operator (HO). This is achieved through human-robot collaboration. In this manner the HO can affect and change parameters of the learning algorithms on-line through the VR interface (e.g., by suggesting new actions such as shaking plans, intervening in case of misclassification).

In the proposed system, the search space in which to discover a successful solution may be quite large. To make this search tractable, the system should accept advice from the HO through its search. This requires the ability to identify when knowledge is needed, as well as, to provide the necessary problem-solving context for the HO so that supplying the knowledge is easy. It must also be possible for the system to proceed without advice when the HO is unavailable. If guidance is available, the system will utilize its own strategies together with advice to quickly construct a solution.

3.2 System stages

The overall system operation is described in Fig. 2. The system operates through the following stages. Each stage is evaluated using the defined performance measures and results in success or failure. Human-intervention is enabled based on the performance measures.

A. Classification:

Description: Determine the state of the system (S) using visual and possibly tactile sensors. This may involve positioning a robot on board camera ("hand-in-eye") to view the object position and the surrounding environment. It may also involve touching the object with a tactile sensor to assess its composition (soft cloth, plastic, hard plastic, etc.). Classification is performed by image processing [Kartoun, 2003].

Success: A bag was classified correctly.

Failure: Required for avoiding the object repositioning stage.

Human Intervention: Required for setup - put manually various bags on the robot workspace. If failure occurs, HO gives correct classification.

Performance measure: Classification accuracy.

B. Grasping:

Description: The robot grasps the object.

Success: A bag was grasped successfully.

Failure: A bag was not grasped optimally or was not grasped at all.

Human Intervention: HO gives a correct set of possible grasping points.

Performance measure: Whether a bag was grasped successfully or not.

C. Repositioning:

Description: Re-arranging the position of the object to prepare it for easier grasping.

Success: A bag was grasped successfully.

Failure: A bag was not grasped at all.

Human Intervention: HO repeats this stage until the object is grasped successfully.

Performance measure: Whether a bag was grasped successfully or not.

D. Lift and shake:

Description: The robot lifts the object above the table or above a collection container and shakes out its contents.

Success: Always successful.

Human Intervention: Not required.

Performance measure: Time to completely empty the contents of a bag.

E. Verification:

Description: The system tries to verify if all the contents have been extracted.

Success: 1) If the number of items fell from a bag is higher than a pre-determined threshold for a shaking policy and 2) time to empty the contents of a bag is lower than a predetermined threshold.

Failure: 1) Not all items fell out; 2) time to empty the contents is too slow and 3) abort task rate is too high.

Human Intervention: 1) Suggest new grasping points through VR interface and 2) suggest new lifting and shaking policies through VR interface.

Performance measures: 1) Quality - how empty the bag is; 2) time to completely empty the contents of a bag and 3) abort task rate.

3.3 Learning

The reinforcement learning algorithm [Kaelbling et al., 1996], Q-learning [Watkins, 1992] will be employed. In Q-learning the system estimates the optimal action-value function directly and then uses it to derive a control policy using the local greedy strategy. The advantage of Q-learning is that the update rule is policy free as it is a rule that just relates Q values to other Q values. It does not require a mapping from actions to states and it can calculate the Q values directly from the elementary rewards observed.

Q is the system's estimate of the optimal action-value function. The first step of the algorithm is to initialize the system's action-value function, Q. Since no prior knowledge is available, the initial values can be arbitrary (e.g., uniformly zero). Next, the system's initial control policy, P, is established. This is achieved by assigning to P(S) the action that locally maximizes the action-value. That is, P(S) ¬ A, such that Q(S, A) = max Q(S, A) where ties are broken arbitrarily. The robot then enters a cycle of acting and policy updating.

First, the robot senses the current state, S = {SR, SO, SE} where SR is the state of the robot, SO is the state of the object and SE is the state of the environment. It then selects an action A to perform next. An action consists of a set of coordinates of possible grasping points, lift trajectories (e.g., manipulate the robotic arm 60cm above the collection container and with 30 degrees left) and shaking trajectories (e.g., manipulate the robotic arm 10cm horizontally and 15cm vertically in 3m/s speed for 5 cycles). Most of the time, this action will be the action specified by the system's policy P(S), but occasionally the system will choose a random action (choosing an action at random is a particularly simple mechanism for exploring the environment). Exploration is necessary to guarantee that the system will eventually learn an optimal policy. The system performs the selected action and notes the immediate reward r and the resulting state S'. The reward function is probabilistic and depends on the number of items falling out of the bag during the lifting and shaking operations. The action-value estimate for the state-action pair {S, A} is then updated.

4. Virtual reality telerobotic interface

4.1 Physical experimental setup

In this work, a five degrees of freedom (DOF) articulated robot is controlled through a VR interface from a remote site for performing various tasks. The articulated robot is a "CRS-A255" robot system that consists of robot arm and controller (Fig. 3). The proposed VR telerobotic system contains a human operator (HO), VR web-based control interface, Internet access method, a remote server, a robot and its controller, and visual sensory feedback.

 
Fig. 3. Experimental setup

Fig. 4. Client-server communication architecture

The system is a client-server application (Fig. 4). The server contains a frame grabber connected to a camera mounted over the workspace. Robot control software is also located in the server. Two additional web-cameras are mounted over the workspace for visual feedback of the scene to the HO (the client). From the client site, the HO can take control over the robot through a VR interface.

4.2 User and control interface

A control interface was developed to enable HO interaction with the robot (Fig. 5). The interface includes a graphical model of the "CRS-A255" robot, two camera views of the real robot (overall and close views), a checkerboard on a table, and a world coordinate diagram that shows the x, y and z directions in the 3D scene.

The system has six different operational stages controlled through predefined control panels [Kartoun et al., 2004]. These are: changing the real robots speed, showing a 3D-grid that contains spatial locations which the robot gripper moves to when selected, selecting the viewing aspect of the VR model, planning shaking policies, planning off-line paths for transferring to the real robot, and on-line simultaneous control (in real-time) of the VR real robots.

 

Fig. 5. Web-based interface (camera views, and VR model)

4.3 VR model

The VR model was developed using "3D-Studio-Max" and "Alice" softwares [Kartoun, 2003]. "Alice", rapid prototyping software for creating interactive computer graphics applications was chosen to be the VR software [Conway et al., 1993]. It is designed to enable rapid development and prototyping of interactive graphics applications and uses "Python" as the language for writing its scripts.

4.4 Communication

The HO communicates with the server, connected to a robotic arm through a web-browser. Commands sent from the VR client are transmitted through TCP/IP to the server that extracts them and updates the real robot.

4.5 System calibration

The inverse kinematics (IK) equations were solved using a closed form analytical solution [Lander, 1998; Craig, 1989; McKerrow, 1991] . It has the benefit of being an exact solution and very fast to calculate. For the VR robot, an IK algorithm was implemented to determine the joint angles required to reach an end point location by supplying the (x, y, z) coordinates. A transformation matrix, providing a 1 to 1 mapping between the VR and real robots is estimated from corresponding pairs of 32 intersection points on the checkerboard appearing in the VR and the real environments. The estimate is based on the method of aligning a pair of shapes, which uses a least-squares approximation [Cootes et al., 1992]. Given the calibrated transformation matrix, an experiment to determine the transformation error was performed. The experiment starts with controlling the VR robot arm through the VR interface. The coordinates of 32 points (xvr , yvr), selected from checkerboard intersections, were set as test points. These points were inserted into the inverse kinematic equations to obtain (xr , yr) which were sent to the real robot. A pen inserted into the real robot's gripper marked its controlled position. The coordinates of the robot's pen on the checkerboard intersection points (xr,m , yr,m) in the real environment were measured manually by a ruler. The average transformation error between the robot's pen positions, and the desired points in the real checkerboard was found to be 3mm [Kartoun et al., 2004].

5.  System test using vr on-line control

Initial system testing was conducted using on-line control through the VR interface for the task of shaking out the contents of a plastic bag. The HO views the images of the robotic workspace in the client browser, and commands the robot by selecting points in the 3D VR scene. Using the calibrated mapping of points in the VR scene to the real workspace, the robot is controlled to perform the task. A view of the experiment to empty the contents of a plastic bag onto the platform is shown in Fig. 6. It is assumed that the bag contains ten identical electronic components known in advance. An inexperienced operator performed ten identical experiments, and performance times (the amount of time it takes to extract all the objects from the bag) were recorded. The learning curve of task completion time was reduced quickly reaching standard time (330 seconds) after 5 to 6 trials (Fig. 7).

6. Conclusion

Human-robot collaboration is unnecessary as long as a telerobotic system can adapt to new states and unexpected events. However, when an autonomous system fails, incorporating learning using human operator (HO) supervision and interventions can achieve improved performance.

In this work we have described the design, implementation and testing of a real-time VR-telerobotic web-based system. Visual feedback is inserted into the human interface via two independent web-cameras. A transformation matrix is used to map the VR scene into the real one. The system allows a HO to: (a) perform off-line path planning by manipulating an object in a VR robotic scene, (b) perform on-line control by indirectly controlling the real robot through manipulation of its VR representation in real-time. Initial testing using on-line control indicated rapid learning, reaching standard time (330 seconds) within 5 to 6 trials. Future research will include integration of learning as presented in the above so as to improve performance. A cooperative human-robot learning system for remote robotic operations using this VR interface is underway.

(b) Close view

(a) Overall view

Fig. 6. "CRS-A255" robot, plastic bag, platform and components

Fig. 7. Learning curve experimental results

References

[1]        H. Asoh, N. Vlassis, Y. Motomura, F. Asano, I. Hara, S. Hayamizu , K. Ito, T. Kurita, T. Matsui, R. Bunschoten and J.A. Kröse Ben. 2001. Jijo-2: An Office Robot that Communicates and Learns. IEEE Intelligent Systems. Vol. 16. Num 5. pp. 46-55.

[2]        O. Aycard and R. Washington. 2000. State Identification for Planetary Rovers: Learning and Recognition. Proceedings of the IEEE International Conference on Robotics and Automation. pp. 1163-1168.

[3]        P. G. Backes, K. S. Tso, and G. K. Tharp. 1999. The Web Interface for Telescience, Presence, MIT Press. Vol. 8. pp. 531-539.

[4]        B. Bhanu, P. Leang, C. Cowden, Y. Lin, and M. Patterson. 2001. Real-Time Robot Learning. International Conference on Robotics and Automation. Seoul. Korea.

[5]        G. Burdea. 1999. Invited Review: The Synergy between Virtual Reality and Robotics. IEEE Transactions on Robotics and Automation. Vol. 15. Num. 3. pp. 400-410.

[6]        M. Carreras, P. Ridao, J. Batlle and T. Nicosevici. 2002. Efficient Learning of Reactive Robot Behaviors with a Neural-Q Learning Approach. IEEE International Conference on Automation, Quality and Testing. Romania.

[7]        M. J. Conway, R. Pausch, R. Gossweiler and T. Burnette. 1993. Alice: A Rapid Prototyping System for Building Virtual Environments. University of Virginia, School of Engineering.

[8]        T. F. Cootes, C. J. Taylor, D. H. Cooper and J. Graham. 1992. Training Models of Shape from Sets of Examples. Proceedings of the British Machine Vision Conference. pp 9-18.

[9]        J. J. Craig. 1989. Introduction to Robotics: Mechanics and Control. Addison-Wesley.

[10]      I. N. Durlach and S.N. Mavor. 1995. Virtual Reality: Scientific and Technological Challenges. National Academy Press. Washington DC.

[11]      R. A. Earnshaw, M. A. Gigante and H. Jones. 1994. Virtual Reality Systems. Academic Press Limited.

[12]      T. W. Fong, C. Thorpe and C. Baur. 2001. Collaboration, Dialogue, and Human-Robot Interaction. Proceedings of the 10th International Symposium of Robotics Research, Lorne, Victoria, Australia, Springer-Verlag.

[13]      K. Goldberg, J. Santarromana, G. Bekey, S. Gentner, R. Morris, J. Wiegley and E. Berger. 1995. The Telegarden. Proceedings of ACM SIGGRAPH.

[14]      H. Hagras and T. Sobh. 2002. Intelligent Learning and Control of Autonomous Robotic Agents Operating in Unstructured Environments. Information Science Journal. Elsevier Science Inc. Vol. 145. Issue 1-2. pp. 1-12.

[15]      O. Heguy, N. Rodriguez, H. Luga, J. P. Jessel and Y. Duthen. 2001. Virtual Environment for Cooperative Assistance in Teleoperation. The 9th International Conference in Central Europe on Computer Graphics, Visualization and Computer Vision.

[16]      B. Hine, P. Hontalas, T.W. Fong, L. Piguet, E. Nygren and A. Kline. 1995. VEVI: A Virtual Environment Teleoperations Interface for Planetary Exploration. SAE 25th International Conference on Environmental Systems.

[17]      H. Hirukawa and I. Hara. 2000. The Web's Top Robotics. IEEE Robotics and Automation Magazine. Vol. 7. Num. 2.

[18]      G. Hirzinger, B. Brunner, J. Dietrich and J. Heindl. 1993. Sensor-Based Space Robotics-ROTEX and its Telerobotic Features. IEEE Transactions on Robotics and Automation. Vol. 9. Num. 5. pp. 649-663.

[19]      L. Hsu, R. Costa, F. Lizarralde and J. Soares. 1999. Passive Arm Based Dynamic Positioning System for Remotely Operated Underwater Vehicles. IEEE International Conference on Robotics and Automation. Vol. 1. pp. 407-412.

[20]      L. P. Kaelbling, M.L. Littman and A.W. Moore. 1996. Reinforcement Learning: A Survey. Journal of Artificial Intelligence. Vol. 4. pp. 237-285.

[21]      U. Kartoun. 2003. A Human-Robot Collaborative Learning System Using a Virtual Reality Telerobotic Interface. Ph.D. proposal, Dept. of Industrial Engineering and Management, Ben-Gurion University of the Negev, Beer Sheva 84105, ISRAEL.

[22]      U. Kartoun, Y. Edan and H. Stern. 2004. Virtual Reality Telerobotic System. e-ENGDET 4th International Conference on e-Engineering and Digital Enterprise Technology, U.K.

[23]      D. Kwon, K. Y. Woo and H. S. Cho. 1999. Haptic Control of the Master Hand Controller for a Microsurgical Telerobot System. IEEE International Conference on Robotics and Automation. Vol. 3. pp. 1722 - 1727 .

[24]      J. Lander. 1998. Oh My God, I Inverted Kine. Game Developer Magazine. Vol. 9. pp. 9-14.

[25]      K. H. Lee, H. S. Tan and T. J. Lie. 2000. Virtual Operator Interface for Telerobot Control. Proceedings of the Sixth International Conference on Modern Industrial Training. Beijing. pp. 250-258.

[26]      S. A. Manesis, G. N. Davrazos and N. T. Koussoulas. 2002. Controller Design for Off-tracking Elimination in Multi-articulated vehicles. 15th IFAC World Congress, Spain.

[27]      P. J. McKerrow. 1991. Introduction to Robotics, Addison-Wesley.

[28]      O. Michel, P. Saucy and F. Mondada. 1997. KhepOnTheWeb: an Experimental Demonstrator in Telerobotics and Virtual Reality, Proceedings of the International Conference on Virtual Systems and Multimedia, IEEE VSMM'97. pp. 90-98.

[29]      J. Preece, Y. Rogers., H. Sharp, D. Benyon, S. Holland and T. Carey 1994. Human-Computer Interaction. Adison-Welsey Publishing Company.

[30]      A. Rastogi, P. Milgram and J. J. Grodski. 1995. Augmented Telerobotic Control: a Visual Interface for Unstructured Environments. University of Toronto and Defense and Civil Institute of Environmental Medicine.

[31]      L. Scárdua, A. H. Reali-Costa and J. J. da Cruz. 2000. Learning to Behave by Environment Reinforcement. RoboCup-99: Robot Soccer World Cup III. Lecture Notes in Artificial Intelligence. Berlin. Vol. 1856. pp. 439-449.

[32]      D. Sorid and S. K. Moore. 2000. The Virtual Surgeon. IEEE SPECTRUM. pp. 26-39.

[33]      M. Stein. 2000. Interactive Internet Artistry, Painting on the World Wide Web with the PumaPaint Project. IEEE Robotics and Automation Magazine. Vol. 7. Num. 1. pp. 28-32.

[34]      K. Tayor and J. Trevelyan. 1995. A Telerobot on the World Wide Web. Proceedings of the National Conference of the Australian Robot Association.

[35]      C. J. C. H.Watkins and P. Dayan. 1992. Q-learning. Machine Learning. Vol. 8. pp. 279-292.


Back