Human-Robot Collaborative Learning of a Bag Shaking Trajectory
Kartoun Uri, Stern Helman, and Edan Yael
Department of Industrial Engineering and Management, Ben-Gurion University of the Negev, Beer-Sheeva 84105, ISRAEL
{kartoun, helman, yael}@bgu.ac.il
Abstract—This paper presents a collaborative
reinforcement learning algorithm,
, designed to
accelerate learning by integrating a human operator into the learning process.
The
-learning
algorithm enables collaboration of knowledge between the robot and a human; the
human, responsible for remotely monitoring the robot, suggests solutions when
intervention is required. Based on the robot's learning performance, it
switches between fully autonomous operation, and the integration of human
commands. The
-learning
algorithm was tested on a Motoman UP-6 fixed-arm robot required to empty the
contents of a suspicious bag. We demonstrate experimental results that support
our hypothesis of evaluating whether learning is faster while human
collaboration is triggered than when the system functions autonomously.
Index Terms: Roobot learning, reinforcement learning, human-robot collaboration.
Teleoperation is used when a task has to be performed in a hostile, unsafe, inaccessible or remote environment [1]. [2] suggest two components of a human-robot system when the robot is remotely located; (i) autonomy mode - an artificial intelligence or computer control of a robot that allows it to act, for a time, without human intervention, and (ii) Human-Robotic Interfaces (HRI) - software installed at the human's location allowing him to perceive the world, the robot states, and send instructions to the robot. One of the main issues in task-oriented HRI is achieving the right mixture of human and robot autonomy [3]. [4], indicates the importance of HRI in meeting operators' requirements: "an understanding of the human decision process should be incorporated into the design of human-robotic interfaces in order to support the process humans' employ."
[5], describe a HRI that supports both adjustable autonomy and hierarchical task selection. With adjustable autonomy, a computer switches among several control modes ranging from full supervision to full autonomy. With hierarchical task selection, the interface allows an operator to easily solve a high-level task autonomously or else to guide a robot through a sequence of lower-level subtasks that may or may not involve autonomous control. [6], 2005 define sliding scale autonomy as the ability to create new levels of autonomy between existing, pre-programmed autonomy levels. The suggested sliding scale autonomy system shows the ability to dynamically combine human and robot inputs, using a small set of variables such as user and robot speeds, speed limitations, and obstacle avoidance.
Reinforcement learning (RL) is regarded as learning through direct experimentation [7], [8]. It does not assume the existence of a teacher providing training examples. Instead, teaching derives from experience. The learner acts on the process to receive signals (reinforcements) from it, indications about how well it is performing the required task. These signals are usually associated with some dramatic condition - e.g., accomplishing a subtask (reward) or complete failure (punishment). The learning agent learns the associations between observed states and chosen actions that lead to rewards or punishments, i.e., it learns how to assign credit to past actions and states by correctly estimating costs associated with these events [9].
In [10], a collaborative process enabling a robotic learner
to acquire concepts and skills from human examples is presented. During the
teaching process, the robot must perform tasks based on human instructions. The
robot executes its tasks by incorporating feedback until its hypothesis space
is converged. Using a
-learning
approach, the robot learns a button pushing task. In [11], a variable autonomy
approach is used. User commands serve as training inputs for the robot learning
component, which optimizes the autonomous control for its task. This is
achieved by employing user commands for modifying the robot's reward function.
Using the potential of learning from reinforcement and human rewards
illustrates the changes in user reward and
-value functions
accordingly [12], [13]. The task was to learn to optimally navigate to a
specific target in a two-dimensional world with obstacles.
-learning requires
no human intervention; the agent is placed in an unknown environment and
explores it independently with the objective of finding an optimal policy. A
disadvantage of this approach is the large amount of required interaction with
the environment until an effective policy is determined. One example for
alleviating this problem includes guiding an agent using rules suggesting
trajectories of successful runs through the environment [14], [15]. [15]
suggest a RL-based framework denoted as "relocation". At any time during
training an agent can request to be placed in any state of the environment. The
"relocation" approach assumes a cost per relocation, and seeks to limit the
number of relocations. The approach requires minimal human involvement and
consists of two agent conditions: (i) "in trouble" - taking actions that turn
out to be a poor choice, even though it learns from the negative experience, would
cause to waste time in a portion of the state-space that is unlikely to be
visited during optimal behavior, and (ii) "bored" - if the
-values
are updated by tiny amounts, the agent is not learning anything new in the
current part of the environment. In this condition it is forced to relocate
with greatest probability when updating a particular
-value
does not change it.
In [16] a cooperative RL algorithm
of multi-agent systems denoted as the "leader-following
-learning
algorithm" is presented. The algorithm is based on a Markov or stochastic game,
in which there are multiple stages, and each stage is a static Stackelberg
game. The problem of multiple RL agents attempting to learn the value function
of a particular task in parallel for the n-armed bandit task is
investigated in [17].. A parallel reinforcement learning solution is suggested
to overcome the problem of statistic overwhelming by an agent's information
that is correspondingly has a larger accumulated experience than the other
agents. Experiments on a group of four foraging mobile robots learning to map
robots' conditions to behaviors was conducted by Matari? [18]. The
learning algorithm of the robots consists of reward functions that combine
individual conditions of a robot (such as, "grasped a puck", "dropped puck away
from home") and collaborative conditions; how close the robots are to each
other. Individually, each robot learns to select the behavior with the highest
value for each condition, to find and take home the most pucks. Evaluation of
groups of three and four robots found that interference was a detriment; in
general, the more robots were learning at the same time, the longer it took for
each individual to converge. Additionally, [18] found that while measuring the
"percent of the correct policy the robots learned in 15 minutes, averaged over
twenty trials," the use of heterogeneous reward functions results in better
performance but also suffers from the credit assignment problem.
As motivation for our approach consider the usual method for bomb squad personnel to blow up suspicious bags and any explosives contained therein. However, if the bag contains chemical, biological or radiological canisters, this method can lead to disastrous results. Furthermore, the "blow-up" method also destroys important clues such as fingerprints, type of explosive, detonators and other signatures of use in subsequent forensic analysis. Learning the extraction of a bag contents by a robot acquiring knowledge from human advice is the subject addressed here. The learning task described in this paper is to observe the position of a plastic bag located on a platform, grasp it with a robot manipulator and shake out its contents on a collection container in minimum time.
Although
-learning and its
variation
have been used in
many robotic fields (e.g., [19]-[24]), accelerating the learning process
is important. This paper presents a new algorithm, referred to as
,
that accelerates learning. The
algorithm is a
collaborative algorithm that integrates the experience of several agents. In
this paper, we describe experiments of applying the
algorithm
on a system that integrates two agents - a robot and a human working
cooperatively to achieve a common goal. The system has no a priori knowledge regarding to efficient lifting and shaking policies of the bag and it
learns this knowledge from experience and from human guidance. Although other
alternatives are available for solving the proposed security problem, such as
cutting open the bag, or sliding out the inspected objects, the application was
selected to serve as a test-bed for testing the
algorithm.
Section II presents the new
algorithm. The
test-bed learning application is described in section III followed by
experimental results in section IV. Experimental and analysis of the results are
given in sections V and VI respectively. Concluding remarks follow in section VII.
The basic assumption in reinforcement learning studies is
that any state
made by the agent
must be a function only of its last state and action:
where
and
are the state and
time at step
, respectively
[9]. In
-learning, the
system estimates the optimal action-value function directly and then uses it to
derive a control policy using the local greedy strategy [25]. It is stated in
[23] that "
-learning can
learn a policy without any prior knowledge of the reward structure or a
transition model.
-learning is thus
referred to as a model-free approach." It does not require mapping from actions
to states and it can calculate the
values directly
from the elementary rewards observed.
is the system's
estimate of the optimal action-value function [26]. It is based on the action
value measurement
, defined in (1):
|
(1) |
which represents the expected
discounted cost for taking action
when visiting
state
and following an
optimal policy thereafter. From this definition and as a consequence of
Bellman's optimality principle [27], (2) is derived:
|
(2) |
The essence of
-learning is that
these characteristics (maximum operator inside the expectation term and policy
independence) allow an iterative process for calculating an optimal action. The
first step of the algorithm is to initialize the system's action-value
function,
. Since no prior
knowledge is available, the initial values can be arbitrary (e.g.,
uniformly zero). Next, the system's initial control policy,
, is
established. This is achieved by assigning to
the action that
locally maximizes the action-value. At time-step
, the agent visits
state
and selects an
action
, receives from
the process the reinforcement
and observes the
next state
. Then it updates
the action value
according to (3)
which describes a
-learning one
step:
|
(3) |
where
is the current
estimate of the optimal expected cost
and
is
the learning rate which controls how much weight is given to the reward just
experienced, as opposed to the old
estimate. The
process repeats until a stopping criterion is met (e.g., robot emptied
the bag from contents). The greedy action
is the best the
agent performs when in state
. For the initial
stages of the learning process, however, it uses randomized actions that
encourage exploration of the state-space. Under some reasonable conditions [28]
this is guaranteed to converge to the optimal
-function [26].
A generalization of
-learning,
represented by
[25], [29] uses eligibility
traces,
: the one-step
-learning
is a particular case with
[30]. The
-learning
algorithm learns quite slowly because only one time-step is traced for each
action [10]. To boost learning convergence, a multi-step tracing mechanism, the
eligibility trace, is used in which the
values of a
sequence of actions can be updated simultaneously according to the respective
lengths of the eligibility traces [19].
"The convergence of
is not assured
anymore for
, but experience
shows that learning is faster" [30]. Several action selection policies are
described in the literature where the greedy policy (e.g., [31]) is to
choose the best action. Other policies (e.g., "softmax" or "?-greedy"
[32]) are stochastic, and based on choosing a suboptimal policy to explore the
state-action space.
The proposed
learning algorithm (Fig.
1) has the objective to accelerate learning in a system
composed of several similar learning processes. Differently from [18], where
the learning algorithms of multiple robots consist of reward functions that
combine individual conditions of a robot, the
learning algorithm is
based on a state-action value of an agent or learning process updated according
to the maximal value within all other state-action values existing in the
learning system (4); collaboration is in taking the maximum of action values,
i.e., the
-value, across all
learners at each update step [33]. Similar to the "leader-following
-learning algorithm" in
the joint policy approach described in [16], the
learning algorithm
enables collaboration of knowledge between several agents (the human and the
robot). Unlike [12] and
[13], the
algorithm does not allow human rewards inserted
directly to its
-value functions.
rewards are achieved by interaction of
learning agents with the environment.
Fig. 1. |
|
(4) |
In (4)
is the temporal
difference error that specifies how different the new value is from the old
prediction and
is the
eligibility trace that specifies how much a state-action pair should be updated
at each time-step. When a state-action pair is first visited, its eligibility
is set to one. Then at each subsequent time-step it is reduced by a factor
.
When it is subsequently visited, its eligibility trace is increased by one
[34].
When only two learning agents are involved such as a robot
and human (Fig. 2), the robot learning function acquires state-action values
achieved from policies suggested by a human operator (HO). In this case, robot
learning performance measure is defined as
, a minimum
acceptable performance threshold in which above the human is called. The
measure
is compared with the
average number of rewarded policies,
, over the last
recent
learning trials considered (5):
|
(5) |
where
is the current
learning trial,
, and
notifies
whether a policy was successful (
- at least one
item fell from the bag) or failed (
- no items fell
from the bag) for the
trial. Based on
this learning performance threshold, the robot switches between fully
autonomous operation and the integration of human commands.
Two levels of collaboration are defined: (i) autonomous -
the robot decides which actions to take, acting autonomously according to its
learning
function, and (ii) semi-autonomous - HO suggests actions remotely and the robot
combines this knowledge, i.e.,
-learning is being
performed. Human-robot collaboration is unnecessary as long as the robot learns
policies autonomously, and adapts to new states. The HO is required to
intervene and suggest alternative shaking parameters for shaking policies if
the robot reports that its learning performance is low (6).
|
(6) |
The robot learning performance
threshold,
, a pre-defined minimum
acceptable performance measure in which above the human is called is compared
with
, the average number
of rewarded policies over the last
recent learning policies
performed, i.e., robot switches its learning level from autonomous (self-shaking
performing) to semi-autonomous (acquiring human knowledge) based on its
learning performance (see example in Fig. 3).
The system is defined by
where
is
a robot,
a bag object, and
an environment
containing a platform on which the inspected bag is manipulated.
is
a task performed on
, using
,
within the environment
. For the task of
emptying the contents of a suspicious plastic bag, learning task,
,
is to observe the position of the bag, located on a platform, grasp it with a
robot manipulator and shake out its contents in minimum time. It is assumed
that the number of items in the bag is known in advance.
Robot states denoted as
(Table I) are its
gripper location in a three-dimensional grid (Fig. 4). The performance of the
task
is a function of
a set of actions,
, for each
physical state of
.
Fig. 2. Robot
and a human operator |
Fig. 3. Example of learning performances |
TABLE I: State Description of the Three-Dimensional Grid
State(s) |
Description |
Number of states |
|
state center |
1 |
|
states where the robot can move over its |
6 |
|
states where the robot can move over its |
6 |
|
states where the robot can move over its |
6 |
Fig. 4. Robot state-space |
A typical learning episode (trial) starts when the robot arm
is located above an inspection surface at a "center shaking position". The
robot is programmed in advance to grasp a plastic bag located in a fixed
position. Grasping includes moving down to grasp the bag, sliding under the
bag, grasping the bag, then moving it's gripper to a "center shaking position".
The "center shaking position" was chosen at the origin of the
and
axes and high
enough above the inspection surface to allow the bag to be shaken freely
without touching the ground when it is being shaked vertically over the
axis.
Further this position was considered to prevent the robot from hitting itself
while performing the fast shaking movements.
An action,
, consists of a
robot movement in the direction of one of the
,
or
axes. The robot
starts a shaking policy from a
state. From
it can
move to any of the other 18 states. The distance (denoted as "the amplitude")
between any two close states is set a priori to performing a shaking
policy (e.g., distances between
and
or
between
and
are
30 mm). From a robot state other than
, an action
performed by the robot is limited to (a) its mirror position along its present
axsis position, or (b) back to
(e.g.,
from state
the robot can
move only to
or to
).
In addition to the dependency of an action on the predefined amplitude, two
speed values are defined. This doubles the number of possible actions the robot
can take resulting in 108 possible actions.
Let a policy
be a
set of 100 state-action pairs,
. Three performance
measures be
where
were
defined for a policy
:
(i)
- average time to
complete (or partially complete) emptying the contents of a bag. This value is a
measure of the time each object
fells from the
bag during a shaking policy. Shaking time for each learning episode is
different because different actions are performed. In addition, for example,
time of moving from
to
is
not equal to the time of moving from
to
.
Other time differences occur when human intervenes and changes motion
speeds and amplitudes.
(ii)
- human
intervention rate. This is a measure that represents the percentage of human
interventions out of the total number of learning trails. Human operator (HO)
collaboration is triggered when robot learning is slow.
represents
the degree of HO collaboration (the lower it is, and the more autonomous the
robot is).
(iii)
- average reward.
Robot learning experience is achieved through direct experience with the
environment according to rewards based on the number and time of items falling from
a bag during shaking trial. Its value depends on the time occurrence of the
falling objects according to (7).
|
(7) |
where
is a constant,
is the time horizon for a learning
episode and
is
the time past from the beginning of the shaking episode till an object falls
out of the bag. If an item is not extracted then
.
The experimental setup contains a Motoman UP-6 fixed-arm robot positioned over an inspection surface (Fig. 5a). A dedicated gripper designed for smooth slippage under a bag was used. At the beginning of each learning trial, the robot grasps and lifts a bag containing five wooden cubes (Fig. 5b) to a center position. It then initiates a shaking policy.
![]() |
![]() |
(a) Robot and inspection surface |
(b) Plastic bag and cube |
Fig. 5.
Experimental setup |
|
The UP-6 robot has no a priori knowledge in its initial autonomous learning stage. In a subsequent cooperative stage the robot in addition to interacting with the environment and getting rewards / punishments, policy adjustments are provided by the HO through an interface (Fig. 6).
Fig. 6. Human interface |
The interface view and controls consists of: (i) "Visual
Feedback" - real-time visual feedback captured from a web-camera located over
the robotic scene; (ii) "System Performance" - system learning performance,
reported
at the end of each episode (iii) "Collaboration Level" - autonomous or
semi-autonomous, and (iv) "Human Decision Making" - when asked to intervene,
the HO can determine robot shaking amplitude and speeds. The robot learning
function acquires the suggested parameters and performs a new shaking policy.
An experiment using the
-learning is
employed. 50 episodes, each contain 100 state-action learning steps were
separated into two
stages: (i) autonomous - during the first 10 episodes the robot performs
shaking policies autonomously and no human intervention is allowed. The shaking
parameters for the first episode were set to amplitude of 30 mm, and to speeds
of 1000 and 1500 mm / s, and (ii) collaborative learning - in this stage
consists of 40 episodes, human intervention triggering was allowed, based on
the system learning performance. In this stage the human can adjust shaking
policies parameters at the range of 10 to 50 mm for the amplitude and speeds in
the range of 100 to 1500 mm / s.
We chose arbitrarily the value of ten learning episodes for the autonomous stage since it is assumed that the average learning performance is low in early stages of learning and the robot would have immediately be triggered to ask for human intervention (this value of 10 will be parameratized in sunsequent studies) thereby autonomous exploration is necessary in the beginning of learning.
The system parameters were set to:
,
,
, and
.
To balance between exploration and exploitation (e.g., [35]),
-greedy
action selection with
was used.
Although the interface shows that the speeds and amplitudes can be changed
independently on each of the three coordinate axes (Fig. 6), in the experiment
performed, the same speeds were chosen for all axes when human intervention was
triggered. This will be relaxed in subsequent studies.
Summary of results is shown in Table II. Results include
measurements of
,
and
(See section
III).
TABLE
II:
Experimental results
Autonomous learning stage I |
Collaborative learning - stage II |
||
|
Time (s) |
11.67 |
12.98 |
Standard Deviation |
1.22 |
1.33 |
|
|
[%] |
- |
20 |
|
Score |
26.5 |
70.57 |
Standard Deviation |
41.6 |
43.02 |
|
Number of Trials |
10 |
40 |
*
- average
time to complete (or partially complete) emptying the contents of a bag.
**
- human
intervention rate.
***
- average
reward.
Averaging over learning episodes, the
average reward (
) is plotted as it improves (Fig. 7).
Fig. 7. Average reward performance ( |
Results indicate slightly deterioration while comparing the collaboration the autonomous stage while measuring
(7.2%) while the human collaborated
in 20% of the learning episodes. This time increase can be explained due to the
reason than shaking policies with higher amplitudes proved to achieve a better reward
performance. This results a tradeoff because policies with higher reward
performance lasted longer.
From a reward perspective a significant improvement of 266.3% was measured while comparing the collaboration stage with the autonomous. This strengthens the hypothesis that learning is faster in the collaboration stage than in the autonomous stage.
From state-action (
-value)
perspective, we have noticed that the highest values over all 50 learning
trails performed are
related to five pairs;
,
,
,
, and
, i.e., the preferred shaking policy
found occurs when most of the robot actions are horizontal over the
axis (Fig. 4). Due to the fact that
every shaking policy performed by the robot starts from the center state, we
analyzed the
values
for these three states, i.e.,
,
, and
for different trials:
· The
table for the first episode indicates
similar
values
(thereby, at random) for all state-action pairs.
· For
the 10th learning trial, high
values, for
various state-action transitions were observed:
,
,
,
,
and
.
· From
the 11th learning trial as described, the robot is allowed to ask
for human intervention. Since we observed eight requests (all happened from the
11th to the 22nd episode) for intervention, the 22nd learning trial high
values are of an interest as follows:
,
,
,
,
and
.
· For
the 50th learning, the, high
,
,
and
.values were
observed.
For a bag shaking learning problem, it was found the number of human interventions increases the total reward
achieved by the system. Intuitively vertical shaking seems best. It was found,
however, that the policy of shaking most of the time over the horizontal
axis with a small number of actions over the
axis was the most effective. Further, from the
values observed, it seems that the system starts exploring the
state-action space randomly. Then, when provided by human collaboration, it
converges to a shaking policy mostly in the horizontal plan,
, passing back and forth through the center using the
highest possible amplitude. A better policy is found later in the learning
process when robot shaking skips the center state and shaking follows oscillatory horizontal high amplitude. This policy caused the bag to occasionally become
entangled around the robot gripper. This was due to the fact that most of the
plastic bag weight is concentrated at its bottom and the swinging momentum
carried over the gripper. This counter intuitive result has an easy
explanation; the plastic bag was losely tied closed by an overhand knot and pulling
it vertically tends to tighten the knot, while a sideway swing makes it loser. Further,
hypothetically, if a human was free to hold and shake the bag, it could have
seen visually the optimal time to reverse the horizontal movement to avoid
entanglement, an ability the robot does not have. Such ability could be
developed with an active vision system.
A proposed RL-based human-robot learning collaboration decision-making method is developed for a complex task of bag shaking. The robot makes a decision whether to learn the task autonomously or ask for human intervention. The approach is aimed to integrate user instructions into an adaptive and flexible control framework and to adjust control policies on-line. To achieve this, user commands at different levels of abstraction are integrated into an autonomous learning system. Based on its learning performance, the robot switches between fully autonomous operation and the integration of human commands.
The
learning algorithm, accelerates
learning in systems composed of several learning agents or systems designed for
human-robot interaction overcoming
the main criticism of the reinforcement learning approach, i.e., long training periods. From a reward perspective a
significant improvement of 266.3% was measured while comparing the collaboration stage with
the autonomous. This strengthens the hypothesis that learning is faster
in the collaboration stage than in the autonomous stage.
Future work will include the design and implementation of extended collaboration strategies that include human intervention using refined or rough intervention policies, robot autonomy, and pure human control. Additionally, system automation will be expanded; one approach might be to a digital scale for counting the extracted objects in real-time. Further, sensitivity analysis will be provided, especially for the trial at which the collaboration stage is initiated (the trial in which collaborative control is allowed).
Acknowledgements
This work was partially supported by the Paul Ivanier Center for Robotics Research and Production Management, Ben-Gurion University of the Negev and by the Rabbi W. Gunther Plaut Chair in Manufacturing Engineering.
References