Real-Time Hand Gesture Telerobotic System Using the Fuzzy C-Means Clustering Algorithm

Download Paper (pdf)

JUAN WACHS*, URI KARTOUN, HELMAN STERN, YAEL EDAN

Department of Industrial Engineering and Management

Ben-Gurion University of the Negev

P.O.Box 653, Be’er-Sheeva 84105, ISRAEL

Fax: +972-8-6472958; Tel: +972-8-6461434

E-mail: (juan, kartoun, helman, yael)@bgu.ac.il

* Corresponding Author: Juan Wachs (juan@bgumail.bgu.ac.il)

Abstract

This paper describes a teleoperation system in which an articulated robot performs a block pushing task based on hand gesture commands sent thorugh the Internet. A Fuzzy C-Means clustering method is used to classify hand postures as “gestures”. The fuzzy recognition system was tested using 20 trials each of a 12 gesture vocabulary. Results revealed a recognition rate (the ratio of unclassified gestures to classified gestures) of 0.96, and a recognition accuracy (the percent of the classified gestures recognized correctly) of 100%. No gestures were recognized incorrectly. Performance times to carry out the pushing task showed rapid learning, reaching standard times within 4 to 6 trials by an inexperienced operator.
 

KEYWORDS: gesture recognition, telerobotics, fuzzy c-means, hand gesture.

1. Introduction

Hand gestures are one but a few of the methods used in telerobotic control [1]. This type of communication provides an expressive, natural and intuitive way for humans to control robotic systems to perform specific tasks. One benefit [2] of such a system is that it is a natural way to send geometrical information to the robot such as ; left, right, up and down hand gestures. Gestures may represent a single command, a sequence of commands, a single word, or a phrase. Such systems should be accurate enough to provide the right classification of hand gestures in a reasonable time [3]. Human-robot interaction using hand gestures provides a formidable challenge. This is because the environment contains a complex background, dynamic lighting conditions, a deformable human hand shape, and a real-time execution requirement. In addition, the system is expected to be user and device independent [4]. Hand gesture recognition systems can be classified as static and dynamic. In this paper we define a static hand gesture vocabulary for telerobotic control. A Fuzzy C-Means (FCM) recognition system is used because of its speed in recognizing gestures with sufficient accuracy for real-time operation. This approach is extremely useful when the teleoperation system is implemented in a client-server Internet environment where transmission delay times are obstacles to real-time system operation. The paper describes the gesture recognition algorithms and the experiment defined to test the system.

2. System Architecture

To control a robot movement, the user evokes a gesture from a gesture vocabulary. The user lays his/her hand over a video imager, and a raw image is acquired. An interface screen allows the user to view the captured gesture. The gesture is classified using a recognition module based on the FCM algorithm [5] and is sent to the robot for execution. The components of the system consist of a five degree of freedom articulated A255 robot manufactured by CRS Robotics, a dual 200 MHz Intel® PentiumTM processor PC system with a Matrox Meteor frame grabber, two 3Com “HomeConnect” PC digital USB (Universal Serial Bus) cameras, and a Panasonic video imager.
 

A set of recognized gestures is sent through the TCP/IP communication protocol to a distant robot PC server (Fig. 1). The server is connected to the robot controller and two USB cameras (web cameras) continually capture the robot scene. Both side and front views of the robot scene are sent to the client using the FTP protocol, and then presented in the user interface. To enable the user to control the robot over the Internet, TCP/IP and FTP client-server applications were developed and installed on each computer.

 

Fig 1. User-Robot Communication Architecture

The hand gesture recognition system flow diagram is shown in Fig. 2. Upon presentation of the robotic scene in the user’s interface, a gesture G is evoked and selected from the gesture vocabulary { G1 ,G2, ….., G12}. A vision system converts the captured image of the gesture into a feature string which is subsequently recognized and sent to the robot PC server. After the robot executes the command, camera views of the robot environment are transmitted back to the interface.

Fig 2. System Flow Diagram

3. Gesture Classification

3.1 Hand Gesture Language

A vocabulary of 12 static gesture poses was designed for robot control tasks (Fig. 3). The first six gestures of the vocabulary control the robot arm using world coordinates. The forward and back hand gestures control the X-axis, the right and left hand gestures control the Y-axis, and the up and down hand gestures control the Z-axis of the articulated robot arm. The Roll Right and Roll Left hand gestures rotate the wrist joint, and the Open Grip and Close Grip gestures control the robot gripper. The Stop hand gesture stops any action the robot performs. The Home hand gesture resets and calibrates all robot joints in the home position.

 

Fig 3. Visual Gesture Recognition Language

3.2 Fuzzy C-Means Clustering

Fuzzy C-Means Clustering algorithm (FCM) is an easily understood and fast time computational algorithm described mathematically in [5]. Given a set of n data patterns, , the algorithm minimizes the weighted within group sum of squared error objective function, :

 (1)

where  is the -th p-dimensional data vector,  is the prototype of the center of cluster ,  is the degree of membership of  in the -th cluster, is a weighting exponent on each fuzzy membership,  is a distance measure between data pattern  and cluster center ,  is the number of data patterns and is the number of clusters. The objective function  is minimized via an iterative process in which the degrees of membership  and the cluster centers are updated: (2)

and (3)

where, the satisfies:

,   (4).

In the proposed methodology, the FCM algorithm is provided with a training set of candidate gestures. Each gesture is represented by a feature vector extracted from the gesture image. The set of feature vectors are clustered for subsequent use in a recognition system. Once the clusters have been created, they are labeled. (i.e, assigned a linguistic description which is the class or gesture associated with each cluster). This must be done manually, hence the term unsupervised clustering. The cluster center,  is a prototype feature vector for cluster i, xk  is the feature vector of the kth exemplar gesture in the training set, uik is the "degree of belonging" (membership value) of the kth feature vector to cluster i, c is the number of gestures in the gesture set as well as the number of clusters, n is the number of images in the training set.

3.3 Preprocessing and Feature Extraction

Preprocessing of the captured gesture image includes a number of image processing operations such as thresholding followed by morphological erosion and dilation type filters until a black and white hand object is segmented from the background. This is followed by constructing a bounding box around the segmented hand. A feature vector of the image with 13 parameters is created. The first feature is the aspect ratio of the bounding box. The last 12 features are block grays mean values calculated from a 3 by 4 block partition of the image. The block gray mean values represent the average brightness of each block in the image. Fig. 4 illustrates a typical user gesture (a), its block mean grayscale values (b), and the resultant feature vector, respectively.
(a)
(b)

 

                       Feature Vector = [57 176 52 2 2 68 249 171 16 3 13 253 188]

Fig 4. Illustration of a Feature Vector

3.4 Training Stage

 The training stage involves running the Fuzzy C-Means algorithm for a set of exemplar hand gestures. During the training process, several hand gesture images are inserted into a database which includes all possible gestures. Variations are incorporated by slightly varying the hand configuration for each gesture. Variations in lighting are handled by preprocessing the image to reduce its effect. At least 25 samples for each of the 12 hand gestures in the language are taken to construct the training set. Every image gets an identification number, plus a feature vector as described in section 3.3. This information is also inserted into the database.

During the FCM clustering process, for each training image, a membership vector is built and stored in the database. The membership vector computed from (2) uses the actual training image feature vector. A cost function is built to estimate how optimal the system is. After classification of all the images in the training set, equation (1), was used to empirically find the best value of parameter m.

3.5 Classification

Gestures performed by a user are classified using the highest membership value. In our case, if X k’ is the feature vector of the current hand gesture image , its distance to each of the cluster centers  is determined and used in (2) to calculate the membership values {u ik’  " i=1,…,c}. The gesture is classified by finding: ui’k’  =  Max {u ik’ ; " i=1,…,c}. A further test is made before recognition is established. This test depends on a recognition threshold , t . If  ui’k’ >= t  is true, then the gesture is recognized as belonging to classification i’, otherwise it is said to be unclassified as all the membership values are too low. A recognition threshold of t = 0.75 was empirically determined to provide the best performance. Fig. 5 shows the “development interface” used for this test. Note the visualization of the membership values as a bar chart, whose peak value in this case corresponds to a successfully recognized "stop" gesture.

 

Fig 5. Successful Recognition

4. Testing and Task Validation

4.1 Testing the Recognition System

An experiment was designed to test the recognition system. Twenty trials of each gesture were presented to the gesture recognition system. Two measures of performance, based on a distinction between classification and recognition, were used in the evaluation. A gesture k' whose entire set of membership values, {u i k’ ; " i=1,…,c}, is below the given threshold value of t is said to be unclassified. A gesture k', with least one membership value above the threshold value t, is said to be recognized as belonging to classification i' , if u i’ k’  =  Max {u i k’ ; " i=1,…,c}. We can now define the two performance measures as:

(a) Recognition Rate - The ratio of unclassified gestures to classified gestures.

(b) Recognition Accuracy - The percent of the classified gestures recognized correctly.

Results indicate a recognition rate of 0.96, and a recognition accuracy of 100%. Unclassified gestures could be attributed to the recognition system or lack of training of the user to sufficiently provide the correct hand configuration for the intended gesture. All other classified gestures were recognized correctly. Gestures 5 (Right) and 6 (Left) were the most common unclassified gestures, while gestures 4 (Back), 9 (Open Grip) and 11 (Home) were rarely unclassified. This information is useful in the redesign of the gesture vocabulary.

4.2 Case Study

A case study demonstration was performed in which an operator using hand gestures controls the remote robot to perform a task in real-time. The task was to push a yellow wooden cube, located on a top of a pile into a container adjacent to it. (Fig. 6)

 

Fig 6. A255 Robot, Plastic Cup Structure, and Yellow Wooden Box

An inexperienced operator performed ten identical experiments and the performance times for each were recorded. The system is fast, accurate, and due to a simple hand gesture language developed, the learning curve of task completion time was reduced quickly. As can be seen in Fig. 7 standard times were reached after four to six trials.

 

Figure 7. Learning Curve of the Hand Gesture System

5. Conclusions

This project described the design, implementation and testing of a telerobotic gesture-based user interface system using visual recognition. Two aspects of the problem have been examined, the technical aspects of visual recognition of hand gestures in a laboratory environment, and the usability of an interface implemented in a remote client location. Experimental results showed that the system satisfies the requirements for a robust and user friendly input device. The Fuzzy C-Means algorithm provided enough speed and sufficient reliability to perform the desired tasks. The case study demonstrated the importance of latency for telerobotic systems. Although gestures were recognized quickly and sent in packet forms, successful execution of the commands could not be verified until the image of the robot environment was received at the user interface. This resulted in an overlapping effect - sending of new gestures before complete information of the present robot position was received. Future research will be directed to the solve this problem.

Acknowledgements

This project was supported by the Ministry of Defense MAFAT Grant No. 2647 and partially supported by the Paul Ivanier Center for Robotics Research & Production Management, Ben-Gurion University of the Negev.

References

[1] Katkere A., Hunter E., Kuramura D., Schlenzig J., Moezzi S. and Jain R. "ROBOGEST: Telepresence Using Hand Gestures". Technical report VCL-94-104, Visual Computing Laboratory, University of California, San Diego. 1994.

[2] Kortenkamp, D., Huber, E., and Bonasso, R. P. "Recognizing and Interpreting Gestures on a Mobile Robot". In AAAI96, 1996.

[3] Pavlovic V, Sharma R., and Huang T. "Visual Interpretation of Hand Gestures for Human Computer Interaction: A Review". IEEE PAMI, Vol. 19. pp. 677-695. 1997.

[4] Triesch J. and Malsburg C. V. D. "A Gesture Interface for Human-Robot Interaction". Proc. of 3th IEEE Intl. Conf. on Automatic Face and Gesture Recognition, pp. 546-551. 1998.

[5] Bezdek J.C. "Cluster Validity with Fuzzy Sets". Cybernetics. Vol. 3, No. 3, 58-73. 1973.


Back