Electronic Letters on Computer Vision and Image
Analysis 0(0):1-7, 2000
Abstract
The proposed work is part of a project that aims for the control of a videogame
based on hand gesture recognition. This goal implies the restriction of real-time
response and unconstrained environments. In this paper we present a real-time
algorithm to track and recognise hand gestures for interacting with the
videogame. This algorithm is based on three main steps: hand segmentation, hand
tracking and gesture recognition from hand features. For the hand segmentation
step we use the colour cue due to the characteristic colour values of human
skin, its invariant properties and its computational simplicity. To prevent
errors from hand segmentation we add a second step, hand tracking. Tracking is
performed assuming a constant velocity model and using a pixel labeling
approach. From the tracking process we extract several hand features that are
fed to a finite state classifier which identifies the hand configuration. The
hand canbe classified into one of the four gesture classes or one of the four
different movement directions. Finally, using the system’s performance
evaluation results we show the usability of the algorithm in a videogame
environment. Key Words: Hand Tracking, Gesture Recognition, Human-Computer
Interaction, Perceptual User Interfaces.
Introduction
Nowadays, the majority of the human-computer interaction (HCI) is based on
mechanical devices such as keyboards, mouses, joysticks or gamepads. In recent
years there has been a growing interest in a class of methods based on
computational vision due to its ability to recognise human gestures in a
natural way [1]. These methods use as input the images acquired from a camera
or from a stereo pair of cameras. The main goal of these algorithms is to
measure the hand configuration in each time instant. To facilitate this process
many gesture recognition applications resort to the use of uniquely coloured
gloves or markers on hands or fingers [2]. In addition, using a controlled
background makes it possible to localize the hand efficiently and even in
real-time [3]. These two conditions impose restrictions on the user and on the
interface setup. We have specifically avoided solutions that require coloured
gloves or markers
Correspondence to: cristina.manresa@uib.es Recommended for acceptance by ELCVIA
ISSN: 1577-5097 Published by Computer Vision Center / Universitat Autonoma
deBarcelona, Barcelona, Spain
Figure 1: Interactive game application workspace diagram. and
a controlled background because of the initial requirements of our application.
It must work for different people, without any complement on them and for unpredictable
backgrounds. Our application uses images from a low-cost web camera placed in
front of the work area, see Fig. 1, where the recognised gestures act as the
input for a computer 3D videogame. Thus, the players, rather than pressing
buttons, must use different gestures that our application should recognise.
This fact, adds the complexity that the response time must be very fast. Users
should not appreciate a significant delay between the instant they perform a
gesture or motion and the instant the computer responds. Therefore, the
algorithm must provide real-time performance for a conventional processor. Most
of the known hand tracking and recognition algorithms do not meet this
requirement and are inappropiate for visual interface. For instance, particle
filtering-based algorithms can maintain multiple hypotheses at the same time to
robustly track the hands but they need high computational demands [4].
Recently, several works have been presented for reducing the complexity of
particle filters, for example, using a deterministic process to help the random
search [5]. However, thesealgorithms only work in real-time for a reduced size
hand and in our application, the hand holds most of
the image. In this paper we propose a real-time non-invasive hand tracking and
gesture recognition system. In the next sections we explain our method divided
in three main steps. First step is hand segmentation where the image region
that contains the hand has to be located. In order to make this process it is
possible to use shapes, but they vary greatly during the natural motion of hand
[6]. Therefore, we choose skin-colour as the hand feature. The skin-colour is a
distinctive cue of hands and it is invariant to scale and rotation. The next
step is to track the position and orientation of the hand to prevent errors in
the segmentation phase. We use a pixel-based tracking for the temporal update
of the hand state. In the last step we use the estimated hand state to extract
several hand features to define a deterministic process of gesture recognition.
Finally, we present the system’s performance evaluation results that
prove that our method works well in unconstrained environments and for several
users.
Hand Segmentation
The hand must be localized in the image and segmented from the background
before recognition. Colour is the selected cue because of its computational
simplicity, its invariant properties regarding to the hand shape configurations
and due to the human skin-colour characteristic values. Also, theassumption that
colour can be used as a cue to detect faces and hands has been proved in
several publications [7 8]. For our application, the
hand segmentation has been carried out using a low computational cost method
that performs well in real time. The method is based in a probabilistic model
of the skin-colour pixels distribution. Then, it is necessary to model the
skin-colour of the user’s hand. The user places part of his hand in a
learning square as shown in Fig. 2. The pixels restricted in this area will be
used for learning the model. Next, the selected pixels are transformed from the
RGB-space to the HSL-space for taking the chroma information: hue and
saturation. We have encountered two problems in this step that have been solved
in a pre-processing phase. The first one is that human skin hue values are very
near to red colour, that is, their value is very close to 2π radians, so
it is difficult to learn the distribution due to the hue angular nature that
can produce samples on both limits. To solve this inconvenience the hue values
are rotated π radians. The second problem in using HSL-
Figure 2: Application interface and skin-colour
learning square. space is when the saturation values
are close to 0, because then the hue is unstable and can cause false
detections. This can be avoided discarding saturation values near 0.
Oncepre-processing phase has finished, the hue and saturation values for each
selected pixel are use to r r r r infer the model, that is, x = ( x1 ,, x n ) , where n is the number of samples and a
sample is xi = (hi , si ) . As a result of a testing and comparing phase with
several statistical models such as mixture of gaussians or discrete histograms,
the best results have been obtained using a Gaussian model. The values for the
parameters of the Gaussian model (mean, x , and
covariance matrix, Σ ) are computed from the sample set using standard
maximum likelihood methods [9]. Once they are found, the probability that a new
pixel, r x = (h, s ) , is skin can be calculated as
r P( x is skin) = 1
(2π )2 Σ
e (−
1
2(
r r x − x ) Σ -1 ( x − x ) T )
.
(1
Finally, we obtain the blob representation of the hand applying a connected
components algorithm to the probability image, which groups pixels into the
same blob.
Tracking
USB cameras are known for the low quality images they produce. This fact can
cause errors in the hand segmentation process. In order to make the application
robust to these segmentation errors we add a tracking algorithm. This algorithm
tries to maintain and propagate the hand state over time.
r r p = ( p x , p y ) is the hand position in the 2D
image, w = ( w, h) , is the size of the hand in pixels, and α is the
hand’s angle in the 2D image plane. First, from the hand statein time t
we built an hypothesis of the r r r hand state, h = ( p
(t + 1), w(t ), α (t )) , for time t + 1 applying a simple second-order
autoregressive process
to the position component
This distance can be seen as the approximation of the distance from a point in
the 2D space to a normalized ellipse (normalized means centered in origin and
not rotated). From the distance definition of (5) r it turns out that its value
is equal or less than 0 if x is inside the hypothesis h ,
and greater than 0 if it is r outside. Therefore, considering the hand
hypothesis h and a point x belonging to a blob b , if
the distance is equal or less than 0, we conclude that the blob b supports the
existence of the hypothesis h and it is selected to represent the new hand
state. This tracking process could also detect the presence or the absence of
the hand in the image [10].
Gesture Recognition
Our gesture alphabet consists in four hand gestures and four hand directions in
order to fulfil the application’s requirements. The hand gestures
correspond to a fully opened hand (with separated fingers), an opened hand with
fingers together, a fist and the last gesture appears when the hand is not
visible, in part or completely, in the camera’s field of view. These
gestures are defined as Start, Move, Stop and the No-Hand gesture respectively.
Also, when the user is in the Move gesture, he can carry out a Left, Right,
Front and Back movements. For the Left and Right movements,the
user will rotate his wrist to the left or right. For the Front and Back
movements, the hand will get closer to or further of the camera. Finally, the
valid hand gesture transitions that the user can carry out are defined in Fig.
3. The process of gesture recognition starts when the hand’s user is
placed in front of the camera field of view and the hand is in the Start
gesture, that is, the hand fully opened with separated fingers. For avoiding
fast hand gesture changes that were not intended, every change should be kept
fixed for 5 frames, if not the hand gesture does not
change from the previous recognised gesture. For achieving this gesture
recognition, we use the hand state estimated in the tracking process, that is,
r r r r s = ( p, w, α ) . This state can be
viewed as an ellipse approximation of the hand where p = ( p
x , p y ) is the ellipse center and w = ( w, h) is the size of the ellipse in
pixels. To facilitate the process we define the major axis lenght as M and the
minor axis lenght as m .
Manresa et al. / Electronic Letters on Computer Vision and Image Analysis
0(0):1-7, 2000
Figure 3: Gesture alphabet and valid gesture transitions. In addition, from the
hand's contour and the hand’s convex hull we can calculate a sequence of
contour points between two consecutive convex hull vertices. This sequence
forms the so-called convexity defect and it is possible to compute the depth of
the ith-convexitydefect, d i . From these depths some
useful characteristics for the hand shape can be derived like the depth average, d ,
d =
1
n
i =0..n
∑d
i
(6)
where n is the total number of convexity defects in the hand’s contour,
see Fig. 4.
v M m px, py α h u d
Figure 4: Extracted features for the hand gesture recognition. In the right
image, u i and v i indicate the start and the end points of the ith-convexity
defect, the depth, d i , is the distance from the
farthermost point of the convensity defect to the convex hull segment.
Manresa et al. / Electronic Letters on Computer
Vision and Image Analysis 0(0):1-7, 2000
The first step of the gesture recognition process is
to model the Start gesture. The average of the depths of the convexity defects
of an opened hand with separated fingers is larger than in an open hand with no
separated fingers or in a fist. This characteristic is used for differentiating
the next hand gesture transitions: from Stop to Start; from Start to Move; and
from No-Hand to Start. However, first it is necessary to compute the Start
gesture characteristic, Tstart . Once the user is
correctly placed in the camera field of view with the hand widely opened for
learning his skin-colour, the system also computes the Start gesture
characteristic for the n first frames
1 n
Tstart =
t = 0 .. n
∑ d (t )
2
. (7
After the recognition of the Start gesture, the mostpossible valid gesture
change is the Move gesture. Then, if the current hand depth is less than Tstart
the system goes to the Move hand gesture. If the current hand gesture is Move
the hand directions will be enabled: Front, Back, Left and Right. If the user
does not want to move to any direction, he should set his hand in Move state.
For the first time that the Move gesture appears, the system computes the Move
gesture characteristic, Tmove , that is an average of the approximated area of
the hand for n consecutive frames
Tmove =
1
n
t = 0..n
∑ M (t )
m(t ) .
(8
For recognising the Left and Right directions, the calculated angle of the
fitted ellipse is used. To prevent non desired jitter effects in orientation,
we define a predefined constant T jitter . Then, if
the angle of the ellipse that circumscribes the hand, α, satisfies α
> T jitter , Left orientation will be set. If the
angle of the ellipse that circumscribes the hand, α, satisfies α <
−T jitter , Right orientation will be set. For
controlling the Front and Back orientations and for returning to the Move
gesture the hand must not be rotated and the Move gesture characteristic is
used for differentiating these movements. If Tmove C front < M m succeeds the hand
orientation will be Front. The Back orientation will be achieved if
C back > m M .
The Stop gesture will be recognised using the ellipse’s axis. When the
hand is in a fist, the fittedellipse is almost like a circle and m and M are
practically the same, that is, when M − m < C stop .
C front , C back and
C stop are predefined constants established during the algorithm performance
evaluation. Finally, the NoHand state will appear when the system does not
detect the hand, the size of the detected hand is not large enough or when the
hand is in the limits of the camera field of view. The next possible hand state
will be the Start gesture and it will be detected using the transition
procedure from Stop to Start explained former on. Some examples of gesture
transitions and the recognised gesture results can be seen in Fig. 5. It is
very important a correct learning of the skin-colour. If not, some problems
with the detection and the gesture recognition can be encountered. One of the
main problems with the use of the application is the hand control for
maintaining it in the camera’s field of view and without touching the
limits of the capture area. This problem has been shown to disappear with user
training.
Manresa et al. / Electronic Letters on Computer Vision and Image Analysis
0(0):1-7, 2000
Figure 5: Hand tracking and gesture recognition examples.
System's performance evaluation
In this section we describe the accuracy of our hand
tracking and gesture recognition algorithm. The application has been
implemented in Visual C++ using the OpenCV libraries [11]. The application has
been testedon a Pentium IV running at 1.8 GHz. The images have been captured
using a Logitech Messenger WebCam with USB connection. The camera provides
320x240 images at a capture and processing rate of 30 frames per second.
Gesture Recognition
250 200 Nº of tests 150 100 50 0 S M L R F B P N Hand gestures Nº of
correct gestures Total of gestures
S : START M: MOVE L : LEFT R: RIGHT F: FRONT B: BACK
P: STOP N: NO HAND
Figure 6: System's performance evaluation results.
Manresa et al. / Electronic Letters on Computer Vision and Image Analysis
0(0):1-7, 2000
For the performance evaluation of the hand tracking
and gesture recognition, the system has been tested on a set of 24 users. Each
user has performed a predefined set of 40 movements and therefore we have 960
gestures to evaluate the application results. It is natural to think that the
system’s accuracy will be measured controlling the performance of the
desired user movements for managing the videogame. This sequence included all
the application possible states and transitions. Figure 6 shows the performance
evaluation results. These results are represented using a bidimensional matrix
with the application states as columns and the number of appearances of the
gesture as rows. The columns are paired for each gesture: the first column is
the number of tests of the gesture that has been correctly identified; the
second column is the total number of times that the gesturehas been carried
out. As it can be seen in Fig. 6, the hand recognition gesture works fine for a
99% of the cases.
Conclusions
In this paper we have presented a real-time algorithm to track and recognise
hand gestures for humancomputer interaction within the context of videogames.
We have proposed an algorithm based on hand segmentation, hand tracking and
gesture recognition from extracted hand features. The system’s performance
evaluation results have shown that this low-cost interface can be used by the
users to substitute traditional interaction metaphors. The experiments have
confirmed that continuous training of the users results in higher skills and,
thus, better performances.
Acknowledgements
The projects TIC2003-0931 and TIC2002-10743-E of MCYT Spanish Government and
the European Project HUMODAN 2001-32202 from UE V Program-IST have subsidized
this work. J.Varona acknowledges the support of a Ramon y Cajal fellowship from
the spanish MEC.
References
V.I. Pavlovic, R. Sharma, T.S Huang, “Visual interpretation of hand
gestures for human-computer interaction: a review”, IEEE Pattern Analysis
and Machine Intelligence, 19(7): 677 – 695,
[2] R. Bowden, D. Windridge, T. Kadir, A. Zisserman, M. Brady, “A
Linguistic Feature Vector for the Visual Interpretation of Sign
Language”, in Tomas Pajdla, Jiri Matas (Eds), Proc. European Conference
on Computer Vision, ECCV04, v. 1: 391-401, LNCS3022,Springer-Verlag, 2004. [3]
J. Segen, S. Kumar, “Shadow gestures: 3D hand pose estimation using a
single camera”, Proc. of the Computer Vision and Pattern Recognition
Conference, CVPR99, v. 1: 485, 1999. [4] M. Isard, A. Blake,
“ICONDENSATION: Unifying low-level and high-level tracking in a
stochastic framework”, Proc. European Conference on Computer Vision,
ECCV98, pp. 893-908, 1998. [5] C. Shan, Y. Wei, T. Tan, F.Ojardias ,
“Real time hand tracking by combining particle filtering and mean
shift”, Proc. Sixth IEEE Automatic Face and Gesture Recognition, FG04,
pp: 229-674, 2004. [6] T. Heap, D. Hogg, “Wormholes in shape space:
tracking through discontinuous changes in shape”, Proc. Sixth
International Conference on Computer Vision, ICCV98, pp. 344-349, 1998. [7]
G.R. Bradski, “Computer video face tracking for use in a percep-tual user
interface,” Intel Technology Journal, Q2'98, 1998. [8] D. Comaniciu, V.
Ramesh, “Robust detection and tracking of human faces with an active
camera” Proc. of the Third IEEE International Workshop on Visual
Surveillance, pp: 11-18, 2000. [9] C.M. Bishop, Neural Networks for Pattern
Recognition. Clarendon Press, 1995. [10] J. Varona, J.M.
Buades, F.J. Perales, “Hands and face tracking for VR
applications”, Computer & Graphics, 29(2), 2005. [11] G.R.
Bradski, V. Pisarevsky, “Intel's Computer Vision Library”, Proc of
IEEE Conference on Computer Vision and Pattern Recognition, CVPR00, v. 2:
796-797, 2000. [1]