Technical Report 96-1
Research on human movement behaviour reviewed in the context of hand centred input.
Prepared by: Axel Mulder, School of Kinesiology, Simon Fraser University, February 1996
Acknowledgement: This work was supported in part by a strategic grant from the Natural Sciences and Engineering Research Council of Canada.
© Copyright 1996 Simon Fraser University. All rights reserved.
References (you can also search for more references)
This paper focusses on the design issues involved in implementing human
computer communication by means of full hand movements, i.e. based on hand
position and shape.
A variety of terms for describing and defining hand movements are examined, to
gain a better insight into possible ways for classification. Hand movements can
be grouped according to function as semiotic, ergotic or epistemic
(Cadoz, 1994). Semiotic hand movements can be classified as iconic, metaphoric,
deictic, beat-like (McNeill, 1992) and, according to their
linguisticity, as gesticulation, language-like, pantomime, emblematic or
as sign language (Kendon, 1988). Human communication comes in many
modalities. They include speech, gestures, facial and bodily expressions
which appear to implement in close cooperation parts or all of the
aspects of the expression, such as temporo-spatial, visual, structural
and emotional aspects. Thus, human communication is not only symbolic.
Emotional aspects of an expression modulate other aspects of the
expression.
Research efforts in the design of gestural interfaces and other types of input
devices which capture hand shape and position are reviewed. The incorporation
of research results from human movement behaviour as listed above is gradually
taking place, although there are still tendencies to ignore the importance of
these findings. Difficulties in the design and development of gestural
interfaces are discussed taking some of these findings into account. Issues
discussed include hand movement tracking needs, context detection, gesture
segmentation, feature description and gesture identification. The
identification of a method to define a set of standard gestures is
addressed.
This paper is a reflection of an ongoing effort to examine results of research
into human communication through movement to benefit the design and development
of computer interfaces that more adequately capture such forms of human
communication. Human communication comes in many modalities, including
speech, gestures, facial and bodily expressions. A variety of forms of
expression, such as poetry, sign language, mimicry, music and dance, exploit
specific capacities of one or more of these modalities. This paper focusses on
the design issues involved in implementing human computer communication by
means of hand movements into human computer interaction.
To refresh the mind let us look at a random list of examples of hand
movements:
A brief discussion of the word gesture and its possible meanings is
appropriate. Gesture has been used in place of posture and vice versa. The
tendency however, is to see gesture as dynamic and posture as static. In
prosaic and poetic literature, gesture is often used to mean an initiation or
conclusion of some human interaction, where no human movement may be involved.
The notion of a musical gesture without actual human movement is quite common.
Obviously, musical expression is intimately connected with human movement,
hence the existence of such idiom.
In this paper, a hand gesture and hand movement are both defined as the motions
of fingers, hands and arms. Hand posture is defined as the position of the hand
and fingers at one instant in time. However, hand posture and gesture describe
situations where hands are used as a means to communicate to either machine or
human. Empty-handed gestures and free-hand gestures are generally used to
indicate use of the hands for communication purposes without physical
manipulation of any object.
The motivation for these definitions will become apparent in the course of this
paper, while slight nuances will also be added.
Spoken english language has over the ages incorporated a number of expressions
and words that signify hand actions and gestures. The mere fact that these
words and expressions exist indicates that the goals, not necessarily the
corresponding hand actions and gestures, they identify are common in daily
life. McNeill (1992) pointed out that gestures are not equivalent to speech,
but that gestures and speech complement each other (this will be further
discussed below). In other words, the speech modality may have developed such
that certain communication can only be expressed using gestures. Consequently,
the available verbal language may not represent a number of common and/or
important gestures. In Appendix A a list of words describing hand movements is
given. The list can be divided in a number of groups:
It is almost immediately clear from the above discussions that hand movements
can be divided into two major groups, one involving communication (such as
empty handed gestures), the other involving manipulation and prehension.
Somewhat inbetween lie the hand movements identified as haptic exploration
actions. Similar considerations must have led Cadoz (1994) to classify hand
movements according to their function:
Similarly, Thieffry, in Malek et al (1981) classifies hand movements as:
Intransitive hand movements or gestures have a universal language value
especially for the expression of affective and aesthetic ideas. Such gestures
can be indicative, exhortative, imperative, rejective a.o.. The gesture alone
expresses fully the intention and motivation of its author. These gestures
could be equally classified as Cadoz's semiotic hand movements.
We can further classify Cadoz's semiotic hand movements or gestures and ergotic
hand movements.
Many researchers consider gestures, or semiotic hand movements, as intimately
connected with speech and some conclude that speech is complementary to
gesture.
McNeill (1992) compares his classification scheme with a number of other
researchers (Efron, Freedman, Hoffman, Ekman and Friesen) and concludes that
all are using very similar categories. He classifies gestures as follows:
Kendon (1988) classifies gestures along a continuum, discussed more in depth
below:
It seems likely that the oldest purpose of our hands is to manipulate the
physical world, such that it better suited our needs. In terms of objects of a
size of the order of our hands we can change the object's position, orientation
and shape. Objects can be solid, fluid or gaseous. Therefore ergotic hand
movements can be classified according to physical characteristics:
It is more common to classify ergotic hand movements according to their
function, ie. as either prehensile or non-prehensile. Non-prehensile movements
include pushing, lifting, tapping and punching. Mackenzie (1994) defines
prehension as the application of functionally effective forces by the hand
to an object for a task, given numerous constraints. While various
taxonomies exist, one readily recognizable classification scheme (Napier, 1993)
identifies a prehensile movement as either a:
Pressing (1991) lists some more ways to classify ergotic hand movements:
Kimura (1993) and others have pointed out that there is evidence that
(hand)gestures preceded speech in the evolution of communication systems
amongst hominids.This finding supports the modeling of gesture and speech as
forms of expression generated by a system where formalized linguistic
representation is not the main form from which gestures are derived. Instead,
it is conjectured by McNeill (1992) that gestures and speech are an integrated
form of expression of utterances where speech and gestures are
complementary.
Many have investigated the relation between human gestures and speech. Kendon
(1988) ordered gestures of varying nature along a continuum of
"linguisticity":
Gesticulation - Language-like gestures - Pantomimes - Emblems - Sign
languages
Observe that while going from gesticulation to sign languages:
In an effort to further define the underlying structure of gestures, McNeill,
Levy and Pedelty (1990) propose a diagram (based upon Kendon's work) that
clarifies the relations between the units at each level of the speaker's
gestural discourse. Each unit consists of one or more of the units of a (higher
numbered) level:
McNeill (1992) concluded that there is no body "language", but that instead
gestures complement spoken language. In Kendon's (1980) words: the phrases
of gesticulation that co-occur with speech are not to be thought of either as
mere embellishments of expression or as by-products of the speech process. They
are rather, an alternate manifestation of the process by which ideas are
encoded into patterns of behaviour which can be apprehended by others as
reportive of those ideas. Such hand movements voluntarily but also
involuntarily convey extra information, besides speech, about the internal
mental processes of the speaker. Obviously, McNeill is concerned with gestures
similar to gesticulation as defined in Kendon's continuum. McNeill supports his
conclusion above by finding that gesticulation-type gestures have the following
non-linguistic properties:
Examples of sign languages are the American Sign Language (ASL) and the Deaf
and Dumb Language. ASL is an amalgam with French Sign Language. Other systems
of formally coded hand and arm signals are pidgin, or creole language, and a
gesture language used by the women of the Warlpiri, and aborigine people living
in the north central Australian desert.
In ASL, the prevalent form of signing consists of unilateral or bilateral
series of movements usually involving the whole arm. Typically, a particular
hand shape is moved through a pattern in a location specified with respect to
the body. Each sign roughly corresponds to a concept such as a thing or an
event, but there is not necessarily an exact equivalence with English words or
with words of any spoken language. A native sign language like ASL is therefore
quite different from a manual depiction of spoken language, such as signed
english, or co-verbal gesticulation.
A manual sign is claimed to be distinguished from other signs by 4 features
(Stokoe, 1980):
The relation between the semiotic and ergotic function of hand movements, how
they differ, their possible (mutual) dependencies in terms of neuro-motor
systems have barely been researched. Kimura (1993) suggests that manual praxis
is essential for signing, however manual praxis and signing are not identical.
This suggest a model where gestural communication is a higher level in the
hierarchy of systems involved in the creation of hand movements. As discussed
in the section on ergotic hand movements, it appears that the semiotic function
is always present, whether consciously intended or not. This can be explained
by remembering that for communication a sender and receiver are needed. The
receiver can always decide to interpret signal that the sender unintentionally
has sent. Emotion functions in human communication as a means for
modulating the semiotic content, such as emphasis. The expression of
emotions during ergotic movements adds some semiotic content to the actions so
that the hand movements are differently interpreted by an observer. This
procedure may be used, either by the sender (by amplifying the emotional
content, such as the dropping of items to draw attention to an issue or a
problem) or the receiver (by amplifying the focus on emotional content, such as
the initial remarks in conversation as in you seem rather tense today ...
anything wrong ?) to initiate a communication with more semiotic content,
usually of verbal nature. Perhaps emotions could therefore be seen as a bridge
between ergotic and semiotic movements.
As far as cortico-spinal systems are concerned, arm, hand and fingermovements
are controlled contralaterally, while arm and shoulder movements may also be
controlled ipsilaterally, ie. proximal movements can be controlled contra- as
well as ipsilaterally, while distal movements are only controlled
contralaterally. The left hemisphere is specialized for complex movement
programming, ie. manual praxis. Consequently, movements involving identical
commands to the two limbs, ie. motor commands resulting in mirror image
movements, whether they are temporally coinciding or not, are more frequent in
natural gestural communication. In contrast, different, simultaneously
expressed hand postures occur frequently due to the more disparate control of
the distal musculature. The fact that the manual praxis system is left
hemisphere based is also thought to be the origin of the prevalent
right-handedness and is supported by findings susch as the preference of
signers to use the right hand if only hand can be used and the correlation
between sign language aphasia, manual apraxia and left hemisphere damage.
Although the left hemisphere is essential for selecting movements, the left
hand is thought to be better in executing independent finger movements than the
right hand (Kimura, 1993).
Since the human brain exhibits different capabilities for the right and a left
hemispere, differences can be expected between the capabilities of the left and
right hand. This is not as noticeable at the physical level, i.e. in ergotic
hand movements, but more apparent in communicating with the hands. Our left
hand, it can be speculated, may perform better in expressing holistic concepts and
dealing with spatial features of the environment, while our right hand may perform
better in communicating symbolic information.
Napier (1993) discusses handedness and suggests that the dominance of right
handedness (in the order of 90% are right handed) gradually evolved from a
slight left handedness in nonhuman primates, perhaps under the influence of
social and cultural pressures not so much due to capabilities of the right hand
which would supersede capabilities of the left hand. While an exception, humans
have been known to be almost perfectly ambidextrous.
From the above, a clear relation can be seen between gestures and speech as
well as body posture. Napier (1993) explicitly includes facial expressions and
bodily movements when examining the use of gestures as:
Human communication consists of a number of perceptually distinguishable
channels which operate through different modalities. It appears that a single
system underlies this communication which can direct aspects of the expression
through one modality while using other modalities for other aspects of the
expression. Each modality has its intrinsic limitations due to constraints of
musculo-skeletal and neuro-motor nature for example. In co-verbal gesticulation
for instance, these aspects are the structured content of the expression as
present in linguistic forms. Both hand gestures and speech can be used to
express such linguistic forms. Further research is needed to establish more
detail in these aspects of human communication and how they are distributed
amongst the various modalities and under which conditions. It can be easily
observed that, in humans with speaking ability, speech is the most efficient
channel for structural aspects (relating concepts and identifying how) of
expression, while temporo-spatial and visual representation aspects
(identification of where, when, which and what) are more easily conveyed
through hand gestures. Perhaps posture, facial expression as well as gestures
are best applied for communication of emotional aspects (modulation of
concepts).
In the HCI literature the word gesture has been used to identify many types of
hand movements for control of computer processes. Perhaps to avoid confusion
Sturman (1993) defines whole hand input as the full and direct use of
the hand's capabilities for the control of computer-mediated tasks, thereby
making a more precise indication of which type of human movement is involved
(i.e. not just positioning but also hand shape) as well as for which purpose
they are applied. By using the word input the association is made with
information theory, conceiving of hands as devices which output information to
be received and interpreted by another device. Although the link is made with
the semiotic function of hand movements, the hand is really a communication
device, i.e. it can both receive and send information. The limitation to
capabilities of the hand only however, excludes the suggestion that semiotic
hand movements should be considered part of an integrated system for human
expression. In fact capabilities of the hand strongly suggests a focus
on musculo-skeletal capabilities since those are literally the capabilities of
the hand. As such the definition oversees the neuro-motor systems and cognitive
abilities involved in hand movements. Somewhat more appropriate, hand
centred input, as a shorthand for human computer interaction involving hand
movements, emphasizes the context as created by other modalities. However, it
is not specific as to the type of hand movements; gesturing with a mouse,
empty-handed gestures or movements with a joystick are all included.
None of these terms address the distinction between semiotic and ergotic hand
movements, let alone refer to the existence of epistemic hand movements. In the
following the focus will be on the use of hand movements to their full extent,
ie. as few limitations as permitted by the current state of the art movement
tracking technology will be imposed on the hand movements, in computer-mediated
tasks. Such applications of hand movements will be called whole hand centred
input. To include a reference to epistemic hand movements, input
could be replaced by communication. A hand gesture interface will
mean an HCI system employing whole hand centred input which specifically
exploits the semiotic function of hand movements. Certain applications may
however not exploit the full capabilities of the hand. Mouse gesturing and
joystick controlling amongst others will not be examined.
The following applications that mainly exploit the ergotic function of hand
movements have been found in the literature:
The following applications that mainly exploit the semiotic function of hand
movements have been found in the literature:
Often applications only implement the use of hand signs or postures. In the
case of Pook (1995), the use of hand signs which indicate to the robot to
execute a movement pattern to fulfill a task, is questionable since such a
mapping could be more easily implemented using simple function keys on a
keyboard. The value of the approach is that it allows the user to act more
naturally since no cognitive effort is required in mapping function keys to
robotic hand actions. Applications involving navigation essentially implement
deictic gestures. The applications involving signlanguage are technically
impressive, but, since they basically implement a nicely formalized system of
human communication of use for a relatively small (but important) group of
people only, they do not provide us with much clues as to the interpretation,
definition or modeling of gestures in computer-mediated tasks.
The research into the use of hand gestures with speech (or speech with
gestures) has gained special attention. Cavazza (1995), Wexelblatt (1995),
Hauptmann & McAvinney (1993), Sparrel (1993), Cassell et al (1994) all
implement gestural communication theory as developed by McNeill, Kendon,
Birdwhistell and others. Sparrel (1993) implemented a system for interpretation
co-verbal iconic gestures as defined by McNeill (see above).
Multimodal interaction, involving not only hand gestures and speech, but also
facial expressions and body posture is another distinguishable subject
researched by Gao (1995), Maggioni (1995), Hataoka et al (1995) and Bohm (1995)
amongst others. These research efforts are generally rather poor from a human
behaviour researcher's point of view, since they simply make a system which is
able to detect each modality, which information is then checked against each
other, so that each modality is basically interpreted as communicating the same
information. Little or no effort is made to integrate the system's abilities
using a somewhat sophisticated model of human expression, e.g. where each
modality is deemed to express specific aspects, which cannot simply be used to
confirm the correct interpretation of another modality, of the expression such
as discussed above.
Using the above classification of and investigation into hand movements, we can
now proceed with evaluating and analysing how hand movements have been used in
processes mediated through computers. Such an evaluation is of interest since
the following problems in the design and implementation of whole hand centred
input applications have remained unsolved:
From the many aspects discussed in the aforegoing some basic comments can be
made for better design of hand gesture interfaces:
Many applications can be critisized for their idiosyncratic choice of hand
gestures or postures to control or direct the computer-mediated task (Baudel,
Harrison). However, the choice was probably perfectly natural for the developer
of the application ! This shows the dependence of gestures on their cultural
and social environment. For specialized, frequent tasks, where the learning of
a particular set of gestures and postures is worth the investment, such
applications may have a value. In everyday life, however, it is quite unlikely
that users will be interested in a device for which they have to learn some
specific set of gestures and postures, unless there is an obvious increase in
efficiency or ease of use over existing methods of hand centred input in the
adoption of such a gestural protocol. On the other hand, the economics of the
marketplace may dictate such a set independent of its compatibility with
existing cultural and/or social standards, just like the keyboard and mouse
have set a standard. Especially when users are allowed to expand or create
their own sets such a protocol may gain some acceptance.
The problem in defining a possible standard for a gesture set is that it is
very easy to let the graphical user interface dictate which type of hand
movements are required to complete certain computer-mediated tasks. However,
the computer can be programmed to present tasks in a variety of graphical and
auditory ways. The aim is to make the computer and its peripherals transparent,
meaning that the tasks are presented in a way that execution of these tasks is
most natural. Unfortunately the concept of naturalness may apply for
ergotic hand movements but semiotic hand movements prohibit such naturalness
for all of humankind, unless the system knows which culture the user belongs
to.
Zimmerman, in a personal communication, suggested only seven parameters are
needed to describe hand movements for the bulk of the applications: hand
position and orientation and hand grip aperture, where the last parameter could
be just binary, ie. open/close. Such a standard seems to put a lot of emphasis
on the ergotic function only of hand movements and will severly restrict access
to the semiotic function. Augustine Su (1994) suggests touching, pointing and
gripping as a minimal set of gestures that need to be distinguished. At least
they include a reference to the tasks involved, but they do not make any
reference to the semiotic function of hand movements. An analysis of sign
language is needed to extract the very basic gestures and postures that
minimize the amount of learning required of the user. An initial
parametrization is proposed by Stokoe (1980), as discussed above.
Interestingly, his 4 features are mostly body-centred, which would suggest that
for the semiotic function a body-centred coordinate system is more appropriate
when analysing gestures. On the contrary, for ergotic functions, a world-based
coordinate system seems more obvious.
These suggestions for standards may be combined if some way is found to
recognize the relevance of either the ergotic or semiotic function or both
during a specific hand movement. Perhaps the analysis of emotional behaviour
can provide clues. For a gesture set to gain major acceptance in the market
place, it is advisable to examine the tasks and semiotic functions most
frequently executed and then choose a hand gesture set that seems to appear
natural, at least to a number of different people within a social group or even
a culture, when executing those tasks and functions. Simple market economics
will then do the rest.
Summary
Classifying Hand Movement
Examples
Definitions
Verbal Synonyms
Classifications
All three functions may be augmented using an instrument, e.g. handkerchief for
a semiotic good-bye movement, a pair of scissors for the ergotic cutting
movement, a stick for an epistemic poking movement. For the purpose of this
paper we can by definition substitute gestures for semiotic hand movements.
Hand actions commonly identified as prehension are a subset of ergotic hand
movements.
Transitive hand movements are part of an uninterrupted sequence of
interconnected structured hand movements that are adapted in time and space,
with the aim of completing a program, such as prehension. These hand movements
could be equally classified as Cadoz's ergotic hand movements.Semiotic Hand Movements
Any of these gestures can be cohesive gestures or gestures that tie together
thematically related but temporally separated parts of the discourse.
1. gesticulation: idiosyncratic spontaneous movements of the hands and arms
during speech
Nespoulos & Lecours (1986) take a more detailed approach and suggest a
three level scheme for classification:
2. language-like gestures: like gesticulation, but grammatically integrated in
the utterance
3. pantomime: gestures without speech used in theater to communicate a story
4. emblems: "italianate" gestures (e.g. insults and praises)
5. sign language: a set of gestures and postures for a full fledged linguistic
communication system
This approach seems of limited use due to its lack of clear distinctions. While
it summarizes a number of gestural behaviours it does not suggest some form of
underlying structure or model for the processes involving gestural expression.
Nevertheless, the concept of arbitrariness of gestures is duly noted. It refers
to the fact that gestures may be somewhat formalized and generally recognizable
by others, but such formalization exists only within a culture. There is no
such thing as an universal gesture. Kendon's continuum brings forward an
apparent principle, that of different levels of linguisticity of
gestures. McNeill's classification is clearly recognizable and just as Kendon,
emphasizes the strong connection between speech and gesture.
Ergotic Hand Movements
The value of such a classification is limited due to the fact that no reference
is made to the task at hand, although the indirection level bears some relation
to the notion of a task. Basically, the model of ergotic handmovements
suggested by this classification omits the importance of cognitive processes in
such hand movements.
The type of grip used in any given activity is a function of the activity
itself and does not depend on the shape or size of the object to be gripped.
Although in extreme cases this does not always hold. While this classification
relates to the musculo-skeletal properties of the hand, notably opposition, it
incorporates the notion of a task, such actions requiring precision or power.
However, neither the scissor grip nor the hook grip can be related in a similar
way to the notion of a task. These grips merely refer to a frequently used hand
movement. The classification is therefore somewhat ambiguous.
1. use of control effect: modulation (parametric change), selection (discrete
change), or excitation (input energy)
Classification 1 is based on a control task taxonomy and is, for specific
purposes, useful. Classification 2 puts the emphasis on the observer's point of
view and extracts the semiotic function of the hand movement, although the
movement may be purely ergotic from the executer's point of view. It
demonstrates that the semiotic function of hand movements is always present.
Classification 3 omits hand shape as a parameter and has limitations similar to
the first classification discussed above.
2. use of kinetic images: scrape, slide, ruffle, crunch, glide, caress etc.
3. use of spatial trajectory: up, down, left, right, in, out, circular,
sinusoidal, spiral, etc..
Gestural Communication
Evolution of Human Communication
Linguistic Aspects of Gesture
In other words, the formalized, linguistic component of the expression present
in speech is replaced by signs going from gesticulation to sign languages. This
supports the idea that gesture and speech are generated by one integral system
as suggested above.
1. consistent arm use and body posture
From this structure it can be seen that quite often the word gesture is used to
identify the stroke. Perhaps this simplification is due to the fact that most
of the linguistic aspects of the expression are communicated in the stroke. In
a similar vein, it should be noted that a hand posture often involves a
preparation and retraction phase (cf. the OK sign). Therefore the definition of
a hand posture as solely the position of hand and fingers at one particular
point in time is somewhat misleading. Sequences of postures occur with other
hand movements in between.
2. consistent head movement
3. gesture unit
4. gesture phrase
5. preparation, optional hold, stroke, optional hold, retraction
Gesticulation
Due to the lack of linguistic features of gesticulation such gestures can not
be analysed with the tools developed for studying spoken language, and caution
must be taken when summarizing the pragmatics, semantics and syntax of
such gestures. McNeill's arguments for supporting the hypothesis that gesture
and spoken language are a single system based on the following findings:
Sign Language
Sign languages exhibit language-like properties since they are used by people
who have to communicate symbolically encoded messages without the use of the
speech channel. They exhibit the following language-like properties (McNeill
1992):
In signlanguages meaning can be modulated, e.g. the addition of
emotional expression, by varying the following parameters (personal
communication with sign language instructors and Klima & Bellugi, 1979):
Ergotic versus Semiotic and Emotions
Handedness
Other Modalities
It is well known that sign language can be almost entirely replaced by facial
expressions. In musical conducting the integral body posture and dynamics are
by many considered an integral part of the expression of the conductor. A more
formalized conducting method (Saito, 199?) prescribes that only gestures of the
upper body are allowed.
Hand Gestures for HCI
Definitions
Prior Art
Other possible applications not found in the literature include sound design,
stage sound mixing and game playing.
Other possible applications not found in the literature include airplane and
other traffic control, game scoring and playing as welll as stock trading and
legal transactions (Hibbits, 1995).Hand Gesture Interface Design
In general, many applications have been technology driven, and not based on
knowledge of human behaviour and/or a proper task analysis. In many cases there
is no consideration given to the different functions of hand movements,
particular the semiotic and ergotic functions.
Standard Hand Gestures