Gesture recognition is a topic in
computer science
Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to practical disciplines (includi ...
and
language technology Language technology, often called human language technology (HLT), studies methods of how computer programs or electronic devices can analyze, produce, modify or respond to human texts and speech. Working with language technology often requires broa ...
with the goal of interpreting human
gestures via mathematical
algorithm
In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing ...
s.
It is a subdiscipline of
computer vision. Gestures can originate from any bodily motion or state, but commonly originate from the
face
The face is the front of an animal's head that features the eyes, nose and mouth, and through which animals express many of their emotions. The face is crucial for human identity, and damage such as scarring or developmental deformities may aff ...
or
hand. Focuses in the field include
emotion recognition
Emotion recognition is the process of identifying human emotion. People vary widely in their accuracy at recognizing the emotions of others. Use of technology to help people with emotion recognition is a relatively nascent research area. Genera ...
from face and hand gesture recognition, since they are all expressions. Users can make simple gestures to control or interact with devices without physically touching them. Many approaches have been made using cameras and
computer vision algorithms to interpret
sign language
Sign languages (also known as signed languages) are languages that use the visual-manual modality to convey meaning, instead of spoken words. Sign languages are expressed through manual articulation in combination with non-manual markers. Sign l ...
, however, the identification and recognition of posture, gait,
proxemics
Proxemics is the study of human use of space and the effects that population density has on behaviour, communication, and social interaction.
Proxemics is one among several subcategories in the study of nonverbal communication, including haptics ...
, and human behaviors is also the subject of gesture recognition techniques.
Gesture recognition can be seen as a way for computers to begin to
understand human body language, thus building a better bridge between machines and humans than older
text user interface
In computing, text-based user interfaces (TUI) (alternately terminal user interfaces, to reflect a dependence upon the properties of computer terminals and not just text), is a retronym describing a type of user interface (UI) common as an ear ...
s or even
GUI
The GUI ( "UI" by itself is still usually pronounced . or ), graphical user interface, is a form of user interface that allows users to interact with electronic devices through graphical icons and audio indicator such as primary notation, inste ...
s (graphical user interfaces), which still limit the majority of input to keyboard and mouse and interact naturally without any mechanical devices.
Overview
Gesture recognition features:
*Higher accuracy
*Higher stability
*Quicker time to unlock a device
The major application areas of gesture recognition in the current scenario are:
*Automotive sector
*Consumer electronics sector
*Transit sector
*Gaming sector
*To unlock smartphones
*Defense
*
Home automation
*
Automated sign language translation
Gesture recognition can be conducted with techniques from
computer vision and
image processing.
The literature includes ongoing work in the computer vision field on capturing gestures or more general human
pose and movements by cameras connected to a computer.
''Gesture recognition and pen computing:''
Pen computing
Pen computing refers to any computer user-interface using a pen or Stylus (computing), stylus and tablet, over input devices such as a keyboard or a mouse.
Pen computing is also used to refer to the usage of mobile devices such as tablet computers ...
reduces the hardware impact of a system and also increases the range of physical world objects usable for control beyond traditional digital objects like keyboards and mice. The term "gesture recognition" has been used to refer more narrowly to non-text-input handwriting symbols, such as inking on a
graphics tablet
A graphics tablet (also known as a digitizer, digital graphic tablet, pen tablet, drawing tablet, external drawing pad or digital art board) is a computer input device that enables a user to hand-draw images, animations and graphics, with a spec ...
,
multi-touch gestures, and
mouse gesture
In computing, a pointing device gesture or mouse gesture (or simply gesture) is a way of combining pointing device or finger movements and clicks that the software recognizes as a specific computer event and responds to accordingly. They can b ...
recognition. This is computer interaction through the drawing of symbols with a pointing device cursor.
Gesture types
In computer interfaces, two types of gestures are distinguished: We consider online gestures, which can also be regarded as direct manipulations like scaling and rotating, and in contrast, offline gestures are usually processed after the interaction is finished; e. g. a circle is drawn to activate a
context menu
A context menu (also called contextual, shortcut, and pop up or pop-up menu) is a menu in a graphical user interface (GUI) that appears upon user interaction, such as a right-click mouse operation. A context menu offers a limited set of choic ...
.
* Offline gestures: Those gestures that are processed after the user interaction with the object. An example is a gesture to activate a menu.
* Online gestures: Direct manipulation gestures. They are used to scale or rotate a tangible object.
Touchless interface
Touchless user interface is an emerging type of technology in relation to gesture control. Touchless user interface (TUI) is the process of commanding the computer via body motion and gestures without touching a keyboard, mouse, or screen. Touchless interface in addition to gesture controls are becoming widely popular as they provide the abilities to interact with devices without physically touching them.
Types of touchless technology
There are a number of devices utilizing this type of interfaces such as smartphones, laptops, games, TVs, and music equipment.
One type of touchless interface uses the Bluetooth connectivity of a smartphone to activate a company's visitor management system. This prevents having to touch an interface like, at the time of the
COVID-19
Coronavirus disease 2019 (COVID-19) is a contagious disease caused by a virus, the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The first known case was identified in Wuhan, China, in December 2019. The disease quickly ...
pandemic.
Input devices
The ability to track a person's movements and determine what gestures they may be performing can be achieved through various tools. The kinetic user interfaces (KUIs) are an emerging type of
user interfaces
In the industrial design field of human–computer interaction, a user interface (UI) is the space where interactions between humans and machines occur. The goal of this interaction is to allow effective operation and control of the machine fr ...
that allow users to interact with computing devices through the motion of objects and bodies. Examples of KUIs include
tangible user interface
A tangible user interface (TUI) is a user interface in which a person interacts with digital information through the physical environment. The initial name was Graspable User Interface, which is no longer used. The purpose of TUI development ...
s and motion-aware games such as
Wii and Microsoft's
Kinect, and other interactive projects.
Although there is a large amount of research done in image/video-based gesture recognition, there is some variation in the tools and environments used between implementations.
*
Wired glove
A wired glove (also called a dataglove or cyberglove) is an input device for human–computer interaction worn like a glove.
Various sensor technologies are used to capture physical data such as bending of fingers. Often a motion tracker, such a ...
s. These can provide input to the computer about the position and rotation of the hands using magnetic or inertial tracking devices. Furthermore, some gloves can detect finger bending with a high degree of accuracy (5-10 degrees), or even provide haptic feedback to the user, which is a simulation of the sense of touch. The first commercially available hand-tracking glove-type device was the DataGlove, a glove-type device that could detect hand position, movement and finger bending. This uses fiber optic cables running down the back of the hand. Light pulses are created and when the fingers are bent, light leaks through small cracks, and the loss is registered, giving an approximation of the hand pose.
* Depth-aware cameras. Using specialized cameras such as
structured light or
time-of-flight camera
A time-of-flight camera (ToF camera), also known as time-of-flight sensor (ToF sensor), is a range imaging camera system for measuring distances between the camera and the subject for each point of the image based on time-of-flight, the round ...
s, one can generate a
depth map
In 3D computer graphics and computer vision, a depth map is an image or image channel that contains information relating to the distance of the surfaces of scene objects from a viewpoint. The term is related (and may be analogous) to ''depth ...
of what is being seen through the camera at a short-range, and use this data to approximate a 3d representation of what is being seen. These can be effective for the detection of hand gestures due to their short-range capabilities.
*
Stereo cameras. Using two cameras whose relations to one another are known, a 3d representation can be approximated by the output of the cameras. To get the cameras' relations, one can use a positioning reference such as a
lexian-stripe or
infrared
Infrared (IR), sometimes called infrared light, is electromagnetic radiation (EMR) with wavelengths longer than those of visible light. It is therefore invisible to the human eye. IR is generally understood to encompass wavelengths from around ...
emitters. In combination with direct motion measurement (
6D-Vision) gestures can directly be detected.
* Gesture-based controllers. These controllers act as an extension of the body so that when gestures are performed, some of their motion can be conveniently captured by the software. An example of emerging gesture-based
motion capture is through skeletal
hand tracking
In the field of gesture recognition and image processing, finger tracking is a high-resolution technique developed in 1969 that is employed to know the consecutive position of the fingers of the user and hence represent objects in 3D.
In additio ...
, which is being developed for virtual reality and augmented reality applications. An example of this technology is shown by tracking companies
uSens and
Gestigon, which allow users to interact with their surroundings without controllers.
*
Wi-Fi sensing
Another example of this is
mouse gesture trackings, where the motion of the mouse is correlated to a symbol being drawn by a person's hand which can study changes in acceleration over time to represent gestures. The software also compensates for human tremor and inadvertent movement.
[''Electronic Design'' September 8, 2011. William Wong]
Natural User Interface Employs Sensor Integration.
/ref>[''Cable & Satellite International'' September/October, 2011. Stephen Cousins]
A view to a thrill.
[''TechJournal South'' January 7, 2008]
Hillcrest Labs rings up $25M D round.
/ref> The sensors of these smart light emitting cubes can be used to sense hands and fingers as well as other objects nearby, and can be used to process data. Most applications are in music and sound synthesis, but can be applied to other fields.
* Single camera. A standard 2D camera can be used for gesture recognition where the resources/environment would not be convenient for other forms of image-based recognition. Earlier it was thought that a single camera may not be as effective as stereo or depth-aware cameras, but some companies are challenging this theory. Software-based gesture recognition technology using a standard 2D camera that can detect robust hand gestures.
Algorithms
Depending on the type of input data, the approach for interpreting a gesture could be done in different ways. However, most of the techniques rely on key pointers represented in a 3D coordinate system. Based on the relative motion of these, the gesture can be detected with high accuracy, depending on the quality of the input and the algorithm's approach.
In order to interpret movements of the body, one has to classify them according to common properties and the message the movements may express. For example, in sign language, each gesture represents a word or phrase.
Some literature differentiates 2 different approaches in gesture recognition: a 3D model-based and an appearance-based. The foremost method makes use of 3D information of key elements of the body parts in order to obtain several important parameters, like palm position or joint angles. On the other hand, Appearance-based systems use images or videos for direct interpretation.
3D model-based algorithms
The 3D model approach can use volumetric or skeletal models or even a combination of the two. Volumetric approaches have been heavily used in the computer animation industry and for computer vision purposes. The models are generally created from complicated 3D surfaces, like NURBS or polygon meshes.
The drawback of this method is that it is very computationally intensive, and systems for real-time analysis are still to be developed. For the moment, a more interesting approach would be to map simple primitive objects to the person's most important body parts (for example cylinders for the arms and neck, sphere for the head) and analyze the way these interact with each other. Furthermore, some abstract structures like super-quadrics and generalized cylinders maybe even more suitable for approximating the body parts.
Skeletal-based algorithms
Instead of using intensive processing of the 3D models and dealing with a lot of parameters, one can just use a simplified version of joint angle parameters along with segment lengths. This is known as a skeletal representation of the body, where a virtual skeleton of the person is computed and parts of the body are mapped to certain segments. The analysis here is done using the position and orientation of these segments and the relation between each one of them( for example the angle between the joints and the relative position or orientation)
Advantages of using skeletal models:
* Algorithms are faster because only key parameters are analyzed.
* Pattern matching against a template database is possible
* Using key points allows the detection program to focus on the significant parts of the body
Appearance-based models
These models don't use a spatial representation of the body anymore, because they derive the parameters directly from the images or videos using a template database. Some are based on the deformable 2D templates of the human parts of the body, particularly hands. Deformable templates are sets of points on the outline of an object, used as interpolation nodes for the object's outline approximation. One of the simplest interpolation functions is linear, which performs an average shape from point sets, point variability parameters, and external deformation. These template-based models are mostly used for hand-tracking, but could also be used for simple gesture classification.
The second approach in gesture detecting using appearance-based models uses image sequences as gesture templates. Parameters for this method are either the images themselves, or certain features derived from these. Most of the time, only one (monoscopic) or two (stereoscopic) views are used.
Electromyography-based models
Electromyography (EMG) concerns the study of electrical signals produced by muscles in the body. Through classification of data received from the arm muscles, it is possible to classify the action and thus input the gesture to external software. Consumer EMG devices allow for non-invasive approaches such as an arm or leg band and connect via Bluetooth. Due to this, EMG has an advantage over visual methods since the user does not need to face a camera to give input, enabling more freedom of movement.
Challenges
There are many challenges associated with the accuracy and usefulness of gesture recognition software. For image-based gesture recognition, there are limitations on the equipment used and image noise
Image noise is random variation of brightness or color information in images, and is usually an aspect of electronic noise. It can be produced by the image sensor and circuitry of a scanner or digital camera. Image noise can also originate in ...
. Images or video may not be under consistent lighting, or in the same location. Items in the background or distinct features of the users may make recognition more difficult.
The variety of implementations for image-based gesture recognition may also cause issues for the viability of the technology for general usage. For example, an algorithm calibrated for one camera may not work for a different camera. The amount of background noise also causes tracking and recognition difficulties, especially when occlusions (partial and full) occur. Furthermore, the distance from the camera, and the camera's resolution and quality, also cause variations in recognition accuracy.
In order to capture human gestures by visual sensors, robust computer vision methods are also required,
for example for hand tracking and hand posture recognition or for capturing movements of the head, facial expressions or gaze direction.
Social acceptability
One significant challenge to the adoption of gesture interfaces on consumer mobile devices such as smartphones and smartwatches stems from the social acceptability implications of gestural input. While gestures can facilitate fast and accurate input on many novel form-factor computers, their adoption and usefulness is often limited by social factors rather than technical ones. To this end, designers of gesture input methods may seek to balance both technical considerations and user willingness to perform gestures in different social contexts. In addition, different device hardware and sensing mechanisms support different kinds of recognizable gestures.
Mobile device
Gesture interfaces on mobile and small form-factor devices are often supported by the presence of motion sensors such as inertial measurement units (IMUs). On these devices, gesture sensing relies on users performing movement-based gestures capable of being recognized by these motion sensors. This can potentially make capturing signals from subtle or low-motion gestures challenging, as they may become difficult to distinguish from natural movements or noise. Through a survey and study of gesture usability, researchers found that gestures that incorporate subtle movement, which appears similar to existing technology, look or feel similar to every action, and which are enjoyable were more likely to be accepted by users, while gestures that look strange, are uncomfortable to perform, interfere with communication, or involve uncommon movement caused users more likely to reject their usage. The social acceptability of mobile device gestures rely heavily on the naturalness of the gesture and social context.
On-body and wearable computers
Wearable computer
A wearable computer, also known as a body-borne computer, is a computing device worn on the body. The definition of 'wearable computer' may be narrow or broad, extending to smartphones or even ordinary wristwatches.
Wearables may be for general ...
s typically differ from traditional mobile device
A mobile device (or handheld computer) is a computer small enough to hold and operate in the hand. Mobile devices typically have a flat LCD or OLED screen, a touchscreen interface, and digital or physical buttons. They may also have a physica ...
s in that their usage and interaction location takes place on the user's body. In these contexts, gesture interfaces may become preferred over traditional input methods, as their small size renders touch-screens or keyboards
Keyboard may refer to:
Text input
* Keyboard, part of a typewriter
* Computer keyboard
** Keyboard layout, the software control of computer keyboards and their mapping
** Keyboard technology, computer keyboard hardware and firmware
Music
* Musi ...
less appealing. Nevertheless, they share many of the same social acceptability obstacles as mobile devices when it comes to gestural interaction. However, the possibility of wearable computers to be hidden from sight or integrated in other everyday objects, such as clothing, allow gesture input to mimic common clothing interactions, such as adjusting a shirt collar or rubbing one's front pant pocket. A major consideration for wearable computer interaction is the location for device placement and interaction. A study exploring third-party attitudes towards wearable device interaction conducted across the United States
The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territori ...
and South Korea
South Korea, officially the Republic of Korea (ROK), is a country in East Asia, constituting the southern part of the Korean Peninsula and sharing a land border with North Korea. Its western border is formed by the Yellow Sea, while its eas ...
found differences in the perception of wearable computing use of males and females, in part due to different areas of the body considered as socially sensitive. Another study investigating the social acceptability of on-body projected interfaces found similar results, with both studies labelling areas around the waist, groin, and upper body (for women) to be least acceptable while areas around the forearm and wrist to be most acceptable.
Public installations
Public Installations, such as interactive public displays, allow access to information and displaying interactive media in public settings such as museums, galleries, and theaters. While touch screens are a frequent form of input for public displays, gesture interfaces provide additional benefits such as improved hygiene, interaction from a distance, improved discoverability, and may favor performative interaction. An important consideration for gestural interaction with public displays is the high probability or expectation of a spectator audience.
"Gorilla arm"
"Gorilla arm" was a side-effect of vertically oriented touch-screen or light-pen use. In periods of prolonged use, users' arms began to feel fatigued and/or discomfort. This effect contributed to the decline of touch-screen input despite its initial popularity in the 1980s.
In order to measure arm fatigue and the gorilla arm side effect, researchers developed a technique called Consumed Endurance.[Hincapié-Ramos, J.D., Guo, X., and Irani, P. 2014]
"The Consumed Endurance Workbench: A Tool to Assess Arm Fatigue During Mid-Air Interactions"
In Proceedings of the 2014 companion publication on Designing interactive systems (DIS Companion '14). ACM, New York, NY, USA, 109-112. DOI=10.1145/2598784.2602795
See also
* Activity recognition
Activity recognition aims to recognize the actions and goals of one or more agents from a series of observations on the agents' actions and the environmental conditions. Since the 1980s, this research field has captured the attention of several c ...
* Articulated body pose estimation
Articulated body pose estimation in computer vision is the study of algorithms and systems that recover the pose of an articulated body, which consists of joints and rigid parts using image-based observations. It is one of the longest-lasting pr ...
* Automotive head unit
An automotive head unit, sometimes called the infotainment system, is a component providing a unified hardware interface for the system, including screens, buttons and system controls for numerous integrated information and entertainment funct ...
* Computer processing of body language
* 3D pose estimation
3D pose estimation is a process of predicting the transformation of an object from a user-defined reference pose, given an image or a 3D scan. It arises in computer vision or robotics where the pose or transformation of an object can be used for ...
* Pointing device gesture
In computing, a pointing device gesture or mouse gesture (or simply gesture) is a way of combining pointing device or finger movements and clicks that the software recognizes as a specific computer event and responds to accordingly. They can be ...
References
External links
Annotated bibliography of references to gesture and pen computing
Notes on the History of Pen-based Computing (YouTube)
The future, it is all a Gesture
€”Gesture interfaces and video gaming
Ford's Gesturally Interactive Advert
€”Gestures used to interact with digital signage
3D Hand Tracking
€”A Literature Survey
{{Nonverbal communication
Applications of computer vision
Virtual reality
Object recognition and categorization
User interface techniques
History of human–computer interaction