Articulated body pose estimation in

computer vision Computer vision is an interdisciplinary scientific field that deals with how computers can gain high-level understanding from digital images or videos. From the perspective of engineering, it seeks to understand and automate tasks that the hum ...

is the study of

algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algorithms are used as specificat ...

s and systems that recover the

pose Human positions refer to the different physical configurations that the human body can take. There are several synonyms that refer to human positioning, often used interchangeably, but having specific nuances of meaning. *''Position'' is a gen ...

of an articulated body, which consists of

joints A joint or articulation (or articular surface) is the connection made between bones, ossicles, or other hard structures in the body which link an animal's skeletal system into a functional whole.Saladin, Ken. Anatomy & Physiology. 7th ed. McGraw- ...

and rigid parts using image-based observations. It is one of the longest-lasting problems in computer vision because of the complexity of the models that relate observation with pose, and because of the variety of situations in which it would be useful.

Description

Perception of human beings in their neighboring environment is an important capability that robots must possess. If a person uses gestures to point to a particular object, then the interacting machine should be able to understand the situation in real world context. Thus pose estimation is an important and challenging problem in computer vision, and many algorithms have been deployed in solving this problem over the last two decades. Many solutions involve training complex models with large data sets.

Pose estimation 3D pose estimation is a process of predicting the transformation of an object from a user-defined reference pose, given an image or a 3D scan. It arises in computer vision or robotics where the pose or transformation of an object can be used fo ...

is a difficult problem and an active subject of research because the human body has 244 degrees of freedom with 230 joints. Although not all movements between joints are evident, the human body is composed of 10 large parts with 20 degrees of freedom. Algorithms must account for large variability introduced by differences in appearance due to clothing, body shape, size, and hairstyles. Additionally, the results may be ambiguous due to partial occlusions from self-articulation, such as a person's hand covering their face, or occlusions from external objects. Finally, most algorithms estimate pose from monocular (two-dimensional) images, taken from a normal camera. Other issues include varying lighting and camera configurations. The difficulties are compounded if there are additional performance requirements. These images lack the three-dimensional information of an actual body pose, leading to further ambiguities. There is recent work in this area wherein images from RGBD cameras provide information about color and depth.

Sensors

The typical articulated body pose estimation system involves a model-based approach, in which the pose estimation is achieved by maximizing/minimizing a similarity/dissimilarity between an observation (input) and a template model. Different kinds of sensors have been explored for use in making the observation, including the following: * Visible wavelength imagery, *

Long-wave In radio, longwave, long wave or long-wave, and commonly abbreviated LW, refers to parts of the radio spectrum with wavelengths longer than what was originally called the medium-wave broadcasting band. The term is historic, dating from the e ...

thermal

infrared Infrared (IR), sometimes called infrared light, is electromagnetic radiation (EMR) with wavelengths longer than those of visible light. It is therefore invisible to the human eye. IR is generally understood to encompass wavelengths from around ...

imagery, *

Time-of-flight Time of flight (ToF) is the measurement of the time taken by an object, particle or wave (be it acoustic, electromagnetic, etc.) to travel a distance through a medium. This information can then be used to measure velocity or path length, or as a w ...

imagery, and *

Laser range scanner A laser rangefinder, also known as a laser telemeter, is a Rangefinding telemeter, rangefinder that uses a laser beam to determine the distance to an object. The most common form of laser rangefinder operates on the time of flight principle by s ...

imagery. These sensors produce intermediate representations that are directly used by the model. The representations include the following: * Image appearance, * Voxel (volume element) reconstruction, * 3D point clouds, and sum of Gaussian kernels * 3D surface meshes.

Classical models

Part models

The basic idea of part based model can be attributed to the human skeleton. Any object having the property of articulation can be broken down into smaller parts wherein each part can take different orientations, resulting in different articulations of the same object. Different scales and orientations of the main object can be articulated to scales and orientations of the corresponding parts. To formulate the model so that it can be represented in mathematical terms, the parts are connected to each other using springs. As such, the model is also known as a spring model. The degree of closeness between each part is accounted for by the compression and expansion of the springs. There is geometric constraint on the orientation of springs. For example, limbs of legs cannot move 360 degrees. Hence parts cannot have that extreme orientation. This reduces the possible permutations. The spring model forms a graph G(V,E) where V (nodes) corresponds to the parts and E (edges) represents springs connecting two neighboring parts. Each location in the image can be reached by the

x

and

y

coordinates of the pixel location. Let

\mathbf_(x, \, y)

be point at

\mathbf^

location. Then the cost associated in joining the spring between

\mathbf^

and the

\mathbf^

point can be given by

S(\mathbf_,\,\mathbf_) = S(\mathbf_ - \mathbf_)

. Hence the total cost associated in placing

l

components at locations

\mathbf_

is given by :

S(\mathbf_) = \displaystyle\sum_^ \; \displaystyle\sum_^ \; \mathbf_(\mathbf_,\,\mathbf_)

The above equation simply represents the spring model used to describe body pose. To estimate pose from images, cost or energy function must be minimized. This energy function consists of two terms. The first is related to how each component matches the image data and the second deals with how much the oriented (deformed) parts match, thus accounting for articulation along with

object detection Object detection is a computer technology related to computer vision and image processing that deals with detecting instances of semantic objects of a certain class (such as humans, buildings, or cars) in digital images and videos. Well-researched ...

. The part models, also known as pictorial structures, are of one of the basic models on which other efficient models are built by slight modification. One such example is the flexible mixture model which reduces the database of hundreds or thousands of deformed parts by exploiting the notion of local rigidity.

Articulated model with quaternion

The kinematic skeleton is constructed by a tree-structured chain. Each rigid body segment has its local coordinate system that can be transformed to the world coordinate system via a 4×4 transformation matrix

T_l

, :

T_ = T_R_,

where

R_l

denotes the local transformation from body segment

S_l

to its parent

\operatorname(S_l)

. Each joint in the body has 3 degrees of freedom (DoF) rotation. Given a transformation matrix

T_l

, the joint position at the T-pose can be transferred to its corresponding position in the world coordination. In many works, the 3D joint rotation is expressed as a normalized quaternion

,y,z,w /math> due to its continuity that can facilitate gradient-based optimization in the parameter estimation.

Deep learning based models

Since about 2016,

deep learning Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised. De ...

has emerged as the dominant method for performing accurate articulated body pose estimation. Rather than building an explicit model for the parts as above, the appearances of the joints and relationships between the joints of the body are learned from large training sets. Models generally focus on extracting the 2D positions of joints (keypoints), the 3D positions of joints, or the 3D shape of the body from either a single or multiple images.

Supervised

2D joint positions

The first deep learning models that emerged focused on extracting the 2D positions of human joints in an image. Such models take in an image and pass it through a

convolutional neural network In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of artificial neural network (ANN), most commonly applied to analyze visual imagery. CNNs are also known as Shift Invariant or Space Invariant Artificial Neural Netwo ...

to obtain a series of heatmaps (one for each joint) which take on high values where joints are detected. When there are multiple people per image, two main techniques have emerged for grouping joints within each person. In the first, "bottom-up" approach, the neural network is trained to also generate "part affinity fields" which indicate the location of limbs. Using these fields, joints can be grouped limb by limb by solving a series of assignment problems. In the second, "top-down" approach, an additional network is used to first detect people in the image and then the pose estimation network is applied to each image.

3D joint positions

With the advent of multiple datasets with human pose annotated in multiple views, models which detect 3D joint positions became more popular. These again fell into two categories In the first, a neural network is used to detect 2D joint positions from each view and these detections are then triangulated to obtain 3D joint positions. The 2D network may be refined to produce better detections based on the 3D data. Furthermore, such approaches often have filters in both 2D and 3D to refine the detected points. In the second, a neural network is trained end-to-end to predict 3D joint positions directly from a set of images, without 2D joint position intermediate detections. Such approaches often project image features into a cube and then use a 3D convolutional neural network to predict a 3D heatmap for each joint.

3D shape

Concurrently with the work above, scientists have been working on estimating the full 3D shape of a human or animal from a set of images. Most of the work is based on estimating the appropriate pose of the skinned multi-person linear (SMPL) model within an image. Variants of the SMPL model for other animals have also been developed. Generally, some keypoints and a silhouette are detected for each animal within the image, and then the parameters 3D shape model are fit to match the position of keypoints and silhouette.

Unsupervised

The above algorithms all rely on annotated images, which can be time-consuming to produce. To address this issue, computer vision researchers have developed new algorithms which can learn 3D keypoints given only annotated 2D images from a single view or identify keypoints given videos without any annotations.

Applications

Assisted living

Personal care robots may be deployed in future

assisted living An assisted living residence or assisted living facility (ALF) is a housing facility for people with disabilities or for adults who cannot or who choose not to live independently. The term is popular in the United States, but the setting is s ...

homes. For these robots, high-accuracy human detection and pose estimation is necessary to perform a variety of tasks, such as fall detection. Additionally, this application has a number of performance constraints.

Character animation

Traditionally, character animation has been a manual process. However, poses can be synced directly to a real-life actor through specialized pose estimation systems. Older systems relied on markers or specialized suits. Recent advances in pose estimation and

motion capture Motion capture (sometimes referred as mo-cap or mocap, for short) is the process of recording the movement of objects or people. It is used in military, entertainment, sports, medical applications, and for validation of computer vision and robo ...

have enabled markerless applications, sometimes in real time.

Intelligent driver assisting system

Car accidents account for about two percent of deaths globally each year. As such, an intelligent system tracking driver pose may be useful for emergency alerts . Along the same lines, pedestrian detection algorithms have been used successfully in autonomous cars, enabling the car to make smarter decisions.

Video games

Commercially, pose estimation has been used in the context of video games, popularized with the

Microsoft Kinect Kinect is a line of motion sensing input devices produced by Microsoft and first released in 2010. The devices generally contain RGB cameras, and infrared projectors and detectors that map depth through either structured light or time of fli ...

sensor (a depth camera). These systems track the user to render their avatar in-game, in addition to performing tasks like

gesture recognition Gesture recognition is a topic in computer science and language technology with the goal of interpreting human gestures via mathematical algorithms. It is a subdiscipline of computer vision. Gestures can originate from any bodily motion or sta ...

to enable the user to interact with the game. As such, this application has a strict real-time requirement.

Medical Applications

Pose estimation has been used to detect postural issues such as

scoliosis Scoliosis is a condition in which a person's spine has a sideways curve. The curve is usually "S"- or "C"-shaped over three dimensions. In some, the degree of curve is stable, while in others, it increases over time. Mild scoliosis does not t ...

by analyzing abnormalities in a patient's posture,

physical therapy Physical therapy (PT), also known as physiotherapy, is one of the allied health professions. It is provided by physical therapists who promote, maintain, or restore health through physical examination, diagnosis, management, prognosis, patient ...

, and the study of the cognitive brain development of young children by monitoring motor functionality.

Other applications

Other applications include

video surveillance Closed-circuit television (CCTV), also known as video surveillance, is the use of video cameras to transmit a signal to a specific place, on a limited set of monitors. It differs from broadcast television in that the signal is not openly tr ...

, animal tracking and behavior understanding,

sign language Sign languages (also known as signed languages) are languages that use the visual-manual modality to convey meaning, instead of spoken words. Sign languages are expressed through manual articulation in combination with non-manual markers. Sign l ...

detection, advanced

human–computer interaction Human–computer interaction (HCI) is research in the design and the use of computer technology, which focuses on the interfaces between people (users) and computers. HCI researchers observe the ways humans interact with computers and design tec ...

, and markerless motion capturing.

Related technology

A commercially successful but specialized computer vision-based articulated body

pose estimation 3D pose estimation is a process of predicting the transformation of an object from a user-defined reference pose, given an image or a 3D scan. It arises in computer vision or robotics where the pose or transformation of an object can be used fo ...

technique is optical

. This approach involves placing markers on the individual at strategic locations to capture the 6 degrees-of-freedom of each body part.

Research groups

A number of groups and companies are researching pose estimation, including groups at

Brown University Brown University is a private research university in Providence, Rhode Island. Brown is the seventh-oldest institution of higher education in the United States, founded in 1764 as the College in the English Colony of Rhode Island and Providenc ...

Carnegie Mellon University Carnegie Mellon University (CMU) is a private research university in Pittsburgh, Pennsylvania. One of its predecessors was established in 1900 by Andrew Carnegie as the Carnegie Technical Schools; it became the Carnegie Institute of Technology ...

MPI Saarbruecken MPI or Mpi may refer to: Science and technology Biology and medicine * Magnetic particle imaging, an emerging non-invasive tomographic technique * Myocardial perfusion imaging, a nuclear medicine procedure that illustrates the function of the hear ...

Stanford University Stanford University, officially Leland Stanford Junior University, is a private research university in Stanford, California. The campus occupies , among the largest in the United States, and enrolls over 17,000 students. Stanford is consider ...

, the

University of California, San Diego The University of California, San Diego (UC San Diego or colloquially, UCSD) is a public university, public Land-grant university, land-grant research university in San Diego, California. Established in 1960 near the pre-existing Scripps Insti ...

, the

University of Toronto The University of Toronto (UToronto or U of T) is a public research university in Toronto, Ontario, Canada, located on the grounds that surround Queen's Park. It was founded by royal charter in 1827 as King's College, the first institution ...

, the

École Centrale Paris École Centrale Paris (ECP; also known as École Centrale or Centrale) was a French grande école in engineering and science. It was also known by its official name ''École Centrale des Arts et Manufactures''. In 2015, École Centrale Paris mer ...

ETH Zurich (colloquially) , former_name = eidgenössische polytechnische Schule , image = ETHZ.JPG , image_size = , established = , type = Public , budget = CHF 1.896 billion (2021) , rector = Günther Dissertori , president = Joël Mesot , ac ...

National University of Sciences and Technology The National University of Sciences & Technology (NUST) ( ur, , Qaumī Jāmi'ā barā'e Sāins va Ṭaiknālōjī) is a multi-campus public research university having its main campus in Islamabad, Pakistan. The university offers undergraduat ...

(NUST), the

University of California, Irvine The University of California, Irvine (UCI or UC Irvine) is a public land-grant research university in Irvine, California. One of the ten campuses of the University of California system, UCI offers 87 undergraduate degrees and 129 graduate and pr ...

and

Polytechnic University of Catalonia The Technical University of Catalonia ( ca, Universitat Politècnica de Catalunya, , es, link=no, Universidad Politécnica de Cataluña; UPC), currently referred to as BarcelonaTech, is the largest engineering university in Catalonia, Spai ...

Companies

At present, several companies are working on articulated body pose estimation. * Bodylabs: Bodylabs is a Manhattan-based software provider of human-aware artificial intelligence.

References

{{Reflist

External links

Michael J. Black, Professor at Brown University
* ttps://web.archive.org/web/20070612082024/http://www.mpi-inf.mpg.de/~rosenhahn/ Homepage of Dr.-Ing at MPI Saarbrueckenbr>Markerless Motion Capture Project at StanfordComputer Vision and Robotics Research Laboratory at the University of California, San Diego
* ttp://hmi.ewi.utwente.nl/person/Ronald%20Poppe Ronald Poppe at the University of Twente
Professor Nikos Paragios at the Ecole Centrale de ParisArticulated Pose Estimation with Flexible Mixtures of Parts Project at UC Irvinehttp://screenrant.com/crazy3dtechnologyjamescameronavatarkofi3367/2D articulated human pose estimation software

Articulated Pose Estimation with Flexible Mixtures of Parts
Computer vision