US20230305644A1 - Methods and systems for multimodal hand state prediction - Google Patents
Methods and systems for multimodal hand state prediction Download PDFInfo
- Publication number
- US20230305644A1 US20230305644A1 US17/704,970 US202217704970A US2023305644A1 US 20230305644 A1 US20230305644 A1 US 20230305644A1 US 202217704970 A US202217704970 A US 202217704970A US 2023305644 A1 US2023305644 A1 US 2023305644A1
- Authority
- US
- United States
- Prior art keywords
- hand
- data
- contact
- motion
- contact data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/03—Arrangements for converting the position or the displacement of a member into a coded form
- G06F3/033—Pointing devices displaced or positioned by the user, e.g. mice, trackballs, pens or joysticks; Accessories therefor
- G06F3/0354—Pointing devices displaced or positioned by the user, e.g. mice, trackballs, pens or joysticks; Accessories therefor with detection of 2D relative movements between the device, or an operating part thereof, and a plane or surface, e.g. 2D mice, trackballs, pens or pucks
- G06F3/03545—Pens or stylus
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/017—Gesture based interaction, e.g. based on a set of recognized hand gestures
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/03—Arrangements for converting the position or the displacement of a member into a coded form
- G06F3/033—Pointing devices displaced or positioned by the user, e.g. mice, trackballs, pens or joysticks; Accessories therefor
- G06F3/0346—Pointing devices displaced or positioned by the user, e.g. mice, trackballs, pens or joysticks; Accessories therefor with detection of the device orientation or free movement in a 3D space, e.g. 3D mice, 6-DOF [six degrees of freedom] pointers using gyroscopes, accelerometers or tilt-sensors
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/03—Arrangements for converting the position or the displacement of a member into a coded form
- G06F3/041—Digitisers, e.g. for touch screens or touch pads, characterised by the transducing means
- G06F3/044—Digitisers, e.g. for touch screens or touch pads, characterised by the transducing means by capacitive means
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/048—Interaction techniques based on graphical user interfaces [GUI]
- G06F3/0487—Interaction techniques based on graphical user interfaces [GUI] using specific features provided by the input device, e.g. functions controlled by the rotation of a mouse with dual sensing arrangements, or of the nature of the input device, e.g. tap gestures based on pressure sensed by a digitiser
- G06F3/0488—Interaction techniques based on graphical user interfaces [GUI] using specific features provided by the input device, e.g. functions controlled by the rotation of a mouse with dual sensing arrangements, or of the nature of the input device, e.g. tap gestures based on pressure sensed by a digitiser using a touch-screen or digitiser, e.g. input of commands through traced gestures
- G06F3/04883—Interaction techniques based on graphical user interfaces [GUI] using specific features provided by the input device, e.g. functions controlled by the rotation of a mouse with dual sensing arrangements, or of the nature of the input device, e.g. tap gestures based on pressure sensed by a digitiser using a touch-screen or digitiser, e.g. input of commands through traced gestures for inputting data by handwriting, e.g. gesture or text
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2203/00—Indexing scheme relating to G06F3/00 - G06F3/048
- G06F2203/038—Indexing scheme relating to G06F3/038
- G06F2203/0381—Multimodal input, i.e. interface arrangements enabling the user to issue commands by simultaneous use of input devices of different nature, e.g. voice plus gesture on digitizer
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2203/00—Indexing scheme relating to G06F3/00 - G06F3/048
- G06F2203/041—Indexing scheme relating to G06F3/041 - G06F3/045
- G06F2203/04106—Multi-sensing digitiser, i.e. digitiser using at least two different sensing technologies simultaneously or alternatively, e.g. for detecting pen and finger, for saving power or for improving position detection
Definitions
- the present disclosure is related to methods and systems for 3D hand state prediction, in particular, for classifying and modeling 3D hand motion or 3D hand posture using inputs from multiple modalities.
- Digital pens have emerged as a popular tool for interacting with digital devices such as tablets, smartphones or laptops with touchscreens. As digital pens can mimic interaction with traditional pen and paper, digital pens are often employed for tasks such as writing and drawing, or for digital interactions that require higher levels of intricacy such as navigation or playing games, among others.
- PDA personal digital assistant
- the present disclosure describes a hand state prediction system which processes input motion sensor and contact surface signals and generates hand state predictions.
- the hand state prediction system includes a machine learning-based model, such as a neural network model that is trained to convert inertial motion measurements and surface contact data into predictions of a corresponding hand state in response to a hand action, for example, a hand gesture or gripping posture. For each window of data sampled, motion data and contact data are obtained, processed and fused to generate a fused output.
- the hand state prediction system can operate in continuous mode to automatically detect a start and an end of a hand action, or a user can designate a start and an end of a hand action.
- a multimodal hand state is generated by a multimodal classifier by processing the fused output. Instructions represented by a hand action (e.g. a gesture or a posture) can be acted upon through a command action performed by a computing device or a computer application.
- representations of the state of a user's hand in 3D space captured by IMU data and surface contact data are fused into a fused output that may be learned by a neural network of the hand state prediction system.
- the hand state prediction system combines information from multiple modalities (e.g. inertial motion data generated by a device held in a user's hand and surface contact data generated by the user's hand), for example, by fusing a prediction of a hand action based on data from a motion sensing device (i.e. 3D motion captured by a motion sensing device in response to the user's hand action) and a prediction of a hand action based on contact data (i.e. contact area and optionally force measurements generated in response to the user's hand action, while the user's hand is resting on a contact surface) into a fused output, which results in a better prediction of a user's hand state.
- a motion sensing device i.e. 3D motion captured by a motion sensing device in response to the user's hand action
- contact data i.e. contact area and optionally force measurements generated in response to the user's hand action, while the user's hand is resting on a contact surface
- a neural network included in the hand state prediction system is optimized to learn better representations from each modality (e.g. hand motion and hand contact area or contact force), contributing to improved overall performance of the hand state prediction system.
- a motion classifier of the neural network configured to process translational and rotational motion from IMU data is optimized to classify hand motion using IMU data while a contact classifier of the neural network configured to process surface contact data is optimized to classify hand motion using surface contact data.
- Improved performance of the hand state prediction system may therefore be demonstrated by more accurately predicting a hand gesture or gripping position.
- Hand motion data can be acquired from low cost and low power devices to simplify implementation.
- a low cost, low power and low profile IMU motion sensor e.g. 3-degree of freedom (3-DoF) IMU, a 6-degree of freedom (6-DoF) IMU, or a 9-degree of freedom (9-DoF) IMU
- a device used to capture hand motion for example, coupled to a digital pen body or coupled to another device.
- a capacitive touch screen can be used as the contact sensor instead of a 3D pressure pad.
- Flexible hardware and software configuration enable discrete or continuous sampling.
- the present disclosure describes a method for generating a multimodal hand state prediction.
- the method includes: obtaining motion data from a motion sensing device that is configured to sense motion of a user's hand; obtaining contact data from a contact surface that is configured to sense contact of the user's hand; and generating a multimodal hand state based on a fusing of the motion data and the contact data.
- generating the multimodal hand state comprises: pre-processing the motion data to generate pre-processed motion data; and classifying the pre-processed motion data using a trained motion classifier to generate a first output, the first output including a probability corresponding to one or more classes.
- generating the multimodal hand state further comprises: pre-processing the contact data to generate pre-processed contact data; and classifying the pre-processed contact data using a trained contact classifier to generate a second output, the second output including a probability corresponding to one or more classes.
- generating the multimodal hand state further comprises: concatenating the first output and the second output to generate a fused output.
- generating the multimodal hand state further comprises: classifying the fused output using a trained multimodal classifier to generate the multimodal hand state, the multimodal hand state including a probability corresponding to one or more classes.
- prior to obtaining the motion data and contact data receiving an instruction to begin sampling the motion data, and when the instruction to begin sampling the motion data is received, sampling the motion data; receiving an instruction to begin sampling the contact data, and when the instruction to begin sampling the contact data is received, sampling the contact data; receiving an instruction to end sampling the motion data; receiving an instruction to end sampling the contact data; storing the sampled motion data as the motion data; and storing the sampled contact data as the contact data.
- the method prior to obtaining the motion data and contact data: continuously sampling the motion data and the contact data; determining, based on a threshold corresponding to the continuously sampled motion data and a threshold corresponding to the continuously sampled contact data, when a start of a hand action occurs; determining when an end of the hand action occurs based on the start of the hand action occurring; extracting the sampled motion data from the continuously sampled motion data based on the start of the hand action and the end of the hand action; extracting the sampled contact data from the continuously sampled contact data based on the start of the hand action and the end of the hand action; storing the sampled motion data as the motion data; and storing the sampled contact data as the contact data.
- the motion sensing device includes an inertial measurement unit (IMU).
- IMU inertial measurement unit
- the capacitive touch pad capturing the contact data in 2D.
- the contact surface is a pressure sensor pad
- the pressure sensor pad capturing the contact data in 3D.
- the method further comprising: obtaining peripheral contact data from a peripheral contact surface operatively coupled to the motion sensing device, that is configured to sense peripheral contact of the user's hand on the motion sensing device; and generating the multimodal hand motion state based on a fusing of the motion data, the contact data and the peripheral contact data.
- the multimodal hand state is a classification prediction corresponding to one or more classes of hand actions.
- the multimodal hand motion state is a real-time 3D skeletal representation of a user's hand in a 3D space.
- the present disclosure describes a system.
- the system comprises: a motion sensing device that is configured to sense motion of a user's hand and output corresponding motion data; a contact surface that is configured to sense contact of the user's hand and output corresponding contact data; one or more memories storing executable instructions; and one or more processors coupled to the motion sensing device, contact surface and one or more memories, the executable instructions configuring the one or more processors to: generate a multi-modal hand state based on a fusing of the motion data and the contact data.
- the present disclosure describes a non-transitory computer-readable medium having machine-executable instructions stored thereon which, when executed by a processor of a computing system, cause the computing system to perform any of the preceding example aspects of the method.
- FIG. 1 is a block diagram of a computing system that may be used for implementing a hand state prediction system, in accordance with example embodiments of the present disclosure
- FIG. 2 A is a block diagram illustrating a hand state prediction system, in accordance with examples of the present disclosure
- FIG. 2 B illustrates an example embodiment of the hand state prediction system of FIG. 2 A , with inputs obtained from a motion sensing device and a contact surface, in accordance with examples of the present disclosure
- FIG. 3 is a block diagram illustrating a multimodal hand state network of the hand state prediction system of FIG. 2 A , in accordance with examples of the present disclosure
- FIG. 4 is a flowchart illustrating an example method for determining a multimodal hand state, in accordance with examples of the present disclosure
- FIG. 5 is a perspective view of an example embodiment of a motion sensing device, configured for delimited sampling, in accordance with examples of the present disclosure.
- FIG. 6 is a perspective view of an example embodiment of a motion sensing device, configured for continuous sampling, in accordance with examples of the present disclosure.
- statements that a second item is “based on” a first item can mean that characteristics of the second item are affected or determined at least in part by characteristics of the first item.
- the first item can be considered an input to an operation or calculation, or a series of operations or calculations that produces the second item as an output that is not independent from the first item.
- the present disclosure describes a hand state prediction system which processes input motion sensor and contact surface signals and generates hand state predictions.
- the hand state prediction system includes a machine learning-based model, such as a neural network model that is trained to convert inertial motion measurements and surface contact data into predictions of a corresponding hand state in response to a hand action, for example, a hand gesture or gripping posture. For each window of data sampled, motion data and contact data are obtained, processed and fused to generate a fused output.
- the hand state prediction system can operate in continuous mode to automatically detect a start and an end of a hand action, or a user can designate a start and an end of a hand action.
- a multimodal hand state is generated by a multimodal classifier by processing the fused output. Instructions represented by a hand action (e.g. a gesture or a posture) can be acted upon through a command action performed by a computing device or a computer application.
- Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed.
- a neural network consists of neurons.
- a neuron is a computational unit that uses x s and an intercept of 1 as inputs.
- An output from the computational unit may be:
- s 1, 2, . . . n, n is a natural number greater than 1
- W s is a weight of x s
- b is an offset (i.e. bias) of the neuron
- f is an activation function of the neuron and used to introduce a nonlinear feature to the neural network, to convert an input of the neuron to an output denoted as h.
- a neural network may be constructed in layers, including an input layer that accepts inputs, an output layer that generates a prediction as output, and in the case of deep neural networks (DNN), a plurality of hidden layers which are situated between the input layer and output layer.
- the output of the activation function in one layer may be used as an input to a neuron of a subsequent layer in the neural network. In other words, an output from one neuron may be an input to another neuron.
- Different activation functions may be used for different purposes in a neural network, with hidden layers commonly using different activation functions than output layers.
- a layer is considered to be a fully connected layer when there is a full connection between two adjacent layers of the neural network.
- two adjacent layers e.g., the i-th layer and the (i+1)-th layer
- each and every neuron in the i-th layer must be connected to each and every neuron in the (i+1)-th layer.
- the operation is performed on an input vector ⁇ right arrow over (x) ⁇ , to obtain an output vector ⁇ right arrow over (y) ⁇ .
- weights W and offset vectors ⁇ right arrow over (b) ⁇ may be referred to as parameters of the neural network, the optimal values of which may be learned by training the neural network.
- a greater number of hidden layers may enable the DNN to better model a complex situation (e.g., a real-world situation).
- a DNN with more parameters is more complex, has a larger capacity (which may refer to the ability of a learned model to fit a variety of possible scenarios), and indicates that the DNN can complete a more complex learning task.
- Training of the DNN is a process of learning the weight matrix.
- a purpose of the training is to obtain a trained weight matrix, which consists of the learned weights W of all layers of the DNN.
- the initial weights need to be set. For example, an initialization function such as random or Gaussian distributions may define initial weights.
- supervised learning In the process of training a DNN, two approaches are commonly used: supervised learning and unsupervised learning.
- unsupervised learning the neural network is not provided with any information on desired outputs, and the neural network is trained to arrive at a set of learned weights on its own.
- supervised learning a predicted value outputted by the DNN may be compared to a desired target value (e.g., a ground truth value).
- a weight vector (which is a vector containing the weights W for a given layer) of each layer of the DNN is updated based on a difference between the predicted value and the desired target value. For example, if the predicted value outputted by the DNN is excessively high, the weight vector for each layer may be adjusted to lower the predicted value.
- This comparison and adjustment may be carried out iteratively until a convergence condition is met (e.g., a predefined maximum number of iterations has been performed, or the weight vector converges).
- a loss function or an objective function is defined, as a way to quantitatively represent how close the predicted value is to the target value.
- An objective function represents a quantity to be optimized (e.g., minimized or maximized) in order to bring the predicted value as close to the target value as possible.
- a loss function more specifically represents the difference between the predicted value and the target value, and the goal of training the DNN is to minimize the loss function.
- Backpropagation is an algorithm for training a DNN.
- Backpropagation is used to adjust (also referred to as update) a value of a parameter (e.g., a weight) in the DNN, so that the error (or loss) in the output becomes smaller.
- a parameter e.g., a weight
- a defined loss function is calculated, from forward propagation of an input to an output of the DNN.
- Backpropagation calculates a gradient of the loss function with respect to the parameters of the DNN, and a gradient algorithm (e.g., gradient descent) is used to update the parameters to reduce the loss function. Backpropagation is performed iteratively, so that the loss function is converged or minimized.
- a convolutional neural network is a neural network that is designed to find the spatial relationship in data.
- CNNs are commonly used in applications related to computer vision or image processing for purposes of classification, regression, segmentation and/or object detection.
- a CNN is a DNN with a convolutional structure.
- the CNN includes a feature extractor consisting of a convolutional layer and a sub-sampling layer.
- the convolutional layer consists of kernels or filters that are convolved with a two-dimensional (2D) input image to generate feature maps or feature representations using a trainable filter.
- a recurrent neural network is a neural network that is designed to process sequential data and make predictions based on the processed sequential data.
- RNNs have an internal memory that remembers inputs (e.g. the sequential data), thereby allowing previous outputs (e.g. predictions) to be fed back into the RNN and information to be passed from one time step to the next time step.
- RNNs are commonly used in applications with temporal components, for example real-time applications or interactions.
- a “hand action” can mean an action intentionally performed by a user's hand, for example, while engaging a motion sensing device and a contact surface.
- a hand action may be a gesture or a hand movement.
- a hand action may be a hand posture.
- a “hand state” can mean the state of a user's hand in 3D space in response to, or while performing, a hand action.
- the hand state may include positional information about the position of the user's hand in 3D space.
- a user's hand state can be described while the user's hand is in motion, for example, while performing a gesture (e.g. a swipe action or a mid-air gesture) or a hand movement, such as writing or drawing.
- a user's hand state can be described while the user's hand is still or motionless, for example, while engaged in a specific hand posture.
- a user's hand state can be described with reference to a 3D skeletal model of the user's hand.
- modality refers to a particular mode in which something exists or is experienced or expressed.
- a modality can mean a mode of data collection (e.g. inertial motion or contact force).
- a modality can mean a way of operating an application (e.g. drawing mode or erasing mode).
- a “multimodal input” can mean an input that encompasses two or more input modalities, for example, a combination of two or more modes of input data.
- a multimodal input may be a single input that comprises a combination of individual inputs that were obtained from two or more different data sources, for example, an inertial motion sensor, a force sensor, a contact sensor etc.
- fusion can mean the consolidation of multiple elements into a single representation.
- a merging of information from different sensors e.g. motion sensors, capacitive touchscreens or force sensors
- Fusing information from different sources may help to enhance correlated features and reduce uncertainty in a system, leading to improved recognition accuracy.
- position can mean a physical configuration of the human body or a part of the human body.
- a hand position or a wrist position may describe the configuration of a user's hand or wrist in 3D space.
- a “gripping posture” may describe the configuration of a user's fingers around the shaft of a pen while holding a digital pen for the purpose of executing a task (e.g. writing or drawing) with the pen, or to execute a gesture with the pen.
- Gripping postures may be described in a number of ways, for example, common gripping postures include a correct grip, a close grip, a fold grip, a tuck grip, a squeeze grip, a hook grip, a wrap grip, a mount grip or a tripod grip.
- gesture can mean a particular movement of a part of the human body, or sequence of movements that may be used for non-verbal communication, for example, a controlled movement that contains meaning to a person who observes the movement, or to a device that receives an input representing the movement.
- gestures may be performed by a part of the body, for example, a finger executing a “swipe” gesture in contact with a touchscreen or in mid-air, or a gesture may be performed by a device being operated by a user, for example, a right-to-left movement executed by a user while holding a device (e.g. digital pen), among others.
- mode-switching can mean an act of switching from one mode of operation to another mode of operation. For example, switching between performing a writing operation and an erasing operation.
- command action is an action performed by a computing device or computer application in response to an instruction by a user.
- a command action associated with a “circle” gesture made by a user within a drawing application may be interpreted by the device as a “mode-switching” command and may have the effect of changing the user's mode of operation from “drawing” to “selecting an object” within the drawing canvas.
- Some examples of existing technologies applied to digital pens include the incorporation of pressure sensors at the tip of the tool for measuring an input force applied by the digital pen on the surface of a device, to assist with writing and drawing.
- digital pens are equipped with external buttons that when pressed, enable users to perform various functions, such as mode-switching (e.g. switching between performing a writing operation and an erasing operation).
- buttons may introduce additional complexity in operation and hardware cost, and may not be aesthetically pleasing.
- shortcut keys and menu buttons for tasks such as mode-switching typically employed on larger devices cannot be accommodated.
- the present disclosure describes examples that may help to address some or all of the above drawbacks of existing technologies.
- FIG. 1 shows a block diagram of an example hardware structure of a computing system 100 that is suitable for implementing embodiments of the system and methods of the present disclosure, described herein. Examples of embodiments of system and methods of the present disclosure may be implemented in other computing systems, which may include components different from those discussed below.
- the computing system 100 may be used to execute instructions to carry out examples of the methods described in the present disclosure.
- the computing system 100 may also be used to train the machine learning models of the hand motion prediction system 200 , or the hand motion prediction system 200 may be trained by another computing system.
- FIG. 1 shows a single instance of each component, there may be multiple instances of each component in the computing system 100 .
- the computing system 100 includes at least one processor 102 , such as a central processing unit, a microprocessor, a digital signal processor, an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), a dedicated logic circuitry, a dedicated artificial intelligence processor unit, a graphics processing unit (GPU), a tensor processing unit (TPU), a neural processing unit (NPU), a hardware accelerator, or combinations thereof.
- processor 102 such as a central processing unit, a microprocessor, a digital signal processor, an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), a dedicated logic circuitry, a dedicated artificial intelligence processor unit, a graphics processing unit (GPU), a tensor processing unit (TPU), a neural processing unit (NPU), a hardware accelerator, or combinations thereof.
- processor 102 such as a central processing unit, a microprocessor, a digital signal processor, an application-specific integrated circuit (ASIC), a field-programmable gate array (
- the computing system 100 may include an input/output (I/O) interface 104 , which may enable interfacing with an input device 106 and/or an optional output device 114 .
- the input device 106 e.g., a keyboard, a mouse, a camera, a touchscreen, a stylus and/or a keypad
- the optional output device 114 e.g., a display, a speaker and/or a printer
- the I/O interface 104 may not be needed.
- the I/O interface 104 may buffer the data generated by the input units 120 and provide the data to the processor 102 to be processed in real-time or near real-time (e.g., within 10 ms, or within 100 ms).
- the I/O interface 104 may perform preprocessing operations on the input data, for example normalization, filtering, denoising, etc., prior to providing the data to the processor 102 .
- the I/O interface 104 may also translate control signals from the processor 102 into output signals suitable to each respective output device 114 .
- a display 116 may receive signals to provide a visual output to a user.
- the display 116 may be a touch-sensitive display (also referred to as a touchscreen) in which the touch sensor 110 is integrated.
- a touch-sensitive display may both provide visual output and receive touch input.
- the computing system 100 may include an optional communications interface 120 for wired or wireless communication with other computing systems (e.g., other computing systems in a network) or devices.
- the communications interface 120 may include wired links (e.g., Ethernet cable) and/or wireless links (e.g., one or more antennas) for intra-network and/or inter-network communications.
- the computing system 100 may include one or more memories 122 (collectively referred to as “memory 122 ”), which may include a volatile or non-volatile memory (e.g., a flash memory, a random access memory (RAM), and/or a read-only memory (ROM)).
- the non-transitory memory 122 may store instructions for execution by the processor 102 , such as to carry out example embodiments of methods described in the present disclosure.
- the memory 122 may store instructions for implementing any of the systems and methods disclosed herein.
- the memory 122 may include other software instructions, such as for implementing an operating system (OS) and other applications/functions.
- OS operating system
- the memory 122 may also store other data 124 , information, rules, policies, and machine-executable instructions described herein, including a motion data 230 captured by the motion sensor 108 , contact data 250 captured by the touch sensor 110 or the force sensor 112 or data representative of a user's hand motion captured by an input device on another computing system and communicated to the computing system 100 .
- the computing system 100 may also include one or more electronic storage units (not shown), such as a solid state drive, a hard disk drive, a magnetic disk drive and/or an optical disk drive.
- data and/or instructions may be provided by an external memory (e.g., an external drive in wired or wireless communication with the computing system 100 ) or may be provided by a transitory or non-transitory computer-readable medium. Examples of non-transitory computer readable media include a RAM, a ROM, an erasable programmable ROM (EPROM), an electrically erasable programmable ROM (EEPROM), a flash memory, a CD-ROM, or other portable memory storage.
- the storage units and/or external memory may be used in conjunction with memory 116 to implement data storage, retrieval, and caching functions of the computing system 100 .
- the components of the computing system 100 may communicate with each other via a bus, for example.
- the computing system 100 may be a single physical machine or device (e.g., implemented as a single computing device, such as a single workstation, single end user device, single server, etc.).
- the computing system may be a mobile communications device (e.g. a smartphone), a laptop computer, a tablet, a desktop computer, a wearable device, a vehicle driver assistance system, an assistive technology device, among others.
- the computing system 100 may comprise a plurality of physical machines or devices (e.g., implemented as a cluster of machines, server, or devices).
- the computing system 100 may be a virtualized computing system (e.g., a virtual machine, a virtual server) emulated on a cluster of physical machines or by a cloud computing system.
- FIG. 2 A shows a block diagram of an example hand state prediction system 200 of the present disclosure.
- the hand state prediction system 200 may be a software that is implemented in the computing system 100 of FIG. 1 , in which the processor 102 is configured to execute instructions 200 -I of the hand state prediction system 200 stored in the memory 122 .
- the hand state prediction system 200 includes a motion sensing device 220 , a contact surface 240 and a multimodal hand state network 260 .
- the hand state prediction system 200 receives an input of a hand action 210 and outputs a multimodal hand state 270 that may be transformed into a command action 290 .
- the hand action 210 may be representative of a gesture or a gesture sequence.
- gestures may include a left-to-right gesture, a right-to-left gesture, an up-to-down gesture, a down-to-up gesture and a circle or curved “rotation” gesture, among others.
- the hand action 210 may be representative of a gripping posture.
- gripping postures may include postures for holding a pen in a user's hand, for example postures with a correct grip, a close grip, a fold grip, a tuck grip, a squeeze grip, a hook grip, a wrap grip, a mount grip or a tripod grip, among others.
- Example gripping postures are described in: Bi, Hongliang, Jian Zhang, and Yanjiao Chen, “SmartGe: identifying pen-holding gesture with smartwatch,” IEEE Access 8 (2020): 28820-28830, the entirety of which is hereby incorporated by reference.
- the hand action 210 may be captured by a motion sensing device 220 that is configured to sense motion of a user's hand 202 , for example, a digital pen or a stylus equipped with a motion sensor 108 , to generate motion data 230 .
- the motion data 230 may be sampled over a predetermined period of time or the motion data 230 may be continuously sampled.
- the hand action 210 may also be captured by a contact surface 240 , for example, a 2D touch sensitive surface or a 3D pressure sensor pad, to generate contact data 250 .
- the contact data 250 may be sampled over a predetermined period of time or the contact data 250 may be continuously sampled.
- the hand state prediction system 200 may generate a multimodal hand state 270 .
- the multimodal hand state 270 may be a classification prediction corresponding to one or more classes of hand actions, for example, a gesture or a gripping posture classified from a set of gesture classes or a set of gripping posture classes.
- a multimodal hand state 270 may be generated based on decision criteria for classification, for example, using one hot encoding or comparing a maximum confidence probability to a pre-determined threshold.
- the multimodal hand state 270 may indicate whether a gripping posture is correct or incorrect, based on a set of gripping posture classes or based on a 3D skeletal model of a user's hand posture.
- the multimodal hand state 270 may be a real-time 3D skeletal representation of a user's hand in 3D space, for example, a 3D skeleton model may map, in real-time, coordinates corresponding to one or more modeled skeletal features to a shape, position or posture of a user's hand.
- the multimodal hand state 270 may be transformed by an interpreter 280 into a command action 290 based on a predefined set of commands.
- a computing system or computer application running on a computing system that is capable of executing the predefined command action 290 may then be able to execute the command action 290 .
- a user may perform a hand action 210 such as a right-to-left motion gesture while interacting with an application on the computing system 100 such as an e-reader, which may then be received as motion data 230 and contact data 250 by the computing system 100 implementing the hand state prediction system 200 .
- the hand state prediction system 200 may process the motion data 230 and the contact data 250 to output a multimodal hand state 270 that captures the user's intent to “turn the page”.
- the computing system 100 may then be able to map the multimodal hand state 270 to a command action 290 from a predefined set of command actions that the user wishes to turn the page, and may execute the command action 290 .
- a user may perform a hand action 210 such as a circle motion gesture while interacting with a drawing application on the computing system 100 , which may then be received as motion data 230 and contact data 250 by the computing system 100 implementing the hand state prediction system 200 .
- the hand state prediction system 200 may process the motion data 230 and the contact data 250 to output a multimodal hand state 270 that captures the user's intent to switch modes from drawing mode to select mode, and “select an object” in the drawing canvas.
- the computing system 100 may then be able to map the multimodal hand state 270 to a command action 290 from a predefined set of command actions that the user wishes to switch modes of operation in the drawing application and select the object, and may execute the command action 290 .
- FIG. 2 B illustrates an example embodiment of the hand state prediction system 200 , where inputs are obtained from a motion sensing device 220 and a contact surface 240 that are configured to sense motion of a user's hand 202 , in accordance with examples of the present disclosure.
- the motion sensing device 220 may be an object operatively coupled to the user, for example, the motion sensing device 220 may be held in a user's hand, or the motion sensing device 220 may be coupled to an arm or wrist of a user, among others.
- the motion sensing device 220 may be a digital pen, for example, having a rigid body, the rigid body being an elongated shaft with a first end and a second end.
- the motion sensing device 220 that is configured to sense motion of a user's hand 202 wherein the first end of the elongated shaft is proximal to the user's fingers and the second end of the elongated shaft is distal to the user's fingers and wherein in response to a hand action 210 , the second end may experience a greater degree of translational and rotational motion than the first end.
- the motion sensing device 220 may be another object, for example, a wearable device.
- the user's hand 202 is also coupled to the contact surface 240 , for example, with a portion of the user's palm or wrist resting on the contact surface 240 to generate a contact area 242 .
- the contact area 242 may represent a pivot point for a user's palm or wrist while the user performs the hand action 210 .
- the contact area 242 may represent a drag motion of the palm of a user's hand 202 while the user performs the hand action 210 .
- more than one contact area 242 may be generated, for example, if more than one portion of a user's hand 202 (e.g. a finger) interacts with the contact surface 240 during the hand action 210 .
- the motion sensing device 220 includes a motion sensor 108 , for example an inertial measurement unit (IMU) to detect the movement of the motion sensing device 220 in response to a user's hand motion 210 .
- the motion sensor 108 may be a 3 degree-of-freedom (3DoF) IMU, a 6 degree-of-freedom (6DoF) IMU or a 9 degree-of-freedom (9DoF) IMU, where the IMU may comprise an accelerometer that measures translational acceleration in 3-axes, a gyroscope that measures rotational velocity or acceleration in another 3-axes or optionally a magnetometer that measures a magnetic field strength in 3-axes.
- 3DoF 3 degree-of-freedom
- 6DoF 6 degree-of-freedom
- 9DoF 9 degree-of-freedom
- the motion data 230 generated by the motion sensing device 220 during a hand action 210 may be represented by 3 channels of time-series translational acceleration measurements (e.g. force or acceleration), 3 channels of time-series rotational velocity measurements (e.g. angular rate) and optionally 3 channels of time-series magnetic field measurements (e.g. orientation), corresponding to movement of the motion sensing device 220 in response to the hand action 210 .
- the motion device 220 may sample the motion data 230 based on a start and an end of the hand action 210 , and in some examples, the sampled motion data may be stored as motion data 230 .
- the contact surface 240 may include a touch sensor 110 , for example a capacitive touch sensitive surface, to capture 2D positional information corresponding to the contact area 242 of a user's hand 202 on the contact surface 240 in response to the hand action 210 .
- a capacitive touch screen draws small electrical charges to a point of contact by a user, and functions as a capacitor in the region of contact.
- a change in the capacitance and electrostatic field in the capacitive panel of the touch sensitive surface provides location information corresponding to the contact area 242 .
- the contact data 250 generated by the contact surface 240 during a hand action 210 may be represented by a sequence of 2D contours defining the contact area 242 .
- the contact surface 240 may sample the contact data 250 based on a start and an end of the hand action 210 , and in some examples, the sampled contact data may be stored as contact data 250 .
- the contact surface 240 may include one or more force sensors 112 , where the one or more force sensors 112 may be arranged in a 2D array, for example, as a pressure pad, to measure a force distribution corresponding to the contact area 242 of a user's hand 202 on the contact surface 240 in response to the hand action 210 .
- the contact data 250 generated by the contact surface 240 during a hand action 210 may be represented by a sequence of force measurements distributed across the contact surface 240 and defined by the contact area 242 .
- the value of the force measurements may be proportional to the magnitude of the applied force by the user's hand 202 at each point in the pressure array of the contact surface 240 .
- the contact data 250 generated by one or more force sensors 112 may be considered to be three-dimensional (3D), including both 2D positional information and force measurements defining the contact area 242 .
- a benefit of generating contact data 250 with a capacitive touch sensitive surface is that touch sensitive surfaces require lower power and are readily embedded into many surfaces on commercial devices, for example, tablets, laptops, smartphones or dual-screen devices, compared to a pressure pad that may require greater power requirements.
- a benefit of generating contact data 250 with a pressure pad includes the collection of richer data, including information corresponding to the applied force of a user's hand 202 along with positional information, compared to a 2D touch sensitive surface that captures only 2D positional information.
- the motion data 230 and the contact data 250 are input to a multimodal hand state network 260 to generate a multimodal hand state 270 .
- FIG. 3 may be referenced.
- FIG. 3 is a block diagram illustrating a multimodal hand state network 260 , in accordance with examples of the present disclosure.
- a motion data pre-processor 310 receives the motion data 230 and generates pre-processed motion data 312 .
- the motion data pre-processor 310 may filter the motion data by mean zeroing the columns and forcing a unit variance or applying dynamic time warping (DTW) to time synchronize the data, or other pre-processing operations may be performed, depending on the system input requirements, or depending on the application.
- DTW dynamic time warping
- the motion data 230 may be pre-processed at the motion sensing device 220 or the motion data 230 may be transmitted by the computing system 100 to be pre-processed by the processor 102 of the computing system 100 .
- a contact data pre-processor 330 receives the contact data 250 and generates pre-processed contact data 332 .
- the contact data pre-processor 330 may convert 2D or 3D contact data 250 into motion history images (MHI). In some examples, other pre-processing operations may be performed, depending on the system input requirements, or depending on the application.
- the pre-processed motion data 312 is input to a trained motion classifier 320 to generate a first output 322 representing a motion state of the user's hand.
- the motion classifier 320 may be a neural network, for example, a RNN, or a DNN or the motion classifier 320 may be another machine learning model.
- the first output 322 may be a first classification probability (e.g. a prediction confidence) corresponding to each class in a set of one or more classes, where classes may represent various hand gestures or gripping postures.
- the pre-processed contact data 332 is input to a trained contact classifier 340 to generate a second output 342 representing a contact state of the user's hand.
- the contact classifier 340 may be a neural network, for example, a CNN, a RNN or a DNN or the contact classifier 340 may be another machine learning model.
- the second output 342 may be a second classification probability (e.g. a prediction confidence) corresponding to each class in a set of one or more classes, where classes may represent various hand gestures or gripping postures, among others.
- the motion classifier 320 and the contact classifier 340 may be trained, for example, using supervised learning using labeled training datasets including pre-processed motion data and pre-processed contact data obtained from the motion sensing device 220 and the contact surface 240 in response to hand motion executed by various users, using backpropagation to minimize a respective classification loss function, for example a motion classification loss function or a contact classification loss function.
- the first output 322 and the second output 342 may be fused to generate a fused output 350 .
- fusing the first output 322 and the second output 342 may comprise concatenating the first output 322 and the second output 342 , or other methods to fuse the first output 322 and the second output 342 may be used.
- the fused output 350 is input to a multimodal classifier 360 to generate a multimodal hand state 270 .
- the multimodal hand state 270 may be a multimodal classification probability (e.g. a prediction confidence) corresponding to each class in a set of one or more classes representing various hand actions 210 , for example, gestures or gripping postures.
- the multimodal classifier 360 may be a neural network, for example, a DNN or the motion classifier 320 may be another machine learning model. In some examples, the multimodal classifier 360 may be trained, for example, using backpropagation to minimize a multimodal classification loss function.
- the multimodal hand state 270 output by the multimodal hand state network 260 may optionally be input to an interpreter 280 which is configured to transform the multimodal hand state 270 into a command action 290 based on a predefined set of commands.
- the predefined set of commands may be stored as data 124 in the memory 122 of the computing system 100 .
- a command action 290 may an action being taken by a computer or computer application, such as an e-reader or a drawing application, in response to hand motion predictions representing a user intent.
- a command action 290 associated with a “right-to-left” gesture would cause a computer device or computer application such as an e-reader to turn the page, or in another example, perform a mode switching operation.
- FIG. 4 is a flowchart illustrating an example method 400 for determining a multimodal hand state 270 , in accordance with examples of the present disclosure.
- the method 400 may be performed by the computing system 100 .
- the method 400 represents operations performed by the multimodal hand state network 260 depicted in FIG. 3 .
- the processor 102 may execute computer readable instructions 200 -I (which may be stored in the memory 122 ) to cause the computing system 100 to perform the method 400 .
- Method 400 begins at step 402 in which motion data 230 is obtained from a motion sensing device 220 configured to sense the motion of a user's hand in response to performing a hand action 210 .
- the motion data 230 may be representative of movement of a user's hand captured by a motion sensor 108 of the computing system 100 , and corresponding to a motion sensing device 220 .
- a contact data 250 is obtained from a contact surface 240 configured to sense the contact of a user's hand in response to performing a hand action 210 .
- the contact data 250 may be representative of movement of a user's hand captured by a touch sensor 110 or a force sensor 112 of the computing system 100 , and corresponding to a contact surface 240 .
- a multimodal hand state 270 is generated based on a fusing of the motion data 230 and the contact data 250 .
- the multimodal hand state 270 may be a multimodal classification probability (e.g. a prediction confidence) corresponding to each class in a set of one or more classes, for example, a gesture or a gripping posture classified from a set of gesture classes or a set of gripping posture classes.
- a multimodal hand state 270 may be generated based on decision criteria for classification, for example, using one hot encoding or comparing a maximum confidence probability to a pre-determined threshold.
- the multimodal hand state 270 may indicate whether a gripping posture is correct or incorrect, based on a set of gripping posture classes or based on a 3D skeletal model of a user's hand posture.
- the multimodal hand state 270 may be a real-time 3D skeletal representation of a user's hand in 3D space, for example, a 3D skeleton model may map, in real-time, coordinates corresponding to one or more modeled skeletal features to a shape, position or posture of a user's hand.
- the motion data 230 may be processed to generate a first output 322 representing a motion state of a user's hand.
- motion data 230 may be pre-processed, for example, the motion data 230 may be filtered by mean zeroing the columns and forcing a unit variance or by applying dynamic time warping (DTW) to time synchronize the data.
- DTW dynamic time warping
- other pre-processing operations may be performed, depending on the system input requirements, or depending on the application.
- the pre-processed motion data 312 may be classified to generate the first output 322 , where the first output 322 may be a first classification probability (e.g. a prediction confidence) corresponding to each class in a set of one or more classes, where classes may represent various hand gestures or gripping postures.
- first classification probability e.g. a prediction confidence
- the contact data 250 may be processed to generate a second output 342 representing a contact state of a user's hand.
- the contact data 250 may be pre-processed, for example, the 2D or 3D contact data 250 may be converted into motion history images. In some examples, other pre-processing operations may be performed, depending on the system input requirements, or depending on the application.
- the pre-processed contact data 332 may be classified to generate the second output 342 , where the second output 342 may be a second classification probability (e.g. a prediction confidence) corresponding to each class in a set of one or more classes, where classes may represent various hand gestures or gripping postures.
- a second classification probability e.g. a prediction confidence
- fusing the motion data 230 and the contact data 250 may comprise fusing the first output 322 and the second output 342 to generate a fused output 350 .
- fusing the first output 322 and the second output 342 may comprise concatenating the first output 322 and the second output 342 , or other methods to fuse the first output 322 and the second output 342 may be used.
- the fused output 350 may be a joint representation of both the inertial motion and contact force modality.
- step 406 may be described as performing a fusion of multimodal features.
- Feature fusion may be described as a method to integrate the features of different data to enhance the features distinguished from feature extractors.
- fusion of representations from different modalities for example, inertial motion and contact force
- a benefit of using a joint representation of the modalities may be that additional information may be extracted from the contact force modality (e.g. contact data 250 ) to help capture important aspects of a hand action 210 that are not present in the motion data 230 alone.
- the multimodal hand state 270 may be transformed, for example by an interpreter 280 into a command action 290 , based on a predefined set of commands.
- the predefined set of commands may be stored as data 124 in the memory 122 of the computing system 100 .
- a command action 290 may an action being taken by a computer or computer application, such as an e-reader or a drawing application, in response to hand action predictions representing a user intent.
- a command action 290 associated with a “right-to-left” swipe gesture would cause a computer device or computer application such as an e-reader to turn the page, or in another example, perform a mode switching operation.
- a motion sensing device 220 for capturing hand motion 210 is provided.
- the motion sensing device 220 includes a motion sensor 108 and configured to interact with a computing system 100 and a user to instruct command actions 290 .
- the motion sensing device 220 may be held in a user's hand 202 while the user is simultaneously interacting with a contact surface 240 to instruct command actions 290 .
- FIGS. 5 and 6 illustrate an example motion sensing device 220 according to example embodiments.
- the motion sensing device 220 may take the form of a digital pen or a stylus, having a rigid body 510 that extends as a shaft along an elongate axis from a first axial end 520 to a second axial end 530 or the motion sensing device 220 may be another device.
- a motion sensing device 220 may have a body 510 that is configured to allow a user to grip the digital pen and the body 510 may be cylindrical along its length.
- the motion sensing device 220 may have a tapered tip 540 provided at the first axial end 520 of the body 510 .
- the tip 540 may be used to actuate user-interface elements on a touchscreen display.
- the motion sensing device 220 may incorporate a writing pen.
- the motion sensing device 220 may have an ink-dispensing writing tip at the tip 540 .
- a motion sensor 108 may be coupled to the motion sensing device 220 , for example, the motion sensor 108 may be coupled to the second axial end 530 of the motion sensing device 220 or the motion sensor 108 may be coupled to another location on the motion sensing device 220 .
- positioning the motion sensor 108 at the second axial end 530 may have the advantage of capturing greater or more exaggerated translational or rotational movement while a user interacts with the motion sensing device 220 .
- motion data 230 captured by the motion sensing device 220 may be transmitted to the computing system 100 to be pre-processed or motion data 230 may be pre-processed at the motion sensing device 220 and pre-processed motion data 312 may be transmitted to the computing system 100 .
- FIG. 5 is a perspective view of an example embodiment of a motion sensing device 220 in the form of a digital pen, configured for delimited sampling, in accordance with examples of the present disclosure.
- a delimiter button 550 may be located on motion sensing device 220 , for example the delimiter button 550 may be located on the body 510 of the motion sensing device 220 .
- the delimiter button 550 may be located near the first axial end 520 of the body 510 or it may be located elsewhere on the motion sensing device 220 .
- the delimiter button 550 is configured to interact with the computer system 100 to determine that a “button interaction” has occurred, in order to initiate or end data sampling.
- a user may instruct the motion sensing device 220 to begin sampling motion data 230 by depressing the delimiter button 250 in a single “button click”, and in response to receiving instruction to begin sampling motion data 230 , the motion sensing device 220 may sample the motion data 230 for a pre-determined period of time, for example, 3 seconds.
- the user may instruct the motion sensing device 220 to begin sampling motion data 230 by depressing the delimiter button 250 in a first single “button click”, and in response to receiving instruction to begin sampling motion data 230 , the motion sensing device 220 may sample the motion data 230 until an instruction is received to end the sampling period, for example, by the user depressing the delimiter button 250 in a second “button click”.
- the user may instruct the motion sensing device 220 to begin sampling motion data 230 by depressing and holding the delimiter button 250 in a depressed condition, the motion sensing device 220 continuing to sample the motion data 230 until the delimiter button 250 is released by the user, signaling an end to the data sampling period.
- the multimodal hand motion state network 260 may be trained to recognize the end of a hand gesture, and the user may instruct the motion sensing device 220 to begin sampling motion data 230 by depressing the delimiter button 250 in a single “button click”, and in response to receiving instruction to begin sampling motion data 230 , the motion sensing device 220 may sample the motion data 230 until the multimodal hand state network 260 recognizes that the hand gesture is complete and instructs and end to the data sampling period.
- the motion sensing device 220 is synchronized with the contact surface 240 and instructions received at the motion sensing device 220 to initiate or end data sampling are also received at the contact surface 240 to initiate or end data sampling at the contact surface 240 .
- the sampled motion data may be stored as motion data 230 .
- FIG. 6 is a perspective view of an example embodiment of a motion sensing device 220 in the form of a digital pen, configured for continuous sampling, in accordance with examples of the present disclosure.
- the delimiter button 550 described with respect to FIG. 5 may be absent in a motion sensing device 220 configured for continuous sampling.
- no explicit instruction may be provided to the motion sensing device 220 or the contact surface 240 by the user, signaling the start or the end of sampling, and the motion sensing device 220 and the contact surface 240 may be continuously sampling motion data at all times and generating a stream of sampled motion data and sampled contact data.
- a control operator within the computing system 100 may monitor the sampled motion data 230 and the sampled contact data 250 to detect the start of a hand action 210 .
- a control operator within the computing system 100 may compare the motion data and contact data values to a threshold level to signal the start or the end of sampling, for example, a threshold level of translational or rotational motion by the motion sensor 108 that indicates that the motion sensing device 220 is in motion, or a non-zero measure on the touch sensor 110 or the force sensor 112 that indicates that a user's hand 202 is in contact with the contact sensor 240 .
- a control operator may initiate data sampling for the motion sensing device 220 and the contact surface 240 for a pre-determined period of time, for example, 3 seconds.
- the control operator may repeat the initiation of data sampling in rolling windows of pre-determined length, (e.g. 3 seconds) until an indication has been received that the hand action 210 has ended.
- the multimodal hand state network 260 may be trained to recognize the beginning or the end of a hand action 201 , for example, a gesture, and may segment discrete portions of the continuous motion data 230 and the continuous contact data 250 for processing and classification.
- the continuously sampled motion data may be stored as motion data 230 .
- a handheld electronic device e.g., a tablet, or a smartphone
- examples of the present disclosure may be implemented using other electronic devices, such as electronic wearable devices including smart watches or smart gloves, among others.
- motion sensors 108 can be mounted on the surface of a wearable device, such as a smart watch to capture wrist movement.
- a touch sensor 110 or a force sensor 112 may be integrated into a vehicle, for example, on the steering wheel or console screen, for human-computer interaction during driving.
- the methods, systems and devices described herein may be used to predict a multimodal hand state 270 by modeling a user's hand posture in 3D space, rather than classifying a gesture or a gripping posture.
- a modeling method to model a user's hand posture in 3D space may include a 3D skeletal model, or another modeling method may be used.
- the multimodal hand state 270 may be a real-time 3D skeletal representation of a user's hand in 3D space, for example, a 3D skeleton model may map, in real-time, coordinates corresponding to one or more skeletal features to a shape, position or posture of a user's hand.
- motion data 230 and contact data 250 may be optionally augmented by additional inputs, for example, a peripheral contact surface that is operatively coupled to the exterior of the motion sensing device 220 that is configured to sense peripheral contact of the user's hand on the motion sensing device 220 while the user is holding the motion sensing device 220 .
- the peripheral contact surface may be a touch sensitive surface or a pressure array coupled to the exterior of the motion sensing device 220 .
- the peripheral contact surface may capture peripheral contact data corresponding to surface area or applied force caused by the user's fingers or hand contacting the exterior of the motion sensing device 220 while executing a gripping posture or a gesture.
- the peripheral contact data may be processed to generate a peripheral contact output representing a peripheral contact state based on the peripheral contact data.
- the peripheral contact output may be fused with the first output 322 and the second output 342 to generate a second fused output, and where the second fused output may be processed to generate the multimodal hand state 270 .
- an additional input may be a camera for capturing images or point data related to the position of the user's hand in 3D space.
- modeling the user's hand motion in 3D space takes place in real-time, for example, with motion data 230 , contact data 250 and optionally, peripheral contact data or camera data being continuously sampled, and the hand state prediction system 200 continuously re-processing and updating the generated multimodal hand state 270 as new input data is received.
- a modeled hand position may be output to an application on an electronic device (e.g., a software application executed by the computing system 100 ) to estimate a deviation in a modeled gripping posture from a target gripping posture. For example, if the application on the electronic device is an assistive tool to support children during early age development, obtaining accurate estimates of a modeled gripping posture may prompt or assist children in learning or modifying their grip to more correct postures.
- the present disclosure is described, at least in part, in terms of methods, a person of ordinary skill in the art will understand that the present disclosure is also directed to the various components for performing at least some of the aspects and features of the described methods, be it by way of hardware components, software or any combination of the two. Accordingly, the technical solution of the present disclosure may be embodied in the form of a software product.
- a suitable software product may be stored in a pre-recorded storage device or other similar non-volatile or non-transitory computer readable medium, including DVDs, CD-ROMs, USB flash disk, a removable hard disk, or other storage media, for example.
- the software product includes instructions tangibly stored thereon that enable an electronic device (e.g., a personal computer, a server, or a network device) to execute examples of the methods disclosed herein.
Landscapes
- Engineering & Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- User Interface Of Digital Computer (AREA)
Abstract
In various examples, the present disclosure describes methods and systems for generating hand state predictions. A hand state prediction system includes a machine learning-based model, such as a neural network model, that is trained to convert inertial motion measurements and surface contact data into predictions of a corresponding hand gesture or gripping posture. For each window of data sampled, motion data and contact data are obtained, processed and fused to generate a fused prediction. The hand state prediction system can operate in continuous mode to automatically detect a start and an end of a hand action, or a user can designate a start and an end of a hand action. A hand state prediction is generated by a multimodal classifier by processing the fused prediction. Instructions represented by a hand action can be acted upon through a command action performed by a computing device or a computer application.
Description
- The present disclosure is related to methods and systems for 3D hand state prediction, in particular, for classifying and modeling 3D hand motion or 3D hand posture using inputs from multiple modalities.
- Digital pens have emerged as a popular tool for interacting with digital devices such as tablets, smartphones or laptops with touchscreens. As digital pens can mimic interaction with traditional pen and paper, digital pens are often employed for tasks such as writing and drawing, or for digital interactions that require higher levels of intricacy such as navigation or playing games, among others. Early versions of stylus tools for use with personal digital assistant (PDA) devices were limited to point and click operations, however recent generations of digital pens are often equipped with sensors that can provide additional inputs for improving human-computer interaction.
- In many software applications, user-interfaces require the ability to switch between modes of operation or to instruct commands that adjust mode attributes. Traditional approaches to such interactions relied on menus, widgets or shortcut keys, however these can be cumbersome to manipulate on small screens. Therefore, a need exists to improve the ease and efficiency of digital pen interaction with digital devices
- Accordingly, it would be useful to provide a method and system for improving user experience when interacting digital devices using a digital pen.
- In various examples, the present disclosure describes a hand state prediction system which processes input motion sensor and contact surface signals and generates hand state predictions. The hand state prediction system includes a machine learning-based model, such as a neural network model that is trained to convert inertial motion measurements and surface contact data into predictions of a corresponding hand state in response to a hand action, for example, a hand gesture or gripping posture. For each window of data sampled, motion data and contact data are obtained, processed and fused to generate a fused output. The hand state prediction system can operate in continuous mode to automatically detect a start and an end of a hand action, or a user can designate a start and an end of a hand action. A multimodal hand state is generated by a multimodal classifier by processing the fused output. Instructions represented by a hand action (e.g. a gesture or a posture) can be acted upon through a command action performed by a computing device or a computer application.
- In some examples, representations of the state of a user's hand in 3D space captured by IMU data and surface contact data are fused into a fused output that may be learned by a neural network of the hand state prediction system.
- The hand state prediction system combines information from multiple modalities (e.g. inertial motion data generated by a device held in a user's hand and surface contact data generated by the user's hand), for example, by fusing a prediction of a hand action based on data from a motion sensing device (i.e. 3D motion captured by a motion sensing device in response to the user's hand action) and a prediction of a hand action based on contact data (i.e. contact area and optionally force measurements generated in response to the user's hand action, while the user's hand is resting on a contact surface) into a fused output, which results in a better prediction of a user's hand state. Combining information from multiple modalities into a fused output may enable additional information to be extracted from the contact surface data to help to capture important aspects of hand action, such as balance and motor control that may not be present in 3D motion data captured from a motion sensing device alone.
- A neural network included in the hand state prediction system is optimized to learn better representations from each modality (e.g. hand motion and hand contact area or contact force), contributing to improved overall performance of the hand state prediction system. For example, a motion classifier of the neural network configured to process translational and rotational motion from IMU data is optimized to classify hand motion using IMU data while a contact classifier of the neural network configured to process surface contact data is optimized to classify hand motion using surface contact data. Improved performance of the hand state prediction system may therefore be demonstrated by more accurately predicting a hand gesture or gripping position.
- Hand motion data can be acquired from low cost and low power devices to simplify implementation. A low cost, low power and low profile IMU motion sensor (e.g. 3-degree of freedom (3-DoF) IMU, a 6-degree of freedom (6-DoF) IMU, or a 9-degree of freedom (9-DoF) IMU) may be coupled to a device used to capture hand motion, for example, coupled to a digital pen body or coupled to another device. Similarly, for applications requiring lower power consumption, a capacitive touch screen can be used as the contact sensor instead of a 3D pressure pad. Flexible hardware and software configuration enable discrete or continuous sampling.
- In some aspects, the present disclosure describes a method for generating a multimodal hand state prediction. The method includes: obtaining motion data from a motion sensing device that is configured to sense motion of a user's hand; obtaining contact data from a contact surface that is configured to sense contact of the user's hand; and generating a multimodal hand state based on a fusing of the motion data and the contact data.
- In some aspects of the method, generating the multimodal hand state comprises: pre-processing the motion data to generate pre-processed motion data; and classifying the pre-processed motion data using a trained motion classifier to generate a first output, the first output including a probability corresponding to one or more classes.
- In some aspects of the method, generating the multimodal hand state further comprises: pre-processing the contact data to generate pre-processed contact data; and classifying the pre-processed contact data using a trained contact classifier to generate a second output, the second output including a probability corresponding to one or more classes.
- In some aspects of the method, generating the multimodal hand state further comprises: concatenating the first output and the second output to generate a fused output.
- In some aspects of the method, generating the multimodal hand state further comprises: classifying the fused output using a trained multimodal classifier to generate the multimodal hand state, the multimodal hand state including a probability corresponding to one or more classes.
- In some aspects of the method, prior to obtaining the motion data and contact data: receiving an instruction to begin sampling the motion data, and when the instruction to begin sampling the motion data is received, sampling the motion data; receiving an instruction to begin sampling the contact data, and when the instruction to begin sampling the contact data is received, sampling the contact data; receiving an instruction to end sampling the motion data; receiving an instruction to end sampling the contact data; storing the sampled motion data as the motion data; and storing the sampled contact data as the contact data.
- In some aspects of the method, prior to obtaining the motion data and contact data: continuously sampling the motion data and the contact data; determining, based on a threshold corresponding to the continuously sampled motion data and a threshold corresponding to the continuously sampled contact data, when a start of a hand action occurs; determining when an end of the hand action occurs based on the start of the hand action occurring; extracting the sampled motion data from the continuously sampled motion data based on the start of the hand action and the end of the hand action; extracting the sampled contact data from the continuously sampled contact data based on the start of the hand action and the end of the hand action; storing the sampled motion data as the motion data; and storing the sampled contact data as the contact data.
- In some aspects of the method, further comprising: transforming the multimodal hand state into a command action based on a predefined set of commands.
- In some aspects of the method, wherein the motion sensing device includes an inertial measurement unit (IMU).
- In some aspects of the method, wherein the contact surface is a capacitive touch pad, the capacitive touch pad capturing the contact data in 2D.
- In some aspects of the method, wherein the contact surface is a pressure sensor pad, the pressure sensor pad capturing the contact data in 3D.
- In some aspects of the method, further comprising: obtaining peripheral contact data from a peripheral contact surface operatively coupled to the motion sensing device, that is configured to sense peripheral contact of the user's hand on the motion sensing device; and generating the multimodal hand motion state based on a fusing of the motion data, the contact data and the peripheral contact data.
- In some aspects of the method, wherein the multimodal hand state is a classification prediction corresponding to one or more classes of hand actions.
- In some aspects of the method, wherein the multimodal hand motion state is a real-time 3D skeletal representation of a user's hand in a 3D space.
- In some aspects, the present disclosure describes a system. The system comprises: a motion sensing device that is configured to sense motion of a user's hand and output corresponding motion data; a contact surface that is configured to sense contact of the user's hand and output corresponding contact data; one or more memories storing executable instructions; and one or more processors coupled to the motion sensing device, contact surface and one or more memories, the executable instructions configuring the one or more processors to: generate a multi-modal hand state based on a fusing of the motion data and the contact data.
- In some aspects, the present disclosure describes a non-transitory computer-readable medium having machine-executable instructions stored thereon which, when executed by a processor of a computing system, cause the computing system to perform any of the preceding example aspects of the method.
- Reference will now be made, by way of example, to the accompanying drawings which show example embodiments of the present application, and in which:
-
FIG. 1 is a block diagram of a computing system that may be used for implementing a hand state prediction system, in accordance with example embodiments of the present disclosure; -
FIG. 2A is a block diagram illustrating a hand state prediction system, in accordance with examples of the present disclosure; -
FIG. 2B illustrates an example embodiment of the hand state prediction system ofFIG. 2A , with inputs obtained from a motion sensing device and a contact surface, in accordance with examples of the present disclosure; -
FIG. 3 is a block diagram illustrating a multimodal hand state network of the hand state prediction system ofFIG. 2A , in accordance with examples of the present disclosure; -
FIG. 4 is a flowchart illustrating an example method for determining a multimodal hand state, in accordance with examples of the present disclosure; -
FIG. 5 is a perspective view of an example embodiment of a motion sensing device, configured for delimited sampling, in accordance with examples of the present disclosure; and -
FIG. 6 is a perspective view of an example embodiment of a motion sensing device, configured for continuous sampling, in accordance with examples of the present disclosure. - Similar reference numerals may have been used in different figures to denote similar components.
- As used herein, statements that a second item (e.g., a signal, value, scalar, vector, matrix, calculation, or bit sequence) is “based on” a first item can mean that characteristics of the second item are affected or determined at least in part by characteristics of the first item. The first item can be considered an input to an operation or calculation, or a series of operations or calculations that produces the second item as an output that is not independent from the first item.
- The following describes example technical solutions of this disclosure with reference to accompanying figures. Similar reference numerals may have been used in different figures to denote similar components.
- In various examples, the present disclosure describes a hand state prediction system which processes input motion sensor and contact surface signals and generates hand state predictions. The hand state prediction system includes a machine learning-based model, such as a neural network model that is trained to convert inertial motion measurements and surface contact data into predictions of a corresponding hand state in response to a hand action, for example, a hand gesture or gripping posture. For each window of data sampled, motion data and contact data are obtained, processed and fused to generate a fused output. The hand state prediction system can operate in continuous mode to automatically detect a start and an end of a hand action, or a user can designate a start and an end of a hand action. A multimodal hand state is generated by a multimodal classifier by processing the fused output. Instructions represented by a hand action (e.g. a gesture or a posture) can be acted upon through a command action performed by a computing device or a computer application.
- To assist in understanding the present disclosure, the following describes some concepts relevant to hand motion classification, along with some relevant terminology that may be related to examples disclosed herein.
- Machine learning (ML) is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. A neural network consists of neurons. A neuron is a computational unit that uses xs and an intercept of 1 as inputs. An output from the computational unit may be:
-
- where s=1, 2, . . . n, n is a natural number greater than 1, Ws is a weight of xs, b is an offset (i.e. bias) of the neuron and f is an activation function of the neuron and used to introduce a nonlinear feature to the neural network, to convert an input of the neuron to an output denoted as h.
- A neural network may be constructed in layers, including an input layer that accepts inputs, an output layer that generates a prediction as output, and in the case of deep neural networks (DNN), a plurality of hidden layers which are situated between the input layer and output layer. The output of the activation function in one layer may be used as an input to a neuron of a subsequent layer in the neural network. In other words, an output from one neuron may be an input to another neuron. Different activation functions may be used for different purposes in a neural network, with hidden layers commonly using different activation functions than output layers.
- A layer is considered to be a fully connected layer when there is a full connection between two adjacent layers of the neural network. To be specific, for two adjacent layers (e.g., the i-th layer and the (i+1)-th layer) to be fully connected, each and every neuron in the i-th layer must be connected to each and every neuron in the (i+1)-th layer.
- Processing at each layer of the DNN may follow a linear relational expression: {right arrow over (y)}=α(W{right arrow over (x)}+{right arrow over (b)}), where {right arrow over (x)} is an input vector, {right arrow over (y)} is an output vector, {right arrow over (b)} is an offset vector, W is a weight (also referred to as a coefficient), and α(.) is an activation function. At each layer, the operation is performed on an input vector {right arrow over (x)}, to obtain an output vector {right arrow over (y)}. Because there is a large quantity of layers in the DNN, there is also a large quantity of weights W and offset vectors {right arrow over (b)}. The weights may be referred to as parameters of the neural network, the optimal values of which may be learned by training the neural network.
- In a DNN, a greater number of hidden layers may enable the DNN to better model a complex situation (e.g., a real-world situation). In theory, a DNN with more parameters is more complex, has a larger capacity (which may refer to the ability of a learned model to fit a variety of possible scenarios), and indicates that the DNN can complete a more complex learning task. Training of the DNN is a process of learning the weight matrix. A purpose of the training is to obtain a trained weight matrix, which consists of the learned weights W of all layers of the DNN. Before a DNN can be trained, the initial weights need to be set. For example, an initialization function such as random or Gaussian distributions may define initial weights.
- In the process of training a DNN, two approaches are commonly used: supervised learning and unsupervised learning. In unsupervised learning, the neural network is not provided with any information on desired outputs, and the neural network is trained to arrive at a set of learned weights on its own. In supervised learning, a predicted value outputted by the DNN may be compared to a desired target value (e.g., a ground truth value). A weight vector (which is a vector containing the weights W for a given layer) of each layer of the DNN is updated based on a difference between the predicted value and the desired target value. For example, if the predicted value outputted by the DNN is excessively high, the weight vector for each layer may be adjusted to lower the predicted value. This comparison and adjustment may be carried out iteratively until a convergence condition is met (e.g., a predefined maximum number of iterations has been performed, or the weight vector converges). A loss function or an objective function is defined, as a way to quantitatively represent how close the predicted value is to the target value. An objective function represents a quantity to be optimized (e.g., minimized or maximized) in order to bring the predicted value as close to the target value as possible. A loss function more specifically represents the difference between the predicted value and the target value, and the goal of training the DNN is to minimize the loss function.
- Backpropagation is an algorithm for training a DNN. Backpropagation is used to adjust (also referred to as update) a value of a parameter (e.g., a weight) in the DNN, so that the error (or loss) in the output becomes smaller. For example, a defined loss function is calculated, from forward propagation of an input to an output of the DNN. Backpropagation calculates a gradient of the loss function with respect to the parameters of the DNN, and a gradient algorithm (e.g., gradient descent) is used to update the parameters to reduce the loss function. Backpropagation is performed iteratively, so that the loss function is converged or minimized.
- A convolutional neural network (CNN) is a neural network that is designed to find the spatial relationship in data. CNNs are commonly used in applications related to computer vision or image processing for purposes of classification, regression, segmentation and/or object detection. A CNN is a DNN with a convolutional structure. The CNN includes a feature extractor consisting of a convolutional layer and a sub-sampling layer. The convolutional layer consists of kernels or filters that are convolved with a two-dimensional (2D) input image to generate feature maps or feature representations using a trainable filter.
- A recurrent neural network (RNN) is a neural network that is designed to process sequential data and make predictions based on the processed sequential data. RNNs have an internal memory that remembers inputs (e.g. the sequential data), thereby allowing previous outputs (e.g. predictions) to be fed back into the RNN and information to be passed from one time step to the next time step. RNNs are commonly used in applications with temporal components, for example real-time applications or interactions.
- In the present disclosure, a “hand action” can mean an action intentionally performed by a user's hand, for example, while engaging a motion sensing device and a contact surface. In some examples, a hand action may be a gesture or a hand movement. In other examples, a hand action may be a hand posture.
- In the present disclosure, a “hand state” can mean the state of a user's hand in 3D space in response to, or while performing, a hand action. In some examples, the hand state may include positional information about the position of the user's hand in 3D space. In some examples, a user's hand state can be described while the user's hand is in motion, for example, while performing a gesture (e.g. a swipe action or a mid-air gesture) or a hand movement, such as writing or drawing. In other examples, a user's hand state can be described while the user's hand is still or motionless, for example, while engaged in a specific hand posture. In some examples, a user's hand state can be described with reference to a 3D skeletal model of the user's hand.
- In the present disclosure, the term “modality” refers to a particular mode in which something exists or is experienced or expressed. For example, a modality can mean a mode of data collection (e.g. inertial motion or contact force). In another example, a modality can mean a way of operating an application (e.g. drawing mode or erasing mode).
- In the present disclosure, a “multimodal input” can mean an input that encompasses two or more input modalities, for example, a combination of two or more modes of input data. In this regard, a multimodal input may be a single input that comprises a combination of individual inputs that were obtained from two or more different data sources, for example, an inertial motion sensor, a force sensor, a contact sensor etc.
- In the present disclosure, “fusion” can mean the consolidation of multiple elements into a single representation. For example, a merging of information from different sensors (e.g. motion sensors, capacitive touchscreens or force sensors) can be an example of “sensor fusion.” Fusing information from different sources may help to enhance correlated features and reduce uncertainty in a system, leading to improved recognition accuracy.
- In the present disclosure, “position” can mean a physical configuration of the human body or a part of the human body. For example, a hand position or a wrist position may describe the configuration of a user's hand or wrist in 3D space.
- In the present disclosure, “posture” can mean an intentional or habitually assumed position for a specific purpose. For example, a “gripping posture” may describe the configuration of a user's fingers around the shaft of a pen while holding a digital pen for the purpose of executing a task (e.g. writing or drawing) with the pen, or to execute a gesture with the pen. Gripping postures may be described in a number of ways, for example, common gripping postures include a correct grip, a close grip, a fold grip, a tuck grip, a squeeze grip, a hook grip, a wrap grip, a mount grip or a tripod grip.
- In the present disclosure, “gesture” can mean a particular movement of a part of the human body, or sequence of movements that may be used for non-verbal communication, for example, a controlled movement that contains meaning to a person who observes the movement, or to a device that receives an input representing the movement. In some examples, gestures may be performed by a part of the body, for example, a finger executing a “swipe” gesture in contact with a touchscreen or in mid-air, or a gesture may be performed by a device being operated by a user, for example, a right-to-left movement executed by a user while holding a device (e.g. digital pen), among others.
- In the present disclosure, “mode-switching” can mean an act of switching from one mode of operation to another mode of operation. For example, switching between performing a writing operation and an erasing operation.
- In the present disclosure, “command action” is an action performed by a computing device or computer application in response to an instruction by a user. For example, a command action associated with a “circle” gesture made by a user within a drawing application may be interpreted by the device as a “mode-switching” command and may have the effect of changing the user's mode of operation from “drawing” to “selecting an object” within the drawing canvas.
- To assist in understanding the present disclosure, some existing technologies are first discussed.
- Some examples of existing technologies applied to digital pens include the incorporation of pressure sensors at the tip of the tool for measuring an input force applied by the digital pen on the surface of a device, to assist with writing and drawing. In other examples, digital pens are equipped with external buttons that when pressed, enable users to perform various functions, such as mode-switching (e.g. switching between performing a writing operation and an erasing operation).
- Some existing technologies have drawbacks in that the physical buttons may introduce additional complexity in operation and hardware cost, and may not be aesthetically pleasing. In addition, due to the small screen size associated with many personal electronic devices, user interfaces are limited in size and available space, and shortcut keys and menu buttons for tasks such as mode-switching typically employed on larger devices cannot be accommodated.
- The present disclosure describes examples that may help to address some or all of the above drawbacks of existing technologies.
-
FIG. 1 shows a block diagram of an example hardware structure of acomputing system 100 that is suitable for implementing embodiments of the system and methods of the present disclosure, described herein. Examples of embodiments of system and methods of the present disclosure may be implemented in other computing systems, which may include components different from those discussed below. Thecomputing system 100 may be used to execute instructions to carry out examples of the methods described in the present disclosure. Thecomputing system 100 may also be used to train the machine learning models of the handmotion prediction system 200, or the handmotion prediction system 200 may be trained by another computing system. - Although
FIG. 1 shows a single instance of each component, there may be multiple instances of each component in thecomputing system 100. - The
computing system 100 includes at least oneprocessor 102, such as a central processing unit, a microprocessor, a digital signal processor, an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), a dedicated logic circuitry, a dedicated artificial intelligence processor unit, a graphics processing unit (GPU), a tensor processing unit (TPU), a neural processing unit (NPU), a hardware accelerator, or combinations thereof. - The
computing system 100 may include an input/output (I/O)interface 104, which may enable interfacing with aninput device 106 and/or anoptional output device 114. In the example shown, the input device 106 (e.g., a keyboard, a mouse, a camera, a touchscreen, a stylus and/or a keypad) may also include amotion sensor 108, atouch sensor 110, and anoptional force sensor 112. In the example shown, the optional output device 114 (e.g., a display, a speaker and/or a printer) is shown as optional and external to thecomputing system 100. In other example embodiments, there may not be anyinput device 106 andoutput device 114, in which case the I/O interface 104 may not be needed. - The I/
O interface 104 may buffer the data generated by theinput units 120 and provide the data to theprocessor 102 to be processed in real-time or near real-time (e.g., within 10 ms, or within 100 ms). The I/O interface 104 may perform preprocessing operations on the input data, for example normalization, filtering, denoising, etc., prior to providing the data to theprocessor 102. - The I/
O interface 104 may also translate control signals from theprocessor 102 into output signals suitable to eachrespective output device 114. Adisplay 116 may receive signals to provide a visual output to a user. In some examples, thedisplay 116 may be a touch-sensitive display (also referred to as a touchscreen) in which thetouch sensor 110 is integrated. A touch-sensitive display may both provide visual output and receive touch input. - The
computing system 100 may include anoptional communications interface 120 for wired or wireless communication with other computing systems (e.g., other computing systems in a network) or devices. Thecommunications interface 120 may include wired links (e.g., Ethernet cable) and/or wireless links (e.g., one or more antennas) for intra-network and/or inter-network communications. - The
computing system 100 may include one or more memories 122 (collectively referred to as “memory 122”), which may include a volatile or non-volatile memory (e.g., a flash memory, a random access memory (RAM), and/or a read-only memory (ROM)). Thenon-transitory memory 122 may store instructions for execution by theprocessor 102, such as to carry out example embodiments of methods described in the present disclosure. For example, thememory 122 may store instructions for implementing any of the systems and methods disclosed herein. Thememory 122 may include other software instructions, such as for implementing an operating system (OS) and other applications/functions. - The
memory 122 may also storeother data 124, information, rules, policies, and machine-executable instructions described herein, including amotion data 230 captured by themotion sensor 108,contact data 250 captured by thetouch sensor 110 or theforce sensor 112 or data representative of a user's hand motion captured by an input device on another computing system and communicated to thecomputing system 100. - In some examples, the
computing system 100 may also include one or more electronic storage units (not shown), such as a solid state drive, a hard disk drive, a magnetic disk drive and/or an optical disk drive. In some examples, data and/or instructions may be provided by an external memory (e.g., an external drive in wired or wireless communication with the computing system 100) or may be provided by a transitory or non-transitory computer-readable medium. Examples of non-transitory computer readable media include a RAM, a ROM, an erasable programmable ROM (EPROM), an electrically erasable programmable ROM (EEPROM), a flash memory, a CD-ROM, or other portable memory storage. The storage units and/or external memory may be used in conjunction withmemory 116 to implement data storage, retrieval, and caching functions of thecomputing system 100. The components of thecomputing system 100 may communicate with each other via a bus, for example. - Although the
computing system 100 is illustrated as a single block, thecomputing system 100 may be a single physical machine or device (e.g., implemented as a single computing device, such as a single workstation, single end user device, single server, etc.). The computing system may be a mobile communications device (e.g. a smartphone), a laptop computer, a tablet, a desktop computer, a wearable device, a vehicle driver assistance system, an assistive technology device, among others. In some embodiments, thecomputing system 100 may comprise a plurality of physical machines or devices (e.g., implemented as a cluster of machines, server, or devices). In some embodiments, thecomputing system 100 may be a virtualized computing system (e.g., a virtual machine, a virtual server) emulated on a cluster of physical machines or by a cloud computing system. -
FIG. 2A shows a block diagram of an example handstate prediction system 200 of the present disclosure. The handstate prediction system 200 may be a software that is implemented in thecomputing system 100 ofFIG. 1 , in which theprocessor 102 is configured to execute instructions 200-I of the handstate prediction system 200 stored in thememory 122. The handstate prediction system 200 includes amotion sensing device 220, acontact surface 240 and a multimodalhand state network 260. - The hand
state prediction system 200 receives an input of ahand action 210 and outputs amultimodal hand state 270 that may be transformed into acommand action 290. In some embodiments, for example, thehand action 210 may be representative of a gesture or a gesture sequence. For example, gestures may include a left-to-right gesture, a right-to-left gesture, an up-to-down gesture, a down-to-up gesture and a circle or curved “rotation” gesture, among others. In other embodiments, for example, thehand action 210 may be representative of a gripping posture. For example, gripping postures may include postures for holding a pen in a user's hand, for example postures with a correct grip, a close grip, a fold grip, a tuck grip, a squeeze grip, a hook grip, a wrap grip, a mount grip or a tripod grip, among others. Example gripping postures are described in: Bi, Hongliang, Jian Zhang, and Yanjiao Chen, “SmartGe: identifying pen-holding gesture with smartwatch,” IEEE Access 8 (2020): 28820-28830, the entirety of which is hereby incorporated by reference. In some examples, thehand action 210 may be captured by amotion sensing device 220 that is configured to sense motion of a user'shand 202, for example, a digital pen or a stylus equipped with amotion sensor 108, to generatemotion data 230. In some examples, themotion data 230 may be sampled over a predetermined period of time or themotion data 230 may be continuously sampled. In some examples, thehand action 210 may also be captured by acontact surface 240, for example, a 2D touch sensitive surface or a 3D pressure sensor pad, to generatecontact data 250. In some examples, thecontact data 250 may be sampled over a predetermined period of time or thecontact data 250 may be continuously sampled. - In some examples the hand
state prediction system 200 may generate amultimodal hand state 270. In some examples, themultimodal hand state 270 may be a classification prediction corresponding to one or more classes of hand actions, for example, a gesture or a gripping posture classified from a set of gesture classes or a set of gripping posture classes. In some examples, amultimodal hand state 270 may be generated based on decision criteria for classification, for example, using one hot encoding or comparing a maximum confidence probability to a pre-determined threshold. In some examples, themultimodal hand state 270 may indicate whether a gripping posture is correct or incorrect, based on a set of gripping posture classes or based on a 3D skeletal model of a user's hand posture. In other examples, themultimodal hand state 270 may be a real-time 3D skeletal representation of a user's hand in 3D space, for example, a 3D skeleton model may map, in real-time, coordinates corresponding to one or more modeled skeletal features to a shape, position or posture of a user's hand. Optionally, themultimodal hand state 270 may be transformed by aninterpreter 280 into acommand action 290 based on a predefined set of commands. A computing system or computer application running on a computing system that is capable of executing thepredefined command action 290 may then be able to execute thecommand action 290. In an example embodiment, a user may perform ahand action 210 such as a right-to-left motion gesture while interacting with an application on thecomputing system 100 such as an e-reader, which may then be received asmotion data 230 andcontact data 250 by thecomputing system 100 implementing the handstate prediction system 200. The handstate prediction system 200 may process themotion data 230 and thecontact data 250 to output amultimodal hand state 270 that captures the user's intent to “turn the page”. Thecomputing system 100 may then be able to map themultimodal hand state 270 to acommand action 290 from a predefined set of command actions that the user wishes to turn the page, and may execute thecommand action 290. In another example embodiment, a user may perform ahand action 210 such as a circle motion gesture while interacting with a drawing application on thecomputing system 100, which may then be received asmotion data 230 andcontact data 250 by thecomputing system 100 implementing the handstate prediction system 200. The handstate prediction system 200 may process themotion data 230 and thecontact data 250 to output amultimodal hand state 270 that captures the user's intent to switch modes from drawing mode to select mode, and “select an object” in the drawing canvas. Thecomputing system 100 may then be able to map themultimodal hand state 270 to acommand action 290 from a predefined set of command actions that the user wishes to switch modes of operation in the drawing application and select the object, and may execute thecommand action 290. -
FIG. 2B illustrates an example embodiment of the handstate prediction system 200, where inputs are obtained from amotion sensing device 220 and acontact surface 240 that are configured to sense motion of a user'shand 202, in accordance with examples of the present disclosure. In some embodiments, for example, themotion sensing device 220 may be an object operatively coupled to the user, for example, themotion sensing device 220 may be held in a user's hand, or themotion sensing device 220 may be coupled to an arm or wrist of a user, among others. In some embodiments, for example, themotion sensing device 220 may be a digital pen, for example, having a rigid body, the rigid body being an elongated shaft with a first end and a second end. In some examples, themotion sensing device 220 that is configured to sense motion of a user'shand 202 wherein the first end of the elongated shaft is proximal to the user's fingers and the second end of the elongated shaft is distal to the user's fingers and wherein in response to ahand action 210, the second end may experience a greater degree of translational and rotational motion than the first end. In other embodiments, for example, themotion sensing device 220 may be another object, for example, a wearable device. In some examples, the user'shand 202 is also coupled to thecontact surface 240, for example, with a portion of the user's palm or wrist resting on thecontact surface 240 to generate acontact area 242. In some examples, thecontact area 242 may represent a pivot point for a user's palm or wrist while the user performs thehand action 210. In other examples, thecontact area 242 may represent a drag motion of the palm of a user'shand 202 while the user performs thehand action 210. In some examples, more than onecontact area 242 may be generated, for example, if more than one portion of a user's hand 202 (e.g. a finger) interacts with thecontact surface 240 during thehand action 210. - In some examples, the
motion sensing device 220 includes amotion sensor 108, for example an inertial measurement unit (IMU) to detect the movement of themotion sensing device 220 in response to a user'shand motion 210. In some examples, themotion sensor 108 may be a 3 degree-of-freedom (3DoF) IMU, a 6 degree-of-freedom (6DoF) IMU or a 9 degree-of-freedom (9DoF) IMU, where the IMU may comprise an accelerometer that measures translational acceleration in 3-axes, a gyroscope that measures rotational velocity or acceleration in another 3-axes or optionally a magnetometer that measures a magnetic field strength in 3-axes. In some examples, themotion data 230 generated by themotion sensing device 220 during ahand action 210 may be represented by 3 channels of time-series translational acceleration measurements (e.g. force or acceleration), 3 channels of time-series rotational velocity measurements (e.g. angular rate) and optionally 3 channels of time-series magnetic field measurements (e.g. orientation), corresponding to movement of themotion sensing device 220 in response to thehand action 210. In some embodiments, for example, themotion device 220 may sample themotion data 230 based on a start and an end of thehand action 210, and in some examples, the sampled motion data may be stored asmotion data 230. - In some embodiments, for example, the
contact surface 240 may include atouch sensor 110, for example a capacitive touch sensitive surface, to capture 2D positional information corresponding to thecontact area 242 of a user'shand 202 on thecontact surface 240 in response to thehand action 210. A capacitive touch screen draws small electrical charges to a point of contact by a user, and functions as a capacitor in the region of contact. In some examples, in response to a user's hand placed in contact with the capacitive touch sensitive surface, a change in the capacitance and electrostatic field in the capacitive panel of the touch sensitive surface provides location information corresponding to thecontact area 242. In some examples, thecontact data 250 generated by thecontact surface 240 during ahand action 210 may be represented by a sequence of 2D contours defining thecontact area 242. In some embodiments, for example, thecontact surface 240 may sample thecontact data 250 based on a start and an end of thehand action 210, and in some examples, the sampled contact data may be stored ascontact data 250. - In other embodiments, for example, the
contact surface 240 may include one ormore force sensors 112, where the one ormore force sensors 112 may be arranged in a 2D array, for example, as a pressure pad, to measure a force distribution corresponding to thecontact area 242 of a user'shand 202 on thecontact surface 240 in response to thehand action 210. In some examples, thecontact data 250 generated by thecontact surface 240 during ahand action 210 may be represented by a sequence of force measurements distributed across thecontact surface 240 and defined by thecontact area 242. In some examples, the value of the force measurements may be proportional to the magnitude of the applied force by the user'shand 202 at each point in the pressure array of thecontact surface 240. In this regard, thecontact data 250 generated by one ormore force sensors 112 may be considered to be three-dimensional (3D), including both 2D positional information and force measurements defining thecontact area 242. - In some examples, a benefit of generating
contact data 250 with a capacitive touch sensitive surface is that touch sensitive surfaces require lower power and are readily embedded into many surfaces on commercial devices, for example, tablets, laptops, smartphones or dual-screen devices, compared to a pressure pad that may require greater power requirements. In some examples, a benefit of generatingcontact data 250 with a pressure pad includes the collection of richer data, including information corresponding to the applied force of a user'shand 202 along with positional information, compared to a 2D touch sensitive surface that captures only 2D positional information. - Returning to
FIG. 2A , themotion data 230 and thecontact data 250 are input to a multimodalhand state network 260 to generate amultimodal hand state 270. To further describe the multimodalhand state network 260,FIG. 3 may be referenced. -
FIG. 3 is a block diagram illustrating a multimodalhand state network 260, in accordance with examples of the present disclosure. In some examples, amotion data pre-processor 310 receives themotion data 230 and generatespre-processed motion data 312. In some examples, themotion data pre-processor 310 may filter the motion data by mean zeroing the columns and forcing a unit variance or applying dynamic time warping (DTW) to time synchronize the data, or other pre-processing operations may be performed, depending on the system input requirements, or depending on the application. In some examples, themotion data 230 may be pre-processed at themotion sensing device 220 or themotion data 230 may be transmitted by thecomputing system 100 to be pre-processed by theprocessor 102 of thecomputing system 100. In some examples, acontact data pre-processor 330 receives thecontact data 250 and generatespre-processed contact data 332. In some examples, thecontact data pre-processor 330 may convert 2D or3D contact data 250 into motion history images (MHI). In some examples, other pre-processing operations may be performed, depending on the system input requirements, or depending on the application. - In some examples, the
pre-processed motion data 312 is input to a trainedmotion classifier 320 to generate afirst output 322 representing a motion state of the user's hand. In some examples, themotion classifier 320 may be a neural network, for example, a RNN, or a DNN or themotion classifier 320 may be another machine learning model. In some examples, thefirst output 322 may be a first classification probability (e.g. a prediction confidence) corresponding to each class in a set of one or more classes, where classes may represent various hand gestures or gripping postures. In some examples, thepre-processed contact data 332 is input to a trainedcontact classifier 340 to generate asecond output 342 representing a contact state of the user's hand. In some examples, thecontact classifier 340 may be a neural network, for example, a CNN, a RNN or a DNN or thecontact classifier 340 may be another machine learning model. In some examples, thesecond output 342 may be a second classification probability (e.g. a prediction confidence) corresponding to each class in a set of one or more classes, where classes may represent various hand gestures or gripping postures, among others. In some examples, themotion classifier 320 and thecontact classifier 340 may be trained, for example, using supervised learning using labeled training datasets including pre-processed motion data and pre-processed contact data obtained from themotion sensing device 220 and thecontact surface 240 in response to hand motion executed by various users, using backpropagation to minimize a respective classification loss function, for example a motion classification loss function or a contact classification loss function. - In some examples, the
first output 322 and thesecond output 342 may be fused to generate a fusedoutput 350. In some examples, fusing thefirst output 322 and thesecond output 342 may comprise concatenating thefirst output 322 and thesecond output 342, or other methods to fuse thefirst output 322 and thesecond output 342 may be used. In some examples, the fusedoutput 350 is input to amultimodal classifier 360 to generate amultimodal hand state 270. In some examples, themultimodal hand state 270 may be a multimodal classification probability (e.g. a prediction confidence) corresponding to each class in a set of one or more classes representingvarious hand actions 210, for example, gestures or gripping postures. In some examples, themultimodal classifier 360 may be a neural network, for example, a DNN or themotion classifier 320 may be another machine learning model. In some examples, themultimodal classifier 360 may be trained, for example, using backpropagation to minimize a multimodal classification loss function. - Returning to
FIG. 2 , themultimodal hand state 270 output by the multimodalhand state network 260 may optionally be input to aninterpreter 280 which is configured to transform themultimodal hand state 270 into acommand action 290 based on a predefined set of commands. The predefined set of commands may be stored asdata 124 in thememory 122 of thecomputing system 100. Acommand action 290 may an action being taken by a computer or computer application, such as an e-reader or a drawing application, in response to hand motion predictions representing a user intent. For example, acommand action 290 associated with a “right-to-left” gesture would cause a computer device or computer application such as an e-reader to turn the page, or in another example, perform a mode switching operation. -
FIG. 4 is a flowchart illustrating anexample method 400 for determining amultimodal hand state 270, in accordance with examples of the present disclosure. Themethod 400 may be performed by thecomputing system 100. Themethod 400 represents operations performed by the multimodalhand state network 260 depicted inFIG. 3 . For example, theprocessor 102 may execute computer readable instructions 200-I (which may be stored in the memory 122) to cause thecomputing system 100 to perform themethod 400. -
Method 400 begins at step 402 in whichmotion data 230 is obtained from amotion sensing device 220 configured to sense the motion of a user's hand in response to performing ahand action 210. Themotion data 230 may be representative of movement of a user's hand captured by amotion sensor 108 of thecomputing system 100, and corresponding to amotion sensing device 220. - The
method 400 then proceeds to step 404. At step 404, acontact data 250 is obtained from acontact surface 240 configured to sense the contact of a user's hand in response to performing ahand action 210. Thecontact data 250 may be representative of movement of a user's hand captured by atouch sensor 110 or aforce sensor 112 of thecomputing system 100, and corresponding to acontact surface 240. - The
method 400 then proceeds to step 406. At step 406, amultimodal hand state 270 is generated based on a fusing of themotion data 230 and thecontact data 250. In some examples, themultimodal hand state 270 may be a multimodal classification probability (e.g. a prediction confidence) corresponding to each class in a set of one or more classes, for example, a gesture or a gripping posture classified from a set of gesture classes or a set of gripping posture classes. In some examples, amultimodal hand state 270 may be generated based on decision criteria for classification, for example, using one hot encoding or comparing a maximum confidence probability to a pre-determined threshold. In some examples, themultimodal hand state 270 may indicate whether a gripping posture is correct or incorrect, based on a set of gripping posture classes or based on a 3D skeletal model of a user's hand posture. In other examples, themultimodal hand state 270 may be a real-time 3D skeletal representation of a user's hand in 3D space, for example, a 3D skeleton model may map, in real-time, coordinates corresponding to one or more modeled skeletal features to a shape, position or posture of a user's hand. - In some examples, prior to fusing the
motion data 230 and thecontact data 250, themotion data 230 may be processed to generate afirst output 322 representing a motion state of a user's hand. In some examples,motion data 230 may be pre-processed, for example, themotion data 230 may be filtered by mean zeroing the columns and forcing a unit variance or by applying dynamic time warping (DTW) to time synchronize the data. In some examples, other pre-processing operations may be performed, depending on the system input requirements, or depending on the application. In some examples, thepre-processed motion data 312 may be classified to generate thefirst output 322, where thefirst output 322 may be a first classification probability (e.g. a prediction confidence) corresponding to each class in a set of one or more classes, where classes may represent various hand gestures or gripping postures. - In some examples, prior to fusing the
motion data 230 and thecontact data 250, thecontact data 250 may be processed to generate asecond output 342 representing a contact state of a user's hand. In some examples, thecontact data 250 may be pre-processed, for example, the 2D or3D contact data 250 may be converted into motion history images. In some examples, other pre-processing operations may be performed, depending on the system input requirements, or depending on the application. In some examples, thepre-processed contact data 332 may be classified to generate thesecond output 342, where thesecond output 342 may be a second classification probability (e.g. a prediction confidence) corresponding to each class in a set of one or more classes, where classes may represent various hand gestures or gripping postures. - In some examples, fusing the
motion data 230 and thecontact data 250 may comprise fusing thefirst output 322 and thesecond output 342 to generate a fusedoutput 350. In some examples, fusing thefirst output 322 and thesecond output 342 may comprise concatenating thefirst output 322 and thesecond output 342, or other methods to fuse thefirst output 322 and thesecond output 342 may be used. Using inputs from both the inertial motion modality and contact force modality, the fusedoutput 350 may be a joint representation of both the inertial motion and contact force modality. - In some examples, step 406 may be described as performing a fusion of multimodal features. Feature fusion may be described as a method to integrate the features of different data to enhance the features distinguished from feature extractors. In the case of multimodal feature fusion, fusion of representations from different modalities (for example, inertial motion and contact force) into a single representation enables the machine learning model to learn a joint representation of each of the modalities. In some examples, a benefit of using a joint representation of the modalities may be that additional information may be extracted from the contact force modality (e.g. contact data 250) to help capture important aspects of a
hand action 210 that are not present in themotion data 230 alone. - Optionally, at
step 408, themultimodal hand state 270 may be transformed, for example by aninterpreter 280 into acommand action 290, based on a predefined set of commands. The predefined set of commands may be stored asdata 124 in thememory 122 of thecomputing system 100. Acommand action 290 may an action being taken by a computer or computer application, such as an e-reader or a drawing application, in response to hand action predictions representing a user intent. For example, acommand action 290 associated with a “right-to-left” swipe gesture would cause a computer device or computer application such as an e-reader to turn the page, or in another example, perform a mode switching operation. - According to embodiments of the present disclosure, a
motion sensing device 220 for capturinghand motion 210 is provided. In some examples, themotion sensing device 220 includes amotion sensor 108 and configured to interact with acomputing system 100 and a user to instructcommand actions 290. In some examples, themotion sensing device 220 may be held in a user'shand 202 while the user is simultaneously interacting with acontact surface 240 to instructcommand actions 290.FIGS. 5 and 6 illustrate an examplemotion sensing device 220 according to example embodiments. In some examples, themotion sensing device 220 may take the form of a digital pen or a stylus, having arigid body 510 that extends as a shaft along an elongate axis from a firstaxial end 520 to a secondaxial end 530 or themotion sensing device 220 may be another device. In example embodiments, amotion sensing device 220 may have abody 510 that is configured to allow a user to grip the digital pen and thebody 510 may be cylindrical along its length. In some examples, themotion sensing device 220 may have a taperedtip 540 provided at the firstaxial end 520 of thebody 510. In some examples, thetip 540 may be used to actuate user-interface elements on a touchscreen display. In other examples, themotion sensing device 220 may incorporate a writing pen. For example, themotion sensing device 220 may have an ink-dispensing writing tip at thetip 540. In some examples, amotion sensor 108 may be coupled to themotion sensing device 220, for example, themotion sensor 108 may be coupled to the secondaxial end 530 of themotion sensing device 220 or themotion sensor 108 may be coupled to another location on themotion sensing device 220. In some examples, positioning themotion sensor 108 at the secondaxial end 530 may have the advantage of capturing greater or more exaggerated translational or rotational movement while a user interacts with themotion sensing device 220. In some examples,motion data 230 captured by themotion sensing device 220 may be transmitted to thecomputing system 100 to be pre-processed ormotion data 230 may be pre-processed at themotion sensing device 220 andpre-processed motion data 312 may be transmitted to thecomputing system 100. -
FIG. 5 is a perspective view of an example embodiment of amotion sensing device 220 in the form of a digital pen, configured for delimited sampling, in accordance with examples of the present disclosure. In some examples, adelimiter button 550 may be located onmotion sensing device 220, for example thedelimiter button 550 may be located on thebody 510 of themotion sensing device 220. In some examples, thedelimiter button 550 may be located near the firstaxial end 520 of thebody 510 or it may be located elsewhere on themotion sensing device 220. In some examples, thedelimiter button 550 is configured to interact with thecomputer system 100 to determine that a “button interaction” has occurred, in order to initiate or end data sampling. In some embodiments, for example, a user may instruct themotion sensing device 220 to begin samplingmotion data 230 by depressing thedelimiter button 250 in a single “button click”, and in response to receiving instruction to begin samplingmotion data 230, themotion sensing device 220 may sample themotion data 230 for a pre-determined period of time, for example, 3 seconds. In other embodiments, for example, the user may instruct themotion sensing device 220 to begin samplingmotion data 230 by depressing thedelimiter button 250 in a first single “button click”, and in response to receiving instruction to begin samplingmotion data 230, themotion sensing device 220 may sample themotion data 230 until an instruction is received to end the sampling period, for example, by the user depressing thedelimiter button 250 in a second “button click”. In other embodiments, for example, the user may instruct themotion sensing device 220 to begin samplingmotion data 230 by depressing and holding thedelimiter button 250 in a depressed condition, themotion sensing device 220 continuing to sample themotion data 230 until thedelimiter button 250 is released by the user, signaling an end to the data sampling period. In other embodiments, for example, the multimodal handmotion state network 260 may be trained to recognize the end of a hand gesture, and the user may instruct themotion sensing device 220 to begin samplingmotion data 230 by depressing thedelimiter button 250 in a single “button click”, and in response to receiving instruction to begin samplingmotion data 230, themotion sensing device 220 may sample themotion data 230 until the multimodalhand state network 260 recognizes that the hand gesture is complete and instructs and end to the data sampling period. In some examples, themotion sensing device 220 is synchronized with thecontact surface 240 and instructions received at themotion sensing device 220 to initiate or end data sampling are also received at thecontact surface 240 to initiate or end data sampling at thecontact surface 240. In some embodiments, for example, the sampled motion data may be stored asmotion data 230. -
FIG. 6 is a perspective view of an example embodiment of amotion sensing device 220 in the form of a digital pen, configured for continuous sampling, in accordance with examples of the present disclosure. In some examples, thedelimiter button 550 described with respect toFIG. 5 may be absent in amotion sensing device 220 configured for continuous sampling. In some examples, when the handmotion prediction system 200 is configured for continuous sampling, no explicit instruction may be provided to themotion sensing device 220 or thecontact surface 240 by the user, signaling the start or the end of sampling, and themotion sensing device 220 and thecontact surface 240 may be continuously sampling motion data at all times and generating a stream of sampled motion data and sampled contact data. In some embodiments, for example, a control operator within thecomputing system 100 may monitor the sampledmotion data 230 and the sampledcontact data 250 to detect the start of ahand action 210. For example, a control operator within thecomputing system 100 may compare the motion data and contact data values to a threshold level to signal the start or the end of sampling, for example, a threshold level of translational or rotational motion by themotion sensor 108 that indicates that themotion sensing device 220 is in motion, or a non-zero measure on thetouch sensor 110 or theforce sensor 112 that indicates that a user'shand 202 is in contact with thecontact sensor 240. In some embodiments, for example, if amotion sensing device 220 is deemed to be in motion, a control operator may initiate data sampling for themotion sensing device 220 and thecontact surface 240 for a pre-determined period of time, for example, 3 seconds. In some examples, the control operator may repeat the initiation of data sampling in rolling windows of pre-determined length, (e.g. 3 seconds) until an indication has been received that thehand action 210 has ended. In other embodiments, for example, the multimodalhand state network 260 may be trained to recognize the beginning or the end of a hand action 201, for example, a gesture, and may segment discrete portions of thecontinuous motion data 230 and thecontinuous contact data 250 for processing and classification. In some embodiments, for example, the continuously sampled motion data may be stored asmotion data 230. - Although some examples have been described in the context of a handheld electronic device (e.g., a tablet, or a smartphone), it should be understood that examples of the present disclosure may be implemented using other electronic devices, such as electronic wearable devices including smart watches or smart gloves, among others. For example,
motion sensors 108 can be mounted on the surface of a wearable device, such as a smart watch to capture wrist movement. In other examples, atouch sensor 110 or aforce sensor 112 may be integrated into a vehicle, for example, on the steering wheel or console screen, for human-computer interaction during driving. - In another example embodiment, the methods, systems and devices described herein may be used to predict a
multimodal hand state 270 by modeling a user's hand posture in 3D space, rather than classifying a gesture or a gripping posture. In some examples, a modeling method to model a user's hand posture in 3D space may include a 3D skeletal model, or another modeling method may be used. In some examples, themultimodal hand state 270 may be a real-time 3D skeletal representation of a user's hand in 3D space, for example, a 3D skeleton model may map, in real-time, coordinates corresponding to one or more skeletal features to a shape, position or posture of a user's hand. In some examples, to model a user's hand posture in 3D space,motion data 230 andcontact data 250 may be optionally augmented by additional inputs, for example, a peripheral contact surface that is operatively coupled to the exterior of themotion sensing device 220 that is configured to sense peripheral contact of the user's hand on themotion sensing device 220 while the user is holding themotion sensing device 220. In some examples, the peripheral contact surface may be a touch sensitive surface or a pressure array coupled to the exterior of themotion sensing device 220. In some examples, the peripheral contact surface may capture peripheral contact data corresponding to surface area or applied force caused by the user's fingers or hand contacting the exterior of themotion sensing device 220 while executing a gripping posture or a gesture. In some examples, the peripheral contact data may be processed to generate a peripheral contact output representing a peripheral contact state based on the peripheral contact data. In some examples, the peripheral contact output may be fused with thefirst output 322 and thesecond output 342 to generate a second fused output, and where the second fused output may be processed to generate themultimodal hand state 270. In some examples, an additional input may be a camera for capturing images or point data related to the position of the user's hand in 3D space. In some examples, modeling the user's hand motion in 3D space takes place in real-time, for example, withmotion data 230,contact data 250 and optionally, peripheral contact data or camera data being continuously sampled, and the handstate prediction system 200 continuously re-processing and updating the generatedmultimodal hand state 270 as new input data is received. In some examples, a modeled hand position may be output to an application on an electronic device (e.g., a software application executed by the computing system 100) to estimate a deviation in a modeled gripping posture from a target gripping posture. For example, if the application on the electronic device is an assistive tool to support children during early age development, obtaining accurate estimates of a modeled gripping posture may prompt or assist children in learning or modifying their grip to more correct postures. - Although the present disclosure describes methods and processes with steps in a certain order, one or more steps of the methods and processes may be omitted or altered as appropriate. One or more steps may take place in an order other than that in which they are described, as appropriate.
- Although the present disclosure is described, at least in part, in terms of methods, a person of ordinary skill in the art will understand that the present disclosure is also directed to the various components for performing at least some of the aspects and features of the described methods, be it by way of hardware components, software or any combination of the two. Accordingly, the technical solution of the present disclosure may be embodied in the form of a software product. A suitable software product may be stored in a pre-recorded storage device or other similar non-volatile or non-transitory computer readable medium, including DVDs, CD-ROMs, USB flash disk, a removable hard disk, or other storage media, for example. The software product includes instructions tangibly stored thereon that enable an electronic device (e.g., a personal computer, a server, or a network device) to execute examples of the methods disclosed herein.
- The present disclosure may be embodied in other specific forms without departing from the subject matter of the claims. The described example embodiments are to be considered in all respects as being only illustrative and not restrictive. Selected features from one or more of the above-described embodiments may be combined to create alternative embodiments not explicitly described, features suitable for such combinations being understood within the scope of this disclosure.
- All values and sub-ranges within disclosed ranges are also disclosed. Also, although the systems, devices and processes disclosed and shown herein may comprise a specific number of elements/components, the systems, devices and assemblies could be modified to include additional or fewer of such elements/components. For example, although any of the elements/components disclosed may be referenced as being singular, the embodiments disclosed herein could be modified to include a plurality of such elements/components. The subject matter described herein intends to cover and embrace all suitable changes in technology.
Claims (20)
1. A method comprising:
obtaining motion data from a digital pen device that is configured to sense motion of a user's hand;
obtaining contact data from a contact surface that is configured to sense contact of the user's hand, the contact data being representative of the motion of the user's hand and the contact surface including a capacitive touch surface;
obtaining peripheral contact data from a peripheral contact surface operatively coupled to the digital pen device, that is configured to sense peripheral contact of the user's hand on the digital pen device; and
generating a multimodal hand state based on a fusing of the motion data, the contact data and the peripheral contact data.
2. The method of claim 1 , wherein generating the multimodal hand state comprises:
pre-processing the motion data to generate pre-processed motion data; and
classifying the pre-processed motion data using a trained motion classifier to generate a first output, the first output including a probability corresponding to one or more classes.
3. The method of claim 2 , wherein generating the multimodal hand state further comprises:
pre-processing the contact data to generate pre-processed contact data; and
classifying the pre-processed contact data using a trained contact classifier to generate a second output, the second output including a probability corresponding to one or more classes.
4. The method of claim 3 , wherein generating the multimodal hand state further comprises:
concatenating the first output and the second output to generate a fused output.
5. The method of claim 4 , wherein generating the multimodal hand state further comprises:
classifying the fused output using a trained multimodal classifier to generate the multimodal hand state, the multimodal hand state including a probability corresponding to one or more classes.
6. The method of claim 1 , comprising:
prior to obtaining the motion data and contact data:
receiving an instruction to begin sampling the motion data, and when the instruction to begin sampling the motion data is received, sampling the motion data;
receiving an instruction to begin sampling the contact data, and when the instruction to begin sampling the contact data is received, sampling the contact data;
receiving an instruction to end sampling the motion data;
receiving an instruction to end sampling the contact data;
storing the sampled motion data as the motion data; and
storing the sampled contact data as the contact data.
7. The method of claim 1 , comprising:
prior to obtaining the motion data and contact data:
continuously sampling the motion data and the contact data;
determining, based on a threshold corresponding to the continuously sampled motion data and a threshold corresponding to the continuously sampled contact data, when a start of a hand action occurs;
determining when an end of the hand action occurs based on the start of the hand action occurring;
extracting the sampled motion data from the continuously sampled motion data based on the start of the hand action and the end of the hand action;
extracting the sampled contact data from the continuously sampled contact data based on the start of the hand action and the end of the hand action;
storing the sampled motion data as the motion data; and
storing the sampled contact data as the contact data.
8. The method of claim 1 , further comprising:
transforming the multimodal hand state into a command action based on a predefined set of commands.
9. The method of claim 1 , wherein the digital Den device includes an inertial measurement unit (IMU).
10. The method of claim 1 , wherein the contact data captured by the contact surface is 2D contact data.
11. The method of claim 1 , wherein the contact surface includes a pressure sensor pad, the contact data captured by the contact surface being 3D contact data.
12. (canceled)
13. The method of claim 1 , wherein the multimodal hand state is a classification prediction corresponding to one or more classes of hand actions.
14. The of claim 1 , wherein the multimodal hand motion state is a real-time 3D skeletal representation of a user's hand in a 3D space.
15. A system comprising:
a digital pen device that is configured to sense motion of a user's hand and output corresponding motion data;
a contact surface that is configured to sense contact of the user's hand and output corresponding contact data, the contact data being representative of the motion of the user's hand and the contact surface including a capacitive touch surface;
a peripheral contact surface coupled to the digital pen device, that is configured to sense peripheral contact of the user's hand on the digital pen device;
one or more memories storing executable instructions; and
one or more processors coupled to the digital pen device, contact surface and one or more memories, the executable instructions configuring the one or more processors to:
generate a multi-modal hand state based on a fusing of the motion data, the contact data, and the peripheral contact data.
16. The system of claim 15 , wherein the executable instructions, when executed by the one or more processors, further cause the system to:
pre-process the motion data to generate pre-processed motion data; and
classify the pre-processed motion data using a trained motion classifier to generate a first output, the first output including a probability corresponding to one or more classes.
17. The system of claim 16 , wherein the executable instructions, when executed by the one or more processors, further cause the system to:
pre-process the contact data to generate pre-processed contact data; and
classify the pre-processed contact data using a trained contact classifier to generate a second output, the second output including a probability corresponding to one or more classes.
18. The system of claim 17 , wherein the executable instructions, when executed by the one or more processors, further cause the system to:
concatenate the first output and the second output to generate a fused output; and
classify the fused output using a trained multimodal classifier to generate the multimodal hand state, the multimodal hand state including a probability corresponding to one or more classes.
19. The system of claim 15 , wherein the executable instructions, when executed by the one or more processors, further cause the system to:
prior to obtaining the motion data and contact data:
continuously sample the motion data and the contact data;
determine, based on a threshold corresponding to the continuously sampled motion data and a threshold corresponding to the continuously sampled contact data, when a start of a hand action occurs;
determine when an end of the hand action occurs based on the start of the hand action occurring;
extract the sampled motion data from the continuously sampled motion data based on the start of the hand action and the end of the hand action;
extract the sampled contact data from the continuously sampled contact data based on the start of the hand action and the end of the hand action;
store the sampled motion data as the motion data; and
store the sampled contact data as the contact data.
20. A non-transitory computer-readable medium having machine-executable instructions stored thereon which, when executed by one or more processors of a computing system, cause the computing system to:
obtain motion data from a digital pen device that is configured to sense motion of a user's hand;
obtain contact data from a contact surface that is configured to sense contact of the user's hand, the contact data being representative of the motion of the user's hand and the contact surface including a capacitive touch surface;
obtain peripheral contact data from a peripheral contact surface operatively coupled to the digital pen device, that is configured to sense peripheral contact of the user's hand on the digital pen device; and
generate a multimodal hand state based on a fusing of the motion data, the contact data and the peripheral contact data.
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US17/704,970 US11782522B1 (en) | 2022-03-25 | 2022-03-25 | Methods and systems for multimodal hand state prediction |
| PCT/CN2022/123548 WO2023178984A1 (en) | 2022-03-25 | 2022-09-30 | Methods and systems for multimodal hand state prediction |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US17/704,970 US11782522B1 (en) | 2022-03-25 | 2022-03-25 | Methods and systems for multimodal hand state prediction |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| US20230305644A1 true US20230305644A1 (en) | 2023-09-28 |
| US11782522B1 US11782522B1 (en) | 2023-10-10 |
Family
ID=88095700
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US17/704,970 Active 2042-03-25 US11782522B1 (en) | 2022-03-25 | 2022-03-25 | Methods and systems for multimodal hand state prediction |
Country Status (2)
| Country | Link |
|---|---|
| US (1) | US11782522B1 (en) |
| WO (1) | WO2023178984A1 (en) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20240160297A1 (en) * | 2022-11-04 | 2024-05-16 | Tata Consultancy Services Limited | Systems and methods for real-time tracking of trajectories using motion sensors |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US11205065B2 (en) * | 2019-10-18 | 2021-12-21 | Alpine Electronics of Silicon Valley, Inc. | Gesture detection in embedded applications |
Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20030156756A1 (en) * | 2002-02-15 | 2003-08-21 | Gokturk Salih Burak | Gesture recognition system using depth perceptive sensors |
| US20130300659A1 (en) * | 2012-05-14 | 2013-11-14 | Jinman Kang | Recognizing Commands with a Depth Sensor |
| US20160011718A1 (en) * | 2013-03-15 | 2016-01-14 | Qualcomm Incorporated | Combined touch input and offset non-touch gesture |
| US20180364813A1 (en) * | 2017-06-16 | 2018-12-20 | Anousheh Sayah | Smart Wand Device |
| US20200184204A1 (en) * | 2015-12-31 | 2020-06-11 | Microsoft Technology Licensing, Llc | Detection of hand gestures using gesture language discrete values |
| CN112286440A (en) * | 2020-11-20 | 2021-01-29 | 北京小米移动软件有限公司 | Touch operation classification, model training method and device, terminal and storage medium |
Family Cites Families (9)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US10437459B2 (en) | 2007-01-07 | 2019-10-08 | Apple Inc. | Multitouch data fusion |
| WO2015181161A1 (en) | 2014-05-28 | 2015-12-03 | Thomson Licensing | Methods and systems for touch input |
| CN106598243B (en) | 2016-12-08 | 2019-07-19 | 广东工业大学 | A multi-modal adaptive cursor control method and system |
| KR102184243B1 (en) | 2018-07-06 | 2020-11-30 | 한국과학기술연구원 | System for controlling interface based on finger gestures using imu sensor |
| WO2020068876A1 (en) | 2018-09-24 | 2020-04-02 | Interlink Electronics, Inc. | Multi-modal touchpad |
| CN112114665B (en) | 2020-08-23 | 2023-04-11 | 西北工业大学 | Hand tracking method based on multi-mode fusion |
| CN112492090A (en) | 2020-11-27 | 2021-03-12 | 南京航空航天大学 | Continuous identity authentication method fusing sliding track and dynamic characteristics on smart phone |
| CN113205074B (en) | 2021-05-29 | 2022-04-26 | 浙江大学 | A gesture recognition method based on multimodal signals of EMG and micro-inertial measurement unit |
| CN113849068B (en) | 2021-09-28 | 2024-03-29 | 中国科学技术大学 | A gesture multimodal information fusion understanding and interaction method and system |
-
2022
- 2022-03-25 US US17/704,970 patent/US11782522B1/en active Active
- 2022-09-30 WO PCT/CN2022/123548 patent/WO2023178984A1/en not_active Ceased
Patent Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20030156756A1 (en) * | 2002-02-15 | 2003-08-21 | Gokturk Salih Burak | Gesture recognition system using depth perceptive sensors |
| US20130300659A1 (en) * | 2012-05-14 | 2013-11-14 | Jinman Kang | Recognizing Commands with a Depth Sensor |
| US20160011718A1 (en) * | 2013-03-15 | 2016-01-14 | Qualcomm Incorporated | Combined touch input and offset non-touch gesture |
| US20200184204A1 (en) * | 2015-12-31 | 2020-06-11 | Microsoft Technology Licensing, Llc | Detection of hand gestures using gesture language discrete values |
| US20180364813A1 (en) * | 2017-06-16 | 2018-12-20 | Anousheh Sayah | Smart Wand Device |
| CN112286440A (en) * | 2020-11-20 | 2021-01-29 | 北京小米移动软件有限公司 | Touch operation classification, model training method and device, terminal and storage medium |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20240160297A1 (en) * | 2022-11-04 | 2024-05-16 | Tata Consultancy Services Limited | Systems and methods for real-time tracking of trajectories using motion sensors |
| US12111982B2 (en) * | 2022-11-04 | 2024-10-08 | Tata Consultancy Services Limited | Systems and methods for real-time tracking of trajectories using motion sensors |
Also Published As
| Publication number | Publication date |
|---|---|
| US11782522B1 (en) | 2023-10-10 |
| WO2023178984A1 (en) | 2023-09-28 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN111766948B (en) | Pose Prediction Using Recurrent Neural Networks | |
| Wang et al. | Controlling object hand-over in human–robot collaboration via natural wearable sensing | |
| CN112789577B (en) | Neuromuscular text input, writing and drawing in augmented reality systems | |
| CN110312471B (en) | Adaptive system for deriving control signals from neuromuscular activity measurements | |
| Sharma et al. | Human computer interaction using hand gesture | |
| EP3625644B1 (en) | Sensor based component activation | |
| US20190102044A1 (en) | Depth-Based Touch Detection | |
| US20130335318A1 (en) | Method and apparatus for doing hand and face gesture recognition using 3d sensors and hardware non-linear classifiers | |
| KR102046706B1 (en) | Techniques of performing neural network-based gesture recognition using wearable device | |
| KR100630806B1 (en) | Command input method using gesture recognition device | |
| WO2018125347A1 (en) | Multi-task machine learning for predicted touch interpretations | |
| KR102046707B1 (en) | Techniques of performing convolutional neural network-based gesture recognition using inertial measurement unit | |
| US20140208274A1 (en) | Controlling a computing-based device using hand gestures | |
| CN108475113B (en) | Method, system, and medium for detecting a user's hand gesture | |
| KR102179999B1 (en) | Hand gesture recognition method using artificial neural network and device thereof | |
| WO2023178984A1 (en) | Methods and systems for multimodal hand state prediction | |
| EP2857951B1 (en) | Apparatus and method of using events for user interface | |
| Fahim et al. | A visual analytic in deep learning approach to eye movement for human-machine interaction based on inertia measurement | |
| CN107390867A (en) | A kind of man-machine interactive system based on Android wrist-watch | |
| JP6623366B1 (en) | Route recognition method, route recognition device, route recognition program, and route recognition program recording medium | |
| Wang et al. | Continuous Hand Gestures Detection and Recognition in Emergency Human-Robot Interaction Based on the Inertial Measurement Unit | |
| Haratiannejadi et al. | Smart glove and hand gesture-based control interface for multi-rotor aerial vehicles | |
| Agarwal et al. | Gestglove: A wearable device with gesture based touchless interaction | |
| Yeom et al. | [POSTER] Haptic Ring Interface Enabling Air-Writing in Virtual Reality Environment | |
| Dhamanskar et al. | Human computer interaction using hand gestures and voice |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| FEPP | Fee payment procedure |
Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
| AS | Assignment |
Owner name: HUAWEI TECHNOLOGIES CO., LTD., CHINA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:CODY, ROYA;HUANG, DA-YUAN;CAO, XIANG;AND OTHERS;SIGNING DATES FROM 20220324 TO 20220730;REEL/FRAME:060969/0210 |
|
| STCF | Information on status: patent grant |
Free format text: PATENTED CASE |