WO2009068641A1

WO2009068641A1 - Method of stereoscopic tracking of a texture object

Info

Publication number: WO2009068641A1
Application number: PCT/EP2008/066406
Authority: WO
Inventors: Anthony Remazeilles
Original assignee: Commissariat A L'energie Atomique
Priority date: 2007-11-30
Filing date: 2008-11-28
Publication date: 2009-06-04
Also published as: FR2924560A1

Abstract

The invention relates to a method of stereoscopic tracking of a texture object (16) comprising the following steps: - in the course of an initialization phase: a) bringing the object into the field of vision of two cameras, b) capturing two images and viewing the first of these two images, c) defining this object by capturing two opposite corners of a box encompassing the object in this first image, d) searching for this box in the second image, e) extracting points of interest in the first image, f) searching for the corresponding points of interest in the second image, g) eliminating the poor pairings, h) estimating the coordinates of the paired points, i) estimating the three-dimensional position of the box, - in the course of a processing phase: a') acquiring a pair of new images, b') tracking the characteristic points, c') deducing the sought-after position.

Description

METHOD FOR STEREOSCOPIC TRACKING OF A TEXTURED OBJECT

DESCRIPTION

TECHNICAL FIELD The invention relates to a stereoscopic tracking method of a texture object, that is to say an object where characteristic points can be detected.

The field of the invention is in particular that of the robotic vision, for example in the context of the assistance to persons with disabilities.

STATE OF THE PRIOR ART

There are many approaches to tracking an object. The vast majority assume that the 3D (three-dimensional) model of this object is known information, which limits the scope of these approaches. Other approaches, which make it possible to follow characteristic points of an object, do not make it possible to go back to the information of 3D pose, that is to say of position and orientation 3D, of a camera relative to a reference linked to this object.

The majority of publications dealing with the use of a stereoscopic vision system focus on a dense 3D reconstruction of an unknown scene, or on the location and identification of an object in a scene. But the problem of tracking an object in a sequence of images is not explicitly addressed. The publications dealing with tracking in stereoscopic object images assume that the model of it is known, as in article [3].

A document of the known art, referenced [1] at the end of the description, proposes a solution in which, from a theoretical point of view, the statistical approach used is based on several constraining hypotheses. The object must be flat, and remain throughout the video sequence parallel to the image plane of the cameras. Classes of objects treatable by this approach are therefore restricted.

The object of the invention is to overcome these drawbacks and to allow the tracking of a texture object without having prior knowledge of it, the 3D model of the object not being information to be supplied but information that the system automatically estimates, the algorithm used can work for any object, if it is sufficiently textured.

STATEMENT OF THE INVENTION

The invention relates to a method for stereoscopic tracking of a texture object using an input system comprising: a computer associated with a display screen and an input device, a stereoscopic vision set comprising two cameras , wherein this object and / or stereoscopic set are in motion, characterized in that it comprises the following steps: - During an initialization phase: a) bring the object into the field of vision of the two cameras, b) enter two images and view the first of these two images, c) define this object by entering two opposite corners of a box enclosing this one in this first image, d) to look for this box in the second image, e) to extract characteristic points in the first image, f) to look for the corresponding characteristic points in the second image, g) to eliminate the mismatches between the corresponding characteristic points belonging to the two images, h) estimate the coordinates of the paired points, i) estimate the 3D position of the box,

during a processing phase a) to acquire a pair of new stereoscopic images, b ') to follow up the characteristic points in the new images, c') to deduce the position of the stereoscopic set with respect to the object, d) possibly update the 3D position of the characteristic points, e ') possibly reject the bad points; f ') possibly enrich the model of the characteristic points by adding new characteristic points. The invention relates to the genericity of the approach, and its ease of use since the user only has to surround the object with a box to start monitoring.

The robustness of this approach makes it possible to use low-cost cameras, such as "webcams" on the market. Such a characteristic is very advantageous since the system making it possible to implement the method of the invention thus has a very low cost.

The method of the invention can have many applications and in particular:

- Robotic assistance to people with disabilities, for example in the context of the automatic grasping of objects by a robotic arm equipped with a stereoscopic sensor on its clamp.

- Any application of seizure of objects which can be a potential industrial application: it is enough that the object is sufficiently texture, and that it presents the same face to the cameras during the approach phase of the arm. The tracking of an object by a remote camera system is also possible, rigid object tracking applications (monitoring, tracking license plates, tracking cars, ...). BRIEF DESCRIPTION OF THE DRAWINGS

Figure 1 illustrates an input system implementing the method of the invention.

Figures 2A and 2B illustrate a step of the method of the invention.

FIG. 3 illustrates all the steps of the initialization phase of the method of the invention.

Figures 4 and 5 schematically illustrate two steps of the method of the invention. Figs. 6A and 6B illustrate steps of the initialization phase and final results thereof.

FIG. 7 illustrates all the steps of the treatment phase of the method of the invention.

DETAILED PRESENTATION OF PARTICULAR EMBODIMENTS

An object grasping system 16 making it possible to implement the method of the invention, as illustrated in FIG. 1, comprises: a calculator 10 associated with a display screen 11 and an input member, which can comprise a keyboard 12 and a mouse 13, arranged on a table 18, a stereoscopic vision set 15 comprising a stereoscopic pair of calibrated cameras 19, the intrinsic parameters (geometrical modeling) and extrinsic (respective position) of the two cameras being known.

The cameras can, for example, be mounted on the clamp 20 of a robotic arm 17. This arm can be mounted on the wheelchair of a person disabled, or be mounted on an independent mobile platform, stereoscopic tracking being then used to guide the arm 17 to an object to be grasped 16. According to the method of the invention, the user designates on an image returned by one of the two cameras the object to grab. The arm 17 then performs the automatic entry thereof, and reports the object to the user. The method of the invention relates mainly to the tracking phase of the object during the phase of movement of the arm. It is essential, since it is this which gives the information to deduce the movements of the arm.

The method of the invention thus makes it possible to carry out the video tracking of an object observed 16 by the stereoscopic vision set 15 while this object and / or this set of vision is in motion. The main originality of such an approach is that the 3D model of the object is not necessary. The only initial knowledge of the object is given by the user who defines a box 25 encompassing this object 16, by two "clicks" on a video return of one of the two cameras 19. By image processing, and after calibrating the stereoscopic assembly, locating, mapping and triangulating characteristic points of the object in the two images 26 and 27 shown in Figures 2A and 2B. The 3D position of these points 30 is used to define the position of each camera 19 relative to the object 16 (pose). Once this initialization is done, each pair of stereoscopic images acquired is treaty. The characteristic points are located in the two images. Then, a technique known as virtual visual servoing makes it possible to deduce the pose of the camera. We will now analyze more precisely each of the steps of the method of the invention.

The object tracking method of the invention is divided into two phases:

- an initialization phase, and - a monitoring phase.

1 Initialization phase

The stereoscopic set captures two images 25 and 26, said left and right and illustrated respectively in Figures 2A and 2B. The user only needs to view one of the two images (the image on the right 26 in the example of FIG. 2). It designates the object 16 in this image, thanks to the definition of a bounding box 27. Two "clicks" images are necessary: one for the lower left corner, one for the upper right corner, the box being aligned on the axes of the image. The object 16 is supposed to be contained in this box 27.

The robustness of the process makes it possible to be satisfied with an approximate box. It does not need to exactly match the object.

The object 16 should be visible in the right and left images 25 and 26.

FIG. 3 shows the succession of steps constituting the initialization phase, namely: a) a step (El) of seizure of a box 27 by the user

This is the only operation where the user intervenes. He visualizes the image 26 (right view), and makes two "clicks" to define the upper left corner and the lower right corner of a box 27 encompassing the image projection of the object. b) A step (E2) of finding the corresponding box in the second image This processing can be performed using a conventional method of exhaustive correlation. It provides the position in the image of the zone most similar to that of the image 26.

An example of correlation is a cross-normalized correlation. We search the image area l (x, y) to maximize the following correlation:

Where T (x ', y') designates the zone that we want to match in the second image. c) A step (E3) of extracting points in the first image 26

This treatment is very conventional in the field of image processing. Points called Harris (similar to corners) are searched in the area 27 delimited by the user. These characteristic points 30 are attached to the object 16.

All image processing libraries offer point extraction tools characteristics, as described in referenced document [5]. d) A step (E4) for finding the corresponding points in the second image The search for the corresponding points in the second image can be carried out by conventional techniques of image matching or point tracking.

The use of the estimated position of the box 27 (step E2) limits the search area. We obtain a list of so-called paired points.

The technique used is a differential correlation, as proposed in the referenced document [6]. e) A step (E5) for matching filtering (E5)

The objective of this step E5 is to detect mismatches, that is to say the associations of image points that do not correspond to the same 3D point of the scene.

Simple criteria make it possible to detect them. Using a stereoscopic set 15 makes it possible to define rules that all matches must comply with. Thus, if we write (xl, yl) and (xr, yr) the image coordinates of paired points between the left and right images 25 and 26, we then have:

(a) xl> xr

(b) the difference | yl - yr | must be very weak Relationship (b) stems from the fact that, by constructing the stereoscopic ensemble, the images are aligned vertically as shown in Figure 4.

Moreover, since all the points belong to the same object, the distances between the left and right points must all be of the same order of magnitude. Robust estimation techniques make it possible to identify the points having too great distances, as described in the document referenced [2]. Figure 4 illustrates these classic rules on pairing with a stereoscopic system. (P: 3D point of the scene Cl & Cr: optical center of the cameras, pl (xl, xr) and pr (xr, yr) projection of the 3D point in the left and right images). These rules are well known to those skilled in the art. f) A step (E6) of triangulation of the paired points

This is a classic step in computer vision. Knowing the left and right image coordinates of the same point of the scene, and with the calibration parameters of the stereoscopic set, it is possible, by triangulation, to estimate the 3D coordinates Pl of the points, expressed in the reference of the first camera. Figure 5 illustrates the fact that it is easy to go back to these 3D coordinates by knowing the movement (R, t) between the two cameras, and the image coordinates of a point of the scene. The depths in the two camera markers (Zi and Z _r ) are deduced from the following relation:

Where the + sign designates the pseudo-inverse operator. g) A step (E7) for estimating the 3D position of the box The box 27 defined by the user does not necessarily correspond to physical points of the scene. The objective is to estimate a 3D position of this box, according to the 3D coordinates estimated in the previous step. A pragmatic solution consists in associating at the corners of the box the median depth of the estimated 3D points.

Figures 6A and 6B illustrate several of these steps namely E1, E2, E3 and E4 in the left and right images 31 and 32 and the final results of the initialization phase.

2 Treatment phase

Figure 7 illustrates the succession of steps of this phase of treatment. The objective is to deduce the attitude (position and orientation) of each camera 19 with respect to the object 16, known by the 3D points and the 3D box estimated during the initialization phase.

The treatment phase is carried out continuously, for each pair of stereoscopic image acquired (step E'1). At the end of the loop (step E '3, the steps E' 4 to E '6 being optional), we know exactly the position of the object 16 vis-à-vis each camera 19. This information is the result of the process of the invention. We can just inform the user of this result (by tracing on each pair of acquired images the current position of the box thus obtained for example), or use this result as input data of another process. The current position of the object 16 thus obtained can be used to guide a robotic arm 17 towards this object 16, with a view to its capture.

We thus have the following steps a ') a step (E' 1) of acquisition of a pair of stereoscopic images b ') a step (E' 2) of tracking of the characteristic points

This step consists in finding the position of the characteristic points in the new images, knowing their positions in the two previous images. Conventional correlation techniques make it possible to perform this operation.

The technique used is a KLT differential monitoring. As described in the referenced document [6]. c ') a step (E' 3) of laying calculation

This step consists in deducing the pose of the camera with respect to the object. The information used is: the estimated pose of the camera for the previous pair of stereoscopic images, the 3D coordinates of the characteristic points of the object and the 2D coordinates estimated in the previous step.

This step can be performed using virtual visual servoing techniques as described in document [4]. Visual servoing is an approach to deduce displacement of a robotic system according to an error measured between a visual pattern measured in the image and the desired value of this pattern. Virtual visual servoing takes up this principle, but instead of moving a robot, we update the estimated pose of the camera in relation to the scene of interest. This operation updates the pose of each camera vis-à-vis the object. This principle adapts very easily to a stereoscopic configuration as described in the document referenced [3].

To do this, we use the formalism below. We suppose to know:

- the pose of the camera vis-à-vis the object at time t = ^C M _O - the 3D positions of the characteristic points, obtained during the triangulation at initialization: ⁰ P ₁ (3D)

Following the acquisition of two new stereo images, the tracking step E '2 provides the new 2D position of the points in the images: ^c Pi. (2D)

- From the pose of the camera with respect to the object, we estimate the supposed position of the points by projection: "^' Jy. = Λ ^C M / P _i, where A is a projection operator of the 3D space to the image space (thanks to the intrinsic parameters of the cameras).

- This gives an error between the estimated position and the measured position: ^ - _P ^~ P < ^~ Pi

- The visual servoing makes it possible to obtain a camera speed making it possible to reduce this error: T = -ΛVΑ, where "to is a positive scalar regulating the speed of convergence, and L ⁺ the pseudo-inverse of the interaction matrix, connecting the variation of the image coordinates of the points according to a movement of the camera.

- By applying this speed to the estimated position, we obtain an update of the estimated position of the camera: MZ _n → _. /('Λ-*,,.^.).

This process is repeated until the error ^ _p is considered sufficiently small. This gives the current position of the camera vis-à-vis the object. d ') A step (E' 4) of rejection bad points (optional)

During the tracking phase, some feature points may be lost, that is, tracking has not been successful in locating them in the current image. Moreover, the algorithm can just as well give a wrong result. These errors may affect the proper functioning of the process. It may be interesting to detect these errors, and reject these points of the next treatments.

The detection and rejection of these bad points is done by a statistical approach. During the pose calculation phase, since the object is rigid, one can, indeed, suppose that the error between the estimated position of the characteristic points and their position followed in the image is of the same order of magnitude for all the points considered. M-estimation techniques, as described in the document referenced [2], make it possible to detect the points which do not respect this rule. e ') A step (E' 5) for updating the model (optional)

The goal is to update the position

3D of the characteristic points, since the object / camera displacement can make it possible to have a more precise knowledge on the coordinates images of the points

(The closer you get to an object, the more it's fine.) It is possible to update the 3D coordinates of the points tracked, by triangulating from the image coordinates of the points tracked in the two images. f ') A step (E' 6) of enrichment of the model (optional)

During the movement, characteristic points tracked may disappear, leave the field of view of one of the cameras or be hidden by another object of the scene. To prevent the knowledge of the object from diminishing, it is possible to add new points in the model. This operation is performed by repeating the processing of the initialization phase, and using the current 3D position of the box, deduced from step (c).

Example of implementation The method of the invention can be used in the context of assistance to persons with disabilities. Thanks to a robotic arm, a disabled person can grab and manipulate objects from his environment. A system currently marketed by Exact Dynamics, for example, is controlled by means of a joystick. to the handicap of the user. This mode of operation imposes a strong control of the system by the user, while monopolizing all his attention during all the phase of capture (approach of the arm, closing of the clamp, return of the arm towards the operator, ...)

The method of the invention makes it possible to envisage automating these input tasks. In this context, it is often proposed to use one or more cameras to control and guide the movements of the arm towards the object to be grasped. Naturally, the success of the seizure rests mainly on the tracking of the object in the flow of images acquired by the camera or cameras observing the scene. Indeed, the method of the invention allows to automatically control the movements of the arm towards the object. This new approach is more generic and robust than that proposed in the document referenced [I]. The original characteristics of this new method include: a very simple initialization of the algorithm,

- no prior information necessary on the subject to follow. The first characteristic is essential, since the process is developed for use by non-specialists. It is therefore very important that its use is very simple and does not require any particular scientific competence. The second characteristic is equally so: the fact of not imposing a prior knowledge on the object to be followed makes it possible to consider a very wide range of objects. In addition, this second characteristic makes it possible to envisage many fields of application of the method, outside the field of assistance to handicapped persons, since any rigid object (not deformable) and texture can be considered by the method of the invention. We can think of vehicle tracking, the tracking of license plates for example.

REFERENCES

[1] EP 1614509

[2] "Statistically robust 2D visual servoing" by A. Comport, E. Marchand, F. Chaumette. (IEEE Trans. On Robotics, 22 (2)), pages 415-421, April 2006 http: // www. irisa. en / lagadic / pdf / 2006_ieee_tro_comport .p df)

[3] F. Dionnet, E. Marchand, "Robust model-based tracking with multiple cameras for spatial applications"

(In 9th ESA Workshop on Advanced Space Technologies for

Robotics and Automation, ASTRA 2006, pages 287-294, Noordwijk, The Netherlands, November 2006 http: // www. irisa. fr / lagadic / pdf / 2006_astra_dionnet .pdf)

[4] "Virtual Visual Servoing: a framework for real-time augmented reality" by E. Marchand, F. Chaumette (Computer Graphics Forum, 21 (3), pages 289-298, September 2002 http: // www. en / lagadic / pdf / 2002_eurographics_march and. pdf)

[5] "A combined corner and edge detector." In Alvey

Vision by C. Harris and MJ Stephens (Conference, pp. 147-152, 1988, http: //www.ewe.uwa.edu/pk/research/ matlabfns / Spatial / Does / Harris /) [6] "Good Features to Track" by J. Shi and C. Tomasi .. (Conference on Computer Vision and Pattern Recognition (CVPR '94) http: // www.cmu.edu / pub_files / pub2 / shi_j ianbo_1994_l / shi jianbo 1994 l.pdf).

Claims

A stereoscopic tracking method of a texture object (16) using an input system (14) comprising:

- a computer (10) associated with a display screen (11) and a gripping member (12,13), a stereoscopic vision unit (13) comprising two cameras (19), wherein this object (16) and / or this stereoscopic assembly (13) are in motion, characterized in that it comprises the following steps

- During an initialization phase: a) bring the object into the field of vision of the two cameras, b) enter two images and view the first of these two images, c) define this object by entering two opposite corners a box (27) including it in this first image, d) find this box in the second image, e) extract characteristic points (30) in the first image, f) find the corresponding characteristic points in the second image image, g) eliminating mismatches between the corresponding characteristic points belonging to the two images, h) estimating the coordinates of the paired points, i) estimating the 3D position of the box (27),

- during a treatment phase a ') to acquire a couple of new stereoscopic images, b') to follow the characteristic points in the new images, c ') to deduce the position of the stereoscopic set with respect to the object.

2. Method according to claim 1 comprising a following additional step during the processing phase: d) update the 3D position of the characteristic points.

3. The method of claim 1 comprising a following additional step during the treatment phase: e ') rejecting the bad points.

4. A method according to claim 1 comprising a further step following during the processing phase: f ') enrich the model of the characteristic points by adding new characteristic points.

An object grasping method using a robotic arm using the method of any of the preceding claims.