WO2009047561A1

WO2009047561A1 - Value determination

Info

Publication number: WO2009047561A1
Application number: PCT/GB2008/050923
Authority: WO
Inventors: John Allen Robinson
Original assignee: University Of York
Priority date: 2007-10-08
Filing date: 2008-10-08
Publication date: 2009-04-16
Also published as: GB0719527D0

Abstract

A method and apparatus are disclosed for determining at least one unknown attribute value associated with a target object visible in an image. The method includes the steps of providing a sample point in multi-dimensional space associated with the target object, determining at least one conditional probability distribution for at least one attribute associated with the target object and determining at least one conditional expectation value, indicating a respective attribute value associated with the target object, from the conditional probability distribution.

Description

VALUE DETERMINATION

The present invention relates to a method and apparatus for determining values associated with unknown attributes of objects visible in images. In particular, but not exclusively, the present invention relates to the real time, simultaneous determination of attributes, such as age or gender, associated with faces visible in still or moving images.

It is known that computers and other processing units are able to process graphic images in digital format. For example, such images may be digital or digitized photographs or still frames from video footage or the like. From time to time it is desirable or indeed required to identify objects of interest in such images and subsequently determine one or more attributes associated with the identified objects. One such particular field of interest is the detection of one or more human faces in an image and the determination of one or more attributes associated with the face. Such use has a broad range of possible applications such as surveillance and/or demographic analysis or the like.

Whilst there are a number of existing methods which have been proposed for detecting objects in images, very few methods have been suggested which can accurately indicate a particular attribute associated with the detected object. For example, there have been few techniques suggested able to determine a gender and/or race and/or demeanour and/or age associated with a person whose face is visible in an image. Also any techniques which have been suggested are prone to error and/or may take considerable time to operate and/or may require substantial processing power in the form of mainframe or expensive computers.

It is thus understood that from time to time it would be useful to be able to accurately and efficiently detect and describe a particular type of target object. Such an object may be a face but alternatively a broad range of target objects such as vehicles or buildings or planets or animals or the like could be targeted. Currently there are only limited ways in which such target objects may be detected and then attributes, associated with those target objects, determined.

It is an aim of the present invention to at least partly mitigate the above-mentioned problems. It is an aim of embodiments of the present invention to provide a method and apparatus for determining one or more values associated with unknown attributes of target objects visible in an image.

It is an aim of embodiments of the present invention to provide a method and apparatus able to determine values for multiple unknown attributes associated with one or more target objects visible in an image simultaneously.

It is an aim of embodiments of the present invention to provide a method and apparatus for determining unknown attribute values in real time to enable processing of a data stream corresponding to frame-by-frame images to be carried out.

It is an aim of embodiments of the present invention to provide a method and apparatus for training probability models for a class or sub-class of target object whereby the models can subsequently be utilised to determine unknown attributes associated with objects visible in an image.

According to a first aspect of the present invention there is provided a method of determining at least one unknown attribute value associated with a target object visible in an image, comprising the steps of: providing a sample point in multi-dimensional space associated with the target object; determining at least one conditional probability distribution for at least one attribute associated with the target object; and determining at least one conditional expectation value, indicating a respective attribute value associated with the target object, from the conditional probability distribution.

According to a second aspect of the present invention there is provided apparatus for determining at least one unknown attribute value associated with a target object visible in an image, comprising: a vector generator that generates an N-dimensional vector associated with a target object; and an attribute determiner that determines at least one conditional probability distribution for at least one attribute associated with the target object, and from the distribution determines at least one conditional expectation value indicating a respective attribute value associated with the target object.

According to a third aspect of the present invention there is provided a method of providing an N-dimensional regularised Gaussian probability distribution, comprising the steps of: providing a plurality of images of a selected class or sub-class of target object; hand labelling attributes associated with the target object; and generating a regularised model of the target object.

Embodiments of the present invention provide a method and apparatus for determining values for unknown attributes and/or confirming known attribute values associated with target objects visible in an image.

Embodiments of the present invention can determine attributes such as gender and/or race and/or glasses wearer and/or age and/or demeanour or the like for a person or persons whose face is visible in an image.

Embodiments of the present invention can carry out the determination of attribute values on more than one face in an image simultaneously and in real time so that frame-by- frame image processing is possible.

Embodiments of the present invention provide values for unknown attributes. When the attribute is a binary attribute the values can be used to determine the incidence or lack of incidence of the attribute. Alternatively, the values can be used to indicate a number within a possible range of values for a particular attribute. The number gives an indication of the attribute.

Embodiments of the present invention may also provide a further value in addition to the unknown attribute value indicating a reliability associated with each determined attribute value.

Embodiments of the present invention can be broadly applied to the determination of attributes associated with a broad range of target objects such as, but not limited to, faces, vehicles, buildings, animals, and others as detailed hereinafter. Embodiments of the present invention are also broadly applicable to a wide variety of applications including, but not limited to, surveillance, biometrics, demographic analysis, movie logging, key frame selection, video conferencing, control of further systems portrait beautification or safety monitoring.

Embodiments of the present invention are useful in "description" of objects when an object is visible in an image.

Embodiments of the present invention may be used in combination with known forms of object "detection", "recognition", and/or "verification".

Embodiments of the present invention will now be described hereinafter, by way of example, only with reference to the accompanying drawings in which:

Figure 1 is a block diagram illustrating detection and description of a target object;

Figure 2 illustrates a detection and description system;

Figure 3 illustrates images and the detection and description of facial attributes;

Figure 4 illustrates an N-dimension sampling point;

Figure 5 illustrates model training;

Figure 6 illustrates an image including multiple instances of target objects;

Figure 7 illustrates tagging of an image with determined attributes;

Figure 8 illustrates face/non face classification performance with training set size;

Figure 9 illustrates an optimisation surface corresponding to the CCMI result; and

Figure 10 illustrates examples of face detection.

In the drawings like reference numerals refer to like parts. Embodiments of the present invention provide a method and apparatus for determining values associated with unknown attributes of target objects. Throughout the further description reference will be made to the term "target object/s". It is to be understood that this term is to be interpreted broadly to cover any object whatsoever having one or more attributes associated with it which may be the target of analysis. Further discussion is made, by way of example only, with reference to a class of target object being a face. Different classes of object may be target objects such as, but not limited to, vehicles, buildings, animals, planets, stars, hands or other body parts or whole people or objects within medical images such as micro calcifications in mammograms or the like.

Reference is also made hereinafter to the establishment of one or more attributes associated with a target object. Again the term "attribute" is to be broadly construed as meaning any feature by which the class of object may be partitioned into a sub-class. For example, in terms of attributes associated with a class of object being a face, at least the attributes illustrated in Table 1 may be identified. Table 1 also indicates possible examples for applications of the attributes.

Table 1

It will be appreciated that further attributes associated with the face class may be envisaged and are applicable to embodiments of the invention and thus the attributes described above are not an exhaustive list. In terms of the possible further classes of object referred to above, such as vehicles, other respective types of applicable attributes may be envisaged. For example, for a vehicle, whether the vehicle has wheels, number of wheels, is the vehicle a car or train, does the vehicle have a windscreen?

Certain attributes are referred to herein as binary attributes. Such attributes may in real life have only one of two values. For example, with respect to gender of a human face, the gender must be male or female. Equally, some attributes will be non-binary attributes. A value within a range of possible values for such attributes, such as age, etc, can thus be determined.

It will be appreciated that for each object visible in an image, some attributes may be certain or may be provided by a user. Such attributes will hereinafter be referred to as known attributes. Other attributes which are not initially determined will be referred to as unknown attributes. Embodiments of the present invention can be used to describe these unknown attributes as well as, if desirable, confirming known attributes.

Figure 1 illustrates schematically the flow of steps followed according to embodiments of the present invention in order to define one or more unknown attributes associated with a target object or to confirm known attribute values. The process starts at step S100 with an input image which might be a still image or stream of moving images which may be separated into frame-by-frame still images.

Figure 2 illustrates a block diagram of an object detection and description system 200 which includes a digital processor 201 and data storage 202. The processor includes an image digitiser 203 which receives an input image 204. The image may or may not include an object 205 being targeted displayed in a sub-window 206. The image digitiser 203 is a routine, module or firmware or hardware entity or software program or the like that creates a digital image representation of the image 204.

An object detector 207 detects the object, such as a face, at step S101. If no target object is detected the system flags this and moves on to the next image to be processed or stops if all images are exhausted. The object detector 207 is a routine, module, firmware or hardware entity or software program or the like that detects instances of objects in an image 204. For example embodiments of the present invention can be utilised with a broad range of face detectors which provide framing coordinates for a sub-window including a face in the image. Aptly possible face detectors are cascade classifiers based on the work of Viola and Jones P. Viola and M. Jones, "Rapid Object Detection using a Boosted Cascade of Simple Features", Proc. IEEE CVPR, Hawaii, December (2001 ). The instance of the face is extracted at step S102 and then the size and/or luminance of the extracted image is adjusted at step S103. In this way, for each face to be described, the relevant area from the input picture 204 is extracted and resized to the nominal size of face models utilised during the face description process as described below. Each face may also be normalised for luminance e.g. to a standard mean and variance.

Figure 3 illustrates two examples of still frames from the film Kill Bill by Miramax Films 2003. Each image 204_{1 ;} 204₂, shown includes at least one face as a target object. A respective sub-window 20G₁ , 206₂, is detected in each image. For each detected sub- window including a detected target object, i.e. face, an N-dimensional vector is generated in the attribute determiner 208.

Figure 4 illustrates the generation of an N-dimensional vector for a detected target object in a sub-window. The image is divided into multiple pixels (16 shown in Figure 4). Each pixel 40O₁ to 40O₁₆ has a grey scale value determined for it. The value determined equivalent to the grey scale value for that pixel is stored in a corresponding entry in a column vector 401 . For example, as shown in Figure 4, the grey scale value for pixel 40O₁ is loaded into entry 40I ₁ whilst the grey scale value for pixel 40O₂ is stored in entry 401₂ etc. The values corresponding to the grey scale values from the target object image thus provide known values in respective entries in the vector 401 . The vector 401 has dimensions N. N is equal to the number of known entries plus a number of unknown entries 402! to 402₄. It will be understood that N can be any integral number and is determined by the pixellation scheme used as well as the number of attributes which are known/are to be determined. The number of unknown attributes may be selected as appropriate to the application.

In the example described, entry 402_! corresponds to the gender attribute associated with the face whilst entry 402₂ corresponds to race associated with the face, entry 402₃ corresponds to the age of the face and entry 402₄ corresponds to the degree to which the face is smiling. For a face illustrated in a new image these unknown attributes 402 will include blank entries in the N-dimensional vector 401 generated for the object. After generation of the N-dimensional vector at step S104, the vector is applied to a relevant model S105 supplied from a training module 209. It will be appreciated that whilst the module 209 is illustrated as a separate unit in Figure 2, the functionality of the module could be held in any part of the processor 201 . Embodiments of the present invention use conditional density estimation as described hereinbelow in more detail to derive the expectation for the unknown attribute values from a regularised model of faces probed by the input. In order to do this models of target objects, such as faces, are generated during a training sequence and can then be utilised in real time by probing the models with the generated sample vector for a specific target object illustrated in an image. The models used can consist of a single multi-variate normal characterised by a single mean vector and covariance matrix or optionally of a multiple (mixture) model consisting of several multi-variate normals. In either case the models are generated from training samples. When the target object is a face these are face images which have been framed and scaled to a consistent size then hand-labelled with attributes. The attributes are converted to numerical values which, for any particular image, are appended to the values derived from the pixels of that image to form a feature vector. The lengths of the vectors provide the dimensionality of the space in which all the image and attribute values are processed.

The normal models are regularised so that in deriving the parameters of each model from training samples the statistics are combined in ways that optimise performance.

Figure 5 illustrates a step in the training process which is carried out prior to the face description. Many training samples 500, each being an image 501 , including an object instance 502 displayed in a respective sub-window region 503. Each window 503 is extracted and sized and adjusted for luminance properties to provide an example image of an object 504_! to 504_n with the object, such as a face, in each image having certain characteristics. Grey scale values for pixels in each sized sub window in the image are determined according to a number of divisions of the image which will ultimately be used during processing of unknown images. For example, as per Figure 4, each image 504 is split into 16 pixels and grey scale values corresponding to each respective pixel are stored into the first 16 entries 505i to 505i₆ of the column vector for that object. Next values for the 4 remaining entries, which each correspond to a respective attribute, are entered by a human supervisor. For example, a human supervisor will look at each image in the training sample and determine a gender, race, age and level of smile and a respective value will be stored in respective entries 506i to 506₄. It will be appreciated that other numbers of attributes may be utilised according to embodiments of the present invention. The training samples are stored in a data store of the training module 209 or data store 202 accessible to the training module.

N-dimensional vectors are generated for each image in the training sample set. For a single-normal-model approach a regularised normal model is generated from the full set of training samples. For a multiple-model approach the training set is first partitioned according to the scheme described in section "MULTIPLE-MODEL APPROACH", then a regularised normal model is generated for each subset.

MODEL TRAINING

Normal models used in the invention are derived from training samples according to the following scheme for regularising covariance estimates.

Suppose that each of K classes or sub-classes of object is characterized by a n- dimensional normal distribution:

/_!(x) =

exp[-l/2(x- μ_!)^r ∑;¹(x- μ_!)] (1 ) where μ_; is the class's mean vector, Σ_; is its covariance matrix, and T denotes transpose. While not all that follows depends on the normality assumption, the classification and therefore the optimization criterion use the Maximum Likelihood (ML) classifier which assumes normality, namely, classify a sample x as belonging to class k if

with d_I(x) = (x - μ_I)^r∑r¹(x - μ_I) + ln|∑_I| (3) which, if the prior probabilities P₁ of the classes are known, is related to the Bayes classifier

4(x) = (x - _μi)^r ∑;¹(x - μ_!) + ln|∑J - 21n^ (4) by a scalar offset.

Using a unimodal multivariate normal model for appearance-based image processing assumes a compact, convex distribution with ellipsoidal symmetry. The particular shape of the normal distribution relative to other compact, convex, symmetric distributions, has little effect for classification.

With the normal model, the characterization of class / amounts to estimating the mean vector μ_; and covariance matrix Σ_; . The estimated values are denoted μ_; and Σ_; . For μ_; the sample-mean of the class / training samples m, is the maximum likelihood estimate. While there is sometimes a justification for moving of the sample mean towards a prior there is usually no reason to modify it on the basis of any out-of-class samples. Hereinafter m, will be adopted as μ_; . The class /training samples also provide a sample- covariance matrix

^S, = £∑K, ^"A₁ ⁾K, ^"A/ ⁽⁵⁾

J=I where N₁ is the number of training samples for class /, of which x_y is the /th. The average sample-covariance matrix is

K

^average ⁼ ~K Z-_I J ^

J=I which weights all classes equally, whereas the pooled sample-covariance matrix is

S_pooled = j_f∑N_JS_J (7)

J=I which weights all training samples equally. Finally, in some contexts, there may be a total sample-covariance matrix, constructed from all the training samples including some which are unlabelled but belong to a superclass of which all K classes are part:

N

with Λ/the total number of samples and μ the mean vector of all samples.

Embodiments of the present invention make us of Cross-Validated Covariance Mixing (CCM) which estimates the covariance matrix for class / as: ±_ι{{a_] }γ_ι) = γ_ι{a_ιS_ι + aβ₂ + ... + {l-a_ι -a₂ -... -a_p__ι)S_p) (9)

Here p is not fixed; any number of contributing covariances might be mixed. This recognizes that although accurate estimation makes use of available information, no particular partial estimator has special status - not even the sample-covariance for the class' training samples (the plug-in). In any particular case, the available evidence and the application will determine what S, are available. They will typically include ∑_pnor, some assumed prior distribution, S, the class's sample-covariance, S_pooted or S_tofa/, and I. There might be covariances available at different levels of a class hierarchy, for example one might estimate the covariance of "smiling faces" by mixing a structured prior (e.g. encoding that adjacent pixels have a constant correlation coefficient), with a sample- covariance matrix generated from a set of real images, a sample-covariance matrix from a set of faces, and a sample-covariance matrix from a set of smiling faces. Equation 9 indicates that mixing for different classes will use the same weightings of these components (the {a}), but that overall class volumes will then be adjusted by class- specific weights (the ( ^J). The reason for this is set out below in detail but briefly, the class-specific scalars J₁ are meant to undo the differential effects on the volume of the different class-specific contributions to the estimate (like S, ) by the same global contributions (like I). γ_λ is set to 1 ; the remainder are set relative to this.

In more detail suppose that the class sample-covariance matrices are well-conditioned and therefore invertible with non-zero eigenvalues. In practice they will often be singular because there are insufficient training samples, but one can add the smallest amount necessary for numerically-stable inversion to the eigenvalues. As a result each class has a highly ellipsoidal distribution.

An n-dimensional hyperellipsoid has three fundamental properties - its volume (a scalar), its shape (the lengths of its n axes; n-1 free if the volume is known) and its orientation (specified by n-1 angles). In principle these can be manipulated independently. Changing the volume of the sample-covariance distribution corresponds to changing the matrix determinant or scaling all the eigenvalues equally. Changing the shape corresponds to a transformation of the eigenvalues that is not a pure multiplication. Changing the orientation corresponds to multiplication by an orthogonal matrix. The manipulations actually done to the sample-covariance matrix hyperellipsoid according to prior art techniques such as Regularised Discriminant Analysis (RDA) and Mixed-Leave-One-Out Covariance Matrix (Mixed-LOOC) estimators alter the three properties in combinations as follows:

1. Adding a multiple of the identity matrix (as done in both RDA and Mixed-LOOC) causes a shape change - short axes get elongated proportionally more than long ones. It also causes a volume change, despite the weights-sum-to-1 constraint in both methods. a. The shape change is favourable for regularization - it diminishes the importance of distances in the low-variance subspace whose eigenvalues are most dramatically affected by the scarcity of training data. The weighted sum with the identity matrix has two other valuable properties: it is the simplest mechanism for adjusting shape and it has a direct interpretation in the measurement (pixel) domain as the addition of uncorrelated noise. b. However, the volume change is less well motivated and the use of the average of the diagonal of a sample-covariance matrix in both RDA and Mixed-LOOC to "normalize" the identity matrix term is questionable, first because it takes the volume of the non-regularized sample-covariance as a reliable estimator of covariance volume, and second because it corresponds in the measurement (pixel) domain to adding different amounts of uncorrelated noise to each class.

2. Forming a weighted sum of the class sample-covariance and a pooled or average sample-covariance changes all three of volume, shape and orientation. However, it is a justifiable way of effecting an orientation change. In principle, the pooled matrix could be decomposed into an re-orientation and a scaling and these applied independently, but the simple addition of matrices has the virtue of simplicity and two obvious endpoints - the class sample-covariance and the common sample- covariance.

3. Forming a weighted sum with the diagonal either of the class sample-covariance or the pooled sample-covariance (as done in Mixed-LOOC but not RDA) lacks theoretical justification. If the scales of the different dimensions are different, then there is perhaps a rationale, but it is notable that the only reported experiments in which the original LOOC significantly outperformed RDA were those where the covariance eigenvectors were aligned with the measurement axes and where, therefore, the diagonal entries (the variances) would accidentally estimate the full covariance.

In the light of points 1 a and 2, the estimator according to embodiments of the present invention uses the identity matrix and the common sample-covariance as components for weighted corrections to the class sample-covariance. Following point 3, matrix diagonals are not included. The argument of 1 b relates to the weighting of the components. The estimator according to embodiments of the present invention adopts global parameters for combining a class's sample-covariance matrix with the common sample- covariance and the identity matrix, but recognizing that this produces different volume changes for each class, compensates by rescaling every class's volume independently.

The {a} and { γ} are thus tested on a grid of possible values. At each combination, a subset of size (m-1 )Λ//m of the training samples is used to develop sample-covariance matrices while the remaining N/m samples are processed according to the covariances estimated from equation 9 and evaluated according to the application's objective function. The objective function is always that of the underlying task. For classification, the ordinary Maximum Likelihood (ML) decision rules (2), (3) classify the validation set to produce error rates. The process is repeated m times (m-fold cross-validation) with different validation sets and the results summed. The best-performing parameter combination is selected.

Cross-validation equates to repeatedly dividing the training set into two parts, deriving statistics from the first part, processing the second with regularization parameter values to obtain objective function values, then choosing the values that give best overall performance. Doing this so that the second part of the training set contains only one sample, i.e. leave-one-out cross-validation, is the most exhaustive and accurate approach, but because the whole process must be run N times, it is time-consuming. As specified above, in CCM according to embodiments of the present invention the validation set can have more members, say 1/m of the total training set, and the process repeats m times, always with different samples in the validation set. This m-fold cross- validation takes just over ml N of the time of full leave-one-out validation.

Relative to the prior art, CCM has positive distinguishing characteristics — (i) it combines partial estimators similarly for all classes, then compensates for volume distortions via scale parameters, (ii) it evaluates only by an application-specific objective function, (iv) it permits cross-validation with fewer steps than leave-one-out. In appearance-based image processing, it is reasonable not to include diag(Si). The dimensions are all pixels and uncorrelated variation in individual component variances can be assumed due to equal noise, and therefore captured by the component I. Reasoning in this way allows the specialization of CCM for appearance-based image processing to: ±_ι(a,β, y_ι) = y_ι(aS_ι + (\ -a)S_p +β) (10)

This specializes equation (9) to use S,, S_p, and I, introduces /? to avoid / subscripts, and algebraically reorganizes the weighted sum (taking a constant factor into the class- specific weight) to allow testing over a grid. For dimensionality reduction or data recovery there is only one class, so only a, β need to be estimated, and for two class problems, there is additionally only one γ_t . Thus the complexity of the general method of equation

(9) reduces to the estimation of a maximum of three parameters in the experiments given by way of example only discussed below. The specialization of CCM in equation 10 is termed CCM1 in the remainder of this description. It will be appreciated that embodiments of the present invention are not restricted to only one or two class problems.

In this way, the single normal model is trained from all available training samples or a subset selected to match the training set more closely to the expected types of faces in the application. In this way the most appropriate part of the available training set can be used without extraneous training cases.

The single normal model estimates attributes by transforming the face covariance into a set of matrices for conditional density estimation. For each face to be described, the relevant area from the input picture is extracted and sized to the nominal size of the face model. This is then used as a probe with a single linear transformation being applied to the probe in line with conditional density estimation to derive an expectation value for the missing dimensions which are the descriptive attributes. Optionally, in addition, variances for the descriptive attributes, which are unknown, can be obtained to provide a measure of confidence of the estimation. The numerical values for the attributes are converted as necessary to descriptive tags (e.g. for male/female) or via a simple transformation to coordinates (e.g. of landmarks).

CONDITIONAL DISTRIBUTION ESTIMATION

Consider the random vector X, distributed as N_n(μ,∑). Suppose there is a particular sample from the distribution P, referred to as the "probe", in which some of the measurements are known and some are not. A binary vector M (or "mask") may be used to indicate which dimensions in P are known values, and a permutation matrix R can then be defined which will reorder M into a column of q zeroes followed by p (= n - q) ones.

When R is applied to P, it moves all the unknown values to the top:

)

where P₁ and P₂are column vectors of dimensionality q and p respectively ( P₁ ¹S values are undefined and may correspond to unknown attibutes). Similarly,

(12)

It is also possible to define μ = Rμ =

(13)

Here μi represents the mean of the values in X₁, and μ₂ represents the mean of the values in X₂. ∑n is the covariance matrix associated with the Xi component values and ∑₂₂ is the covariance matrix associated with the X₂ component values. ∑_i2 is a sub matrix of overall covariance which represents covariance between components in the top X, and bottom X₂ (the probe) values. ∑₂₁ is the transpose of ∑₁₂.

Thus μ_perm is the mean vector reordered to match RP, the reordered probe vector. Similarly ∑_penn is the covariance matrix with rows and columns appropriately reordered.

A matrix A can then be defined with submatrices and dimensions as shown:

Now the random vector

is a linear transformation of the normally-distributed random vector X_perm and so itself is normally distributed with mean E[A(X_penn -μ_penn)] = AE[(X_penn -μ_perm)] = 0 and 5 covariance matrix AΣ_permA^τ where the T denotes transpose.

Using the identities E^₂ = E₂₂₅Ef₂ = E₂₁ (because of the symmetry of covariance matrices), it is possible to calculate equation (16) below:

10 22 Σ 21 0

A∑_permA^J .1

(16)

Equation (16) above is the motivation for choosing A: when the covariance matrix is calculated, the top right and bottom left corners turn out to be 0 submatrices, meaning

15 that X₁ -μ₁ - Σ_nΣ^~ ₂₂(X₂ -μ₂) and X₂ -μ₂ have zero covariance and are therefore independent. The quantity X₁ -μ - Σ₁₂Σ^~ ₂₂(X₂ -μ₂) can then be considered as a distinct q x q multivariate normal distribution. When X₂ takes the value P₂, the random variable becomes X₁ -μ - Σ_nΣ^~ ₂₂(P₂ -μ₂) . As shown above,

E[A(X_perm

AElX_perm -μ_perm)\ = Q, so E[X₁ -μ - ∑₁₂∑^" ₂(P₂ -μ₂)] = 0. But

20 μ - Σ₁₂Σ^~ ₂₂(P₂ -μ₂) is a constant, so the mean of X₁, i.e. the expected value of the missing data which is to be filled in, is given by:

E[X₁] =μ + Σ₁₂∑-₂₂(P₂ -μ₂) (17)

25 This provides a direct method for estimating P₁ from P₂. All that then remains is to apply R¹ to Ppe_rm to recover the full image corresponding to the probe.

The above derivation is indirect. It will be appreciated that it is also possible to construct a proof that uses the densities directly. Also it will be noted that, as well as the mean,

30 the covariance matrix of the estimated data has been derived as E₁₁ -∑₁₂∑^~ ₂₂∑₂₁. This can be used to explore the principal components of variation, i.e. the modes in which the real values are likely to differ from the estimate. This provides an indication on the reliability in the determined values.

c [χ j= ∑_π -∑₁₂ ∑ : ∑ _M (^{1 8})

Once a value for the missing attributes has been derived at step S106, together with any confidence values derived from equation 18, an image generator 210 uses these values, together with the input image, to display an output image 21 1 including the target object 205 together with one or more tags 212 associated with the object in the input image. After displaying the image attributes at step S107 for a predetermined time, the face description stops at step S108. Alternatively if the image is a frame of a series of frames the next frame in the flow can be input for processing.

MULTIPLE-MODEL VERSION

The multiple normal model is created by partitioning the training set, either manually (i.e. in a supervised way) or automatically (unsupervised).

Supervised partitioning of the training set may be by selection of an attribute and a threshold for that attribute, thereby defining a partition. For example, the sex of a face is encoded as an attribute with value between 0 (male) and 1 (female). Choosing that attribute and the threshold 0.5 would define a partition of the training set where the male samples belong to one subset and the female samples belong to another.

Alternatively supervised partitioning could be by a criterion that involves a set of attributes. For example, a formula involving landmark location could be used to partition training images into those looking left, those looking forward and those looking upwards.

Further, supervised partitioning could specify a criterion on the pixel values rather than the descriptive attributes. For example a criterion based on the distribution of luminance in the picture could be used to partition based on illumination.

Unsupervised methods of partitioning the training set include standard clustering techniques whereby some or all dimensions within the feature space can be used to group the training samples into clusters which then function as partitions for later stages. Following partition of the training set, each subset is used independently to derive normal models. For each subset, more than one model is formed. The reason is that the requirements of classification and conditional density estimation lead to different regularization criteria for those two tasks. Since the description stage involves both classification and conditional density estimation, the two models are used separately in the two stages.

The multiple-model version may involve partitioning the training set in multiple ways, e.g. according to sex and according to illumination. For each of these different ways, each partition will have its own normal model for classification and its own normal model for conditional density estimation.

The multiple-model version estimates attributes similarly to the single model version in taking as input a resized and normalized face area. However, this area is then classified to one of the partitions defined during training. Classification is done by standard statistical means using the distributions regularized for this purpose. The face is then used as a probe in the conditional density model for that participation and the attributes read out appropriately. Alternatively, several models may be probed by the face, and the output description based on a combination of these depending on its closeness to boundaries in the classifier stage.

Embodiments of the present invention can be applied to images containing a single instance of a target object. Alternatively, as shown in Figure 6 images containing more than one target object may be described. Figure 7 illustrates use of the input image shown in Figure 6 with output images displayed in which certain landmarks associated with each face are described, together with numerical values for certain further attributes. These can be shown on a user interface such as a monitor in real time acting as tags visible to a user.

TRIALS

The following trials demonstrate the performance of the regularization scheme CCM1. CCM1 can be used for estimating covariances both for conditional distribution estimation and for classification between the component models of a multiple-model realisation. The trials here are concerned with classification, and evaluate CCM1 on diverse problems. Four sets of experiments are described below referring to CC/W7-estimated full- dimensional normal models. Section A presents two-class discrimination - face vs. non- face and smiling vs. neutral face -where CCMI is compared against other estimators for a range of training set sizes. This affords direct comparison between CCMI, regularised discriminant analysis RDA and Mixed-LOOC1 (leave one out covariance) matrix estimators. The same data sets also allow comparison of leave-one-out cross-validation, and simpler, faster training with fewer iterations. Section B shows results for the face/non-face discriminator used to find faces. These results are indicators of performance which may be compared against other face finding approaches. Section C illustrates face classification experiments with 40 and 200 classes representing the identities of 40 and 200 individuals. Distance measurement in full dimensional space is compared with a well-known subspace method. Finally Section D uses the same face datasets as C to investigate dimensionality reduction.

A. Face/non-face and smiling/neutral classification

Face classification experiments used 19x19 greyscale pictures for a total dimensionality of 361. All training images were normalized to the same luminance mean and variance and the face training images were centred just above the nose tip. The applications were discrimination of faces from non-faces and discrimination between smiling and neutral faces. In the first case the number of training images per class was many times the dimensionality, in the second it was of about the same size. In both cases an extended superset (pool) of images that encompassed both classes was available. For the smiling/neutral comparison, this larger pool was the set of training faces from the face/non-face case.

Tables 2 to 5 and figure 8 summarize the experiment. The tables show details for runs which used all available training images to compare CCM1 , RDA, Mixed-LOOC1 , unregularized quadratic ML classification and pooled-sample-covariance linear ML classification. Tables 2 and 3 show face/non-face classification; tables 4 and 5 smiling/neutral classification. Tables 2 and 4 use just the training sets for the classes being discriminated, tables 3 and 5 incorporate the larger pools. Two cases are shown for CCM1 : first the leave-one-out training result, where every one of the training images was used in turn for cross validation, then the result for three-fold cross validation. The graph in figure 8 compares the same methods for face/non-face discrimination, but with a range of training set sizes. The far left points on each curve in figure 8 correspond to the results of table 2.

Because the examples shown are two-class experiments, CCM1 has only one γ parameter ^ . The grid values tested during training were: { a , β , γ₂ } =

{{0,0.2,0.5,0.6,0.7,0.8,0.9,1.0},

{0.1 ,0.2,0.5,1 ,2,5,10,20,50,100,200,500,1000,2000,5000,10000,20000,50000}, {0.5,0.55,0.6,0.65,0.7,0.75,0.8,0.85,0.9,0.95,1 ,1.05,1 .1 ,1 .15,1.2,1.25,1 .3,1 .35,1.4,1.45,1 .5}} for a total of 8 x 18 x 21 = 3024 combinations. In practice, all optima were clearly within a subset of 210 of these, and by subdivision search on γ₂ this could be reduced to 63 cases. For RDA and Mixed-LOOC1 the grids were those of the original publications - J H Friedman, "Regularized Discriminant Analysis", Journal of the American Statistical Association, VoI 84 No 405, March 1989, pp 165-175 - B-C Kuo, D A Landgrebe, "A Covariance Estimator For Small Sample Size Classification Problems and its Application to Feature Extraction", IEEE Transactions on Geoscience and Remote Sensing, VoI 40, No 4, April 2002, pp 814-819 which are incorporated herein by reference.

Table 2

Table 2 illustrates face/non-face discrimination with no extra pool of training images. All images 19x19 greyscale. Training images: 3608 faces, 4479 non-faces. Test images: 1370 faces, 1276 non-faces.

Table 3

Table 3 illustrates face/non-face discrimination with extra pool of training images. All images 19x19 greyscale. Training images: 3608 faces, 4479 non-faces, 13200 superset of both. Test images: 1370 faces, 1276 non-faces.

Table 4

Table 4 illustrates smiling/neutral face discrimination with no extra pool of training images. All images 19x19 greyscale. Training images: 164 smiling faces, 300 neutral faces. Test images: 69 smiling faces, 136 neutral faces.

Table 5

Table 5 illustrates smiling/neutral face discrimination with extra pool of training images. All images 19x19 greyscale. Training images: 164 smiling faces, 300 neutral faces, 2249 superset of both. Test images: 69 smiling faces, 136 neutral faces.

Tables 2 to 5 and the graph in figure 8 show:

• All three regularization schemes (CCM1 , RDA, Mixed-LOOC1 ) provide much better performance than classification without regularization, even where the per-class sample size is over five times the dimensionality (as in Table 2).

• All three regularization schemes also outperform classification with unregularized pooled covariance. Using a larger superset for estimating a pooled covariance improves unregularized performance when the number of class-labelled samples is small (table 5), but CCM1 is significantly better.

• CCM1 outperforms both Mixed-LOOC and RDA, particularly when the volumes of the class covariances are dissimilar (e.g. in table 2 and figure 8 where a low-volume face class is compared with a high-volume non-face class). A reason for this is CCMI 's class-specific correction for changes to covariance matrix volume - i.e. the γ₂ parameter. Evidence for this comes from a supplementary experiment to that of

Table 2. The RDA tests were re-run with a modified Maximum Likelihood Classifier: a linear shift parameter was added to equation 3, with this being set to give optimal performance over a validation set. The modified RDA produced only 8 errors over the whole test set, more than halving the performance difference between CCM1 and RDA. The linear shift parameter provides RDA with a volume-correction similar to that incorporated in CCM1 . Probably a similar change to the classifier could also improve Mixed-LOOC1 although this would be contrary to the LOOC design rule that each class's parameters are estimated separately. Another reason for CCMI 's superior performance to RDA is that it is better at adapting to cases where a large pooled set of training samples is available (tables 3 and 5), perhaps because RDA mixes with the pooled covariance before regularizing the result with the identity matrix. • The difference between doing full leave-one-out training and three-fold cross- validation in CCM1 is small. Therefore, it is not necessary to do leave-one-out cross- validation. Three-fold cross-validation takes only a little more than 3/Λ/ of the time of leave-one-out training. Figure 9 shows a typical a,β optimization surface where at each point the optimal T₂ has been used to calculate the log error rate. In common with other optimization surfaces measured during the experiments, this has multiple minima, but a large area over which the results are close to the optimum. This partially explains the result in the last bullet point: CCM1 achieves near-optimum performance over a range of the parameters, so a suitable set is likely to be found through limited cross-validation training. Moreover the tested grid of parameter values is certainly fine enough.

B. Face detection

The experiments noted above yield regularized face and non-face covariance matrices, which can be used for face detection. Examples of appearance-based face detectors

(operating in the raw pixel domain) are Rowley et al and Sung and Poggio. The former uses multiple neural networks that scan a 20x20 window over scaled versions of the image while the latter uses elliptic k-means training of a multimodal structure of 6 face clusters and 6 non-face clusters with a similar scanning mechanism on 19x19 windows. More recent detectors have used the Haar features and cascade structure of Viola and

Jones P. Viola and M. Jones, "Rapid Object Detection using a Boosted Cascade of

Simple Features", Proc. IEEE CVPR, Hawaii, December (2001 ) which are incorporated herein by reference. Figure 10 shows examples of a face detection scanner using maximum likelihood classification of each 19x19 window as face or non-face, according to regularized covariance matrices. Errors are illustrated on the bottom row of figure 333

- a false positive on the left and a false negative on the right.

Faces detected as in figure 10 can be fed to the smiling/neutral classifier according to embodiments of the present invention. For the avoidance of doubt it is to be noted that embodiments of the present invention can be used in conjunction with any type of object detector/methodology.

Embodiments of the present invention provide a method and apparatus for providing an image processing scheme which can describe a picture of an object, such as a face, with a list of textual attributes. In the case of a face, the attributes can include the location of facial landmarks, interpretation of facial expression, identification of the existence of glasses, beard, moustache or the like, and intrinsic properties of the subject, such as race. By combination of these attributes it will be understood that embodiments of the present invention can also determine other properties, such as head orientation, gaze direction, arousal and valence. By monitoring attributes over time, still further properties, such as detection of speaking when the object is a face, and in principle lip-reading and temporal tracking of expression, can be carried out.

Embodiments of the present invention which are directed to the determining of attributes of a face have a broad range of applications which include but are not limited to video logging, surveillance, demography analysis, selection of key frames or portraits with particular attributes (e.g. eye visibility) from a photoset or video, preprocessing for recognition, resynthesis, beautification, animation, video telephony, and the use of the face as a user interface.

Embodiments of the present invention provide a detailed, rich, holistic, description of face images which can be utilised for a broad range of purposes. For example, developers of well known beautification applications may appreciate that such applications benefit from automatic location of landmarks. However, embodiments of the present invention provide the possibility of exploiting knowledge of age and sex and other attributes when carrying out the beautification application.

Throughout the description and claims of this specification, the words "comprise" and "contain" and variations of the words, for example "comprising" and "comprises", means "including but not limited to", and is not intended to (and does not) exclude other moieties, additives, components, integers or steps.

Throughout the description and claims of this specification, the singular encompasses the plural unless the context otherwise requires. In particular, where the indefinite article is used, the specification is to be understood as contemplating plurality as well as singularity, unless the context requires otherwise.

Features, integers, characteristics, compounds, chemical moieties or groups described in conjunction with a particular aspect, embodiment or example of the invention are to be understood to be applicable to any other aspect, embodiment or example described herein unless incompatible therewith.

Claims

1 . A method of determining at least one unknown attribute value associated with a target object visible in an image, comprising the steps of: providing a sample point in multi-dimensional space associated with the target object; determining at least one conditional probability distribution for at least one attribute associated with the target object; and determining at least one conditional expectation value, indicating a respective attribute value associated with the target object, from the conditional probability distribution.

2. The method as claimed in claim 1 , wherein said step of providing a sample point comprises the steps of: from image data associated with an image in which the target object is visible, determining a multi-dimensional column vector corresponding to the target object, said column vector comprising a plurality of known entries each having a known respective value and at least one unknown entry having an unknown value.

3. The method as claimed in claim 2, further comprising the steps of: providing known entries for said vector by determining grey scale values of respective pixels associated with the image of the target object, known entries in the vector having a value corresponding to the grey scale value of a respective pixel.

4. The method as claimed in claim 3, further comprising the steps of: said step of providing known entries further comprises providing a value for each known attribute of the target object at a respective entry in the column vector.

5. The method as claimed in claim 3, further comprising the steps of: subsequent to detection of the target object in the image, identifying a subset of data associated with the target object from a total data subset associated with the image; and scaling said a subset of data to a predetermined size.

6. The method as claimed in any preceding claim, further comprising the steps of: determining a conditional variance value indicating a reliability associated with the attribute value indicated from the conditional expectation value.

7. The method as claimed in any preceding claim wherein said step of determining a conditional probability distribution, comprises the steps of: providing at least one N-dimensional trained probability model of a target object type associated with said target object.

8. The method as claimed in claim 7, further comprising the steps of: providing a single N-dimensional trained probability model for a target object type corresponding to a class of said target object.

9. The method as claimed in claim 7, further comprising the steps of: providing a plurality of N-dimensional trained probability models each associated with a respective sub-class of said target object.

10. The method as claimed in claim 9, further comprising the steps of: determining a plurality of sub-classes of target object to which said a target object belongs; and providing a respective trained probability model for each determined sub-class.

1 1. The method as claimed in claim 10 wherein said step of determining the subclasses further comprises the steps of: via a standard probability classifier, determining which of the respective trained probability models is the most likely model for the target object.

12. The method as claimed in any one of claims 7 to 1 1 wherein the step of providing an N-dimensional trained probability model comprises the steps of: for each model, providing a pre-generated N-dimensional regularised Gaussian probability distribution with covariance equal to:

Σ_ι(a,β, γ_ι) = γ_ι(aS_ι +(l-a)S_p +βl)

where α, β, and y are predetermined weights, S₁ is a sample covariance matrix derived from a set of training samples for the target object, S_p is a pooled covariance matrix derived from target object and non target object training samples and I is the identity matrix.

13. The method as claimed in claim 12, wherein values for α, β, and y are provided by the steps of: providing a training set comprising a plurality of test images comprising training samples showing sample target objects; repeatedly dividing the training set into a main set and a check set; for each combination of main set and check set, generating an N-dimensional regularised Gaussian model of a target object from values in the main set; setting combinations of values for α, β, and y and cross-validating an accuracy of the set values by comparing results of the model when applied to the check set; and selecting values for α, β, and y by selecting the combination of values which provides a lowest error rate responsive to the comparison step.

14. The method as claimed in claim 8 or any claim dependent therefrom wherein said step of determining a conditional expectation value comprises the steps of: determining a mean vector E[X₁] comprising a plurality of entries, each entry comprising a mean value associated with a respective attribute of the target object.

15. The method as claimed in claim 14, further comprising the steps of: generating the mean vector E[X₁] equal to:

E[X₁ ] = μ_{1 +} Σ_n∑-₂l {P₂ - μ₂ )

where P₂ is a column vector comprising known values of the target object, μ₂ is the mean vector of the known values derived from a training set, μi is the mean vector of estimates values and Z₁₂ and Z₂₂ are submatrices of a regularised covariance matrix.

16. The method as claimed in claim 14 or 15 further comprising the steps of: determining a covariance matrix C[X₁] comprising a plurality of entries, each entry comprising a respective value indicating a variance associated with a respective attribute of the target object; and responsive to the covariance matrix, determining a conditional variance value indicating a reliability associated with the attribute indicated from the conditional expectation value.

17. The method as claimed in claim 16, further comprising the steps of: generating the covariance vector C[X₁] equal to:

c ^L \ LAγ l 1J-- y Zj -T 12 Y 22¹ y Zj 21

where I₁₁ is a covariance matrix associated with unknown attributes, Σ₂₂ is a covariance matrix associated with known attributes, ∑₁₂ is a submatrix representing covariance between known and unknown attributes and ∑₂₁ is the transpose of ∑₁₂.

18. The method as claimed in claim 9 or any claim dependent therefrom wherein said step of determining a conditional expectation value comprises the steps of: for each of the trained probability models, determining a mean vector E[X₁] comprising a plurality of entries, each entry comprising a mean value associated with a respective attribute of the target object.

19. The method as claimed in claim 18, further comprising the steps of: generating the mean vector E[X₁] equal to:

E[X₁ ] = μ_λ + Σ₁₂Σ^" ₂ (P₂ - μ₂ )

where P₂ is a column vector comprising known values of the target object, μ₂ is the mean vector of the known values derived from a training set, ^ is the mean vector of estimates values and Z₁₂ and Z₂₂ are submatrices of a regularised covariance matrix.

20. The method as claimed in claim 18 or 19 further comprising the steps of: determining a covariance matrix C[X₁] comprising a plurality of entries, each entry comprising a respective value indicating a variance associated with a respective attribute of the target object; and responsive to the covariance matrix, determining a conditional variance value indicating a reliability associated with the attribute indicated from the conditional expectation value.

21. The method as claimed in claim 16, further comprising the steps of: generating the covariance vector C[X₁] equal to:

21

where ∑n is a covariance matrix associated with unknown attributes, Σ₂₂ is a covariance matrix associated with known attributes, ∑i₂ is a submatrix representing covariance between known and unknown attributes and ∑₂i is the transpose of ∑_i2.

22. The method as claimed in any preceding claim, further comprising the steps of: determining an unknown binary attribute associated with the target object by comparing the determined conditional expectation value with at least one predetermined threshold value associated with an attribute and subsequent to the comparison selecting an incidence or lack of incidence of the binary attribute.

23. The method as claimed in any preceding claim, further comprising the steps of: simultaneously determining a conditional expectation value for a plurality of unknown attributes associated with the target object to thereby simultaneously determine a plurality of attributes values associated with the target object.

24. The method as claimed in any preceding claim, further comprising the steps of: prior to the step of determining a sample point, detecting each target object shown in the image; and determining at least one unknown attribute value for each detected target object.

25. The method as claimed in any preceding claim wherein said target object class comprises a face.

26. The method as claimed in claim 25 wherein a sub-class of the target object is selected from one of gender and/or race and/or glasses wearer and/or age and/or demeanor.

27. The method as claimed in any preceding claim wherein said target object class comprises a vehicle.

28. A computer program comprising program instructions for causing a computer to perform the method of any one of claims 1 to 27.

29. A computer program product having thereon computer program code means, when said program is loaded, to make the computer execute procedure to display, on a user display, an image in which a target object is visible together with at least one unknown attribute value associated with the target object, attribute values for generating said image being determined by the computer responsive to the method as claimed in any one of claims 1 to 27.

30. Apparatus for determining at least one unknown attribute value associated with a target object visible in an image, comprising: a vector generator that generates an N-dimensional vector associated with a target object; and an attribute determiner that determines at least one conditional probability distribution for at least one attribute associated with the target object, and from the distribution determines at least one conditional expectation value indicating a respective attribute value associated with the target object.

31 . The apparatus as claimed in claim 30, further comprising: an image digitiser that receives input images and provides digitised images in response thereto.

32. The apparatus as claimed in claim 30 or claim 31 , further comprising: an object detector that detects target objects in digitised images.

33. The apparatus as claimed in claim 30, further comprising: an image generator that generates images including one or more tags for each target object in an image.

34. The apparatus as claimed in any one of claims 30 to 33, further comprising: a user display that displays images generated by the image generator.

35. The apparatus as claimed in any one of claims 30 to 34, further comprising: a training module storing at least one regularised model of a target object.

36. A method of providing an N-dimensional regularised Gaussian probability distribution, comprising the steps of: providing a plurality of images of a selected class or sub-class of target object; hand labelling attributes associated with the target object; and generating a regularised model of the target object.

37. The method as claimed in claim 36 wherein each probability distribution has a covariance equal to:

Σ_ι(a,β, γ_ι) = γ_ι(aS_ι +(l-a)S_p +βl)

38. A method substantially as hereinbefore described with reference to the accompanying drawings.

39. Apparatus constructed and arranged substantially as hereinbefore described with reference to the accompanying drawings.