Image Denoising with Information-Theoretic, Adaptive Filtering
This invention was made with government support under grant numbers CCR0092065 and EIA0313268 awarded by the National Science Foundation and grant number P20HL68566 awarded by the National Institutes of Health. The Government has certain rights to this invention.
1 Background
The problem of denoising images is one of the most important and widely studied problems in image processing and computer vision. Research has led to a plethora of algorithms based on diverse strategies such as linear systems, statistics, information theory, and variational calculus. However, most of the image filtering strategies invariably make strong assumptions about the properties of the signal and/or noise. Therefore, they lack the generality to be easily applied to diverse image collections and they break down when images exhibit properties that do not adhere to the underlying assumptions. Hence, there is still a need for general image filtering strategies that are effective for a wide spectrum of denoising tasks and are easily adaptable to new applications.
Classical approaches to image filtering, also applied to the problem of image restoration, rely mostly on the application of the theory of linear systems to images. This work includes
Fourier transform methods, least-squares methods, wavelet-based methods, and scale-space theory. Although these are computationally efficient, because of their linearity, the effects of these algorithms are not local in space and therefore they have difficulties dealing with local features, such as edges. Nonlinear filtering approaches are typically based on either varia¬ tional methods, which result in algorithms based on partial differential equations (PDEs); or statistical methods, which result in nonlinear estimation problems. Nonlinear filters can overcome some of the limitations of linear filters, but they introduce some problems such as higher computational costs and the additional tuning of extra free parameters. Furthermore, most linear and nonlinear approaches enforce specific geometric or statistical assumptions on the image.
PDE-based image processing methods became widespread after Perona and Malik's work on anisotropic diffusion, which smoothed textures while sharpening edges. The anisotropic diffusion equation is also the first variation of an image energy that penalizes image gradients with an allowance for outliers, and therefore seeks piecewise constant solutions. Because such variational approaches prefer certain image geometries, these local geometric configurations can be referred to as models. A multitude of nonlinear PDE models have been developed for a wide variety of images and applications and a variety of algorithms based on level sets.
Several groups have attempted to extend these PDE-based methods to more complicated image models. Vese has modeled textured images using functional minimization and partial differential equations by decomposing images as a sum of two functions — a cartoon-like image (bounded variation) and a texture image. Weickert has proposed a coherence enhancing flow (not derived as a variation of an image energy), which preserves and enhances textures that exhibit a homogeneous structure tensor. Several authors have proposed higher-order flows that correspond to piecewise-linear image models. These nonlinear PDE models have proven to be very effective, but only for particular applications where the input data is well suited to the model's underlying geometric assumptions. Moreover, the parameter tuning
is a challenge because it entails fuzzy thresholds that determine which image features are preserved (or enhanced) and which are smoothed away.
The statistical approaches to nonlinear filtering fall into several classes. One class is the methods that use robust statistics, the most prevalent being the median filter. The median filter enforces a constant image model with an allowance for outliers; iterative applications of the median filter to ID signals result in piecewise-flat solutions. Miller has proposed robust statistics for fitting higher-order models. Bilateral filtering is a robust, nonlinear filtering algorithm that replaces each pixel by the weighted average over a neighborhood with a fuzzy mechanism for excluding outliers. Like many of the variational approaches, these statistical methods are essentially mechanisms for fitting simple geometric models to local image neighborhoods in a robust way.
Another class of statistical methods for image processing rely on stochastic image models described by Markov random fields (MRFs), as proposed by Geman and Geman. The Markov property for images is based on the assumption of spatial dependency or predictability of the image intensities — it implies that the probability of a pixel having a particular intensity depends only on the intensities of its spatial neighbors. In Geman, they describe an algo¬ rithm that relies on Gibbs sampling to modify pixel intensities. Gibbs sampling, assuming the knowledge of the conditional distributions of the pixel intensities given the intensities of their neighbors, generates a Markov chain of pixel intensities which converges point wise to the desired denoised image. These conditional probabilities for image neighborhood configu¬ rations (called cliques) play a similar role to the image energy in the variational approaches. For instance, MRF image models often include extra parameters (hidden parameters) that explicitly model intensity edges, allowing these models to achieve piecewise-constant solu¬ tions. Thus, these conditional probabilities encode a set of probabilistic assumptions about the geometric properties of the signal (noiseless image) .
Figure 1 shows the results of filtering on the Lena image using some of the prevalent
nonlinear techniques, demonstrating their typical characteristics. Perona and Malik's diffu¬ sion (Figure Ic)) eliminates the noise on the cheeks but introduces spurious edges near the nose and the lips. Bilateral filtering (Figure l(d)), which is essentially an integral form of anisotropic diffusion, tends to smooth away fine textures resulting in their elimination, e.g. on the lips. Both of these algorithms entail two free parameters (scale and contrast), and require significant tuning. The coherence enhancing diffusion (Figure l(e)) forces specific elongated shapes in images, as seen in the enlarged nostril and the lips' curves. On the other hand, curvature flow, which is very similar to the total variation strategy of Rudin, tends to shrink features by rounding them off (Figure l(f)). The Lena image, which appears to be a very typical grayscale photograph, does not adhere very well to the basic geometric models underlying these algorithms.
Recently, researchers have begun to analyze the statistics of natural images in terms of local neighborhoods, and are drawing conclusions that are consistent with MRF mod¬ els of images. For instance, Huang analyzes the statistical properties of the intensity and range values of natural images. These include single pixel statistics, two-point statistics and derivative statistics. They found that the mutual information between the intensities of two adjacent pixels in natural images is rather large and attributed this to the presence of spatial correlation in the images. Lee and Silva analyze the statistics of 3 x 3 high-contrast patches in optical images, in the corresponding high-dimensional spaces, and find the the data to be concentrated in clusters and low-dimensional manifolds exhibiting a nontrivial topology. The work in this prior paper relies on the hypothesis that natural images exhibit some regularity in neighborhood structure in a parametric manner.
The literature shows several statistically-based image processing algorithms that do rely on information theory, such as the mean-shift method, which moves the samples uphill on a PDF associated with the data. This process produces a steady state in which all of the data have values corresponding to the nearest local maximum of the PDF (assuming appropriate
windowing strategies). The mean-shift procedure, thus, can be said to be a mode s eeking process. However, the mean-shift algorithm operates only on image intensities (be they scalar or vector valued) and does not account for neighborhood structure in images. Thus,, mean shift resembles a kind of image-driven thresholding process (particularly in the algorithm proposed by Comaniciu, in which the density estimate is static as the algorithm iterates).
Another related body of research is by Weissman who proposes the DUDE algorithm. DUDE addresses the problem of denoising data sequences generated by a discrete source and received over a discrete, memoryless channel. DUDE assumes that the source aaid the received data sequences take values from a finite population of symbols and that the -transi¬ tion probabilities over the channel are known. However, DUDE assumes no knowledge of the statistics of the source and yet performs (asymptotically) as well as any denoiser (e.g., one that knows the source distribution), thereby making DUDE more widely applicable. DUDE assigns image values based on the similarity of neighborhoods gathered from image statistics, which resembles the construction of conditional probabilities in the proposed method. How¬ ever, the DUDE approach is limited to discrete-valued signals whereas the proposed method addresses continuous-valued signals, such as those associated with grayscale images. While the DUDE algorithm is demonstrably effective for removing em replacement noise, it is less effective in case of additive noise.
2 Summary
The present invention provides an unsupervised, information-theoretic, adaptive filter that improves the predictability of pixel intensities from their neighborhoods by decreasing the joint entropy between them. Accordingly, the present invention can programatically discover the statistical properties of the signal and can thereby reduce noise in a wide spectrum of images and applications.
The system and method of the present invention uses probability density functions (PDFs) defined on the multidimensional sample space of image neighborhoods, in order to modify pixel intensities and filter or denoise images. In one embodiment of the present invention, a probability density function (PDF) can be used.
A sample space is the set of all possible outcomes of a random experiment (A random experiment is an experiment whose outcome is not certain). A random variable (RV) is a function that maps samples in the sample space to values. In this embodiment of the invention, the samples comprise pixel/neighborhood locations, and the values comprise the intensities. A probability density function (PDF) is a function on the output of a RV that gives the probability of the value of the RV obtained on the performance of a random experiment.
One embodiment of the invention treats the collection of pixels in a neighborhood as a random variable which has an associated sample space and PDF. The shape/structure of the PDF is used to modify pixel intensities and thereby filter or process the image. The processing of images in a manner that uses multi-dimensional PDFs has not been done in the past.
The structure of the PDF is used to modify the intensities of center pixels based on the gradients of the PDF. The PDF, generated from samples in the entire image, provides the statistical distribution or likelihoods of intensities in neighborhoods. In one embodiment of the invention, only the center pixel for a neighborhood is modified by examining the gradient along one dimension (corresponding to that of the center pixel) to infer an intensity update for that pixel. Of course, if desired, more dimensions may be used or changes can be made to two or more pixels in the neighborhood.
In another embodiment of the invention, the pixel intensities are modified in order to optimize an information-theoretic measure, such as entropy, associated with PDFs of image neighborhoods, in order to filter or denoise images. Entropy is a metric related to the
shape of a PDF and pixel intensities can be modified so as to reduce entropy (resulting in a corresponding change in the PDF) using gradient descent methods.
The present invention also provides automatic scale selection for the nonparametric mul¬ tivariate density estimation of PDFs associated with image neighborhoods by optimizing the particular information-theoretic measure (such as entropy) of the PDF with respect to the scale or width of the Parzen window. Specifically, a nonparametric density estimation technique (such as Parzen windowing) can be used in the present system and method to construct a PDF and in turn compute gradients of the entropy measure. The Parzen win¬ dow technique constructs an estimate by superimposing kernels placed at the locations of each data sample in the multi-dimensional space. Parzen windowing entails the setting of the scale parameter for the kernels used in the estimation process. Determining the optimal value of this parameter is crucial for the success of the filtering method.
Many methods/applications with low dimensional feature spaces (e.g. 2 or 3 dimensions) operate by manually tuning the scale parameter. However, because the present invention relies on a sparsely populated high dimensional space, it is difficult to manually find ap¬ propriate parameter values. The present method is iterative and dynamic, and the derived scale parameter changes every iteration. Hence, this parameter is found via a data-driven approach. The present invention finds the optimal scale parameter value by viewing the entropy of the PDF as a function of the scale parameter and finding that value of the scale parameter that minimizes the entropy.
The present invention also uses an anisotropic distance metric in the sample space in order to construct rotationally invariant neighborhood masks. Square neighborhoods gen¬ erate results with artifacts exhibiting preferences for grid-aligned features. A solution is to weight the intensities, making neighborhoods more isotropic. Such fuzzy weights can be incorporated by using an anisotropic distance metric in the multi-dimensional sample space. The weights correspond to a relatively isotropic kernel in the image space.
Another embodiment of the invention can use subspaces of image-neighborhood PDFs in order process pixels at image boundaries. Images have finite extent and there exists some neighborhoods near the image boundaries that lie partly outside the image where there is no data. Typical image boundary conditions, e.g. replicating pixels or toroidal topologies, can produce neighborhood intensities distorting the feature-space statistics. The boundary neighborhoods can be handled by collapsing the feature space along the dimensions corresponding to the neighbors falling outside the image. Accordingly, the square regions crossing image boundaries can be cropped and processed by analyzing the PDFs associated using the lower-dimensional subspace.
3 Detailed Description
The present invention provides an unsuperυised information-theoretic adaptive filter for image denoising and denoises pixels by comparing pixel values with other pixels in the image that have similar neighborhoods. The underlying formulation relies on an information- theoretic measure of goodness combined with a nonparametric model of image statistics. The information-theoretic optimization measure relies on the entropy of the patterns of intensities in image regions. Entropy is a nonquadratic function of the image intensities, and therefore the filtering operation is nonlinear. The present system and method operates without a priori knowledge of the geometric or statistical structure of the signal or noise, but relies instead on some very general observations about entropy of natural images. It does not rely on labeled examples to shape its output, and is therefore unsuperυised. Because a statistical representation of the image is automatically derived from the input data and constructs a filtering strategy based on that model, it is adaptive. Moreover, the free parameters are automatically adjusted using a data-driven approach and information-theoretic metrics. Because the present invention is nonlinear, nonparametric, adaptive, and unsupervised, it
can automatically reduce image noise in a wide spectrum of images and applications.
The present invention can be applied to images which tend to be two-dimensional in nature. The term image is generally defined as a digital signal, consisting of one or more floating point or integer values, organized on a regular grid over any number of dimensions. The values are sometimes called intensities. Individual grid points are sometimes called pixels. In addition, the invention can be applied to one dimensional type data, such as audio data, volumetric three-dimensional data, or data obtained from multiple modalities (where the appropriate computing power is available) . The dimensions of the data should not be considered to limit the applications of the present invention because the filtering of the present invention may be used in many dimensions if desired.
This Section continues with an overview of the notation and a review of information- theoretic measures. It shows how these measures are applied to image neighborhoods, con¬ cluding with a high-level description of the present invention.
3.1 Information-Theoretic Measures
The notation in this discussion is as follows. An uppercase letter, e.g. X, denotes a random variable (RV), which may be scalar /vector- valued, a,s necessary. A lowercase letter, e.g. x, denotes the value of a particular sample from the sample space of X. p(x) denotes the probability density function (PDF) for X. Applying a function /(•) on X yields a new RV, /(X). When the function is the PDF p(x) itself, we refer to the new RV as p(X).
The Shannon entropy of a RV measures the information, or uncertainty, associated with the RV . For a continuous RV X the differential entropy h(X) is
Λ(*) = - / P{x) logp(x)dx = -Ep[\ogp(X)} (1)
J -OO
where Ep denotes the expectation of the RV with X drawn from the PDF p(x).
The conditional entropy between two continuous RVs X and Y measures the uncertainty remaining in X after observing Y. It is defined as the weighted average of the entropies associated with the conditional PDFs.
h(X\Y) = / p(y)h(X\Y = y)άy = h(X, Y) - h(Y), (2)
where h(X, Y) is the joint entropy of the RVs X and Y .
Besides entropy, several other measures exist that measure the relative information con¬ tent between variables. Mutual information, I(X, Y) = h(X) + h(Y) — h(X, Y), quantifies the uncertainty in X that is resolved after observing Y. For a set of RVs, Zi, . . . , Z
n, the rnultiinformation, M(Zi, • ■ •
> Z
n)
, . . . , Z
n), generalizes mutual information and quantifies all combinations of functional dependencies.
All these information-theoretic quantities are potentially interesting for processing im¬ ages based on neighborhood statistics. However, each measure has a distinct effect on the marginal and joint PDFs of the image neighborhoods, as discussed in Section on Neighbor¬ hood Entropy. A neighborhood or image neighborhood as used in the present discussion can be generally defined as the set of points on the grid in proximity to a specified point. In other words, the set of grid points which is within a. certain, usually small, distance from the specified point.
3.2 Neighborhood Entropy
The following discussion assumes a 2D image, but the formulation extends directly to higher dimensional images. A 2D grayscale image is a function /: S ι-» M, which assigns a real value to each element of a domain, S C {I X I}, where typically S = (1, . . . , n) X (1, . . . , m). We call each point in this domain a pixel. A region r (s) G S centered at a pixel location s is the ordered set (i.e. vector) of pixels {t: \t — ≤ d}, where the set ordering is based on
the values of the two spatial coordinates of pixels t, and d denotes the region size. Thus r(s) is an n = (2d + I)2 dimensional vector comprising the locations of the pixels in the region centered at s. We write the vector r(s) = (s + θχ, . . . , s + on) where O1, . . . , On correspond to the offsets from the center pixel to its neighbors, and oc = 0 is the offset of the center pixel.
The present invention filters images by increasing the predictability of pixel intensities from the neighborhood intensities. This uses an entropy measure for intensities in image regions. We first define a continuous RV X : S ι— >• M. that maps each pixel in the image to its intensity. The statistics of X are the grayscale statistics of the image. Thus x(s) — I(s), and every intensity in the image is seen as a realization of the RV X. Next we define the random vector Y = (X(s + θχ), . . . , X(s + oc^χ), X(s + oc+1), . . . , X(s + On)) which captures the statistics of intensities in pixel neighborhoods. Thus y(s) — (I(s + O1), . . . , I(s -f- oc_i), I(s + oc+ι), . . . , I(s + On)). Let Z = (Zi, . . . , Zn) ≡ (X, Y) be a random vector taking values as z(s) = (I(s + O1), . . . , I(s + On)) representing pixel intensities in the region r(s) centered at s. We call the space in which the vectors z(s) lie as the feature space. The present method relies on the statistical relationship between the intensity of each pixel and the intensities in a set of nearby pixels defined by its neighborhood. The strategy, for each pixel-neighborhood pair (X = x(s), Y = y(s)) from the image, is to reduce the entropy h(X\Y = y(s)) of the conditional PDF by manipulating the value of each center pixel x(s).
The present system and method employs a gradient descent to minimize entropies of the conditional PDFs. In principle, the gradients of h(X\Y) have components corresponding to both the center pixel, x(s), and the neighborhood, y(s), and thus the entire neighborhood, (x(s), y(s)), would be updated for a gradient descent scheme. In practice we update only the center pixel of each neighborhood. That is, the gradient is projected onto the direction associated with the center pixel. Given this projection, the present method is a reweighted gradient descent on either the joint entropy, h(X, Y), or the conditional entropy, h(X\Y) — they are equivalent for this particular descent strategy.
This choice of entropy as a measure of goodness follows from several observations. First, the addition of any sort of noise to the image increases the joint entropy because the sum of two random variables corresponds to a convolution of their PDFs in the probability space, which necessarily increases entropy. Thus, any kind of denoising method (for additive noise) generally decreases entropy. Of course, continuing entropy reduction by filtering forever might not be optimal and might also eliminate some of the normal variability in the signal (noiseless image). However, the present invention is motivated by the observation that noiseless images tend to have very low entropy relative to their noisy counterparts. Thus, the present entropy reducing filter first affects the noise (in the noisy image) substantially more than the signal. Second, among the various measures of information content, the proposed entropy measure, Ji(X \Y — y(s)), makes sense for several reasons. For an image, Ji(X \Y = y(s)) is low when the center pixels, x(s), are predictable from their neighborhoods, y(s). However, h(X\Y = y(s)) will also be low when the image by itself is predictable, e.g. an image with a constant intensity. Although maximizing mutual information (I (X, Y)) and multiinformation (M(Zi, . . . , Zn)) penalizes joint entropy, it rewards higher entropy among the individual RVs. This tendency towards higher entropy stems from the term h(X) (in I(X, Y)) and ∑i Ji(Zi) (in M(Zi, . . . , Zn)). Thus, these information measures tend to perform denoising and contrast enhancement simultaneously. For many images, enhancement and denoising are not compatible goals. Furthermore, additive noise can, in some cases, increase the multi-information measure.
3.3 High-Level Structure of the Method
The high-level structure of the method of the present invention is as follows.
1. The noisy input image, namely /, consists of a set of intensities x(s). These values form the initial values of a sequence of images /°, J1, /2, . . ..
2. Using the image Im, construct another image (with same dimensions), Vm, composed of the intensity vectors, zm(s), of length n = (2c? + I)2.
3. For each pixel zm(s) ≡ (xm(s), ym(s)) in Vm, estimate the PDF p(x\Y = ym{s)) and compute the derivative of h(X\Y = ym(s)) with respect to xm(s).
4. Construct a new image /m+1 consisting of pixel values xm+1(s) using finite forward differences on the gradient descent: xm+l(s) = xm(s) — λdh/dxm(s).
5. Based on a suitable stopping criterion, terminate, or go to Step 2. (Appendix B discusses more about stopping criteria.)
This method includes two important parameters. The first is the size of the image neigh¬ borhoods (the parameter d in the previous discussion). Typically, values of 1 or 2 suffice (giving regions of sizes 3 x 3 or 5 x 5 and feature spaces of dimensions 9 or 25, respectively). However, as we shall see in later sections, more complex or noisy images may perform better with larger neighborhoods for reliable denoising. The second free parameter is in the stop¬ ping criterion. For most natural images we would not expect the steady states of this filter to be an acceptable result for a denoising task — that is, we expect some degree of variation in neighborhoods. However, the method is consistent with several conventional techniques for enforcing fidelity to the input data; such mechanisms inevitably introduce an additional parameter.
Although each step of the method operates on a single pixel (Step 4 above) is merely a gradient descent on the center pixel, the interactions from one iteration to the next are quite complicated. The updates on the center-pixel intensities in Step 4 affect, in the next iteration, not only the center pixels but also the neighborhoods. This is because the image-regions, r(s)Vs, overlap and the set of pixels that form the centers of regions is the same as that which form the neighborhoods. Thus, the present filtering consists of two kinds of processes. One is the first-order optimization process, which computes updates for pixels based on their
neighborhoods. The other second-order process causes updates of the neighborhoods based on the role of those pixels as centers in the previous iteration. The result of the filtering can be seen as the quest for the steady state of the combination of these two processes.
It may also be helpful to note that the present invention exploits the Markov property of the images, but in a different context. Rather than imposing a particular model on the image, the present invention estimates the relevant conditional probability density functions (PDFs) from the input data and updates pixel intensities to decrease the randomness of these conditional PDFs. In addition, some of past filtering work has relied on the hypothesis that natural images exhibit some regularity in neighborhood structure, but the present method discovers this regularity for each image individually in a nonparametric manner.
The present invention also determines the mathematical relationship between a mean- shift procedure and entropy reduction and is thereby a generalization of the mean-shift method, which incorporates image neighborhoods to reduce the entropy of the associated conditional PDFs.
4 Nonparametric Multivariate Density Estimation
Entropy optimization entails the estimation of the PDFs of the RVs involved. For a (2d + 1) x (2d + 1) pixel neighborhood density estimation in a (2d + l)2-dimensional space. This introduces the challenge of high-dimensional, scattered-data interpolation, even for modest sized, image neighborhoods (d = 2 yields a 25D space) . High-dimensional spaces are noto¬ riously challenging for data analysis (regarded as the curse of dimensionality) because they are so sparsely populated. Despite theoretical arguments suggesting that density estimation beyond a few dimensions is impractical, the empirical evidence from the statistics literature is more optimistic. The results in this paper confirm that observation.
One of the advantages for present invention is that the random vector Z ≡ (X, Y)
is sphered, by definition. A sphered random vector is one for which the marginal PDFs of each individual RV have the same mean and variance. Thus, each marginal PDF is simply the grayscale intensity PDF, p(x), of the image. The literature shows that sphered RVs lend themselves to more accurate density estimates. Also, this invention relies on the neighborhoods in natural images having a lower-dimensional topology in the multi¬ dimensional feature space. This is also a general property for multivariate data. Therefore, locally (in the feature space) the PDFs of images are lower dimensional objects that lend themselves to better density estimation.
The literature shows Parzen windows as an effective nonparametric density estimation technique. The Parzen- window density estimate p(z), in an n-dimensional space, is
p(z) * τ^ ∑ G(z - Zj^) (3)
\A\ SjeA
where \A\ denotes the cardinality of the set A, and Zj is a shorthand for Z(SJ). The set A is a randomly selected subset of the sample space. One embodiment of the invention chooses to use G(z, Φ) as the n-dimensional Gaussian
where Φ is the n x n covariance matrix. The Gaussian kernel is not a unique choice, but it suits this application for several reasons: it is smooth, relatively efficient to compute (approximated by a look-up table), and entails a small number of free parameters. Other kernels or distributions may be chosen if a given distribution fits a certain application. Having no a priori information on the structure of the PDFs, we choose an isotropic Gaussian of standard deviation σ, i.e. Φ = σ2l, where / is the n x n identity matrix. A proper choice of the parameters σ and \A\, that determine the quality of the density estimate is valuable.
Section 5 discusses a strategy for computing the optimal parameter values.
4.1 A Stochastic Approximation for Entropy
Equation 1 gives entropy as an expectation of a RV. The approximation for entropy follows from the result that the sample mean converges, almost surely, to the expectation as the number of samples tends to infinity. Thus
Ep[\ogp(Z)] ∑ 10SP(^) (5)
Equations 1, 3, and 5, give
The set A, which generates the density estimate p(zj), should not contain the point s» itself — because this biases the entropy estimates. The set B consists of the samples that will move at each iteration, and therefore it must consist of every sample in the image, i.e. B = S. The samples in set A are, typically, a small fraction of those in B, chosen at random. The relatively small cardinality of A has two important implications. First, it significantly reduces the computational cost for the entropy estimation, from O(|J3|2) to O(|-4||-5|). Second, because A is different for each element of B for each iteration, the entropy estimate, h(Z), is stochastic. Hence, a gradient descent entropy optimization technique results in a stochastic-gradient method. The stochastic-gradient effectively overcomes the effects of spurious local maxima introduced in the Parzen-window density estimate using finitely many samples. Thus, the proposed entropy-estimation scheme is important not only for computational efficiency but also for effective entropy minimization.
5 Conditional Entropy Minimization
Entropy minimization in the present method relies on the derivative of the entropy with respect to the center-pixel value of the samples in B. Each pixel intensity in the image undergoes a gradient descent, based on the entropy of the conditional PDF estimated from A. The gradient descent for X1 ≡ x(st) for each S1 € B is
dxt _ dh(X\Y = yt) 1 fllogp(a;.tø,) _ 1 d \ogp(zt) dt ~ Ox1 ~ \B\ Bxx ~ \B\ dx% [ '
where 8Z1JdX1 is a projection operator that projects an n-dimensional vector Z1 onto the dimension associated with the center pixel intensity X1. Figure 2 elucidates this process.
If we choose a σ > 0, the entropy for a finite set of samples is always bounded. Be¬ cause we perform a (projected) gradient descent on a bounded energy function, the process converges (for sufficiently small time steps). Indeed, analysis of simple examples shows the existence of nontrivial steady states (e.g. an image which is a discrete sampling of a linear function f(x, y)). Empirical evidence, using real and synthetic images in Section 6, shows that the filtering method does sometimes converge to interesting results. However, for many applications, convergence is not the goal; as with many other iterative filtering strategies, several iterations of the gradient descent are sufficient for acceptable denoising.
One issue in the present invention is the selection of the scale or size of the Parzen window (the standard deviation of the Gaussian). The Parzen- window density estimate, using a finite number of samples, shows a great deal of sensitivity for different values of σ. Thus, the particular choice of the standard deviation σ, and thereby Φ, for the Gaussian in Equation 3, is a crucial factor that determines the behavior of the entire process of entropy optimization. Furthermore, this choice is related to the sample size \A\ in the stochastic
approximation. For a particular choice of \A\, we propose to use the σ that minimizes the joint entropy, which we will call the optimal scale for a data set. This can be determined automatically at each iteration in the processing. Our experiments show that for sufficiently large \A\ the entropy estimates and optimal scale are virtually constant, and thus |A| can also be generated directly from the input data.
A second issue is the choice of stopping criterion. For this we refer to the vast literature on nonlinear methods, which presents an array of choices, which are discussed in more detail in the Appendix. For some of the simpler examples in this paper steady-state results are quite good. However, for more complicated images (e.g. real-world images) we choose the number of iterations empirically based on subjective impressions of the quality of the results.
Another issue is the shape of the image neighborhoods. The square neighborhoods de¬ scribed in Section 3.2 show anisotropic artifacts, and favor features that are aligned with the cardinal directions. To obtain isotropic filtering results we use a metric in the feature space that controls the influence of each neighborhood pixel so that the resulting mask is more ro- tationally symmetric. In this way directions in the feature space corresponding to corners of neighborhood collapse so that they do not influence the filtering. A similar strategy enables us to handle image boundaries without distorting the statistics of the image. That is, pixels at image boundaries rely on the statistics in lower-dimensional subspaces corresponding to the set of neighborhood pixels lying within the input image.
Finally, the issue of sample selection for the set A also influences the behavior of the filter. While a uniform sampling over the image produces acceptable results for many images, our experiments have shown that the processing of some images (particularly those with spatially inhomogeneous image statistics) benefit from a 6ms ed sampling strategy, which estimates the PDF using samples that lie nearby the pixel being processed. We have found that a Gaussian distribution (centered at the pixel in question) works quite well, and that the results are not particularly sensitive to the size or scale of this sampling function.
5.1 Relationship To the Mean-Shift Procedure
The mean-shift procedure moves each sample in a feature space to a weighted average of other samples using a weighting scheme that is similar to Parzen windowing. This can also be viewed as moving samples uphill on a PDF. Comaniciu and Meer propose an iterative mean-shift method for image intensities (where the PDF does not change with iterations) that provides a mechanism for image segmentation. Each grayscale (or vector) pixel intensity is drawn toward a local maximum in the grayscale (or vector-valued) histogram.
This section shows how this method relates to the mean-shift procedure. We begin by establishing the relationship between the mean-shift procedure and gradient-descent entropy minimization. Consider, as an example, a gradient descent on the entropy of the grayscale pixel intensities. This gives
dt dxi \B\ £A ∑βheA G(Xi - xk, Φ) K l J > y )
Finite forward differences xm+l(s) - xm(s) — λdh/dxm(s) with a time step λ = |-3|σ2 give
= ∑ xfW(x™, x™, A, Φ) = ∑ WjXf (10)
Thus each new pixel value is a weighted average of a selection of pixel values from the previous iteration with weights Wj > 0 such that ∑j Wj = I. This is exactly the mean-shift update proposed by Fukunaga. Note that here the PDFs on which the samples climb get updated after every iteration. Thus the mean-shift method is a gradient descent on the entropy associated with the grayscale intensities of an image. We observe that samples x(s) are being attracted to every other sample, with a weighting term that diminishes with the distance between the two samples. The updates in this invention have the same form, except
that the weights are influenced not only by the distances/similarities between intensities x(s) but also by the distances/similarities between the neighborhoods y(s). That is, pixels in the image with similar neighborhoods have a relatively larger impact on the weighted means that drive the updates of the center pixels.
6 Examples
This Section provides examples of filtering on real and synthetic images and analyzes the behavior of the present invention on the same. Parzen windowing in all of the examples uses, unless otherwise stated, a uniform random sampling of the image domain with 500 samples (i.e. \A\ = 500), as explained in Appendix E. The noise in the synthetic-image examples is, unless otherwise stated, additive, zero mean, independent, and Gaussian. Furthermore, the amount of noise is high enough that thresholding the noisy image can never yield the noiseless image. Note that present invention, in its formulation, does not assume any particular: distribution on the noise. Because of interactions of neighborhoods from one iteration to the next, the time step λ = \B\σ
2 can lead to oscillations in the results. We have found that a time step of λ
this effect. The size of the Parzen window σ is recomputed after each iteration to minimize the entropy of the processed image.
Figure 3 shows the result of 11 iterations of the present invention on the Lena image with spatially local sampling (explained in Appendix E). The present method preserves and enhances fine structures, such as strands of hair or feathers in the hat, while removing random, noise. The results are noticeably better than any of those obtained using other methods shown in Figure 1. A relatively small number of iterations produce subjectively good results for this image — further processing oversimplifies the image and removes significant details.
The fingerprint image in Figure 4 shows another example of the structure-enhancing tendencies of the present system and method, which enhances the contrast of the light and
dark lines without significant shrinkage. A kind of multidimensional classification of image neighborhoods can be performed — therefore features in the the top-left are lost because they resemble background more than ridges. Figure 5 presents the results of other denoising strategies for visual comparison with the present method. We see that the piece-wise smooth image models associated with anisotropic smoothing, bilateral filtering, and curvature flow (Figure 5(a)-(c)) are clearly inappropriate for the this image. A mean-shift procedure (Figure 5(d)) on image intensities (with the PDF not changing with iterations) yields a thresholded image retaining most of the noise. Weickert's coherence enhancing filter (which is as well suited to this image as virtually any other) does not succeed in retaining or enhancing the light-dark contrast boundaries, and yet it forces elongated structures to grow or connect (Figure 5(e)). Thus, the present method (Figure 5(f)) appears to remain more faithful to the underlying data.
Figure 6 shows the results of processing an MRI image of a human head for 8 iterations. This example employs the local sampling strategy (explained in Appendix E), and shows the ability of the present invention to adapt to a variety of grayscale features. It enhances structure while removing noise, without imposing a piecewise constant intensity profile. As with the Lena example, more iterations tend to slowly erode important features.
Figure 7 gives a denoising example involving large amounts of noise. The checks are 4x4 pixels in size and the defined neighborhoods are 5x5 pixels. All of the edges can be restored and the corners and the image boundaries show no signs of artifacts. Figure 10(a) shows that the RMS error (root of the mean squared difference between pixel intensities in the filtered image and the noiseless image) decreases by 90 percent. Figure 12(c) shows that the joint entropy and the Parzen window size, σ, decrease monotonically as the filtering progresses. For this example, a multi-threaded implementation takes roughly 1 hour for 100 iterations with a pair of Intel®Xeon™2.66 GHz Pentium 4 processors (shared memory).
Figure 8 shows the results of applying this invention to another corrupted binary im-
age. The smoothness of the resulting circle boundary demonstrates the effectiveness of this method in preserving rotational invariance, as explained in Appendix C. The edges of the square are also well restored, but, unlike the checkerboard example, the corners are rounded. The presence of many similar corners in the checkerboard image form a well defined pat¬ tern in feature space, whereas the corners of the square are unique features (in that image) and hence, treats them more like noise — those points in the feature space are attracted to more prevalent (less curved) features. Figure 9 shows the application of 15 iterations of this method to an image of hand-drawn curves (with noise). The present invention learns the pattern of black-on- white curves and forces the image to adhere to this pattern. However, mistakes may be made by the present method when curves become too close, exhibit a very sharp bend, or when the noise introduces ambiguous gaps.
To study the effect of the neighborhood size we performed filtering with different neigh¬ borhood sizes on the checkerboard image from Figure 7, but with significantly more noise. Figure 10(b) shows comparisons of the performance with three different region sizes, i.e. 3 x 3, 5 x 5 and 7 x 7, that reflect the advantage of larger neighborhoods. For higher levels of noise, we find that larger neighborhoods are able to better discern patterns in image regions and yield superior denoised images.
The present method is effective for removing various kinds of noise. Filtering of the checkerboard image with correlated noise (gotten by applying a low-pass filter to zero-mean, independent, Gaussian noise) shows a significant improvement in RMS error (see Figure 10(c)), but the reduction in RMS error is not as good as in the examples with uncorrelated noise. Correlated noise modifies the PDF p(x, y) by creating more local maxima and thereby fractures the manifold associated with the original data. The strategy for automatically choosing the Parzen window size, σ, together with the joint entropy minimization scheme, is unable to remove these new feature-space structures. We can verify this by artificially increasing the size of the Parzen window. Multiplying the optimal σ by 2 gives a lower RMS
error (see Figure 10(c)). This example brings out a weakness in the present filter and the choice of a single isotropic Parzen window — an area of possible improvement. Experiments also show that this method is also very effective at removing replacement noise (e.g. salt and pepper noise), a somewhat easier denoising task (see Figure 10(d)).
Appendix
A Automatic Scale Selection for Parzen Windowing
Figure 11 shows that Parzen windowing, using a finite number of samples, is very sensitive to the value of σ. Many methods/applications with low dimensional features spaces (e.g. 2 or 3) operate by manually tuning the scale parameter. However, because the present method relies on a sparsely populated high dimensional space, it is very difficult to manually find values for σ that properly "connect" the data without excessively smoothing the PDF. Due to the iterative and dynamic approach of this invention, the best scale parameter changes every iteration, and σ is found via a data-driven approach. Because the goal is to minimize joint entropy, a logical choice is to choose a value for σ that minimizes the same. Figure 12(a) confirms the existence of a unique minimum. Figure 12(b) shows that the choice of σ is not sensitive to the value of \A\ for sufficiently large \A\, which automatically fixes \A\ to an appropriate value before the filtering begins.
Both differential (Newton's method) and discrete (Fibonacci search) methods have been implemented, and both offer acceptable results. Figure 12(c) depicts the decreasing trend for σ as the filtering progresses, which is common to every example and is consistent with entropy-reducing action bringing samples closer in the feature space.
B Stopping Criteria
Like many iterative filtering strategies, the steady states of this method can produce simple images that do not adequately reflect important structures in the input image. There are many options for stopping criteria. One possibility is to use an information-theoretic choice based on h(X, Y) or σ, both of which quantify the complexity in image neighborhoods. Figure 12 (c) shows that both decrease monotonically. Hence, a stopping rule could be based on their absolute values, values relative to the input, or relative change between iterations.
Another approach is to rely on the knowledge of the noise level in the inpixt image. In this case the method may terminate when the residual (RMS difference between input and output) equals the noise level. Lastly, because this method is a filtering process on the image, termination can be based on visual inspection of images in the filtered image sequence.
An alternative is to use the system and method of the present invention as part of a reconstruction process where entropy minimization acts as a prior that is combined with an image-data or fidelity term. This would allow the method to run to a steady state, and rather than a stopping criterion one can choose the relative weights of the prior and data, a so-called meta parameter. If the noise level is known, one can avoid the meta parameter and treat residual magnitude as a constraint.
C Rotational Invariance
Rotational invariance does not follow from present method's formulation as illustrated in FIG. 13, because the samples are taken on a rectilinear grid. Square neighborhoods generate results with artifacts exhibiting preferences for grid-aligned features. A solution is to weight the intensities, making neighborhoods more isotropic. Fuzzy weights can also be incorpo¬ rated by using an anisotropic feature-space distance metric, \\Z\\M — VzTMz, where M is a diagonal matrix. The diagonal elements, TO1 , . . . , TOn, are the appropriate weights on the
influence of the neighbors on the center pixel. To have a sphered RV Z, (that aids in density estimation; Section 4), the weights used can be somewhat homogeneous. Figure 13(a)-(b) shows a disk-shaped mask that achieves this balance. The intensities near the center are un¬ changed (rrii = 1) while the intensities near the corners are weighted by the fraction (m^ < 1) of the pixel area overlapping with a hypothetical circle touching all four sides of the square neighborhood. The proposed isotropic mask is a grayscale version of the DUDE strategy of using a binary disc-shaped mask for discrete (half-toned) images. Note that scaling the center-pixel intensity more than its neighbors leads to an elongated space (X', Y') along X' — in the limit, when all neighbors are weighted zero, leading to a thresholding as in the mean-shift method.
D Anisotropic Neighborhoods at Boundaries
Typical image boundary conditions, e.g. replicating pixels or toroidal topologies, can pro¬ duce neighborhoods distorting the feature-space statistics. Boundary neighborhoods can be handled using a strategy similar to that in Appendix C, by collapsing the feature space along the dimensions corresponding to the neighbors falling outside the image. The square regions crossing image boundaries can be cropped and processed in the lower-dimensional subspace. This strategy results in important modifications in two stages of processing. First, the cropped intensity vectors take part in a mean-shift process reducing entropies of the con¬ ditional PDFs in the particular subspace where they reside. Second, a Parzen window size can be chosen σ, based only on the regions lying completely in the image interior.
E Selecting Random Samples
Parzen- window density estimation entails the selection of a set of samples belonging to the density in question, as seen in Section 4. Nominally, this set comprises random samples drawn from a uniform PDF on the sample space. The strategy works well if the image statistics are more or less uniform over the domain, e.g. the fingerprint image in Figure 4. However, it fails if the statistics of different parts of the image are diverse, e.g. the Lena image in Figure 3. This is because distant parts, for such an image, produce samples lying in distant regions of the feature space. To alleviate this problem we estimate p(z(sj)), by selecting random samples Sj in a way that favors nearby points in the image domain — using a Gaussian distribution centered at s* with a relatively small standard deviation (10 pixels). This strategy is appropriate for images having locally consistent neighborhood statistics.