WO2019076867A1

WO2019076867A1 - Semantic segmentation of an object in an image

Info

Publication number: WO2019076867A1
Application number: PCT/EP2018/078192
Authority: WO
Inventors: Stephen FOY; Rosalia BARROS; Ian Clancy
Original assignee: Connaught Electronics Ltd.
Priority date: 2017-10-20
Filing date: 2018-10-16
Publication date: 2019-04-25
Also published as: DE102017124600A1

Abstract

The present invention relates to a method for semantic segmentation of an object (5, 8) in an image, comprising the following method steps: - consecutively acquiring image frames (6, 7, 11), - inputting a first image frame (6) of the consecutively acquired image frames (6, 7, 1) into a convolutional neuronal network in real time, - examining by the convolutional neuronal network whether any object (5, 8) can be detected in the first image frame (6), - semantically classifying the detected objects (5, 8) by the convolutional neuronal network by assigning each detected object (5, 8) to one of a list of predefined object classes, - providing a lookup-table with a priority list which comprises a priority level for each of the predefined object classes, respectively, - determining a respective priority level of the detected objects (5, 8) by comparison with the lookup-table, - determining one or more object(s) (5) which have a predefined priority level, - determining a high priority area (9) of the image frame (6) which relates to the or an object (5) with the predefined priority level, - inputting a next image frame (7) of the consecutively acquired image frames (5, 6) into the convolutional neuronal network in real time, - analyzing only the high priority (9) area in the next image frame (7) by the convolutional neuronal network. In this way, an efficient CNN architecture design that can be applied for an automotive camera (3) with a large field of view taking advantage of the large field of view.

Description

Semantic Segmentation of an Object in an Image

The invention relates to a method for semantic segmentation of an object in an image, comprising the following method steps:

consecutively acquiring image frames,

inputting a first image frame of the consecutively acquired image frames into a convolutional neuronal network in real time, and

examining by the convolutional neuronal network whether any object can be detected in the first image frame for semantic segmentation.

One of the most fundamental problems in automotive computer vision is the semantic segmentation of objects in an image. The segmentation approach refers to the problems of associating every pixel to its corresponding object class. In recent times, there was a surge of convolution neural network (CNN) research and design aided by increase in computational power in computer architectures and the availability of large annotated datasets.

CNNs are highly successful at classification and categorization tasks but much of the research is on standard photometric RGB images and is not focused on embedded automotive devices. Automotive hardware devices need to have low power consumption requirements and thus low computational power.

In machine learning, a convolutional neural is a class of deep, feed-forward artificial neural networks that has successfully been applied to analyzing visual imagery. CNNs use a variation of multilayer perceptrons designed to require minimal preprocessing. Convolutional networks were inspired by biological processes in which the connectivity pattern between neurons is inspired by the organization of the animal visual cortex. Individual cortical neurons respond to stimuli only in a restricted region of the visual field known as the receptive field. The receptive fields of different neurons partially overlap such that they cover the entire visual field.

CNNs use relatively little pre-processing compared to other image classification algorithms. This means that the network learns the filters that in traditional algorithms were hand-engineered. This independence from prior knowledge and human effort in feature design is a major advantage. CNNs have applications in image and video recognition, recommender systems and natural language processing. In this respect, US 2017/0200063 A1 teaches applying a set of sections spanning a down-sampled version of an image of a road-scene to a low-fidelity classifier to determine a set of candidate sections for depicting one or more objects in a set of classes. The set of candidate sections of the down-sampled version may be mapped to a set of potential sectors in a high-fidelity version of the image. A high-fidelity classifier may be used to vet the set of potential sectors, determining the presence of one or more objects from the set of classes. The low-fidelity classifier may include a first convolution neural network trained on a first training set of down-sampled versions of cropped images of objects in the set of classes. Similarly, the high-fidelity classifier may include a second CNN trained on a second training set of high-fidelity versions of cropped images of objects in the set of classes.

From US 2017/0099200 A1 it is known that data is received characterizing a request for agent computation of sensor data. The request includes a required confidence and required latency for completion of the agent computation. Agents to query are determined based on the required confidence. Data is transmitted to query the determined agents to provide analysis of the sensor data.

US 9.704,054 B1 describes that image classification and related imaging tasks performed using machine learning tools may be accelerated by using tools to associate an image with a cluster of such labels or categories, and then to select one of the labels or categories of the cluster as associated with the image. The clusters of labels or categories may comprise labels that are mutually confused for one another, e.g. two or more labels or categories that have been identified as associated with a single image. By defining clusters of labels or categories, and configuring a machine learning tool to associate an image with one of the clusters, processes for identifying labels or categories associated with images may be accelerated because computations associated with labels or categories not included in the cluster may be omitted.

It is an objective of the present invention to provide an efficient CNN architecture design that can be applied for an automotive camera with a large field of view taking advantage of the large field of view. This object is addressed by the subject matter of the independent claims. Preferred embodiments are described in the sub claims.

Therefore, the invention provides a method for semantic segmentation of an object in an image, comprising the following method steps:

consecutively acquiring image frames,

inputting a first image frame of the consecutively acquired image frames into a convolutional neuronal network in real time,

examining by the convolutional neuronal network whether any object can be detected in the first image frame,

semantically classifying the detected objects by the convolutional neuronal network by assigning each detected object to one of a list of predefined object classes, providing a lookup-table with a priority list which comprises a priority level for each of the predefined object classes, respectively,

determining a respective priority level of the detected objects by comparison with the lookup-table,

determining one or more object(s) which have a predefined priority level, determining a high priority area of the image frame which relates to the or an object with the predefined priority level,

inputting a next image frame of the consecutively acquired image frames into the convolutional neuronal network in real time,

analyzing only the high priority area in the next image frame by the convolutional neuronal network.

Thus, it is an essential idea of the invention that instead of regularly processing whole images only a section of the image may be processed with higher resolution for semantic segmentation of objects in the image. Especially, instead of always analyzing the complete image, a high priority area of the image is determined in a first image frame based on the priority levels of the objects detected in the image. Then, in a next image frame, only the high priority area of the image is processed which makes the method a lot more effective. Preferably, the priority levels of the different object classes are defined based on an order of safety, e.g. objects belonging to the object class "person" might be more important than objects belonging to the object class "curbside". Preferably, at the beginning of this method, the high priority area would be defined by the object(s) with the highest priority level, i.e. the predefined priority level would be the highest priority level. If these objects have been classified in a trustworthy way, areas of the image with objects having lower priority levels may be processed.

The step of analyzing only the high priority area in the next image frame by the convoiutional neuronal network may be performed in different ways as set out in the following. According to a preferred embodiment of the invention, analyzing only the high priority area in the next image frame by the convoiutional neuronal network is performed by

examining by the convoiutional neuronal network whether any object can be detected in the high priority area,

semantically classifying the detected objects by the convoiutional neuronal network by assigning each detected object to one of the list of predefined object classes, determining a respective priority of the detected objects by comparison with the lookup-table,

determining the one or more object(s) with the predefined priority level, determining a new high priority area of the image frame which relates to the or an object with the predefined priority level,

inputting a next image frame of the consecutively acquired image frames into the convoiutional neuronal network in real time, and

analyzing only the new high priority area in the next image frame by the convoiutional neuronal network.

Preferably, the step of analyzing only the high priority area in the next image frame by the convoiutional neuronal network by

inputting a next image frame of the consecutively acquired image frames into the convolutional neuronal network in real time, and

analyzing only the new high priority area in the next image frame by the convolutional neuronal network, is repeated at least once.

In this way, a high priority area with objects which should be classified may be defined in a multi-step process. However, according to another preferred embodiment of the invention, such classification may also be performed directly after the first definition of the high priority area. Therefore, according to a preferred embodiment of the invention analyzing only the high priority area in the next image frame by the convolutional neuronal network is performed by semantically classifying the object by assigning the object to one of the list of predefined object classes. In this respect, preferably the following step is performed:

accepting the object class the object has assigned to when analyzing only the high priority area in the next image frame as a trustworthy object class. If such a trustworthy classification of objects with a certain priority level has been achieved, preferably areas with objects with the next smaller priority class are processed.

In general, inputting a next image frame of the consecutively acquired image frames into the convolutional neuronal network in real time may performed by inputting the complete image frame. However, according to a preferred embodiment of the invention, inputting a next image frame of the consecutively acquired image frames into the convolutional neuronal network in real time is performed by inputting only the high priority area of the next image frame into the convolutional neuronal network.

Further, according to a preferred embodiment of the invention, the step of consecutively acquiring image frames is performed by a camera with a field of view of more than 150 yielding respective image frames covering an image angle of more than 150 \ More preferably, the camera has a field of view of more than 180 yielding respective image frames covering an image angle of more than 180°. In this way, a large field of view may be monitored while the mere amount of pixels of the images acquired by such a camera does not slow down processing speed appreciably since not the complete images have to processed for all image frames.

The invention also relates to the use of a method as described above in an automotive vehicle.

The invention further relates to a sensor arrangement for an automotive vehicle configured for performing a method as described above.

The invention also relates to a non-transitory computer-readable medium, comprising instructions stored thereon, that when executed on a processor, induce a sensor arrangement of an automotive vehicle to perform a method as described above.

These and other aspects of the invention will be apparent from and elucidated with reference to the embodiments described hereinafter. Individual features disclosed in the embodiments con constitute alone or in combination an aspect of the present invention. Features of the different embodiments can be carried over from one embodiment to another embodiment.

In the drawings:

Fig. 1 schematically depicts a vehicle with a sensor arrangement according to a preferred embodiment of the invention,

Figs. 2 a, b schematically depict the processing of image frames according to a

preferred embodiment of the invention, and

Figs. 3a - d schematically depict a further aspect of the processing of image frames according to a preferred embodiment of the invention.

Fig. 1 schematically depicts an automotive vehicle 1 with a sensor arrangement 2 which is comprised of a camera 3 and an evaluation unit 4. The sensor arrangement 2 is adapted for semantic segmentation of images of objects 5 captured by camera 3. The evaluation unit 4 may be part of an advanced driver-assistance system for helping a driver of the automotive vehicle 1 in the driving process. The camera 3 is a large field-of- view camera 3 and may have a viewing angle which is larger than 180°.

The method performed by the sensor arrangement 2 according to the preferred embodiment of the invention is as described in the following:

The camera 3 consecutively acquires image frames. The fre uency of acquiring image frames may be as high as 30 frames/second. However, for effectively processing the image frames, a processing frequency of 5 frames/second has shown to be sufficient. For processing the image frames, a first image frame 6 of the consecutively acquired image frames is input into a convolutional neural network in real time. The convolutional neural network is provided in the evaluation unit 4 to which the image frames of the camera 3 are transmitted.

In the convolutional neural network it is examined whether any object 5 which is not part of the ground area the automotive vehicle 1 is driving on can be detected in the first image frame 6. If such objects 5 can be detected in the first image frame 6, these objects are semantically classified by the convolutional neural network by assigning each detected object to one of a list of predefined object classes.

According to the preferred embodiment described here, these object classes may be "person", "car", "wall", "tree",... Such semantical classification of objects by a

convolutional neural network is well-known to the man skilled in the art and does not require any further explanations here.

However, differently from conventional methods, according to the preferred embodiment of the invention, a lookup-table with a priority list which comprises a priority level for each of the predefined object classes, respectively, is provided. In the present case, this priority list looks as follows: person priority 1

car priority 2

wall priority 3

tree priority 4 This priority list may have further object classes which are related to respective priorities. A respective priority level is determined for each object which has been detected in the first image frame 6 by comparison with the lookup table.

A respective image frame 6 can be seen from Fig. 2a. In this image frame 6 two persons are detected as one object 5, and further a wall is detected as another object 8. Since the object class "person^'' has a higher priority than the object class "wall" a high priority area 9 is determined which relates to the object 5 which belong to the object class "person ".

Then, a next image frame 7 of the consecutively acquired image frames is input into the convolutional neural network in real time, wherein only the high priority area 9 in the next image frame 7 is analyzed by the convolutional neural network. This is shown in Fig. 2b in which the image frame 7 which is processed by the convolutional neural network for semantic segmentation of the objects 5 relates to the high priority area determined in the previous method step in image frame 6. In this way, the objects 5 can be processed in much higher resolution which makes semantic segmentation of the objects 5, i.e.

assigning the objects 5 to one of the list of predefined object classes, easier and, thus, more trustworthy.

However, according to a preferred embodiment of the invention, a high priority area with objects which should be classified may also be defined in a multi-step process as described in the following with respect to Figs. 3a to d.

In Fig. 3a it is shown that a high priority area 8 is defined which comprises two objects 5, 8 which belong to different object classes, i.e. "person" and "wall". Instead of directly focusing on object 5 which is the object with the higher priority, a high priority area 9 is defined which comprises both objects 5, 8 which then, in the next image frame 7 shown in Fig. 3b can be analyzed with higher resolution.

This analysis with higher resolution allows to clearly distinguish between the two objects 5, 8, and to define a new high priority area 10 which only relates to the object 5 which belongs to the object class with the highest priority, i.e. "person" as shown in Fig. 3c. Then, in a further image frame 1 1 shown in Fig. 3d, only this new high priority area 10 is examined, i.e. semantic segmentation is only performed for object 5 in order to verify that the object 5 detected here does actually belong to the object class "person".

Reference signs list

automotive vehicle

sensor arrangement

camera

evaluation unit

persons

first image frame

next image frame

wall

high priority area

new high priority area

further next image frame

Claims

1 . Method for semantic segmentation of an object (5, 8) in an image, comprising the following method steps:

consecutively acquiring image frames (6, 7, 1 1 ),

inputting a first image frame (6) of the consecutively acquired image frames (6, 7, 1 1 ) into a convolutional neuronal network in real time,

examining by the convolutional neuronal network whether any object (5, 8) can be detected in the first image frame (6),

semantically classifying the detected objects (5, 8) by the convolutional neuronal network by assigning each detected object (5, 8) to one of a list of predefined object classes,

providing a lookup-table with a priority list which comprises a priority level for each of the predefined object classes, respectively,

determining a respective priority level of the detected objects (5, 8) by comparison with the lookup-table,

determining one or more object(s) (5) which have a predefined priority level, determining a high priority area (9) of the image frame (6) which relates to the or an object (5) with the predefined priority level,

inputting a next image frame (7) of the consecutively acquired image frames (5, 6) into the convolutional neuronal network in real time,

analyzing only the high priority (9) area in the next image frame (7) by the convolutional neuronal network.

2. Method according to claim 1 , wherein analyzing only the high priority area (9) in the next image frame (7) by the convolutional neuronal network is performed by

examining by the convolutional neuronal network whether any object (5, 8) can be detected in the high priority area (9),

semantically classifying the detected objects (5, 8) by the convolutional neuronal network by assigning each detected object (5, 8) to one of the list of predefined object classes,

determining a respective priority of the detected objects (5, 8) by comparison with the lookup-table,

determining the one or more object(s) (5) with the predefined priority level, determining a new high priority area (10) of the image frame which relates to the or an object (5) with the predefined priority level,

inputting a next image frame (1 1 ) of the consecutively acquired image frames (5, 6) into the convolutional neuronal network in real time, and

analyzing only the new high priority area (10) in the next image frame (1 1 ) by the convolutional neuronal network.

3. Method according to claim 2, by repeating at least once the step of analyzing only the high priority area in a further next image frame by the convolutional neuronal network by

examining by the convolutional neuronal network whether any object (5, 8) can be detected in the high priority area,

semanticaliy classifying the detected objects (5, 8) by the convolutional neuronal network by assigning each detected object to one of the list of predefined object classes, determining a respective priority of the detected objects (5, 8) by comparison with the lookup-table,

determining the one or more object(s) (5) with the predefined priority level, determining a new high priority area of the image frame which relates to the or an object (5) with the predefined priority level,

inputting a further next image frame of the consecutively acquired image frames into the convolutional neuronal network in real time, and

analyzing only the new high priority area in the next image frame by the

convolutional neuronal network.

4. Method according to any of claims 1 to 3, wherein analyzing only the high priority area (9, 10) in the next image frame (7, 1 1 ) by the convolutional neuronal network is performed by semanticaliy classifying the object (5, 8) by assigning the object (5, 8) to one of the list of predefined object classes.

5. Method according to claim 4 comprising the following method step:

accepting the object class the object (5, 8) has assigned to when analyzing only the high priority area in the next image (7, 1 1 ) frame as a trustworthy object class.

6. Method according to any of the previous claims, wherein inputting a next image frame (7, 1 1 ) of the consecutively acquired image (6, 7, 1 1 ) frames into the convolutional neuronal network in real time is performed by inputting only the high priority area (9, 10) of the next image frame (7, 1 1 ) into the convolutional neuronal network.

7. Method according to any of the previous claims wherein the step of consecutively acquiring image frames (6, 7, 1 1 ) is performed by a camera (3) with a field of view of more than 150 yielding respective image frames (6, 7. 1 1 ) covering an image angle of more than 150°.

8. Use of the method according to any of the previous claims in an automotive vehicle (1 ).

9. Sensor arrangement (2) for an automotive vehicle (1 ) configured for performing the method according to any of claims 1 to 8.

10. Non-transitory computer-readable medium, comprising instructions stored thereon, that when executed on a processor, induce a sensor arrangement (2) of an automotive vehicle (1 ) to perform the method of any of claims 1 to 8.