GB2630812A

GB2630812A - Occupancy status detection for a vehicle

Info

Publication number: GB2630812A
Application number: GB2308652.3A
Authority: GB
Inventors: Kramadhari Prashant; Hoermann Stefan; Widmann Ludwig
Original assignee: Jaguar Land Rover Ltd
Current assignee: Jaguar Land Rover Ltd
Priority date: 2023-06-09
Filing date: 2023-06-09
Publication date: 2024-12-11
Also published as: GB202308652D0

Abstract

The occupancy status of a vehicle 100 is determined by receiving audio data 201-1..r from distributed microphones 320-1..r, generating a frequency distribution spectrogram for each audio data over a time window (eg. p FFT’d sliding windows), inputting these spectra into a classification model (eg. an artificial neural network comprising a head sub-network for each location with weighted connection nodes) to classify an occupancy status at each discrete location i=1..N in the vehicle (eg. seats, trunk, footwells) as one of a predetermined set of occupancy types j=1..M (eg. empty, adult, child, dog, cat) and outputting values Oij indicating a probability of the occupancy of location i being of type j, fur use by the vehicle control system. Visual data for each location Vij may also be input to the classification.

Description

OCCUPANCY STATUS DETECTION FOR A VEHICLE

TECHNICAL FIELD

The present disclosure relates to occupancy status detection for a vehicle. Aspects of the invention relate to methods, to a control system, to an occupancy status detection system, to a vehicle, and to computer-readable instructions.

BACKGROUND

Various vehicle control functions may be adapted depending on an occupancy status of the vehicle. It is therefore desirable to accurately classify, or detect, the occupancy status at various locations within the vehicle.

Occupancy status can be determined using load sensors at each location to determine whether an occupant is present. However, this method is not robust to distinguishing an occupant from any object of a comparable weight. for example, and so may inaccurately infer that a location is occupied. Alternatively, occupancy status can be determined using vision or radar sensors disposed about the vehicle. Captured images can be processed to identify or categorise an occupancy status at each location. However, this method may not be robust, in particular if an occupant is occluded from view.

It is an aim of the present invention to address one or more of the disadvantages associated with the prior art.

SUMMARY OF THE INVENTION

Aspects and embodiments of the invention provide a method fa occupancy status detection; a control system, an occupancy status detection system, a vehicle and a method of training a classification model as claimed in the appended claims.

According to an aspect of the present invention there is provided a computer-implemented method for occupancy status detection of a vehicle. the method comprising: receiving, from each of a plurality of audio input devices disposed about the vehicle, respective audio data; determining input data in dependence on each respective audio data; inputting the input data into a classification model for classifying an occupancy status of a vehicle at each of a set of N discrete locations within the vehicle i = 1, N as one of a predetermined set of M occupancy types j= 1, M; obtaining, from the classification model, one or more values q indicative of a probability of the occupancy of the location i being of type]; and outputting an occupancy classification signal indicative of the one or more values q Advantageously, utilising audio data from about a vehicle cabin for classifying occupancy at each location facilitates the identification of occupancy type even when the occupant or location is obscured from visual identification.

According to another aspect of the present invention there is provided a computer-implemented method for occupancy status detection of a vehicle, the method comprising: receiving, from each of a plurality of audio input devices disposed about the vehicle, respective audio data; for each respective audio data: determining, in dependence on the audio data, a spectrogram indicative of a frequency distribution of the audio data over a selected time window; inputting each determined spectrogram into a classification model for classifying an occupancy status of a vehicle at each of a set of N discrete locations within the vehicle i = 1. N as one of a predetermined set of M occupancy types j = 1, ..., M; obtaining, from the classification model, one or more values Olj indicative of a probability of the occupancy of the location i being of type j; and outputting an occupancy classification signal indicative of the one or more values q. Advantageously; utilising vehicle audio for classifying occupancy at each location facilitates the identification of occupancy type even when the occupant or location is obscured from visual identification. Furthermore, the use of spectrograms as input to the classification model enables the decomposition of audio features at each audio input device, which enables the spatial location of these features to be encoded in the temporal discrepancies between each spectrogram.

The set of N discrete locations comprise one or more of: a front left seat of the vehicle; a front right seat of the vehicle; a rear left seat of the vehicle: a rear right seat of the vehicle; a trunk of the vehicle; a front left footwell of the vehicle; a front right footwell of the vehicle; a rear left footwell of the vehicle; and a rear right footwell of the vehicle.

The predetermined set of M occupancy types Jr M comprise one or more of: no occupant; an adult; a child; a dog; and a cat.

The one or more values may comprise a respective output vector O'j = {OH, 0'2,...., OW for each of the set of N discrete locations i=1, N. The classification model may receive each determined spectrogram and output the N output vectors.

The audio input devices disposed about an interior of the vehicle or otherwise arranged to record audio internal to the vehicle. Optionally, the method comprises obtaining a waveform corresponding to a selected time window of the audio data, and determining the spectrogram in dependence on the waveform.

The method may comprise determining a prediction for the occupancy type j at each location i in dependence on the one or more values q; and outputting the occupancy classification signal indicative of the determined prediction. In this way, the a definitive occupancy type can be determined at each location, to facilitate occupancy dependent vehicle control. The method may further comprise outputting a control signal to control one or more vehicle settings in dependence on the determined prediction. Beneficially, this enables vehicle settings to be controlled based on occupancy, even when occupancy is not readily visible or discernible using other vehicle systems. The control signal may be for example to control a notification system (such as an alarm, horn or app) to alert a user to an occupancy status. For example, this may be desired if a child or pet is unsupervised in the vehicle. In other examples, the control signal may be for limiting acceleration, speed or volume in dependence on the occupancy status, e.g. if a child is in the vehicle. In other examples, the control signal may be for controlling an audio-visual or HVAC system in dependence on the occupancy status. For example; the audio-visual or HVAC system may be switched off in unoccupied areas of the vehicle, or adjusted differently for adult, child or pet occupancy.

Optionally, determining each spectrogram comprises dividing the audio data into p sliding time windows; and applying a Fast Fourier Transform with q frequency bins to each of the p time windows to determine a spectrogram having dimensions p and q. The sliding time windows may be overlapping. Advantageously, this ensures spectral information is captured over the whole of the duration of the audio data. In some embodiments, the selected time window of the audio data (i.e. the entirety of spectrogram) has a duration of between 0.1 and 5 seconds, such as 3 seconds, 2 seconds or 0.5 seconds. Each sliding time window may have a length of between 10ms and 50ms, such as10ms, 20ms or 30ms. Each sliding time window may have an overlap of between 30% and 70% of the sliding time window length, such as 40% or 50%.

The method may comprise receiving respective audio data from r audio input devices, combining each spectrogram to obtain an input tensor I of dimensions p. q and r, and inputting the input tensor I into the classification model. Optionally, combining comprises concatenating each spectrogram. Advantageously, this ensures distinct features within the spectrogram are spatially aligned and the location is encoded in slight temporal misalignment, enabling accurate location identification of occupancy types.

The classification model may comprise an artificial neural network for classifying the occupancy status of a vehicle at each of the set of N discrete locations within the vehicle i= 1, N as one of the predetermined set of M occupancy types j= 1, M; and the method may comprise: determining an input tensor for the artificial neural network in dependence on each of the determined spectrograms; inputting the input tensor to input nodes of the neural network; and mapping the input tensor through connecting nodes of the neural network to output nodes of each of a plurality of output layers Oita obtain each of the one or more values Oy Advantageously, an artificial neural network (ANN) may perform automatic feature extraction, in contrast to other machine learning classifiers. This facilitates the input of spectrograms as raw input to the model. The artificial neural network has been trained to predict a classification of the occupancy type at each of the discrete locations using a set of training data, the set of training data comprising audio data from a plurality of audio input devices disposed about a training vehicle under a set of conditions having each occupancy type j at each location i. The training vehicle is equivalent to the vehicle, with a corresponding arrangement of audio input devices about the training vehicle.

The artificial neural network may comprise a backbone sub-network having connecting nodes connecting the input layer to a feature map layer, the connections of the connecting nodes having weights. Optionally, the feature map is downsampled. Advantageously; this layer of the artificial neural network can perform automated feature extraction, using any backbone network suitable for image processing.

Optionally, the artificial neural network comprises a respective head sub-network for each discrete location i in the set i = 1, N, wherein each respective head sub-network comprises connecting nodes between the feature map layer and the output layer 0' corresponding to the discrete location i in the set, the connections of the connecting nodes having weights. Advantageously, this provides a multi-headed stricture wherein each head network can function independently to classify the occupancy status at the respective location i using a common set of automatically extracted features, enabling improved accuracy in classification for each specific location.

Optionally, the artificial neural network further comprises a respective attention pooling sub-network for each discrete location i in the set, i = 1, ..., N. wherein each respective attention pooling sub-network comprises connecting nodes connecting the feature map layer to an input layer of the respective head sub-network; wherein the input layer of each respective head sub-network is indicative of a location specific audio feature fsi.

Advantageously, the network structure facilitates location specific attention pooling, facilitating the extraction of audio features specific to each discrete location for input into the relevant head sub-network. The artificial neural network may further comprise a mean pooling sub-network connecting the feature map layer to a mean pooled layer, wherein the mean pooling sub-network acts to pool the feature map output over the frequency dimension q. Each attention pooling sub-network may connect the mean pooled layer to the respective input layer of the respective head sub-network.

The method may further comprise: receiving, from a visual classification system of the vehicle, visual data comprising a further one or more values 1/1 indicative of a probability of the occupancy of the location i being of type j according to the visual classification system; and determining, in dependence on the one or more values 01 and the further one or more values V1an overall probability of the occupancy of the location i being of type j. The overall probability may be determined as a weighted sum of 01 and Advantageously, utilising both audio and visual classification results in improved confidence compared to either modality individually. Further, decoupling the modalities leads to a higher classification confidence as the predictions are independent.

The method may comprise: receiving, from the visual classification system of the vehicle, at least one location specific visual feature fo associated with each of the N discrete locations i = 1, N: for each of the N discrete locations i = 1. N: receiving a fusion neural network for predicting a confidence of each of the one or more values 0, and the further one or more values V1 at the location inputting, to input nodes of an input layer of the fusion neural network, the one or more values 01, the location specific audio feature the further one or more values Viand the location specific visual feature kr,' mapping the input through connecting nodes of the fusion neural network to output nodes of an output layer to obtain one or more weight values; and determining a fused classification probability for each occupancy type j as a combination of the audio probability O' and the visual probability V1 weighted according to the weight values; and outputting the occupancy classification signal in dependence on each fused classification probability.

Advantageously, providing a respective fusion neural network for each location facilitates the flexible adaption of weight values for combining the modalities depending on the particular circumstances. characterised by the location specific features and probabilities. In this way, the fusion neural network can predict the confidence of each modality for each occupancy type. By decoupling these modalities initially and then providing a separate fusion network at the end, this allows the training of the original neural network on a large single modality dataset, requiring only a relatively small cross-modality training set for the fusion.

Each fusion neural network may comprise: an input layer having input nodes for receiving the one or more values q, the location specific audio feature fai, the further one or more values V' and the location specific visual feature fv,r; an output layer having output nodes for outputting the weight values WV= W2 WM,a. WI., W2 WM 4. wherein a first subset of the weight values Ma are indicative of a weight to be applied to each probability from the one or more values 0'; = {0'1, 02 014. and wherein a second subset of the weight values RN are indicative of a weight to be applied to each probability from the further one or more values V1 = eh: V2 VW, one or more hidden layers having connecting nodes connecting the input layer to the output layer; wherein the fusion neural network has been trained to predict a relative confidence of the audio classification and the visual classification using a set of training data, the set of training data comprising audio data from a plurality of audio input devices disposed about a training vehicle and visual data from one or more imaging devices disposed about the training vehicle under a set of conditions having each occupancy type j at the discrete location i.

According to another aspect, there is provided a control system for controlling an occupancy status detection system for a vehicle, the control system comprising one or more controller, the control system configured to: receive, from each of a plurality of audio input devices disposed about the vehicle, respective audio data; for each respective audio data: determine, in dependence on the audio data, a spectrogram indicative of a frequency distribution of the audio data over a selected time window, input each determined spectrogram into a classification model for classifying an occupancy status of a vehicle at each of a set of N discrete locations within the vehicle i = 1, N as one of a predetermined set of M occupancy types j= 1, ..., M; obtain, from the classification model, one or more values 01 indicative of a probability of the occupancy of the location i being of type j; and output an occupancy classification signal indicative of the one or more values 01 The control system may be configured to perform the method according to the above aspect.

The control system comprises one or more controllers collectively comprising at least one electronic processor having an electrical input for receiving an input signal; and at least one memory device electrically coupled to the at least one electronic processor and having instructions stored therein; and wherein the at least one electronic processor is configured to access the at least one memory device and execute the instructions thereon so as to: receive, from each of a plurality of audio input devices disposed about the vehicle, respective audio data; for each respective audio data: determine, in dependence on the audio data, a spectrogram indicative of a frequency distribution of the audio data over a selected time window input each determined spectrogram into a classification model for classifying an occupancy status of a vehicle at each of a set of N discrete locations within the vehicle i = 1, N as one of a predetermined set of M occupancy types j = 1, M; obtain, from the classification model, one or more values 01 indicative of a probability of the occupancy of the location i being of type j; and output an occupancy classification signal indicative of the one or more values According to another aspect there is provided an occupancy status detection system for a vehicle comprising: the control system of the above aspect; and a plurality of audio input devices arranged to be disposed about the vehicle, wherein the plurality of audio input devices are arranged to communicate the respective audio data to the control system.

According to another aspect there is provided a vehicle comprising the control system or the occupancy status detection system of the above aspects.

According to another aspect there is provided a computer implemented method of training a classification model for classifying an occupancy status of a vehicle at each of a set of N discrete locations within the vehicle i = 1, N as one of a predetermined set of M occupancy types j = 1, M; wherein the method comprises: receiving a set of training data to train the classification model, the set of training data comprising audio data from a plurality of audio input devices disposed about a training vehicle under a set of conditions having each occupancy type) at each location i, the set of training data comprising input-output pairs, wherein each input-output pair corresponds to a different occupancy type j at each location i, each pair comprising audio data from each of the audio input devices and corresponding classification data indicative of a classification of the occupancy of each location i being of one of the types j for the associated audio data; for each input-output pair: determining a respective spectrogram indicative of a frequency distribution of the audio data from each of the audio input devices over a selected time window; determining an input tensor I for the input-output pair in dependence on each of the determined spectrograms; inputting the input tensor I into the classification model; obtaining, from the classification model, a predicted set of values P1 indicative of a predicted probability of the occupancy of the location i being of type]; determining an objective function characterising an error between the values of the predicted set of values ID1 compared to the corresponding classification data for each of the one or more input-output pairs; and using an appropriate optimisation algorithm operating on the objective function, updating the classification model to seek to minimise the objective function. When the classification model comprises an artificial neural network, updating the classifier comprises updating the weights between connecting nodes of the artificial neural network.

According to another aspect there is provided computer readable instructions which, when executed by a computer, are arranged to perform a method according to above aspects.

Within the scope of this application it is expressly intended that the various aspects, embodiments, examples and alternatives set out in the preceding paragraphs, in the claims and/or in the following description and drawings, and in particular the individual features thereof, may be taken independently or in any combination. That is, all embodiments and/or features of any embodiment can be combined in any way and/or combination, unless such features are incompatible. The applicant reserves the right to change any originally filed claim or file any new claim accordingly, including the right to amend any originally filed claim to depend from and/or incorporate any feature of any other claim although not originally claimed in that manner.

BRIEF DESCRIPTION OF THE DRAWINGS

One or more embodiments of the invention will now be described, by way of example only, with reference to the accompanying drawings, in which: Figure 1A shows a vehicle in accordance with an embodiment of the invention; Figure 1B shows a plan view of a vehicle in accordance with an embodiment; Figure 2 shows a flow chart of a method according to an embodiment; Figure 3A shows a schematic of an occupancy status detection system according to an embodiment; Figure 3B shows a first example arrangement of audio input devices according to an embodiment of the occupancy status detection system; Figure 3C shows a second example arrangement of audio input devices according to an embodiment of the occupancy status detection system; Figure 4 shows a schematic of a control system according to an embodiment; Figure 5 illustrates the processing of audio data according to an embodiment; Figure 6 illustrates a classification model according to an embodiment; Figure 7 illustrates a mean pooling layer of a classification model according to an embodiment; Figure 8 illustrates an attention pooling sub-network of a classification model according to an embodiment; Figure 9 illustrates a fusion method according to an embodiment; Figure 10 illustrates a further embodiment of the fusion method; and Figure 11 illustrates a training method in accordance with an embodiment.

DETAILED DESCRIPTION

Various vehicle control functions may be adapted depending on an occupancy status of the vehicle. It is therefore desirable to accurately classify, or detect, the occupancy status at various locations within the vehicle. The occupancy status may define whether or not an occupant is present at each location within the vehicle, such as at each seat of the vehicle. However, further delineation of occupancy status to determine a category of occupant can also be beneficial, such as "child", "adult" or "pet for example.

Occupancy status can be determined using load sensors at each location to determine vihether an occupant is present. However, this method is not robust to distinguishing an occupant from any object of a comparable weight, for example, and so may inaccurately infer that a location is occupied. Alternatively, occupancy status can be determined using vision or radar sensors disposed about the vehicle. Captured images can be processed to identify or categorise an occupancy status at each location. However, this method may not be robust, in particular if an occupant is occluded from view.

According to the present invention, there is provided an improved system and method for determining an occupancy status in a vehicle, such as the vehicle 100 illustrated in Figures 1A and 1B. The vehicle 100 illustrated is an automobile, but the present invention may be implemented with any type of vehicle, including land vehicles, watercraft; aircraft or the like.

Figure 1B illustrates a plan view of the vehicle 100 within which N discrete locations 101; 102, 103, 104; 105, 106, 107; 108, 109 have been labelled. According to the present invention, an occupancy status may be determined for each of these N discrete locations within the vehicle. The discrete locations are predefined. Although nine discrete locations are shown in Figure 1B, it will be appreciated that more or fewer locations may be defined in any arrangement suitable for the particular vehicle.

The N discrete locations illustrated in Figure 1B comprise: a front left seal of the vehicle 101, a front right seat of the vehicle 102, a rear left seat of the vehicle 103, a rear right seat of the vehicle 104, a front left footwell of the vehicle 105, a front right footwell of the vehicle 106, a rear left footwell of the vehicle 107, a rear right footwell of the vehicle 108 and a trunk of the vehicle 109. This set of discrete locations is merely illustrative, and in other embodiments the invention may be adapted to determine the occupancy status of a different set of N discrete locations. For example, the N discrete locations may comprise fewer locations than shown, such as only a subset of the seats of the vehicle, and may not include any footwells.

Additionally or alternatively, the N discrete locations may include one or more locations not shown in Figure 1B, such as a frunk of the vehicle or a middle seat of the vehicle. The N discrete locations may include further subdivisions not shown, such as a left portion of the trunk and a right portion of the trunk.

Figure 2 illustrates a method 200 according to an embodiment of the present invention. The method 200 is a method of detecting an occupancy status of a vehicle 100, such as the vehicle 100 illustrated in Figures IA and 1B. In particular, the method 200 is a method of classifying the occupancy status of each of the set of N discrete locations within the vehicle. The method 200 utilises audio data received from audio input devices disposed about the vehicle to classify the occupancy status at each of the N locations. Beneficially, classifying occupancy status using audio data facilitates the identification of occupancy status even when the occupant is obscured from view of any camera, as diffraction enables sound from each location to traverse obstacles.

The method 200 may be performed, at least in part, by a control system 310 for an occupancy status detection system 300, as illustrated in Figure 3A. The occupancy status detection system 300 is for use in a vehicle, such as the vehicle 100, and may be installed at least partially within the vehicle 100. The occupancy status detection system 300 comprises the control system 310 and a plurality of r audio input devices 320-1, ...,320-r arranged to be disposed about the vehicle. Each of the r audio input devices are arranged to record audio at a respective location within the vehicle and communicate respective audio data to the control system 310 for use in the method 200, as will be explained. The audio input devices 320-1, 320-r are, in use, arranged to record audio internal to the vehicle. For example, they may be installed at various locations about an interior of the vehicle.

Two example arrangements for the audio input devices are illustrated in Figures 3B and 3C. According to a first arrangement illustrated in Figure 3B, the audio input devices are arranged about side pillars A, B, C, D of the vehicle. A respective audio input device 320-1..., 320-6 may be disposed at each left and right side pillar A, B, C, D. In this way, the audio input devices surround a perimeter of a vehicle cabin. Each illustrated audio input device may be a stereo input device, such as a stereo microphone, to further improve sound localisation. As such, each Illustrated audio input device 320-1, ..., 320-6 may actually represent two (or more) audio input devices, in some embodiments.

According to a second arrangement illustrated in Figure 3C, the audio input devices 320-1, 320-2, 320-3 are arranged along a center line of the vehicle. Again, each illustrated audio input device may be a stereo input device, such as a stereo microphone, to further improve sound localisation in a left-right plane of the vehicle, and thus may comprise a plurality of input devices. The arrangement shown in Figure 3C may utilise fewer audio input devices than that shown in Figure 3A. The provision of additional audio input devices can provide increased accuracy in occupant localisation, however it may be desired to keep the number of audio input devices to a minimum to reduce cost. The arrangements shown in Figures 3B and 3C have been found to provide a good accuracy. However, alternative arrangements can be envisaged. In particular, the minimum requirement is that at least two audio input devices be provided, to enable sounds to be localised.

Returning to Figure 2, the method 200 comprises receiving respective audio data 201-1. ..., 201-r from each of a plurality r of audio input devices disposed about a vehicle. The r audio input devices may be the devices 320-1. ..., 320-r of the system 300 shown in Figures 3A to 3C. Each audio data 201-n is indicative of audio recorded within the vehicle by the respective audio input device. The method 200 may be performed on an ad hoc basis. however in some embodiments the method 200 may be performed continuously or periodically. Thus. the audio data 201-1, 201-r may each comprise a respective audio stream communicated continuously or at intervals from each audio input device.

The method 200 then comprises a step 210 of determining, in dependence on each respective audio data, a respective spectrogram indicative of a frequency distribution of the audio data over a selected time window. The selected time window may be a predefined interval, such as a most recent interval of a predefined length. For example, the selected time window may define the most recent 0.1 to 5 seconds of audio data, such as the most recent 3 seconds, 2 seconds or 0.5 seconds. In this way, if the method 200 is performed periodically, the resultant classification can be continually updated based on the most recent audio received.

An example of the step 210 according to an embodiment of the invention is illustrated in Figure 5. With reference to Figure 5, four audio input devices are implemented and so four pieces of audio data have been received, but it will be appreciated that the example may be extrapolated to other embodiments having different numbers of audio input devices.

A respective waveform 511; 512, 513, 514 is obtained corresponding to the selected time window T of each piece of audio data. The waveform 511, 512, 513, 514 can be obtained by extracting the selected time window T from the audio data; for example by extracting the last 3 seconds of audio.

In some embodiments, the selected time window T may correspond to the entirety of the audio data, for example if only audio data corresponding to the selected time window T is received. Thus, a separate step of extracting the respective waveforms 511, 512, 513, 514 may be omitted.

A respective spectrogram 521, 522, 523, 524 may then be determined in dependence on each waveform 511, 512, 513, 514. Any signal processing technique for converting audio from an amplitude to a frequency domain may be utilised to determine each spectrogram. For example, with reference the illustrated embodiment, a Fast Fourier Transform technique is described. In other examples, alternative techniques such as Short-time Fourier Transform may be used. In the illustrated embodiment, each waveform is divided into p sliding time windows. Each of the p sliding time windows may have a length of between 5ms and 100ms, for example 10ms, 20ms; 30ms or 40ms. The p sliding time windows may be arranged to be overlapping. For example, the windows may overlap by between 30% and 70% of the time window length, such as 40%; 50% or 60%. Overlapping the windows beneficially ensures spectral information is captured over the whole length of the selected time window T. Then, a Fast Fourier Transform having q frequency bins is applied to each of the p time windows. The result can be represented as a spectrogram having dimensions p and q, as illustrated in Figure 5.

Returning to Figure 2, in step 220, each of the determined spectrograms are provided as input into a classification model. The classification model is a model for classifying an occupancy status of the vehicle at each of the N locations as one of a predetermined set of M occupancy types. The M occupancy types may in some embodiments comprise an occupied classification and a non-occupied classification. In other embodiments, the M occupancy types may comprise more granulated occupancy types. For example, according to one embodiment the predetermined set of M occupancy types comprise five occupancy types: no occupant; an adult; a child; a dog; and a cat. In other embodiments, the M occupancy types may comprise only a subset of these five types, and/or may comprise additional types not menhoned, such as bird, hamster or the like. In another example, the "child' occupancy type could be further subdivided into a baby, a toddler, and an older child.

When respective audio data is obtained from r audio input devices as described with reference to Figure 5. each of the r spectrograms can be combined to obtain an input tensor I of dimensions p, q and r. For example, the r spectrograms may be combined by concatenating the spectrograms. This ensures distinct features within each spectrogram are spatially aligned in the input tensor I and the location is encoded in slight temporal misalignment, enabling accurate location identification of occupancy types. The input tensor I can therefore be provided as input to the classification model in step 220 Any suitable classification model may be used, for example an artificial neural network (ANN). An ANN based classifier can be particularly beneficial as the ANN can be constructed to perform automatic extraction of audio features, as will be explained. An ANN according to an embodiment will be described with reference to Figure 6. In other embodiments. audio features may be predefined for extraction from the audio data. The predefined audio features may include; for example; peak amplitude, Root Mean Square (RMS) value or power, however any audio feature may be extracted. The predefined audio features may then be provided as input to any traditional machine learning based classifier, such as a Support Vector Machine (SVM) or the like.

In some examples, the classification model may comprise a plurality of classification models. For example, the classification model may comprise a plurality of one dimensional convolutional neural network (CNN) based models, each one dimensional CNN model being configured to extract features from the audio data using a different kernel length. The classification model may further comprise additional models, such as the ANN of Figure 6, or an SVM based model as discussed above. The output of each of the plurality of classification models may be advantageously combined in an analogous method to that described in Figures 9 and 10. Using multiple classification models and subsequently combining the results can provide a higher confidence than any individual model.

Typically, the (or each) classification model is trained using any suitable supervised method to classify each of the locations N as one of the M occupancy types using annotated training data, as will be described with reference to Figure 11. The annotated training data comprises audio data obtained under a set of conditions having each of the M occupancy types at each of the N locations in each possible permutation. The amount of audio data required will depend on the characteristics of the occupancy types. For example, highly distinct occupancy types (cat vs adult) may require less training data than similar occupancy types (toddler vs child). For the illustrated example having the five occupancy types outlined above, at least minutes, at least 10 minutes or at least 15 minutes of audio data for each permutation of occupancy type at each location may be sufficient to train the classification model. For example, 25 minutes, 35 minutes or 30 minutes may be sufficient. In other embodiments, the amount of training data required may differ to be less or more than these examples. 1/Whin the audio data for each permutation, it is desirable to include a diverse set of conditions including at least the following: * Diversity in location: including audio data from driving on highways, urban areas. small towns. passing traffic lights, static, and tansitioning between static and moving; * Diversity in background noise: including audio data in the presence of other vehicles, going over potholes, through a tunnel, through a city, past trains, having loud music within the cabin. diversity of road surface, diversity of weather; * Different sounds from each occupancy type: adult/child speaking, singing, shouting, crying, whispering; * Diversity within the occupancy types: using multiple occupants. different genders, diversity of ages, diversity of breed of dogs/cats.

A method of using this training data to train the classification model according to one embodiment of the invention will be described in further detail with reference to Figure 11.

In step 230, the classification model is configured to output one or more values 0( indicative of a probability of the occupancy of the location i = 1, ..., N being of type j = 1, M. The structure of the output of the classification model may depend on the type of classification model implemented. In some embodiments, the one or more values may comprise a respective output vector 0( = Out) for each of the set of N discrete locations N. Example output vectors according to an illustrative example may be as follows: Occupancy types: j=/: no occupant j=2: an adult; j=3: a child; j=4: a dog; j=5: a cat.

Oa (front tight seat) = (0.8, 0.02, 0.05, 0.12. 0.13) 02 (front left seat) = (0, 0.77. 0.1, 0.02, 0.03) 03 (rear right seat) = 1, 0.2. a 66, 0.15. a 04 (rear left seat) = f0.11, 0.02, 0.05, 0.73, 0.42 Finally, the method 200 comprises outputting an occupancy classification signal in step 240 indicative of the one or more values Or obtained from the classification model. In some embodiments, the occupancy classification signal may correspond directly to the one or more values 0(( For example, the occupancy classification signal may be indicative of the likelihood of each of the M occupancy types being present at each of the N locations. In other embodiments, a prediction for the occupancy type j at each location i may be determined in dependence on the one or more values 0,J; and the occupancy classification signal may correspond to the determined prediction. For example, each location i may be predicted to have the occupancy type j having the greatest likelihood 0,) In the above illustrative example, this would correspond to predicting each occupancy type as follows: front right seat = no occupant front left seat = adult rear right seat = child rear left seat = dog In some embodiments, the occupancy classification signal may comprise a control signal to control one or more vehicle settings in dependence on the determined prediction. Thus, the vehicle may be controlled appropriately depending on the occupancy of each location. For example, according to one embodiment the control signal may be for controlling an alert system of the vehicle to output an alert to the user. This may be triggered, for example if it is determined that a child or pet is unsupervised in car. The alert may comprise an alarm, a horn, or an alert on a remote device such as a mobile device of the user.

According to another embodiment, a control signal may be for controlling a cabin environment, such as an audio volume or HVAC settings. The cabin environment may be controlled differently at the different locations depending on the determined occupancy status at each location. For example, audio may be selectively output in locations having an occupant, or specifically locations having an adult occupant. Audio may be selectively reduced in volume in locations having no occupant, a child, or a pet. HVAC settings may be adjusted such that climate control is only utilised in occupied regions of the vehicle, or may be adjusted differently depending on the occupant type.

Turning now to Figure 4, there is illustrated a control system 310 for a vehicle. The control system may form part of the system 300 and comprises one or more controller 400. The control system is arranged to perform the method 200. According to some embodiments, at least one of the controllers 400 may be arranged to be installed in the vehicle. However in other embodiments, one or more of the controllers 400 may be remote to the vehicle and arranged to communicate with other vehicle systems via a remote connection, such as the Internet.

The control system 310 is configured to receive the respective audio data 201-1, 201-r from each of the r audio input devices 320-1, ..., 320-r of the system 300 and determine the spectrograms 521. .... 524. The control system 310 is then configured to provide the spectrograms as input to the classification model and obtain from the classification model the one or more values Oj indicative of a probability of the occupancy of the location i being of type j; and output the occupancy classification signal. As discussed; the occupancy classification signal may comprise at least one control signal to control one or more vehicle systems in dependence on the one or more values O1.

The control system 310 as illustrated in Figure 3 comprises one controller 400, although it will be appreciated that this is merely illustrative. The controller 400 comprises processing means 410 and memory means 420. The processing means 410 may be one or more electronic processing device 410 which operably executes computer-readable instructions. The memory means 420 may be one or more memory device 420. The memory means 420 is electrically coupled to the processing means 410. The memory means 420 is configured to store instructions, and the processing means 410 is configured to access the memory means 130 and execute the instructions stored thereon.

The controller 400 comprises an input means 430 and an output means 440. The input means 430 may comprise an electrical input 430 of the controller 400. The output means 440 may comprise an electrical output 440 of the controller 400. The input 430 is arranged to receive a respective audio signal 435 from each of the plurality r of audio input devices 320-1, , 320-r The audio signal 435 is an electrical signal which is indicative of the respective audio data 201-1, ..., 201-r. The output 150 is arranged to output the occupancy classification signal 445, such as the control signal 445.

The method 200 may be performed by the control system 400 illustrated in Figure 4. In particular, the memory 420 may comprise computer-readable instructions which, when executed by the processor 410; perform the method 200 according to an embodiment of the invention. According to some embodiments, the memory 420 may be arranged to store the classification model. However, according to other embodiments, the classification model may be stored remotely to the control system 310, for example on a remote server. In such embodiments, the control system 310 is configured to communicate the spectrograms to the remote location of the classification model, and obtain the one or more values Oj from the remote location of the classification model.

Turning now to Figure 6, according to some embodiments of the invention, the classification model comprises an artificial neural network (ANN). Figure 6 illustrates a classification model 600 according to one embodiment in the form of an ANN 600. The ANN 600 comprises a number of bespoke features which tailor the ANN 600 to the problem of classifying occupancy data, and as a result the ANN 600 provides an improved accuracy of classification in comparison to traditional classification techniques such as an off-the-shelf ANN.

The ANN 600 is trained to classifying the occupancy status of a vehicle at each of the set of N discrete locations within the vehicle i = 1, N as one of the predetermined set of M occupancy types j = 1, M, as has been described.

Globally, the ANN 600 comprises an input layer 605 having input nodes for receiving the input tensor k and a plurality of output layers 650 including a respective output layer 65i corresponding to each of the i discrete locations i=1, N. Each output layer 65i comprises output nodes for outputting the one or more values Oij corresponding to that location i. The ANN 600 comprises one or more hidden layers having connecting nodes connecting the input layer 605 to each of the output layers 65i, the connections of the connecting nodes having weights. The weights are determined during the supervised training of the ANN 600, as will be explained with reference to Figure 11.

The values of the input tensor I are provided as input to the input nodes of the input layer 605. When the spectrograms are determined as described with reference to Figure 5, the input tensor I has dimensions [r, q, p] wherein r is the number of audio input devices, q the number of frequency bins used during the FFT and p the number of sliding windows into which the selected time window is divided. According to an illustrative example with 4 audio input devices, 512 frequency bins and a 3s selected time window divided into 20ms sliding windows having 10ms overlap, the input tensor has dimensions [4,512,300].

The ANN 600 comprises a backbone sub-network 610 having connecting nodes connecting the input layer 605 to a feature map layer 615. The purpose of the backbone sub-network is to extract at least one audio feature map from the spectrogram input. Because each audio signal is converted into a spectrogram, any neural network architecture suitable for computer vision can be employed as the backbone sub-network 610. such as ResNet (h as Ep(xiv MobileNet or EfficientNet. Typical inputs to such architectures are images of size [channels x height x width], comparable to the input tensor [N,512,300]. As the amount of data input to the ANN 600 is relatively small compared to typical computer vision applications, a relatively shallow architecture can be employed for the backbone sub-network 610, for example a ResNet-34 having 34 layers.

The audio feature map is typically downsampled in comparison to the input tensor, and may have dimensions [s. q/d, p/d] wherein s is the number of feature map channels, and d is the level of downsampling. For example, in the illustrative example having an input tensor of dimensions [4,512,300], the feature map may have dimensions [64, 16, 9] and the feature map layer 615 thus comprises 64 x 16 x 9 nodes. This corresponds to a downsampling level d of 16, and 64 channels. However, in other embodiments the values of s and d may be adjusted. The values d = 16 and s = 64 enable minimal modifications to be made to a typical computer vision backbone architecture such as ResNet-34. These minimal modifications comprise removing the convolution and pooling layers from the backbone architecture and substituting a 1x1 convolution to obtain the 64 channels. These modifications enable application specific pooling to be applied in the ANN 600, as will be explained.

The ANN 600 comprises two pooling sub-networks 620, 630 The first pooling sub-network is a mean pooling sub-network 620 and is common to each of the output layers 650. The mean pooling sub-network 620 acts to pool the frequency dimension q/d of the feature map.

With reference to Figure 7, the mean pooling sub-network 620 connects the feature map layer 615 to a mean pooled layer 710. The mean pooling sub-network acts to pool the feature map output over the frequency dimension q/d such that the output at the mean pooled layer 710 is a mean pooled tensor having dimensions [s, p/d]. Any kind of averaging can be applied over the frequency dimension, for example a mean may be implemented as in the illustrated example.

The second pooling sub-network 630 provides attention pooling over the time dimension, and is applied separately for each of the N discrete locations.

Thus, at this point the ANN 600 splits out into separate branches for each of the N discrete locations, including a respective attention pooling sub-network 63i for each discrete location i.

With reference to Figure 8, each attention pooling sub-network 63i takes as input the values of the mean pooled tensor. The attention pooling subnetwork 631 comprises at least one fully connected layer 810 which maps the mean pooled layer 710 to an attention score vector 820 having length p/d. Thus, a separate fully connected layer 810 is utilised to determine an attention score over the time dimension p/d. Such attention pooling has been found to be beneficial as a separate fully connected layer computes the relevance of the feature at a given time. As depicted in the illustrative example of Figure 8, the attention score vector 820 emphasizes the last two samples with a score of 0.4 and 0.5.

As the attention pooling sub-network 63i is applied separately for each location i, the ANN 600 can focus on different time spans within the audio data for different locations i. For example, the occupant at a first location may be audible at a different time to the occupant at a second location. The attention pooling sub-network 63i thus acts to identify time points in the audio data which appear to relate to the respective location i. A matrix product 830 is then applied to pool the mean pooled tensor over the time axis according to the attention score vector 820. The matrix product 830 defines a product of the mean pooled tensor [s. pid] with the attention score vector [p/d. 1] to obtain a location specific audio feature vector fa; having length s. In this way, a location specific audio feature vector fa; is obtained for each location defining audio features likely to be relevant to the specific location r.

Returning to Figure 6, the ANN 600 comprises a head portion 640. The head portion 640 comprises a respective head sub-network 64i for each discrete location i. The location specific audio feature vector fa determined for the location i is provided as input to input nodes of the head subnetwork 64i. The head sub-network 64i thus comprises at least s input nodes, one for each channel of the location specific audio feature vector. The head sub-network comprises connecting nodes between the input nodes of the head sub-network 64i and output nodes of the output layer 65i corresponding to the discrete location i in the set. Each output layer 651comprises at least M output nodes, one for providing each of the one or more values Oa, j = 1, M indicating a likelihood of the occupancy type being of type j.

Thus, each head sub-network 64i is trained to classify the occupancy status of the location i based on the location specific audio feature vector fa,;. Overall, it can be seen that the ANN 600 comprises a multi-headed structure, wherein each head sub-network 64i is tailored to identifying sounds extracted for a particular location from common audio input. This facilitates efficient multi-location classification of occupancy.

In some circumstances, the use of audio data alone may not accurately classify occupancy type at all times, for example if one or more occupants of the vehicle is silent. According to some embodiments of the invention, this may be improved by fusing the method 200 with output from an external classification method using a different modality, such as from a visual or radar classification system of the vehicle. Including multi-modality input to a single classifier can be problematic due to the tendency for one or more modalities to dominate the classification. Therefore, according to the present invention, each modality provides an independent classification, which may then subsequently be fused according to Figures 9 and 10.

Figure 9 illustrates a flow chart of a method 900 of multi-modality fusion according to an embodiment of the invention. The method 900 may be performed in conjunction with the method 200.

The method comprises receiving audio data 901 comprising the one or more values 0( determined from the classification model of the method 200 As discussed, the one or more values 0( are each indicative of the probability of the occupancy of the location i being of type j according to the audio occupancy status detection system 300.

The method 900 further comprises receiving, from a second classification system of the vehicle, second classification data 902. The second classification system is configured to classify the occupancy type at each of the N locations using a non-audio modality. In some embodiments as will be described, the second classification system is a visual classification system and the second classification data 902 is visual data 902. However. it will be appreciated that in other embodiments the second classification system may use a non-visual modality, such as radar, lidar or ultrasonic. The visual classification system (not illustrated) comprises a plurality of image sensors configured to capture images of at least the N locations within the vehicle and a control system, which may be the same control system 400 as used in the occupancy status detection system 300, or may be a separate control system. The visual classification system is configured to output one or more values Vii indicative of a probability of the occupancy of the location i being of type j in dependence on the images captured by the image sensors. The visual classification system may employ a visual classification model, such as an ANN trained using images captured by the image sensors or equivalent image sensors. The visual data 902 received comprises at least the one or more values V. The method 900 comprises a step 910 of determining, in dependence on the one or more values Og and the further one or more values Vu a fused classification probability of the occupancy of the location i being of type j. That is, step 910 comprises determining a fused prediction of occupancy type at each location in dependence on the probabilities determined independently by the audio occupancy detection system 300 and the visual classification system.

In some embodiments, the fused classification probability of the occupancy of the location i being of type] may be determined as a weighted sum of Op and V1. Advantageously, utilising both audio and visual classification results in improved confidence compared to either modality individually. Further, decoupling the modalities leads to a higher classification confidence as the predictions are independent.

Although Figure 9 illustrates only two modalities, the concept may be extended to fuse predictions from three or more modalities, for example from the audio occupancy detection system, a visual classification system and a radar classification system. Respective predictions are obtained from each system: and weighted as described.

How the values Op and Vii are weighted in step 910 may be approached in various ways. In a naive system, the weights may be fixed based on a pre-defined ratio, such as a weighting of 0.7 for the visual values V; and a weighting of 0.3 for the audio values Oi. These fixed weights may be defined based on relative average accuracy of each classification system. The weights may be fixed across all j occupancy types, or the weights may be defined differently for the different occupancy types j. For example, one modality may be shown to more accurately identify an occupancy type of "no occupancy" than the other modality. However, the other modality may be shown to more accurately identify an occupancy type of "cat'. Therefore, different weights may be applied to combine the probabilities for the different occupancy types.

In other embodiments, a more sophisticated method may be employed to define the weights in the method 900, as illustrated in Figure 10. According to the illustrated embodiment of Figure 10, one or more fusion neural networks 1011: 101N may be implemented to predict the weights. In this way, the confidence of each modality can be predicted in real time depending on the specific output of each of the audio occupancy detection system 300 and the visual classification system.

In this embodiment, the received audio data 901 comprises the one or more values Og as well as each location specific audio feature fa,,. The received visual data 902 comprises the further one or more values as well as a location specific visual feature f0,; associated with each of the N discrete locations i = ..., Nand extracted from the classification model used by the visual classification system.

As illustrated, a respective fusion neural network 1011, ..., 101N is defined for each of the locations i = 1, ..., N. The method will be described with reference to the first fusion neural network 1011, however the description may be extrapolated to each of the remaining fusion neural networks 1012, 101N using the relevant data corresponding to each location.

The first fusion neural network 1011 is trained to predict a confidence for each of the audio data 901 and the visual data 902 for classifying each of the M occupancy types at the first location i=1. The fusion neural network 1011 comprises an input layer having input nodes for receiving the values I PR, the location specific audio feature fa,i, the values \Lib,. 1. , Air and the location specific visual feature The fusion neural network 1011 comprises an output layer having output nodes arranged to output one or more weight values for weighting the visual values VI/ and the audio values atr, At least one respective weight value VV, is output for each of the M occupancy types p..., M. In some embodiments, an audio weight value Wm to be applied to the audio value Oil and a visual weight value Wei to be applied to the visual value Vii are each output at the output nodes. However, in other embodiments a single weight value, for example Wai is output, and the weight value to be applied to the other modality can be inferred through a constraint, e.g. with the constraint that Wa, + Wei = 1.

The fusion neural network 1011 comprises one or more hidden layers having connecting nodes connecting the input layer to the output layer. The connections of the connecting nodes have weights which are adjusted during training of the fusion neural network. The fusion neural network 1011 has been trained to predict the confidence of each modality, i.e. the relative weights to be appfied, at the first location using a set of cross-modality training data. The set of cross-modality training data comprises audio data from a plurality of audio input devices disposed about a training vehicle and visual data from one or more imaging devices disposed about the training vehicle, wherein the training vehicle is arranged equivalently to the vehicle 100 having both an audio occupancy status detection system 300 and a visual classification system. The cross-modality training data is captured under a set of conditions having each occupancy type j at each discrete location i.

By decoupling the audio and visual modalities initially and then providing a separate fusion network 1011, ..., 101N for each discrete location, this allows the training of the original classification model (such as the ANN 600 and equivalent classification model for the visual system) separately on large single modality datasets. Only a relatively small cross-modality training set is required to train each fusion network 1011, ..., 101N.

The method 900 may be performed as part of the method 200, as has been explained. Thus, the occupancy classification signal of step 240 may be output in dependence on each fused classification probability determined in step 910. In this way the occupancy classification signal will reflect an overall classification of the occupancy at each location, based on the fusion of the audio classification with a second modality, such as visual classification.

With reference to Figure 11, there is illustrated a supervised method 1100 of training the classification model of the method 200. For example, the method 1100 may be used to train the ANN 600, however it will be appreciated that the training method may be adapted in any embodiment where an alternative classification model is used.

The method comprises receiving a set of training data 1100. The training data 1100 comprises a set of input output pairs, each pair comprising audio data 1111 and corresponding classification data 1112 indicative of a classification of the occupancy at each location i as one of the types j = 1. , M. The audio data 1111 is captured from a plurality of audio input devices disposed about a training vehicle. In some embodiments, the training vehicle may be the same vehicle in which the occupancy status detection system 300 is installed during use. In other embodiments, the training vehicle may be a separate vehicle set up in the same way, with the same arrangement of audio input devices. In this way, the classification model may be centrally trained in a single training vehicle and deployed on a fleet of vehicles having the same arrangement of audio input devices. Each input-output pair may correspond to a different occupancy type j at each location i, and overall the training data should include input-output pairs corresponding to every permutation of occupancy type at each location. Further requirements for the training data, such as diversity of occupant types and diversity of background noise, are as described earlier with reference to Figure 2.

In step 1120, one input-output pair is selected from the set of training data 1100. An input tensor I for the classification model is determined from the audio data 1111 of the input-output pair. The input tensor I may be determined as described with reference to steps 210, 220 and Figure 5. That is, a respective spectrogram is determined indicative of a frequency distribution of the audio data 1111 from each of the audio input devices over a selected time window, and the input tensor I is determined in dependence on the determined spectrograms, e.g. by concatenating the spectrograms. The input tensor I should be determined the same way in method 1100 as in method 200.

In step 1130, the input tensor I is input into the classification model, as described with reference to step 220. The classification model trained in method 1100 is the same classification model as used, once trained, in the method 200. In one illustrative embodiment as described, the classification model is the ANN 600 and so reference will be made to the ANN 600 in the context of the method 1100. For the ANN 600, step 1130 comprises providing the input tensor I as input to the input nodes of the input layer 610.

In step 1140, a predicted set of values Pu, i=1 = 1, M are obtained from the classification model. The predicted set of values Pu are indicative of a predicted probability of the occupancy of the location i being of type jfor the audio data 1111. Step 1140 may be equivalent to obtaining the one or more values Ou in step 230. With reference to the ANN 600, the predicted set of values Pi, are obtained by mapping the input tensor I through the connecting nodes of the ANN 600 to each of the output layers 650, such that the values Rh j= M are obtained from output nodes of the output layer 651.

In step 1150, an objective function is determined to characterise an error between the values of the predicted set of values PI compared to the corresponding classification data 1112 of the input-output pair. For example, the objective function may be any suitable loss function for multi-class classification, such as Binary Cross Entropy.

In step 1150, an appropriate optimisation algorithm is used to update the classification model to seek to minimise the objective function. When the classification model is the ANN 600, updating the ANN 600 comprises updating the weights between connecting nodes of the ANN 600. Steps 1120 to 1150 are repeated for each of the input-output pairs, and in step 1150 the optimisation algorithm updates the classification model in light of the objective function calculated for each of the input-output pairs. Thus, the model is iteratively trained. The more training data is utilised, the further the objective function may be minimised, and so ideally as many input-output pairs as possible should be collected. As discussed with reference to Figure 2, in one embodiment around 5 minutes, 15 minutes or 30 minutes for each permutation of occupancy types at each location may facilitate the generation of a sufficient number of input-output pairs to train the ANN 600, but the amount of training data required will vary depending on the characteristics of the occupancy types.

Thus, the present invention provides a novel and high accuracy method for classifying occupancy types within a vehicle using audio data captured within the vehicle cabin. Furthermore, the present invention provides a new method for fusing the audio classification with other modalities, such as visual or radar based classification, to further improve the confidence of the classification beyond any individual modality.

It will be appreciated that various changes and modifications can be made to the present invention without departing from the scope of the present application. For purposes of this disclosure, it is to be understood that reference to 'the control system being configured to is to be understood to mean 'the one or more controllers of the control system are collectively configured to'. The controller(s) described herein can each comprise a control unit or computational device having one or more electronic processors, the one or more processors collectively configured to perform the control system functionality set out in the control system claims.

Claims

CLAIMS1. A computer-implemented method for occupancy status detection of a vehicle, the method comprising: receiving, from each of a plurality of audio input devices disposed about the vehicle, respective audio data; for each respective audio data: determining, in dependence on the audio data, a spectrogram indicative of a frequency distribution of the audio data over a selected time window: inputting each determined spectrogram into a classification model for classifying an occupancy status of a vehicle at each of a set of N discrete locations within the vehicle i = 1; ; N as one of a predetermined set of M occupancy types j = 1; M; obtaining, from the classification model, one or more values 01 indicative of a probability of the occupancy of the location i being of type]; and outputting an occupancy classification signal indicative of the one or more values O' 2. The method of claim 1, the method comprising: determining a prediction for the occupancy type j at each location i in dependence on the one or more values 01, and outputting the occupancy classification signal indicative of the determined prediction.3. The method of claim 2, further comprising outputting a confrol signal to control one or more vehicle settings in dependence on the determined prediction.4. The method of any preceding claim, wherein determining each spectrogram comprises: dividing the audio data into p sliding time windows, and applying a Fast Fourier Transform with q frequency bins to each of the p time windows to determine a spectrogram having dimensions p and q.5. The method of claim 4, comprising receiving respective audio data from r audio input devices, combining each spectrogram to obtain an input tensor / of dimensions p, q and r, and inputting the input tensor I into the classification model.6. The method of any preceding claim, wherein: the classification model comprises an artificial neural network for classifying the occupancy status of a vehicle at each of the set of N discrete locations within the vehicle i = 1. N as one of the predetermined set of M occupancy types j= 1, ..., M; and the method comprises: determining an input tensor for the artificial neural network in dependence on each of the determined spectrograms; inputting the input tensor to input nodes of an input layer of the neural network; mapping the input tensor through connecting nodes of the neural network to output nodes of each of a plurality of output layers to obtain each of the one or more values O'.7. The method of any preceding claim, wherein the artificial neural network comprises a backbone sub-network having connecting nodes connecting the input layer to a feature map layer, the connections of the connecting nodes having weights.8. The method of claim 7, wherein the artificial neural network comprises a respective head sub-network for each discrete location i in the set, i = 1; N, and wherein each respective head sub-network comprises connecting nodes between the feature map layer and the output layer corresponding to the discrete location i in the set, the connections of the connecting nodes having weights.9. The method of claim 8, wherein the artificial neural network further comprises a respective attention pooling sub-network for each discrete location i in the set, i = 1, N, wherein each respective attention pooling sub-network comprises connecting nodes connecting the feature map layer to an input layer of the respective head sub-network, wherein the input layer of each respective head sub-network is for receiving values indicative of a location specific audio feature fa,,.10. The method of any preceding claim; comprising: receiving; from a visual classification system of the vehicle, visual data comprising a further one or more values V1 indicative of a probability of the occupancy of the location i being of type j according to the visual classification system; and determining, in dependence on the one or more values 01 and the further one or more values V1 an overall probability of the occupancy of the location i being of type j.11. The method of claim 10, comprising: receiving, from the visual classification system of the vehicle, at least one location specific visual feature fir associated with each of the N discrete locations i = /, N; for each of the N discrete locations i = /, ..., N. receiving a fusion neural network for predicting a confidence of each of the one or more values 01 and the further one or more values V1 at the location i; inputting, to input nodes of an input layer of the fusion neural network, the one or more values 01, the location specific audio feature fe,i, the further one or more values V1and the location specific visual feature fv,i; mapping the input through connecting nodes of the fusion neural network to output nodes of an output layer to obtain one or more weight values; and determining a fused classification probability for each occupancy type j as a combination of the audio probability and the visual probability V', weighted according to the weight values; and outputting the occupancy classification signal in dependence on each fused classification probability.12. A control system for controlling an occupancy status detection system for a vehicle, the control system comprising one or more controller, the control system configured to: receive, from each of a plurality of audio input devices disposed about the vehicle, respective audio data; for each respective audio data: determine, in dependence on the audio data, a spectrogram indicative of a frequency distribution of the audio data over a selected time window; input each determined spectrogram into a classification model for classifying an occupancy status of a vehicle at each of a set of N discrete locations within the vehicle i = N as one of a predetermined set of M occupancy types j= M; obtain, from the classification model, one or more values 01 indicative of a probability of the occupancy of the location i being of type j: and output an occupancy classification signal indicative of the one or more values O'.13. An occupancy status detection system for a vehicle comprising: the control system of claim 12; and a plurality of audio input devices arranged to be disposed about the vehicle, wherein the plurality of audio input devices are arranged to communicate the respective audio data to the control system.14. A vehicle comprising the control system of claim 12 or the occupancy status detection system of claim 13.15. A computer implemented method of training a classification model for classifying an occupancy status of a vehicle at each of a set of N discrete locations within the vehicle i = N as one of a predetermined set of M occupancy types j= M, wherein the method comprises: receiving a set of 'mining data to train the classification model, the set of training data comprising audio data from a plurality of audio input devices disposed about a training vehicle under a set of conditions having each occupancy type j at each location i, the set of 1r aining data comprising input-output pairs, wherein each input-output pair corresponds to a different occupancy type j at each location i, each pair comprising audio data from each of the audio input devices and corresponding classification data indicative of a classification of the occupancy of each location i being of one of the types j for the associated audio data; for each input-output pair: determining a respective spectrogram indicative of a frequency distribution of the audio data from each of the audio input devices over a selected time window; determining an input tensor I for the input-output pair in dependence on each of the determined spectrograms; inputting the input tensor I into the classifier; obtaining, from the classifier, a predicted set of values Psi indicative of a predicted probability of the occupancy of the location i being of type j determining an objective function characterising an error between the values of the predicted set of values P1 compared to the corresponding classification data for each of the one or more input-output pairs; using an appropriate optimisation algorithm operating on the objective function, updating the classifier to seek to minimise the objective function.