WO2016068760A1

WO2016068760A1 - Video stream synchronization

Info

Publication number: WO2016068760A1
Application number: PCT/SE2014/051263
Authority: WO
Inventors: Heidi-Maria BACK; Le Wang; Miljenko OPSENICA; Tomas Mecklin
Original assignee: Telefonaktiebolaget L M Ericsson (Publ)
Priority date: 2014-10-27
Filing date: 2014-10-27
Publication date: 2016-05-06

Abstract

Video synchronization is achieved by transmitting timestamps to user devices (1, 2, 3), which return the timestamps together with frame fingerprints. The user devices (1, 2, 3) also transmit video streams (21, 22, 23) that are decoded to get decoded video frames (31, 32, 33). The received frame fingerprints are compared to frame fingerprints generated for the decoded video frames (31, 32, 33) in order to find a match. The decoded video frames (31, 32, 33) that generated a respective frame fingerprint match are assigned a respective estimated capture time determined based on the time stamps and current system times. The video streams (21, 22, 23) from the different user devices (1, 2, 3) are time aligned based on the estimated capture times assigned to decoded video frames (31, 32, 33).

Description

VIDEO STREAM SYNCHRONIZATION

TECHNICAL FIELD

The present embodiments generally relate to video stream synchronization, and in particular to synchronization of video streams originating from multiple user devices recording a scene.

BACKGROUND

The advance of high-speed mobile Internet and capacity of user devices, such as mobile phones, smartphones and tablets, has given rise to a new way of consuming mobile live video streaming services. There is also a high demand from users to film a social event, e.g. a football game or a music festival in order to present the users' own version of storytelling. The emerging applications allow users to produce videos collaboratively using multiple mobile cameras, in a manner similar to how professional live TV is produced. As shown in Fig. 1 , the scenario includes three user roles, namely producers, directors and consumers. The producers are users with user devices 1 , 2, 3 who collaboratively record and stream video feeds, for example, in a stadium to application servers or a server system 10. A mixed view of video feeds enables the directors to conduct video direction and rich-content assertion. The consumers, thus, are able to watch live broadcast of the event from different viewpoints based on the directors' selection rather than only few options provided by traditional TV broadcasting.

In a social multimedia environment, it is desirable for directors to monitor synchronized video streams from the producers. Simply simultaneously sending each video stream to its physical output hardware will not necessarily ensure synchronization. In professional live video production, the synchronization among multiple camera feeds is done by specialized hardware. However, this approach is not practical when streaming video from user devices 1 , 2, 3 via wireless connection. The reason being that delay is an inherent feature of wireless networks and network congestion often happens when the volume of data traffic goes up. This implies that each user device 1 , 2, 3 experiences different network delays, which may further vary for a given user device 1 , 2, 3 over time. As a consequence, the differences and variations in network delay cause the arrival time of each video stream at the server system 10 to be different. The divergence in arrival time has great impact on the perceived video frames resulting in asynchrony in the live feeds presented to the directors. This means that the directors will not be able to edit the multiple video streams in synchrony. As shown in Fig. 2 illustrating video streams 21 , 22, 23 from user devices 1 , 2, 3, the marked video frames 31 , 32, 33 are taken by cameras of the user devices 1 , 2, 3 at the same time. Due to the network delay, the time when the marked video frames 31 , 32, 33 arrive at the server system 10 is different. Thus, one of the most import requirements of social video streaming is adequate synchronization so that each video stream is aligned to each other. The multi-producer video filming turns out to be a problem of asynchrony, which has to be solved. Various techniques for achieving synchronization among video streams have been proposed in the art.

In a solution clock synchronization is used. Synchronization offsets are calculated using timestamps generated by the cameras' internal clocks on the user devices 1 , 2, 3. This solution is one of the most processing efficient methods. However, some user devices 1 , 2, 3 do not have an internal high- resolution clock. Thus, clock drift and skew may cause the user devices 1 , 2, 3 out of synchronization. In addition, the solution requires all the user devices 1, 2, 3 to synchronize with a centralized Network Time Protocol (NTP) server. The transmission delay between each user device 1 , 2, 3 and the server system 10 would also vary from each other, especially when wireless network is highly congested. Hence, this solution will not be practicable for a typical user case to achieve video stream synchronization involving multiple user devices 1 , 2, 3.

In another solution audio fingerprints are extracted from audio streams and compared to find a match among all the audio streams when multiple cameras are recording the same event. By comparing the occurrence of similar sound matches, it may be possible to calculate the synchronization offset. However, this solution requires all the user devices 1 , 2, 3 to be close enough to the event since the speed of sound is much slower than the speed of light. The sound, recorded by a user device 1 , 2, 3 that is closer to the sound source, could be up to one second ahead as compared to the sound recorded by another user device 1 , 2, 3, when watching a sport game in large stadium. Furthermore, the noise generated by the crowds would also decrease the accuracy of finding suitable audio fingerprints. This means that audio fingerprinting will generally not be very reliable to achieve video stream synchronization involving multiple user devices 1, 2, 3.

In a further solution external hardware synchronized cameras or so-called inter-camera synchronization is assumed. Such a solution requires physically connecting the cameras of the user devices 1 , 2, 3 to external synchronization hardware. It is often used in professional live video production. However, in the social video streaming scenario, synchronizing all users' user devices 1 , 2, 3 in a social event is not practical and nearly impossible. In yet another solution timestamps are added to the video streams by having new features implemented in base stations in mobile communication networks. However, a problem is that not all user devices 1 , 2, 3 are connected to the Internet via same network provider, and some of them may be connected via Wireless Local Area Network (WLAN) provided by the event organizer. In order to overcome such a problem, this solution has to access each base station and WLAN access provider, which introduces complicated management issues in heterogeneous networks and increases corresponding cost.

A further solution involves analyzing the incoming video streams, and monitoring the sequence of video frames for the occurrence of at least one of a plurality of different types of visual events. The occurrence of a selected visual event should be detected among all the video streams and taken as a marker to synchronize all video streams. However, this solution requires all user devices 1 , 2, 3 recording at least one common visual event in order to find the marker among all the video streams from each user device 1 , 2, 3. If the user devices 1 , 2, 3 are focusing on different parts of the event, there is no way for this solution to identify the marker.

US 2011/0043691 discloses a method for synchronizing at least two video streams originating from at least two cameras having a common visual field. This solution requires studying trajectories of objects of a scene. It is not adapted for a situation where multiple users are filming at the same time but at different parts of an event.

A further shortcoming of several of the above mentioned prior art solutions is that they require the user and/or viewer to install proprietary applications on their user devices 1 , 2, 3, which make the solutions less desirable to provide video stream synchronization.

There is therefore a need for an efficient solution to achieve synchronization of video streams originating from different user devices 1 , 2, 3.

SUMMARY

It is a general objective to achieve video synchronization between video streams originating from different user devices.

This and other objectives are met by embodiments as disclosed herein. An aspect of the embodiments relates to a video synchronization method comprising, for each user device of multiple user devices, receiving a video stream of encoded video frames over a wireless media channel from the user device. The method also comprises transmitting, to the user device and over a wireless peer-to-peer channel, a timestamp generated based on a current system time. The method further comprises receiving a frame fingerprint and the timestamp from the user device over the wireless peer-to-peer channel. The method additionally comprises determining an estimated capture time of a video frame, used to generate the received frame fingerprint, based on the timestamp and a current system time. The method also comprises decoding the video stream to get decoded video frames. The method further comprises comparing the received frame fingerprint with a respective frame fingerprint generated for the decoded video frames. The method further comprises assigning the estimated capture time to a decoded video frame based on the comparison. The method also comprises time aligning video streams from the multiple user devices based on the assigned estimated capture times. Another aspect of the embodiments relates to a method for enabling video synchronization. The method comprises a user device transmitting a video stream of encoded video frames to a video synchronization server system over a wireless media channel. The method also comprises the user device generating a frame fingerprint of a current video frame upon reception of a timestamp from the video synchronization server system over a wireless peer-to-peer channel. The method further comprises the user device transmitting the frame fingerprint and the timestamp to the video synchronization server system over the wireless peer-to-peer channel.

A further aspect of the embodiments relates to a video synchronization server system. The video synchronization server system is configured to receive a video stream of encoded video frames over a wireless media channel from each user device of multiple user devices. The video synchronization server system is also configured to transmit, to each user device of the multiple user devices and over a wireless peer-to-peer channel, a timestamp generated based on a current system time. The video synchronization server system is further configured to receive a frame fingerprint and the timestamp from each user device of the multiple user devices over the wireless peer-to-peer channel. The video synchronization server system is additionally configured to determine, for each user device of the multiple user devices, an estimated capture time of a video frame, used to generate the received frame fingerprint, based on the timestamp and a current system time. The video synchronization server system is also configured to decode, for each user device of the multiple user devices, the video stream to get decoded video frames. The video synchronization server system is configured to compare, for each user device of the multiple user devices, the received frame fingerprint with a respective frame fingerprint generated for the decoded video frames. The video synchronization server system is further configured to assign, for each user device of the multiple user devices, the estimated capture time to a decoded video frame based on the comparison. The video synchronization server system is also configured to time align video streams from the multiple user devices based on the assigned estimated capture times.

Yet another aspect of the embodiments relates to a video synchronization server system comprising a timestamp generator for generating, for each user device of multiple user devices, a timestamp based on a current system time. The timestamp is output for transmission to the user device over a wireless peer-to-peer channel. The video synchronization server system also comprises a time estimator for determining, for each user device of the multiple user devices, an estimated capture time of a video frame, used to generate a frame fingerprint, received from the user device with a timestamp over the wireless peer-to-peer channel, based on the timestamp and a current system time. The video synchronization server system further comprises a decoder for decoding, for each user device of the multiple user devices, a video stream of encoded video frames, received from the user device over a wireless media channel, to get decoded video frames. The video synchronization server system additionally comprises a comparator for comparing, for each user device of the multiple user devices, the received frame fingerprint with a respective frame fingerprint generated for the decoded video frames. The video synchronization server system further comprises an assigning unit for assigning, for each user device of the multiple user devices, the estimated capture time to a decoded video frame based on the comparison. The video synchronization server system additionally comprises a time aligner for time aligning video streams from the multiple user devices based on the assigned estimated capture times.

A further of the embodiments relates to a user device that is configured to transmit a video stream of encoded video frames to a video synchronization server system over a wireless media channel. The user device is also configured to generate a frame fingerprint of a current video frame upon reception of a timestamp from the video synchronization server system over a wireless peer-to-peer channel. The user device is further configured to transmit the frame fingerprint and the timestamp to the video synchronization server system over the wireless peer-to-peer channel.

Yet another aspect of the embodiments relates to a user device comprising an encoder for generating a video stream of encoded video frames for transmission to a video synchronization server system over a wireless media channel. The user device also comprises a fingerprint generator for generating a frame fingerprint of a current video frame upon reception of a timestamp from the video synchronization server system over a wireless peer-to-peer channel. The user device further comprises an associating unit for associating the timestamp with the frame fingerprint for transmission to the video synchronization server system over the wireless peer-to-peer channel.

A further aspect of the embodiments relates to a computer program comprising instructions, which when executed by a processor, cause the processor to generate, for each user device of multiple user devices, a timestamp based on a current system time. The timestamp is output for transmission to the user device over a wireless peer-to-peer channel. The processor is also caused to determine, for each user device of the multiple user devices, an estimated capture time of a video frame, used to generate a frame fingerprint, received from the user device with the timestamp over the wireless peer-to-peer channel, based on the timestamp and a current system time. The processor is further caused to decode, for each user device of the multiple user devices, a video stream of encoded video frames, received from the user device over a wireless media channel, to get decoded video frames. The processor is additionally caused to compare, for each user device of the multiple user devices, the received frame fingerprint with a respective frame fingerprint generated for the decoded video frames and to assign, for each user device of the multiple user devices, the estimated capture time to a decoded video frame based on the comparison. The processor is further caused to time align video streams from the multiple user devices based on the assigned estimated capture times.

Another aspect of the embodiments relates to a computer program comprising instructions, which when executed by a processor, cause the processor to generate a video stream of encoded video frames for transmission to a video synchronization server system over a wireless media channel. The processor is also caused to generate a frame fingerprint of a current video frame upon reception of a timestamp from the video synchronization server system over a wireless peer-to-peer channel. The processor is further caused to associate the timestamp with the frame fingerprint for transmission to the video synchronization server system over the wireless peer-to-peer channel. A related aspect of the embodiments defines a carrier comprising a computer program as defined above. The carrier is one of an electronic signal, an optical signal, an electromagnetic signal, a magnetic signal, an electric signal, a radio signal, a microwave signal, or a computer-readable storage medium. The present embodiments address problems that video frames originating from different user devices recording a scene are out-of-synchronization, for instance, in social media environments. The embodiments achieve a reliable and implementation friendly, i.e. low complexity, solution to synchronize video stream from multiple user devices. The solution does not require installation of any proprietary applications on the user devices and is applicable to all kinds of social events.

BRIEF DESCRIPTION OF THE DRAWINGS

The embodiments, together with further objects and advantages thereof, may best be understood by making reference to the following description taken together with the accompanying drawings, in which:

Fig. 1 illustrates social video streaming of a sports event;

Fig. 2 schematically illustrates lack of synchronization of video streams sent from multiple user devices; Fig. 3 is a flow chart illustrating a method for enabling video synchronization according to an embodiment;

Fig. 4 is a flow chart illustrating additional, optional steps of the method illustrated in Fig. 3; Fig. 5 is a flow chart illustrating an embodiment of the fingerprint generating step illustrated in Fig. 3; Fig. 6 is a flow chart illustrating an additional, optional step of the method illustrated in Fig. 4; Fig. 7 is a flow chart illustrating a video synchronization method according to an embodiment;

Fig. 8 is a flow chart illustrating an embodiment of the time determining step illustrated in Fig. 7;

Fig. 9 is a flow chart illustrating additional, optional steps of the method illustrated in Fig. 7; Fig. 10 is a flow chart illustrating an embodiment of the comparing step illustrated in Fig. 7;

Fig. 11 schematically illustrates a system comprising a user device and a video synchronization server system and the operation flow in order to achieve synchronization of video streams according to an embodiment; Fig. 12 schematically illustrates a block diagram of a user device according to an embodiment; Fig. 13 schematically illustrates a block diagram of a user device according to another embodiment;

Fig. 14 schematically illustrates a block diagram of a user device according to a further embodiment;

Fig. 15 schematically illustrates a block diagram of a video synchronization server system according to an embodiment;

Fig. 16 schematically illustrates a block diagram of a video synchronization server system according to another embodiment;

Fig. 17 schematically illustrates a block diagram of a video synchronization server system according to a further embodiment; and

Fig. 18 schematically illustrates a computer program implementation according to an embodiment.

DETAILED DESCRIPTION

Throughout the drawings, the same reference numbers are used for similar or corresponding elements.

The present embodiments generally relate to video stream synchronization, and in particular to synchronization of video streams originating from multiple user devices recording a scene. The embodiments thereby enable video frame synchronization for video streaming of multiple user devices, for instance, in connection with a social event, such as a game or concert. As a consequence of the video stream synchronization it is possible to conduct video direction and rich-content assertion by providing video of the event from different viewpoints corresponding to the users' positions relative to the recorded scene.

In the following, a video frame is used to denote a picture or image of a video stream. Hence, a video frame could alternatively be denoted (video) picture or (video) image in the art. As is known in the art of video coding, a video frame is encoded according to a video coding standard to get an encoded video frame, such as an intra-coded frame, or I frame or picture, or an inter-coded frame, or P or B frame or picture.

Fig. 7 is a flow chart illustrating a video synchronization method according to the embodiment. The steps S40 to S46 as shown in the figure are performed for each user device of multiple user devices. Step S40 comprises receiving a video stream of encoded video frames over a wireless media channel from the user device. A next step S41 comprises transmitting, to the user device and over a wireless peer-to-peer (P2P) channel, a timestamp generated based on a current system time. A frame fingerprint and the timestamp are received from the user device over the wireless peer-to-peer channel in step S42. A next step S43 comprises determining an estimated capture time of a video frame, used to generate the received frame fingerprint, based on the timestamp and a current system time. The method also comprises decoding the video stream in step S44 to get decoded video frames. The received frame fingerprint is compared with a respective frame fingerprint generated for the decoded video frames in step S45. Step S46 then comprises assigning the estimated capture time to a decoded video frame based on the comparison. Finally, the video synchronization method further comprises time aligning video streams from the multiple user devices in step S47 based on the assigned estimated capture times.

Hence, synchronization of video streams originating from different user devices is achieved by generating and transmitting timestamps to the user devices on wireless peer-to-peer channels running in parallel to the wireless media channels used by the user devices to transmit video streams of encoded video frames. The timestamps trigger the user devices to generate a frame fingerprint of a current video frame and return the frame fingerprint with the associated timestamp on the peer-to-peer channel. The timestamp enables estimation of the capture time of the video frame at the user device. The decoded video frames obtained by decoding the video stream are then used to generate frame fingerprints that are compared to the frame fingerprint received over the peer-to-peer channel. When a match is found between a generated frame fingerprint and the received fingerprint the estimated capture time can be assigned to the decoded video frame used to generate the matching frame fingerprint. Hence, a correct time expressed in the system time is obtained for the position within the video stream corresponding to this decoded video frame. Once such a correct time has been assigned to at least one video frame in each of the video streams, the video streams can be time aligned so that video frames recorded at the same time at the different user devices will be time aligned. The embodiments thereby achieve synchronization of video streams from multiple sources by using a feedback channel, i.e. the peer-to-peer channel, to send timestamps and frame fingerprints from the user devices for calculating video frame arrival time of each video stream. The peer-to-peer channel can thereby be used to send timestamps, which are embedded along with frame fingerprints of sampled pictures or frames of recorded video into a packet on the user device. When the packet is sent back on the peer-to-peer channel, the timestamp is used to calculate when the sampled picture or frame was taken. By finding the fingerprinted frame in the received video stream, it is possible to determine when the video frame was taken in local time. i.e. system time. Thus, all the sampled video frames can be time stamped in the system time when they arrive on the wireless media channel. These video frames can then be used as pointers to align all the video streams.

In an embodiment, each user device transmits a video stream, or bitstream, of encoded video frames over a respective wireless media channel. Hence, as the camera of or connected to the user device records the scene and the user device, encodes the frames or pictures from the camera they are transmitted over the wireless media channel to a video synchronization server system. Each user device has, in addition, a wireless peer-to-peer channel established with the video synchronization server system to receive timestamps and to return the timestamps together with generated frame fingerprints. The synchronization server system can determine an estimated capture time of the video frame used to generate the frame fingerprint based on the timestamp received together with the frame fingerprint and the current system time when the frame fingerprint and the timestamp were received at the video synchronization server system. The video synchronization video server system decodes the encoded video streams of the video streams and generate respective frame fingerprints for the decoded video frames. Hence, the video synchronization video server performs the same or at least similar fingerprinting procedure on the decoded video frames as the user device performed on the current video frame when it received the timestamp from the video synchronization server system. The received frame fingerprint is then compared to the generated frame fingerprints in order to identify a generated frame fingerprint that matches, i.e. is sufficiently equal to, the received frame fingerprint. The video server synchronization system will then find, based on the comparison, the position within the video stream at which the user device received the timestamp sent over the wireless peer-to-peer channel. The estimated capture time determined in system time based on the timestamp and the arrival time of the frame fingerprint and timestamp at the video synchronization video server system can thereby be assigned to one of the decoded video frames. This means that the video sever synchronization system knows the correct time of at least one decoded video frame in the video stream. By performing this operation for each video stream from the multiple user devices, the video synchronization server system can use the determined estimated capture times to correctly time align decoded frames between the video streams in order to achieve video synchronization. As a consequence, video frames captured at the same time at the different user devices become time aligned at the video synchronization server system.

System time as used herein represents and denotes the time as recorded by the video synchronization server system, typically using an internal clock of the video synchronization server system. However, it could be possible to use another time reference than an internal clock as long as one and the same time reference is used by the video synchronization server system to generate timestamps and record current times of arrival of video frame and of the timestamp and the frame fingerprint.

Fig. 8 is a flow chart illustrating a particular embodiment of determining estimated capture time in Fig. 7. The method continues from step S42 of Fig. 7. The following step S50 comprises estimating a one-way transmission delay based on the timestamp and a reception time, in the system time, of the received frame fingerprint and the timestamp. A next step S51 comprises calculating the estimated capture time based on the one-way transmission delay and a reception time, in the system time, of a video frame used to generate the frame fingerprint. The method then continues to step S44 of Fig. 7.

In this embodiment, the timestamp and the reception time, i.e. the time at which the frame fingerprint and the timestamp were received on the wireless peer-to-peer channel expressed in the system time, are used to estimate a one-way transmission delay. This one-way transmission delay represents the transmission delay from transmission of a data packet from the user device until the data packet is received at the video synchronization server system.

In particular, the one-way transmission delay can be estimated by comparing the received timestamp with the current system time at which the timestamp and the frame fingerprint were received on the wireless peer-to-peer channels. For instance, the one-way delay could be estimated as disclosed in [1].

Once the one-way transmission delay has been estimated in step S50 it can be used together with the system time at which the video frame used to generate the frame fingerprint was received on the wireless media channel to calculate the estimated capture time in step S51. Hence, in an embodiment each video frame received on the wireless media channel is time stamped (or more correctly the data packets carrying the video frames) with the current reception time in the system time reference. The relevant reception time used in the calculation of step S51 is the reception time of the video frame that generates a frame fingerprint that matches the received frame fingerprint obtained from the user device over the wireless peer-to-peer channel. Hence, this relevant video frame was the one that resulted in the best match in the comparison of step S45. The video synchronization server system can then calculate the estimated capture time using the one-way transmission delay and the reception time, typically as - to, wherein represents the reception time in system time of the video frame and to represents the estimated one-way transmission delay. In an embodiment, step S41 of Fig. 7 comprises periodically transmitting, to the user device and over the wireless peer-to-peer channel, a timestamp generated based on a current system time.

In this embodiment, timestamps are periodically generated for the user devices and transmitted thereto on the wireless peer-to-peer channels. In such a case, a same periodicity could be used for all user devices. For instance, the timestamp representing the current system time is generated and sent to all user devices over the respective wireless peer-to-peer channels. In an embodiment, the periodicity is configurable by the video synchronization server system. Alternatively, the generation of timestamps could be performed on request of the video synchronization server system, such as when an increase in accuracy of estimating one-way transmission delays.

The embodiments are, however, not limited to periodic transmission of timestamps. For instance, the periodicity of transmitting timestamps or the occasions at which timestamp transmission are scheduled could be individually determined for the user devices and hence be different for different user devices. For instance, the scheduling of timestamp transmissions could be adapted to the particular network conditions experienced by the individual user devices. In such a case, the scheduling could be adapted to a trend in the one-way transmission delays estimated for the user devices.

For instance, if the one-way transmission delay estimated for a user device is substantially constant or does not change much over time then timestamps could be sent more seldom to this particular user device since the network conditions seem to be fairly constant for the user device. However, if the oneway transmission delays estimated for a user device change much then it may be more appropriate to decrease the time periods between transmitting timestamps to the user device. Fig. 9 is a flow chart illustrating an embodiment of such adaptation in scheduling transmissions of timestamps. The method continues from step S47 in Fig. 7. A next step S60 comprises storing the oneway transmission delay estimated for the current user device. Any trend in one-way transmission delay is then determined in step S61 and used in step S62 to schedule transmission of timestamps. The method then continues to step S40 of Fig. 7, where the scheduling is used to determine when timestamps are to be sent to the user device.

Hence, in this embodiment information of estimated one-way transmission delays, such as determined in step S50 of Fig. 8, are stored at the video synchronization server system. By storing such estimated one-way transmission delays the video synchronization server system will, over time, have access to multiple one-way transmission delays estimated for a given user device. It is then possible to determine any trend in the one-way transmission delay, such as fairly constant one-way transmission delay, increasing one-way transmission delay, decreasing one-way transmission delay or fluctuating one-way transmission delay. The transmission occasions for timestamps to the particular user device can then be scheduled in step S62 based on the determined trend in one-way transmission delay.

The scheduling could, in one approach, be simply in terms of i) keeping a current periodicity in timestamp transmissions, ii) reducing the time period between transmission occasions, or iii) increasing the time period between transmission occasions. More elaborative scheduling, including usage of non- periodic timestamp transmissions, based on the trend determined in step S61 are possible and within the scope of the embodiments.

Fig. 10 is a flow chart illustrating an embodiment of comparing fingerprints in Fig. 7. The method continues from step S44 in Fig. 7. A next step S70 comprises calculating a respective difference metric between the received frame fingerprint and the respective frame fingerprint generated for the decoded video frames. The following step S71 comprises selecting a decoded video frame that results in a difference metric representing a difference between the received frame fingerprint and a frame fingerprint generated for the decoded video frame that is smaller than a threshold difference. Hence, in this embodiment the video synchronization server system generates a respective frame fingerprint for each decoded video frame output from the decoding process in step S44 of Fig. 7. Each such frame fingerprint is then compared to the frame fingerprint received in step S42 of Fig. 7 by calculating a difference metric representing a difference between the generated frame fingerprint and the received frame fingerprint. The difference metric is then compared to a threshold value to determine whether the difference between the generated frame fingerprint and the received frame fingerprint is smaller than a threshold difference. If the difference is smaller than the threshold difference then the video synchronization server system assumes that the generated frame fingerprint and the received frame fingerprint are the same or more correctly that the decoded video frame used to generate the generated frame fingerprint at the video synchronization server system represents the same picture or frame as the video frame used to generate the received frame fingerprint at the user device.

In a typical embodiment, a low value of the difference metric represents a small difference between the frame fingerprints and a high value of the difference metric represents a large difference between the frame fingerprints. Non-limiting examples of difference metrics that can be used include sum of absolute differences (SAD) or sum of squared differences (SSD) between corresponding pixels or sample values in the frame fingerprints. In such a case, a generated frame fingerprint and the received frame fingerprint are assumed to be fingerprints of a same video frame if the SAD value or SSD value is smaller than a defined threshold value.

Once a decoded video frame has been selected in step S71 the method continues to step S46 of Fig. 7, in which the estimated capture time is assigned to the selected decoded video frame.

Determination of frame fingerprints does not necessarily have to be performed on all decoded video frames output from the decoder for a given video stream. In clear contrast, it could be sufficient to determine frame fingerprints for decoded video frames present within a time period selected to most likely encompass the video frame used to generate, at the user device, the frame fingerprint received in step S42 of Fig. 7. For instance, the expected arrival time ( ), in system time, of the relevant video frame would be = tr + tD, wherein tr represents the timestamp sent to the user device in step S41 and to represents an estimated one-way transmission delay for transmissions from the user device to the video synchronization server system. In this case, no one-way transmission delay for transmissions from the video synchronization server system to the user device has been assumed. If such one-way transmission delay is also regarded then the expected arrival time would be = tr + 2xtD (assuming symmetrical one-way transmission delays) or = tr + tD + (assuming asymmetrical one-way transmission delays), wherein k represents an estimated one-way transmission delay for transmission from the video synchronization server system to the user device. It could then be sufficient to determine frame fingerprints for the video frames having arrival time, in system time, at the video synchronization sever system within a time period of [ - Δ, + Δ] or [ΪΑ - ΔΙ, + Δ2] for some defined value of Δ or of Δ1 and Δ2. In an embodiment, step S47 of Fig. 7 comprises time aligning the video streams from the multiple user devices based on the assigned estimated capture times so that video frames in the video streams having a same capture time are time aligned.

Fig. 3 is a flow chart illustrating a method for enabling video synchronization. The method is performed by a user device communicating with a video synchronization server system. The method comprises a user device transmitting, in step S1 , a video stream of encoded video frames to a video synchronization server system over a wireless media channel. A next step S2 comprises the user device generating a frame fingerprint of a current video frame upon reception of a timestamp from the video synchronization server system over a wireless peer-to-peer channel. The next step S3 comprises the user device transmitting the frame fingerprint and the timestamp to the video synchronization server system over the wireless peer-to-peer channel. Step S1 of Fig. 3 is preferably performed throughout the communication session. Hence, as the user device outputs encoded video frames they are preferably packetized and transmitted to the video synchronization server system over the media channel. Steps S2 and S3 are, however, performed each time the user device receives a timestamp from the video synchronization sever system over the wireless peer-to-peer channel as schematically illustrated by the hashed line in the figure.

Fig. 4 is a flow chart illustrating additional, optional steps of the method for enabling video synchronization in Fig. 3. The method starts in step S10, which comprises the user device recording a scene with a camera of or connected to the user device to produce video frames. The user device then generates, in step S11 , the video stream by encoding the video frames. The method continues to step S1 of Fig. 3.

In a preferred embodiment, the user device therefore comprises a camera or is at least connected, wirelessly or by a wired connection, to a camera that is used to record a scene. The video frames output from the camera are encoded to generate the video stream that is transmitted to the video synchronization server system over the media channel in step S1 of Fig. 3.

Fig. 5 is a flow chart illustrating an embodiment of generating frame fingerprint in Fig. 3. The method continues from step S1 in Fig. 3. A next step S20 comprises the user device providing a video frame of a current scene recorded by the camera upon reception of the timestamp from the video synchronization server system over the wireless peer-to-peer channel. The user device then generates the frame fingerprint of the video frame in step S21. The method continues to step S3 of Fig. 3, where the generated frame fingerprint is transmitted together with the timestamp to the video synchronization server system over the wireless peer-to-peer channel.

In this embodiment, the user device fetches or retrieves a current picture or video frame as output from the camera when it receives the timestamp over the wireless peer-to-peer channel. This current picture or video frame thereby represents a snapshot of the current scene at the moment when the user device received the timestamp. This current picture or video frame is then processed to generate a frame fingerprint that is transmitted together with the received timestamp to the video synchronization server system.

The generation of a frame fingerprint in step S2 of Fig. 3, step S21 of Fig. 5 and step S45 of Fig. 7 can be performed according to various embodiments traditionally employed for generating fingerprints of images, pictures and/or video frames.

In an embodiment, the generation of the frame fingerprint comprises:

1 ) resample the frame or picture to a standardized size;

2) reduce saturation, normalize and equalize the resampled frame to increase contrast; and 3) resample again and take raw pixel data as the frame fingerprint.

In an alternative embodiment, a luminosity histogram of the original video frame can be used as frame fingerprint. Further alternatives include use a Radon transformation of the video frame to produce a normative mapping of the frame data as frame fingerprint, take a Haar wavelet of the video frame, etc. Other examples of generating frame fingerprints include image hashing, such as disclosed in [2-7].

The particular technique used to generate frame fingerprints is not essential to the embodiments and various prior art fingerprinting techniques can be used. It is, though, preferred that the fingerprinting is not too computational complex to allow battery-driven user devices to generate frame fingerprints. Furthermore, the video synchronization server system preferably generates a respective frame fingerprint on all or at least some of the decoded video frames. Hence, the generation of frame fingerprints should preferably be sufficient fast to enable generation of such fingerprints in real-time or at least near real-time as decoded video frames are output from the video decoder. The methods of the embodiments can advantageously be implemented using Web Real-Time Communication (WebRTC). Web RTC is an application programming interface (API) that supports browser-to-browser applications for, among others, voice calling, video chat and peer-to-peer file sharing without plugins. WebRTC is thereby a secure and reliable solution to transmit video and audio streams from user devices to backend servers, such as the video synchronization server system. Currently WebRTC is gaining widely support from, for instance, Firefox and Chrome on Android.

The present embodiments can thereby be used to achieve live delivery of video streams from WebRTC-capable user devices to the video synchronization server system where video streams from different WebRTC-capable user devices are time aligned and synchronized.

Fig. 6 is a flow chart illustrating an additional, optional step of the method when implemented using WebRTC. The method starts in step S30, which comprises the user device initiating a browser-based application service to activate a WebRTC getilserMedia API to access the camera of or connected to the user device. The browser-based application service also activates a WebRTC MediaStream API to transmit the video stream to the video synchronization server system over the wireless media channel using Real-time Transport Protocol (RTP). The browser-based application service further activates a WebRTC DataChannel API to establish the wireless peer-to-peer channel with the video synchronization server system. The method then continues to step S10 of Fig. 4.

Although WebRTC is a suitable technology of implementing the video synchronization, the embodiments are not limited thereto.

Fig. 11 schematically illustrates a system comprising a user device 1 and a video synchronization server system 10 and the operation flow in order to achieve synchronization of video streams according to an embodiment. Generally, the user device 1 opens a Web application service that calls WebRTC getilserMedia API to access the camera 5 of the user device 1 and uses WebRTC MediaStream API to stream encoded video frames to the video synchronization server system 10 via RTP/ RTP Control Protocol (RTCP) over the wireless media channel 40. The Web application service uses WebRTC DataChannel API to create a data channel 45, i.e. the wireless peer-to-peer channel 45, to the video synchronization server system 10. The video synchronization server system 10 generates a timestamp and sends it to the user device 1 , such as periodically, over the data channel 45. Upon receiving the timestamp, the user device 1 takes a screenshot or snapshot of the current scene 7 and generates a frame fingerprint (denoted image fingerprint in the figure) of the screenshot or snapshot. The user device 1 then sends the frame fingerprint together with the timestamp to the video synchronization server system 10 through the data channel 45.

The video synchronization server system 10 retrieves the frame fingerprint and the timestamp. By comparing the timestamp with the current system time, the video synchronization server system 10 is able to accurately estimate the one-way delay. Thus, the time when the screenshot or snapshot was taken can be derived. Meanwhile the video synchronization server system 10 decodes the received video stream. The video synchronization server system 10 produces a frame fingerprint for each decoded video frame. By comparing the produced frame fingerprints with the received frame fingerprint, the video synchronization server system 10 can determine when a video frame producing a frame fingerprint that matches the received frame fingerprint was captured on the user device 1. The video synchronization server system 10 performs this operation for each received video stream. Once the timestamp of each video stream is derived, the video synchronization server system is able to align them with the system time to achieve full video synchronization.

The video synchronization server system 10 could be a backend server capable of communicating with user devices 1 over the wireless media channel 40 and the wireless peer-to-peer channel 45, such as using WebRTC communication technology. The video synchronization server system 10 could alternatively be implemented as a group or cluster of multiple, i.e. at least two, backend servers that are interconnected by wired or wireless connections. The multiple backend servers could be locally arranged at the video synchronization service provider or be distributed among multiple locations. Also cloud-based implementations of the video synchronization server system 10 are possible and within the scope of the embodiments. A further aspect of the embodiments relates to a video synchronization server system. The video synchronization server system is configured to receive a video stream of encoded video frames over a wireless media channel from each user device of multiple user devices. The video synchronization server system is also configured to transmit, to each user device of the multiple user devices and over a wireless peer-to-peer channel, a timestamp generated based on a current system time. The video synchronization server system is further configured to receive a frame fingerprint and the timestamp from each user device of the multiple user devices over the wireless peer-to-peer channel. The video synchronization server system is additionally configured to determine, for each user device of the multiple user devices, an estimated capture time of a video frame, used to generate the received frame fingerprint, based on the timestamp and a current system time. The video synchronization server system is also configured to decode, for each user device of the multiple user devices, the video stream to get decoded video frames. The video synchronization server system is configured to compare, for each user device of the multiple user devices, the received frame fingerprint with a respective frame fingerprint generated for the decoded video frames. The video synchronization server system is further configured to assign, for each user device of the multiple user devices, the estimated capture time to a decoded video frame based on the comparison. The video synchronization server system is also configured to time align video streams from the multiple user devices based on the assigned estimated capture times. In an embodiment, the video synchronization server system is configured to estimate, for each user device of the multiple user devices, a one-way transmission delay based on the timestamp and a reception time of the received frame fingerprint and the timestamp in the system time. In this embodiment, the video synchronization server system is also configured to calculate, for each user device of the multiple user devices, the estimated capture time based on the one-way transmission delay and a reception time of a video frame used to generate the frame fingerprint in the system time.

In an embodiment, the video synchronization server system is configured to periodically transmit, to each user device of the multiple user devices and over the wireless peer-to-peer channel, a timestamp generated based on a current system time.

In an embodiment, the video synchronization server system is configured to store, for each user device of the multiple user devices, the one-way transmission delay. The video synchronization server system is also configured to determine, for each user device of the multiple user devices, any trend in one-way transmission delay. The video synchronization server system is further configured to schedule, for each user device of the multiple user devices, transmission of timestamps based on the trend in one-way transmission delay.

In an embodiment, the video synchronization server system is configured to calculate, for each user device of the multiple user devices, a respective difference metric between the received frame fingerprint and the respective frame fingerprint generated for the decoded video frames. The video synchronization server system is also configured to select, for each user device of the multiple user devices, a decoded video frame that results in a difference metric representing a difference between the received frame fingerprint and a frame fingerprint generated for the decoded video frame that is lower than a threshold difference. In an embodiment, the video synchronization server system is configured to time align the video stream from the multiple user devices based on the assigned estimated capture times so that video frames in the video streams having a same capture time are time aligned.

It will be appreciated that the methods and devices described herein can be combined and re-arranged in a variety of ways.

For example, embodiments may be implemented in hardware, or in software for execution by suitable processing circuitry, or a combination thereof.

The steps, functions, procedures, modules and/or blocks described herein may be implemented in hardware using any conventional technology, such as discrete circuit or integrated circuit technology, including both general-purpose electronic circuitry and application-specific circuitry.

Particular examples include one or more suitably configured digital signal processors and other known electronic circuits, e.g. discrete logic gates interconnected to perform a specialized function, or Application Specific Integrated Circuits (ASICs). Fig. 15 illustrates a particular hardware implementation of the video synchronization server system 400. In an embodiment, the video synchronization server system 400 comprises a receiver 410 configured to receive the video stream over the wireless media channel and receive the frame fingerprint and the timestamp over the wireless peer-to-peer channel. The video synchronization server system 400 also comprises a transmitter 410 configured to transmit the timestamp over the wireless peer-to-peer channel. A time estimator 420 of the video synchronization server system 400 is configured to determine the estimated capture time. The video synchronization server system 400 further comprises a decoder 430 configured to decode the video stream and a comparator 440 configured to compare the received frame fingerprint with the respective frame fingerprint. An assigning unit 450 of the video synchronization server system 400 is configured to assign the estimated capture time to the decoded video frame and a time aligner 460 is configured to time align the video streams.

In the figure, the receiver and transmitter have been exemplified by a transceiver (TX/RX) 410. In alternative embodiments, the video synchronization server system 400 could comprise a dedicated receiver 410 and a dedicated transmitter 410, or a first receiver used for reception of encoded video frames on the wireless media channel and a second receiver used for reception of timestamps and frame fingerprints on the wireless peer-to-peer channel in addition to the transmitter 410 or multiple transmitters. The receiver 410 is preferably connected to the decoder 430 and the comparator 440 for forwarding received encoded video frame thereto and to the time estimator 420 for forwarding the received timestamp thereto. The comparator 440 is connected to the decoder 430 for receiving decoded video frames therefrom and to the assigning unit 450 for forwarding a selected decoded video frame thereto. The assigning unit 450 is connected to the time estimator 420 for receiving the estimated capture time therefrom and to the time aligner 460 for forwarding a decoded video frame with assigned estimated capture time thereto. The time aligner 460 is also connected to the decoder 430 for receiving the decoded video frames to be time aligned therefrom.

In an embodiment, the comparator 440 of the video synchronization server system 400 is configured to generate frame fingerprints of the decoded video frames output from the decoder 430. In an alternative embodiment, the video synchronization server system 400 comprises a fingerprint generator (not illustrated) configured to generate frame fingerprints of the decoded video frames output from the decoder 430. The figure also shows a system clock 15 used as internal time reference by the video synchronization server system 400. This system clock 15 is preferably used to generate timestamps and to record reception times of encoded video frames and of the frame fingerprints in the system time.

Alternatively, at least some of the steps, functions, procedures, modules and/or blocks described herein may be implemented in software such as a computer program for execution by suitable processing circuitry such as one or more processors or processing units.

Examples of processing circuitry includes, but is not limited to, one or more microprocessors, one or more Digital Signal Processors (DSPs), one or more Central Processing Units (CPUs), video acceleration hardware, and/or any suitable programmable logic circuitry such as one or more Field Programmable Gate Arrays (FPGAs), or one or more Programmable Logic Controllers (PLCs).

It should also be understood that it may be possible to re-use the general processing capabilities of any conventional device or unit in which the proposed technology is implemented. It may also be possible to re-use existing software, e.g. by reprogramming of the existing software or by adding new software components.

In a particular example, the video synchronization server system 500, see Fig. 16, comprises a processor 510 and a memory 520 comprising instructions executable by the processor 510. The processor 510 is operative to cause a receiver 530 to receive the video stream over the wireless media channel and receive the frame fingerprint and the timestamp over the wireless peer-to-peer channel.

The processor 510 is also operative to cause a transmitter 530 to transmit the timestamp over the wireless peer-to-peer channel. The processor 510 is further operative to determine the estimated capture time, to decode the video stream and to compare the received frame fingerprint with the respective frame fingerprint. The processor 510 is additionally operative to assign the estimated capture time to the decoded video frame and to time align the video streams.

In a particular embodiment, the processor 510 is operative, when executing the instructions stored in the memory 520, to perform the above described operations. The processor 510 is thereby interconnected to the memory 520 to enable normal software execution.

Fig. 16 illustrates the video synchronization server system 500 as comprising a transceiver 530. In alternative embodiments, the video synchronization server system 500 could instead comprises one or more receivers and one or more transmitters.

The figure also illustrates the previously mentioned system clock 15 that may be implemented in the video synchronization server system 500. Fig. 18 is, in an embodiment, a schematic block diagram illustrating an example of a video synchronization server system 700 comprising a processor 710, an associated memory 720 and a communication circuitry 730.

In this particular example, at least some of the steps, functions, procedures, modules and/or blocks described herein are implemented in a computer program 740, which is loaded into the memory 720 for execution by processing circuitry including one or more processors 710. The processor 710 and memory 720 are interconnected to each other to enable normal software execution. A communication circuitry 730 is also interconnected to the processor 710 and/or the memory 720 to enable input and/or output of encoded video frames, timestamps and frame fingerprints. The term 'processor' should be interpreted in a general sense as any system or device capable of executing program code or computer program instructions to perform a particular processing, determining or computing task.

5

The processing circuitry including one or more processors is thus configured to perform, when executing the computer program, well-defined processing tasks such as those described herein.

The processing circuitry does not have to be dedicated to only execute the above-described steps, 10 functions, procedure and/or blocks, but may also execute other tasks.

In an embodiment, the computer program 740 comprises instructions, which when executed by the processor 710, cause the processor 710 to generate, for each user device of multiple user devices, a timestamp based on a current system time. The timestamp is output for transmission to the user device

15 over a wireless peer-to-peer channel. The processor 710 is also caused to determine, for each user device of the multiple user devices, an estimated capture time of a video frame, used to generate a frame fingerprint, received from the user device with the timestamp over the wireless peer-to-peer channel, based on the timestamp and a current system time. The processor 710 is further caused to decode, for each user device of the multiple user devices, a video stream of encoded video frames,

20 received from the user device over a wireless media channel, to get decoded video frames. The processor 710 is additionally caused to compare, for each user device of the multiple user devices, the received frame fingerprint with a respective frame fingerprint generated for the decoded video frames and to assign, for each user device of the multiple user devices, the estimated capture time to a decoded video frame based on the comparison. The processor 710 is further caused to time align video

25 streams from the multiple user devices based on the assigned estimated capture times.

The proposed technology also provides a carrier 750 comprising the computer program 740. The carrier 750 is one of an electronic signal, an optical signal, an electromagnetic signal, a magnetic signal, an electric signal, a radio signal, a microwave signal, or a computer-readable storage medium 30 750.

By way of example, the software or computer program 740 may be realized as a computer program product, which is normally carried or stored on a computer-readable medium 750, preferably nonvolatile computer-readable storage medium 750. The computer-readable medium 750 may include one or more removable or non-removable memory devices including, but not limited to a Read-Only Memory (ROM), a Random Access Memory (RAM), a Compact Disc (CD), a Digital Versatile Disc (DVD), a Blue-ray disc, a Universal Serial Bus (USB) memory, a Hard Disk Drive (HDD) storage device, a flash memory, a magnetic tape, or any other conventional memory device. The computer program 5 740 may thus be loaded into the operating memory of a computer or equivalent processing device, represented by the video synchronization server system 700 in Fig. 18, for execution by the processor 710 thereof.

The flow diagram or diagrams presented herein may therefore be regarded as a computer flow diagram0 or diagrams, when performed by one or more processors. A corresponding video synchronization server system may be defined as a group of function modules, where each step performed by the processor corresponds to a function module. In this case, the function modules are implemented as a computer program running on the processor. Hence, the video synchronization server system may alternatively be defined as a group of function modules, where the function modules are implemented5 as a computer program running on at least one processor.

The computer program residing in memory may thus be organized as appropriate function modules configured to perform, when executed by the processor, at least part of the steps and/or tasks described herein. An example of such function modules is illustrated in Fig. 17 illustrating a schematic0 block diagram of a video synchronization server system 600 with function modules. The video synchronization server system 600 comprises a timestamp generator 610 for generating, for each user device of multiple user devices, a timestamp based on a current system time. The timestamp is output for transmission to the user device over a wireless peer-to-peer channel. The video synchronization server system 600 also comprises a time estimator 620 for determining, for each user device of the5 multiple user devices, an estimated capture time of a video frame, used to generate a frame fingerprint, received from the user device with a timestamp over the wireless peer-to-peer channel, based on the timestamp and a current system time. The video synchronization server system 600 further comprises a decoder 630 for decoding, for each user device of the multiple user devices, a video stream of encoded video frames, received from the user device over a wireless media channel, to get decoded0 video frames. The video synchronization server system 600 additionally comprises a comparator 640 for comparing, for each user device of the multiple user devices, the received frame fingerprint with a respective frame fingerprint generated for the decoded video frames. The video synchronization server system 600 further comprises an assigning unit 650 for assigning, for each user device of the multiple user devices, the estimated capture time to a decoded video frame based on the comparison. The video synchronization server system 600 additionally comprises a time aligner 660 for time aligning video streams from the multiple user devices based on the assigned estimated capture times.

In an embodiment, the video synchronization server system 600 also comprises a receiver 670 for receiving the video streams over the wireless media channel from each user device of the multiple user devices and for receiving the frame fingerprint and the timestamp from each user device of the multiple user devices over the wireless peer-to-peer channel. The video synchronization server system 600 also comprises a transmitter 670 for transmitting, to each user device of the multiple user devices, the timestamp over the wireless peer-to-peer channel.

The receiver and transmitter can be implemented as a transceiver or one and more receivers and one or more transmitters.

Yet another aspect of the embodiments relates to a user device that is configured to transmit a video stream of encoded video frames to a video synchronization server system over a wireless media channel. The user device is also configured to generate a frame fingerprint of a current video frame upon reception of a timestamp from the video synchronization server system over a wireless peer-to- peer channel. The user device is further configured to transmit the frame fingerprint and the timestamp to the video synchronization server system over the wireless peer-to-peer channel.

In an embodiment, the user device is configured to record a scene with a camera of or connected to the user device to produce video frames. The user device is also configured to generate the video stream by encoding the video frames. In an embodiment, the user device is configured to provide a video frame of a current scene recorded by the camera upon reception of the timestamp from the video synchronization server system over the wireless peer-to-peer channel. The user device is also configured to generate the frame fingerprint of the video frame. In an embodiment, the user device is configured to initiate a browser-based application service to active a WebRTC getilserMedia API to access the camera. The browser-based application service also actives a WebRTC MediaStream API to transmit the video stream to the video synchronization server system over the wireless media channel using RTP. The browser-based application service further activates a WebRTC DataC annel API to establish the wireless peer-to-peer channel with the video synchronization server system.

Fig. 12 is a schematic block diagram of a hardware implementation of the user device 100. The user device 100 comprises a transmitter 110 configured to transmit the video stream to the video synchronization server system over the wireless media channel and transmit the frame fingerprint and the timestamp to the video synchronization server system over the wireless peer-to-peer channel. The user device 100 also comprises a receiver 110 configured to receive the timestamp from the video synchronization server system over the wireless peer-to-peer channel. The user device 100 further comprises a fingerprint generator 120 configured to generate the frame fingerprint.

The user device 100 optionally comprises a camera 5 used to record a scene and output video frames. The camera 5 does not necessarily have to be part of the user device 100 but could, alternatively, be connected thereto through a wired or wireless connection. An encoder (not illustrated), such as implemented in the camera 5 or in the user device 100, is used to encode the video frames to get encoded video frames of the video stream.

The fingerprint generator 120 is preferably connected to the receiver 110 to get a notification of generating a frame fingerprint upon reception of a timestamp by the receiver 110. The fingerprint generator 120 is preferably also connected to the camera 5 to retrieve a current video frame or picture therefrom to generate the frame fingerprint. This frame fingerprint is forwarded to the connected transmitter 110 for transmission together with the timestamp to the video synchronization server system. The receiver and transmitter can be implemented as a transceiver 110 or one and more receivers and one or more transmitters.

In a particular example, the user device 200, see Fig. 13, comprises a processor 210 and a memory 220 comprising instructions executable by the processor 210. The processor 210 is operative to cause a transmitter 230 to transmit the video stream to the video synchronization server system over the wireless media channel and to transmit the frame fingerprint and the timestamp to the video synchronization server system over the wireless peer-to-peer channel. The processor 210 is also operative to generate the frame fingerprint. The receiver and transmitter can be implemented as a transceiver 230 or one and more receivers and one or more transmitters.

In a particular embodiment, the processor 210 is operative, when executing the instructions stored in the memory 220, to perform the above described operations. The processor 210 is thereby interconnected to the memory 220 to enable normal software execution.

The user device 200 may optionally also comprise a camera 5 configured to record a scene to produce video frames.

Fig. 18 is, in an embodiment, a schematic block diagram illustrating an example of a user device 700 comprising a processor 710, an associated memory 720 and a communication circuitry 730.

In this particular example, at least some of the steps, functions, procedures, modules and/or blocks described herein are implemented in a computer program 740, which is loaded into the memory 720 for execution by processing circuitry including one or more processors 710. The processor 710 and memory 720 are interconnected to each other to enable normal software execution. A communication circuitry 730 is also interconnected to the processor 710 and/or the memory 720 to enable input and/or output of encoded video frames, timestamps and frame fingerprints.

In an embodiment, the computer program 740 comprises instructions, which when executed by the processor 710, cause the processor 710 to generate a video stream of encoded video frames for transmission to a video synchronization server system over a wireless media channel. The processor 710 is also caused to generate a frame fingerprint of a current video frame upon reception of a timestamp from the video synchronization server system over a wireless peer-to-peer channel. The processor 710 is further caused to associate the timestamp with the frame fingerprint for transmission to the video synchronization server system over the wireless peer-to-peer channel.

The computer program 740 may be comprised in the previously described carrier 750.

Associating the timestamp with the frame fingerprint may, for instance, be achieved by including the timestamp and the frame fingerprint in a data packet that is transmitted over the wireless peer-to-peer channel to the video synchronization server system. The computer program residing in memory may thus be organized as appropriate function modules configured to perform, when executed by the processor, at least part of the steps and/or tasks described herein. An example of such function modules is illustrated in Fig. 14 illustrating a schematic block diagram of a user device 300 with function modules. The user device 300 comprises an encoder

5 310 for generating a video stream of encoded video frames for transmission to a video synchronization server system over a wireless media channel. The user device 300 also comprises a fingerprint generator 320 for generating a frame fingerprint of a current video frame upon reception of a timestamp from the video synchronization server system over a wireless peer-to-peer channel. The user device 300 further comprises an associating unit 330 for associating the timestamp with the frame fingerprint0 for transmission to the video synchronization server system over the wireless peer-to-peer channel.

In an embodiment, the user device 300 also comprises a transmitter 340 for transmitting the video stream to the video synchronization server system over the wireless media channel and for transmitting the timestamp and the frame fingerprint to the video synchronization server system over the wireless5 peer-to-peer channel. The user device 300 preferably further comprises a receiver 340 for receiving the timestamp from the video synchronization server system over the wireless peer-to-peer channel.

The transmitter and receiver could be implemented as a transceiver 340 or as one or more transmitters and one or more receivers.

0

The user device is preferably in the form of a mobile or portable user device, such a mobile telephone, a smart phone, a tablet, a laptop, a video camera with wireless communication circuitry, etc.

The embodiments described above are to be understood as a few illustrative examples of the present5 invention. It will be understood by those skilled in the art that various modifications, combinations and changes may be made to the embodiments without departing from the scope of the present invention. In particular, different part solutions in the different embodiments can be combined in other configurations, where technically possible. The scope of the present invention is, however, defined by the appended claims.

0

REFERENCES

[1] Jin-Hee Choi and Chunck Yoo, "Analytic End-to-End Estimation for the One-Way Delay and Its Variation", Consumer Communications and Networking Conference, 2005, pages: 527 - 532 [2] Venkatesan, "Robust image hasing", Microsoft Research publications, 2000,

(http://research.microsoft.com/pubs/77279/verikie00robust.pdf) - retrieved on October 24, 2014 [3] Saminathan et al., "Robust and secure image hashing", IEEE Transactions on Information

Forensics and Security, 2006, Vol. 1 , Issue 2, pages 215-230

[4] Picket, "Simple Image Hashing Withy Python", 2003,

(https://bloq.safaribooksonline.com/2013/11/28/imaqe-hashing-with-python/) - retrieved on

October 24, 2014

[5] Zhao et al., "A Robust Image Hashing Method Based on Zernike Moments", Journal of Computational Information Systems, 2010, Vol. 6, Issue 3, pages 717-725

[6] Hernandez et al., "Robust Image Hashing Using Image Normalization and SVD Decomposition", IEEE 54^th International Midwest Symposium on Circuits and Systems (MWSCAS), 2011 , pages 1-4

[7] Liu and Xiao, "A Robust Image Hashing Algorithm Resistant Against Geometrical Attacks", Radioengineering, 2013, Vol. 22, Issue 4, pages 1072-1081

Claims

1. A video synchronization method comprising, for each user device (1 ) of multiple user devices (1 ,

2. 3):

receiving (S40) a video stream (21) of encoded video frames (31) over a wireless media channel (40) from said user device (1);

transmitting (S41), to said user device (1) and over a wireless peer-to-peer channel (45), a timestamp generated based on a current system time;

receiving (S42), from said user device (1) and over said wireless peer-to-peer channel (45), a frame fingerprint and said timestamp;

determining (S43) an estimated capture time of a video frame, used to generate said received frame fingerprint, based on said timestamp and a current system time;

decoding (S44) said video stream (21) to get decoded video frames (31);

comparing (S45) said received frame fingerprint with a respective frame fingerprint generated for said decoded video frames (31); and

assigning (S46) said estimated capture time to a decoded video frame (31) based on said comparison, wherein said video synchronization method further comprising time aligning (S47) video streams (21 , 22, 23) from said multiple user devices (1, 2, 3) based on said assigned estimated capture times. 2. The video synchronization method according to claim 1 , wherein transmitting (S41) said timestamp comprises periodically transmitting (S41), to said user device (1) and over said wireless peer-to-peer channel (45), a timestamp generated based on a current system time.

3. The video synchronization method according to claim 1 or 2, wherein determining (S43) said estimated capture time comprises:

estimating (S50) a one-way transmission delay based on said timestamp and a reception time of said received frame fingerprint and said timestamp in said system time; and

calculating (S51) said estimated capture time based on said one-way transmission delay and a reception time of a video frame used to generate said frame fingerprint in said system time.

4. The video synchronization method according to claim 3, further comprising, for each user device (1) of said multiple user devices (1 , 2, 3):

storing (S60) said one-way transmission delay;

determining (S61) any trend in one-way transmission delay; and scheduling (S62) transmission of timestamps based on said trend in one-way transmission delay.

5. The video synchronization method according to any of the claims 1 to 4, wherein comparing (S45) said frame fingerprint comprises:

calculating (S70) a respective difference metric between said received frame fingerprint and said respective frame fingerprint generated for said decoded video frames (31); and

selecting (S71) a decoded video frame (31) that results in a difference metric representing a difference between said received frame fingerprint and a frame fingerprint generated for said decoded video frame (31) that is smaller than a threshold difference.

6. The video synchronization method according to any of the claims 1 to 5, wherein time aligning (S47) said video streams (21 , 22, 23) comprises time aligning (S47) said video streams (21, 22, 23) from said multiple user devices (1 , 2, 3) based on said assigned estimated capture times so that video frames (31 , 32, 33) in said video streams (21 , 22, 23) having a same capture time are time aligned.

7. A method for enabling video synchronization comprising:

a user device (1) transmitting (S1) a video stream (21) of encoded video frames (31) to a video synchronization server system (10) over a wireless media channel (40);

said user device (1) generating (S2) a frame fingerprint of a current video frame (31) upon reception of a timestamp from said video synchronization server system (10) over a wireless peer-to- peer channel (45); and

said user device (1) transmitting (S3) said frame fingerprint and said timestamp to said video synchronization server system (10) over said wireless peer-to-peer channel (45).

8. The method according to claim 7, further comprising:

said user device (1) recording (S10) a scene (7) with a camera (5) of or connected to said user device (1) to produce video frames (31); and

said user device (1) generating (S11) said video stream (21) by encoding said video frames (31).

9. The method according to claim 8, wherein said user device (1) generating (S2) said frame fingerprint comprises:

said user device (1) providing (S20) a video frame of a current scene (7) recorded by said camera (5) upon reception of said timestamp from said video synchronization server system (10) over said wireless peer-to-peer channel (45); and said user device (1) generating (S21) said frame fingerprint of said video frame.

10. The method according to claim 8 or 9, further comprising said user device (1) initiating (S30) a browser-based application service to:

activate a Web Real Time Communication, WebRTC, getUserMedia Application Programming Interface, API, to access said camera (5);

activate a WebRTC MediaStream API to transmit said video stream (21) to said video synchronization server system (10) over said wireless media channel (40) using Real-time Transport Protocol, RTP; and

activate a WebRTC DataChannel API to establish said wireless peer-to-peer channel (45) with said video synchronization server system (10).

11. A video synchronization server system (10, 400, 500), wherein

said video synchronization server system (10, 400, 500) is configured to receive a video stream (21) of encoded video frames (31) over a wireless media channel (40) from each user device (1) of multiple user devices (1 , 2, 3);

said video synchronization server system (10, 400, 500) is configured to transmit, to each user device (1) of said multiple user devices (1 , 2, 3) and over a wireless peer-to-peer channel (45), a timestamp generated based on a current system time;

said video synchronization server system (10, 400, 500) is configured to receive a frame fingerprint and said timestamp from each user device (1) of said multiple user devices (1 , 2, 3) over said wireless peer-to-peer channel (45);

said video synchronization server system (10, 400, 500) is configured to determine, for each user device (1) of said multiple user devices (1 , 2, 3), an estimated capture time of a video frame, used to generate said received frame fingerprint, based on said timestamp and a current system time;

said video synchronization server system (10, 400, 500) is configured to decode, for each user device (1) of said multiple user devices (1 , 2, 3), said video stream (21) to get decoded video frames (31);

said video synchronization server system (10, 400, 500) is configured to compare, for each user device (1) of said multiple user devices (1 , 2, 3), said received frame fingerprint with a respective frame fingerprint generated for said decoded video frames (31);

said video synchronization server system (10, 400, 500) is configured to assign, for each user device (1) of said multiple user devices (1 , 2, 3), said estimated capture time to a decoded video frame (31) based on said comparison; and said video synchronization server system (10, 400, 500) is configured to time align video streams (21 , 22, 23) from said multiple user devices (1 , 2, 3) based on said assigned estimated capture times.

12. The video synchronization server system according to claim 11 , wherein said video synchronization server system (10, 400, 500) is configured to periodically transmit, to each user device

(1 ) of said multiple user devices (1 , 2, 3) and over said wireless peer-to-peer channel (45), a timestamp generated based on a current system time.

13. The video synchronization server system according to claim 11 or 12, wherein

said video synchronization server system (10, 400, 500) is configured to estimate, for each user device (1) of said multiple user devices (1 , 2, 3), a one-way transmission delay based on said timestamp and a reception time of said received frame fingerprint and said timestamp in said system time; and

said video synchronization server system (10, 400, 500) is configured to calculate, for each user device (1) of said multiple user devices (1 , 2, 3), said estimated capture time based on said one-way transmission delay and a reception time of a video frame used to generate said frame fingerprint in said system time.

14. The video synchronization server system according to claim 13, wherein

said video synchronization server system (10, 400, 500) is configured to store, for each user device (1) of said multiple user devices (1 , 2, 3),) said one-way transmission delay;

said video synchronization server system (10, 400, 500) is configured to determine, for each user device (1) of said multiple user devices (1 , 2, 3), any trend in one-way transmission delay; and

said video synchronization server system (10, 400, 500) is configured to schedule, for each user device (1) of said multiple user devices (1 , 2, 3), transmission of timestamps based on said trend in one-way transmission delay.

15. The video synchronization server system according to any of the claims 11 to 14, wherein

said video synchronization server system (10, 400, 500) is configured to calculate, for each user device (1) of said multiple user devices (1 , 2, 3), a respective difference metric between said received frame fingerprint and said respective frame fingerprint generated for said decoded video frames (31); and

said video synchronization server system (10, 400, 500) is configured to select, for each user device (1) of said multiple user devices (1 , 2, 3), a decoded video frame (31) that results in a difference metric representing a difference between said received frame fingerprint and a frame fingerprint generated for said decoded video frame (31) that is smaller than a threshold difference.

16. The video synchronization server system according to any of the claims 11 to 15, wherein said 5 video synchronization server system (10, 400, 500) is configured to time align said video streams (21 , 22, 23) from said multiple user devices (1 , 2, 3) based on said assigned estimated capture times so that video frames (31, 32, 33) in said video streams (21, 22, 23) having a same capture time are time aligned.

10 17. The video synchronization server system according to any of the claims 11 to 16, further comprising:

a receiver (410) configured to receive said video stream (21) over said wireless media channel (40) and receive said frame fingerprint and said timestamp over said wireless peer-to-peer channel (45);

15 a transmitter (410) configured to transmit said timestamp over said wireless peer-to-peer channel (45);

a time estimator (420) configured to determine said estimated capture time;

a decoder (430) configured to decode said video stream (21);

a comparator (440) configured to compare said received frame fingerprint with said respective 20 frame fingerprint;

an assigning unit (450) configured to assign said estimated capture time to said decoded video frame (31); and

a time aligner (460) configured to time align said video streams (21 , 22, 23).

25 18. The video synchronization server system according to any of the claims 11 to 16, further comprising:

a processor (510); and

a memory (520) comprising instructions executable by said processor (510), wherein said processor (510) is operative to cause a receiver (530) to receive said video stream (21) over 30 said wireless media channel (45) and receive said frame fingerprint and said timestamp over said wireless peer-to-peer channel (45);

said processor (510) is operative to cause a transmitter (530) to transmit said timestamp over said wireless peer-to-peer channel (45);

said processor (510) is operative to determine said estimated capture time; said processor (510) is operative to decode said video stream (21);

said processor (510) is operative to compare said received frame fingerprint with said respective frame fingerprint;

said processor (510) is operative to assign said estimated capture time to said decoded video frame (31); and

said processor (510) is operative to time align said video streams (21 , 22, 23).

19. A video synchronization server system (600) comprising:

a timestamp generator (610) for generating, for each user device (1) of multiple user devices (1 , 2, 3), a timestamp based on a current system time, said timestamp is output for transmission to said user device (1) over a wireless peer-to-peer channel (45);

a time estimator (620) for determining, for each user device (1) of said multiple user devices (1 , 2, 3), an estimated capture time of a video frame, used to generate a frame fingerprint, received from said user device (1) with said timestamp over said wireless peer-to-peer channel (45), based on said timestamp and a current system time;

a decoder (630) for decoding, for each user device (1) of said multiple user devices (1 , 2, 3), a video stream (21) of encoded video frames (31), received from said user device (1) over a wireless media channel (40), to get decoded video frames (31);

a comparator (640) for comparing, for each user device (1) of said multiple user devices (1 , 2, 3), said received frame fingerprint with a respective frame fingerprint generated for said decoded video frames (31);

an assigning unit (650) for assigning, for each user device (1 ) of said multiple user devices (1 , 2, 3), said estimated capture time to a decoded video frame (31) based on said comparison; and

a time aligner (660) for time aligning video streams (21 , 22, 23) from said multiple user devices (1, 2, 3) based on said assigned estimated capture times.

20. The video synchronization server system according to claim 19, further comprising:

a receiver (670) for receiving said video stream (21) over said wireless media channel (40) from each user device (1) of multiple user devices (1 , 2, 3) and receiving said frame fingerprint and said timestamp from each user device (1) of said multiple user devices (1 , 2, 3) over said wireless peer-to- peer channel (45); and

a transmitter (670) for transmitting, to each user device (1) of said multiple user devices (1 , 2, 3), said timestamp over said wireless peer-to-peer channel (45).

21. A user device (1 , 100, 200), wherein

said user device (1 , 100, 200) is configured to transmit a video stream (21) of encoded video frames (31) to a video synchronization server system (10) over a wireless media channel (40);

said user device (1 , 100, 200) is configured to generate a frame fingerprint of a current video 5 frame (31) upon reception of a timestamp from said video synchronization server system (10) over a wireless peer-to-peer channel (45); and

said user device (1 , 100, 200) is configured to transmit said frame fingerprint and said timestamp to said video synchronization server system (10) over said wireless peer-to-peer channel (45).

10 22. The user device according to claim 21 , wherein

said user device (1 , 100, 200) is configured to record a scene (7) with a camera (5) of or connected to said user device (1) to produce video frames (31);

said user device (1 , 100, 200) is configured to generate said video stream (21) by encoding said video frames (31).

15

23. The user device according to claim 22, wherein

said user device (1 , 100, 200) is configured to provide a video frame of a current scene (7) recorded by said camera (5) upon reception of said timestamp from said video synchronization server system (10) over said wireless peer-to-peer channel (45); and

20 said user device (1 , 100, 200) is configured to generate said frame fingerprint of said video frame.

24. The user device according to claim 22 or 23, wherein said user device (1 , 100, 200) is configured to initiate a browser-based application service to:

25 activate a Web Real Time Communication, WebRTC, getUserMedia Application Programming Interface, API, to access said camera (5);

30 activate a WebRTC DataChannel API to establish said wireless peer-to-peer channel (45) with said video synchronization server system (10).

The user device according to any of the claims 21 to 24, further comprising: a transmitter (110) configured to transmit said video stream (21) to said video synchronization server system (10) over said wireless media channel (40) and transmit said frame fingerprint and said timestamp to said video synchronization server system (10) over said wireless peer-to-peer channel (45);

a receiver (110) configured to receive said timestamp from said video synchronization server system (10) over said wireless peer-to-peer channel (45); and

a fingerprint generator (120) configured to generate said frame fingerprint.

26. The user device according to any of the claims 21 to 24, further comprising:

a processor (210); and

a memory (220) comprising instructions executable by said processor (210), wherein said processor (210) is operative to cause a transmitter (230) to transmit said video stream (21) to said video synchronization server system (10) over said wireless media channel (40) and transmit said frame fingerprint and said timestamp to said video synchronization server system (10) over said wireless peer-to-peer channel (45); and

said processor (210) is operative to generate said frame fingerprint.

27. The user device according to claim 25 or 26, further comprising a camera (5) configured to record a scene (7) to produce video frames (31).

28. A user device (300) comprising:

an encoder (310) for generating a video stream (21) of encoded video frames (31) for transmission to a video synchronization server system (10) over a wireless media channel (40);

a fingerprint generator (320) for generating a frame fingerprint of a current video frame (31) upon reception of a timestamp from said video synchronization server system (10) over a wireless peer-to- peer channel (45); and

an associating unit (330) for associating said timestamp with said frame fingerprint for transmission to said video synchronization server system (10) over said wireless peer-to-peer channel (45).

29. The user device according to claim 28, further comprising:

a transmitter (340) for transmitting said video stream (21) to said video synchronization server system (10) over said wireless media channel (40) and transmitting said timestamp and said frame fingerprint to said video synchronization server system (10) over said wireless peer-to-peer channel (45); and

a receiver (340) for receiving said timestamp from said video synchronization server system (10) over said wireless peer-to-peer channel (45).

5

30. A computer program (740) comprising instructions, which when executed by a processor (710), cause said processor (710) to

generate, for each user device (1) of multiple user devices (1 , 2, 3), a timestamp based on a current system time, said timestamp is output for transmission to said user device (1) over a wireless 10 peer-to-peer channel (45);

determine, for each user device (1) of said multiple user devices (1 , 2, 3), an estimated capture time of a video frame, used to generate a frame fingerprint, received from said user device (1) with said timestamp over said wireless peer-to-peer channel (45), based on said timestamp and a current system time;

15 decode, for each user device (1) of said multiple user devices (1 , 2, 3), a video stream (21) of encoded video frames (31), received from said user device (1) over a wireless media channel (40), to get decoded video frames (31);

compare, for each user device (1) of said multiple user devices (1 , 2, 3), said received frame fingerprint with a respective frame fingerprint generated for said decoded video frames (31);

20 assign, for each user device (1) of said multiple user devices (1 , 2, 3), said estimated capture time to a decoded video frame (31) based on said comparison; and

time align video streams (21, 22, 23) from said multiple user devices (1 , 2, 3) based on said assigned estimated capture times.

25 31. A computer program (740) comprising instructions, which when executed by a processor (710), cause said processor (710) to

generate a video stream (21) of encoded video frames (31) for transmission to a video synchronization server system (10) over a wireless media channel (40);

generate a frame fingerprint of a current video frame (31) upon reception of a timestamp from 30 said video synchronization server system (10) over a wireless peer-to-peer channel (45); and

associate said timestamp with said frame fingerprint for transmission to said video synchronization server system (10) over said wireless peer-to-peer channel (45).

32. A carrier (750) comprising a computer program (740) of claim 30 or 31 , wherein said carrier (750) is one of an electronic signal, an optical signal, an electromagnetic signal, a magnetic signal, an electric signal, a radio signal, a microwave signal, or a computer-readable storage medium.