CN119788920A

CN119788920A - Video editing method, device, electronic device and storage medium

Info

Publication number: CN119788920A
Application number: CN202311295541.7A
Authority: CN
Inventors: 王光伟; 龙超; 刘策龙; 李沛霖; 李云珠
Original assignee: Beijing Zitiao Network Technology Co Ltd; Lemon Inc Cayman Island
Current assignee: Beijing Zitiao Network Technology Co Ltd; Lemon Inc Cayman Island
Priority date: 2023-10-08
Filing date: 2023-10-08
Publication date: 2025-04-08
Also published as: WO2025077695A1

Abstract

The embodiment of the disclosure discloses a video editing method, a device, electronic equipment and a storage medium, wherein the method comprises the steps of carrying out attribute analysis on scenes in a sample video frame to obtain scene attribute information, wherein the scene attribute information at least comprises space information and semantic information corresponding to the scenes in the sample video frame, determining a target editing effect, wherein the target editing effect comprises at least one of editing a scene moving mirror and editing a scene shape, generating a target video frame based on the scene attribute information and the target editing effect through a nerve radiation field model, and constructing the nerve radiation field model based on the sample video frame. Diversified video editing can be realized.

Description

Video editing method and device, electronic equipment and storage medium

Technical Field

The embodiment of the disclosure relates to the technical field of computers, in particular to a video editing method, a video editing device, electronic equipment and a storage medium.

Background

The existing video editing technology can generally realize cutting, splicing, tuning and speed changing of videos, adding special effects (such as captions, stickers, filters, transitions and the like) and the like. In the related art, the editing mode of the video is limited greatly.

Disclosure of Invention

The embodiment of the disclosure provides a video editing method, a video editing device, electronic equipment and a storage medium, which can realize diversified video editing.

In a first aspect, an embodiment of the present disclosure provides a video editing method, including:

performing attribute analysis on scenes in a sample video frame to obtain scene attribute information, wherein the scene attribute information at least comprises space information and semantic information corresponding to the scenes in the sample video frame;

determining a target editing effect, wherein the target editing effect comprises at least one of an editing scene fortune mirror and an editing scene shape;

And generating a target video frame based on the scene attribute information and the target editing effect through a nerve radiation field model, wherein the nerve radiation field model is constructed based on the sample video frame.

In a second aspect, an embodiment of the present disclosure further provides a video editing apparatus, including:

the analysis module is used for carrying out attribute analysis on scenes in the sample video frames to obtain scene attribute information, wherein the scene attribute information at least comprises space information and semantic information corresponding to the scenes in the sample video frames;

The effect determining module is used for determining a target editing effect, wherein the target editing effect comprises at least one of an editing scene fortune mirror and an editing scene shape;

And the editing module is used for generating a target video frame based on the scene attribute information and the target editing effect through a nerve radiation field model, wherein the nerve radiation field model is constructed based on the sample video frame.

In a third aspect, embodiments of the present disclosure further provide an electronic device, including:

one or more processors;

Storage means for storing one or more programs,

The one or more programs, when executed by the one or more processors, cause the one or more processors to implement the video editing method as described in any of the embodiments of the present disclosure.

In a fourth aspect, the disclosed embodiments also provide a storage medium containing computer-executable instructions, which when executed by a computer processor, are for performing a video editing method as described in any of the disclosed embodiments.

According to the technical scheme, the scene in the sample video frame is subjected to attribute analysis to obtain scene attribute information, the scene attribute information at least comprises spatial information and semantic information corresponding to the scene in the sample video frame, the target editing effect is determined, the target editing effect comprises at least one of editing a scene fortune mirror and editing a scene shape, a target video frame is generated based on the scene attribute information and the target editing effect through a nerve radiation field model, and the nerve radiation field model is constructed based on the sample video frame.

Because the nerve radiation field model is constructed based on the sample video frame, the nerve radiation field model can have the capability of presenting scenes in the sample video frame at any angle and distance. And performing attribute analysis on scenes in the sample video frames to at least obtain scene attribute information such as space, semantics and the like corresponding to the scenes, so that scene understanding can be realized. On the basis of scene attribute information, the nerve radiation field model can edit the scenes in the sample video frames by carrying out mirror and/or shape editing according to the target editing effect, and the target video frames are generated, so that diversified video editing can be realized.

Drawings

The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. The same or similar reference numbers will be used throughout the drawings to refer to the same or like elements. It should be understood that the figures are schematic and that elements and components are not necessarily drawn to scale.

Fig. 1 is a flowchart of a video editing method according to an embodiment of the disclosure;

fig. 2 is a schematic flow chart of a scene coordinate reconstruction in a video editing method according to an embodiment of the disclosure;

Fig. 3 is a flowchart of a video editing method according to an embodiment of the disclosure;

Fig. 4 is a schematic diagram of a scene-changing shooting mirror in a video editing method according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of changing scene shapes in a video editing method according to an embodiment of the present disclosure;

FIG. 6 is a block flow diagram of a video editing method according to an embodiment of the present disclosure;

FIG. 7 is a block flow diagram of a video editing method according to an embodiment of the present disclosure;

FIG. 8 is a schematic diagram of a new target video frame of a video editing method according to an embodiment of the disclosure;

Fig. 9 is a schematic structural diagram of a video editing apparatus according to an embodiment of the disclosure;

Fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment," another embodiment "means" at least one additional embodiment, "and" some embodiments "means" at least some embodiments. Related definitions of other terms will be given in the description below.

It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units.

It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.

The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.

It will be appreciated that the data (including but not limited to the data itself, the acquisition or use of the data) involved in the present technical solution should comply with the corresponding legal regulations and the requirements of the relevant regulations.

Fig. 1 is a schematic flow chart of a video editing method according to an embodiment of the present disclosure, where the embodiment of the present disclosure is applicable to a video editing situation, for example, a situation of changing a mirror during shooting a scene in a video, and/or a situation of changing a three-dimensional shape of the scene. The method may be performed by a video editing apparatus, which may be implemented in software and/or hardware, which may be configured in an electronic device, for example in a computer or the like.

As shown in fig. 1, the video editing method provided in this embodiment may include:

S110, performing attribute analysis on scenes in the sample video frames to obtain scene attribute information.

In the embodiment of the disclosure, the video frames of the video to be edited can be obtained through the prior art, and the sample video frames can be determined based on the video frames. For example, one frame may be extracted as a sample video frame from among video frames every predetermined number of frames or a predetermined time.

In some optional implementations, before performing the attribute analysis on the scene in the sample video frame, selecting the sample video frame may further include selecting the sample video frame based on at least one of a co-view relationship between the video frames and a content of each video frame.

In these alternative implementations, pose solutions, such as motion structuring (Structure from Motion, sfM), may be used to determine the camera position for each video frame. When the change of the camera position is smaller than the preset distance, the camera position can be considered to have a common view relationship. Further, one frame may be selected from the video frames having the common view relationship as a sample video frame. Thus, there is sufficient camera motion, i.e., a change in viewing angle, between selected sample video frames. And selecting the sample video frames according to the common view relation among the video frames, so that the sample video frames can bear more complete and comprehensive scene attribute information.

The selecting the sample video frame according to the content of the video frame may include selecting the sample video frame according to at least one index of sharpness, brightness, noise, etc. of the content of the video frame. For example, fourier transforms may be used to analyze whether the video frames are clear and to screen out the video frames in which the blur is compared, and the remaining video frames may be taken as sample video frames. The sample video frames are selected according to the content of the video frames, so that the quality of the sample video frames is improved, and each sample video frame carries more accurate scene attribute information.

The scene attribute information at least comprises spatial information and semantic information corresponding to a scene in the sample video frame. The semantic information may include information such as bottom semantic information (e.g., contour, edge, color, texture, shape, etc.) of each object in the scene, middle semantic information (e.g., object state, etc.), and higher semantic information (e.g., object meaning, such as cat, person, sky, building, etc.), etc. The sample video frame can be processed by the existing image analysis method to obtain scene attribute information at least comprising spatial information and semantic information corresponding to the scene in the sample video frame. By analyzing the attributes of the scenes in the sample video frames, scene understanding can be realized, and a foundation is laid for subsequent rendering of the target video frames according to the determined target editing effect.

S120, determining a target editing effect.

The target editing effect comprises at least one of editing scene fortune mirrors and editing scene shapes. The editing scene fortune mirror can be understood as a fortune mirror when the scene in the editing sample video frame shoots, and the editing scene shape can be understood as a three-dimensional shape of the scene in the editing sample video frame.

When the target editing effect includes an editing scene mirror, the target editing effect may include at least one of a first person named a main view mirror, a surround target object mirror, and a Seikon mirror (also referred to as a push-track zoom or a slide-zoom), and the like. The target editing effect is that the first person calls a main vision mirror, the substitution sense of the edited video can be enhanced, the target editing effect is that the target object in the edited video can be emphasized when the target editing effect surrounds the target object mirror, when the target editing effect is that the size of a main object in the edited video is unchanged, the background perspective is changed severely, the visual effect that the background is far away from/close to the main body is presented, and the target editing effect can be that the edited video creates suspicion and tension atmosphere or highlights the strong mood change of the inner heart of the person.

When the target editing effect includes editing a shape of a scene, the target editing effect may include at least one of at least partial enlargement/reduction in the scene, at least partial stretching/compression in a predetermined direction in the scene, at least partial dithering around a predetermined coordinate axis in the scene, bending, and the like. When the target editing effect is at least local enlargement/reduction in the scene, the scene can be mapped to at least local enlargement operation after three-dimensional space, and/or at least local reduction operation in the scene can be performed, so that a 'big country/small country' scene different from the surrounding environment can be created for the edited video. When the target editing effect is stretching/compressing of at least part of the scene in a preset direction, stretching operation and/or compressing operation can be performed on at least part of the scene after the scene is mapped to the three-dimensional space. When the scene mapped to the three-dimensional space is continuously stretched and compressed, the scene fluctuation effect can be created for the edited video, and when the scene Jing Lashen is on the spot, the effects of penetrating insect holes, being high in speed and the like can be created for the edited video. When the target editing effect is jitter and bending around a preset coordinate axis at least partially in the scene, the scene can be mapped to the jitter and/or bending around the preset coordinate axis at least partially of the object after the three-dimensional space, and an ultra-real scene can be built for the edited video.

In embodiments of the present disclosure, a video editing apparatus may provide at least one layer of user interface to interact with a user during video editing. The user interface may include, for example, a selection interface of the target editing effects, where a selection control of at least one editing effect may be presented in the selection interface, and the target editing effect may be determined from the editing effects in response to a triggering operation of the selection control by the user.

Wherein, step S110 and step S120 have no strict timing relationship. For example, the video editing apparatus may push at least one editing effect for selection by the user on the basis of the various scene attribute information obtained in step S110, and then may determine the target editing effect from the pushed editing effects in response to the triggering operation of the selection control by the user. For another example, the video editing apparatus may perform analysis of the relevant scene attribute information according to the target editing effect after determining the target editing effect from the editing effects in response to a triggering operation of the user on the selection control.

S130, generating a target video frame based on scene attribute information and a target editing effect through the nerve radiation field model.

The nerve radiation field model is constructed based on a sample video frame. Neural radiation field (Neural RADIANCE FIELDS, NERF) is a computer vision technique that uses deep learning techniques to extract aggregate shape and texture information of objects from images at multiple perspectives, and then uses this information to generate a continuous three-dimensional radiation field that can represent highly realistic three-dimensional models/scenes at any angle and distance. In this embodiment, the construction of the neural radiation field model can be realized by using only the sample video frame in the video shot once, and the implicit reconstruction of the scene in the video, that is, the three-dimensional information of the scene is learned, is automatically performed.

Scene understanding can be achieved through scene attribute information, and on the basis, a mirror track and/or deformation processing operation of the scene can be generated according to the target editing effect. For example, when the target editing effect is that a mirror is carried around a target object, the target object in the scene can be determined through scene attribute information, and a new mirror carrying track can be generated on the basis of the scene attribute information.

On the premise of determining the microscope track and/or deformation processing operation of the scene, the nerve radiation field model can draw out deformed or non-deformed scenes under all view angles in the track to obtain a target video frame. Based on the existing mode, new video can be generated according to each target video frame, so that various video editing effects can be easily realized. The video can be produced with low threshold and low time consumption without repeatedly shooting the video to obtain the expected fortune mirror and using professional editing tools by professionals to edit the video.

In addition, by carrying out implicit reconstruction on the scene in the sample video frame through NeRF, not only can the highly real rendering effect be ensured, but also the time consumption is less compared with the traditional technology for constructing the three-dimensional model based on the depth point cloud, and the reconstruction of the hour level can be achieved.

Embodiments of the present disclosure may be combined with each of the alternatives in the video editing method provided in the above embodiments. The video editing method provided by the embodiment describes the process of performing attribute analysis on the scene in the sample video frame in detail. By performing at least one analysis operation of three-dimensional coordinate reconstruction, symmetry analysis and semantic analysis, a foundation can be laid for scene understanding.

In the video editing method provided by the embodiment, the attribute analysis of the scene in the sample video frame can comprise at least one of reconstructing three-dimensional coordinates corresponding to the scene in the sample video frame, analyzing symmetry of the scene in the sample video frame and performing semantic analysis on the scene in the sample video frame.

The constructed neural radiation field model can reconstruct a scene in three dimensions, but lacks a unified reference coordinate system when rendering the scene at each view angle. Therefore, when the attribute analysis is performed on the scene in the sample video frame, the three-dimensional coordinates corresponding to the scene in the sample video frame can be reconstructed, so that the rendering positions of the scene under each view angle are uniform. Reconstructing three-dimensional coordinates corresponding to the scene in the sample video frame, for example, determining world coordinates corresponding to the scene based on the camera position and the camera parameters of each sample video frame, and for example, mapping the scene to a three-dimensional space by using the constructed nerve radiation field model, determining a three-dimensional coordinate system according to a straight line in the scene in the three-dimensional space, and the like.

The symmetry of the scene in the sample video frame is analyzed, for example, whether the symmetry axis exists in the sample video frame or not can be identified through an existing image identification method (such as a pattern matching method, an optimized search method, and the like). The symmetry axis may be a symmetry axis of the entire scene, and may be a symmetry axis of a local scene. For example, referring to fig. 5a, fig. 5a may be considered to be symmetrical left and right throughout the scene with the road centerline as the symmetry axis. By analyzing the symmetry of the scene in the sample video frame, a reference can be provided for a subsequent edited scene mirror or edited scene shape. For example, when editing a scene, at least part of the edited lens moving path can be parallel to the symmetry axis, and the like, and when editing the scene shape, the scene can be stretched or compressed along the symmetry axis, and the like.

The semantic analysis is performed on the scenes in the sample video frame, for example, the semantic segmentation can be performed on the scenes in the sample video frame by an existing image segmentation method. For example, in the case of an indoor scene, wall division, floor division, ceiling division, furniture division, etc., may be performed, and in the case of an outdoor scene, building division, road division, etc., may be performed. By carrying out semantic analysis on the scene in the sample video frame, a foundation can be laid for selecting objects in the subsequent editing scene fortune mirror or editing scene shape. For example, when a scene is edited and the object is transported, the object which needs to be surrounded when the scene is transported can be determined according to the semantic segmentation result, and when the scene shape is edited, the object of the shape to be edited can be determined.

In these alternative implementations, the scene attribute analysis process may include at least one of three-dimensional coordinate reconstruction, symmetry analysis, and semantic analysis, which lays a foundation for scene understanding. In addition, other manners of analyzing the attributes of the scenes may be applied to the method, which is not exhaustive herein.

Fig. 2 is a schematic flow chart of a scene coordinate reconstruction in a video editing method according to an embodiment of the disclosure. As shown in fig. 2, in some alternative implementations, reconstructing three-dimensional coordinates corresponding to a scene in a sample video frame may include:

S210, detecting straight lines in scenes in each sample video frame;

S220, determining the spatial position of each straight line through a nerve radiation field model;

s230, clustering the straight lines based on the space positions of the straight lines to obtain target straight lines;

S240, identifying a target plane in the scene, and determining a three-dimensional coordinate system of the scene according to the target plane and the target straight line.

In this embodiment, a detection manner such as hough transform (hough) may be used to detect a straight line of a scene in each sample video frame, so as to obtain a two-dimensional straight line in the sample video frame.

The depth information of the two-dimensional straight line can be rendered by using the nerve radiation field model, and on the basis, the two-dimensional straight line can be mapped into a three-dimensional straight line in space due to the fact that the position of the camera is known, so that the space position of the three-dimensional straight line is obtained.

After the two-dimensional straight lines of the scene in each sample video frame are mapped into the space straight lines, a plurality of straight lines are contained near the true value straight lines due to the existence of errors, and clustering analysis can be carried out on the straight lines according to the space positions of the straight lines, so that target straight lines with high confidence in space can be obtained.

The result of semantic analysis on the scene in the sample video frame can be used to identify the target plane corresponding to the scene, such as ground, desktop, etc. Wherein the target plane may be used to characterize a horizontal plane. On this basis, the target straight line can be analyzed. For example, straight lines perpendicular to each other are extracted, and two straight lines among the straight lines perpendicular to each other may be parallel to the target plane. At this time, two straight lines parallel to the target plane may represent two axes on a horizontal plane, and a straight line perpendicular to the target plane may represent an axis perpendicular to the horizontal plane, so that it may be ensured that the established three-dimensional coordinate system is consistent with semantic information of the scene itself.

In these alternative implementations, the built neural radiation field model may be utilized to map the scene to a three-dimensional space, and the lines in the scene may be combined with semantic analysis to determine a three-dimensional coordinate system, so as to ensure that the built three-dimensional coordinate system is consistent with semantic information of the scene itself.

In some alternative implementations, analyzing symmetry of a scene in a sample video frame may include mapping each axis of symmetry of the scene in the sample video frame to a three-dimensional space using a constructed neural radiation field model, and determining a target axis of symmetry based on the axes of symmetry of the three-dimensional space that are parallel to any axis in the three-dimensional coordinate system.

In these alternative implementations, the symmetry axis may be determined on the basis of the created three-dimensional coordinate system. The editing effect can be enriched by preferentially selecting a symmetry axis parallel to a coordinate axis in the reconstructed three-dimensional coordinate system as a basis for subsequently determining the editing mirror effect.

According to the technical scheme, the process of analyzing the attributes of the scenes in the sample video frames is described in detail. By performing at least one analysis operation of three-dimensional coordinate reconstruction, symmetry analysis and semantic analysis, a foundation can be laid for scene understanding. The video editing method provided by the embodiment of the present disclosure belongs to the same disclosure concept as the video editing method provided by the above embodiment, technical details which are not described in detail in the present embodiment can be seen in the above embodiment, and the same technical features have the same beneficial effects in the present embodiment and the above embodiment.

Embodiments of the present disclosure may be combined with each of the alternatives in the video editing method provided in the above embodiments. The video editing method provided by the embodiment limits the determined target editing effect. Editing scene mirrors and/or editing scene shapes may be determined as target editing effects based on scene attribute information, which may provide more automated video editing capabilities.

Fig. 3 is a schematic flow chart of a video editing method according to an embodiment of the disclosure. As shown in fig. 3, the video editing method provided in this embodiment may include:

s310, performing attribute analysis on scenes in the sample video frames to obtain scene attribute information.

The scene attribute information at least comprises spatial information and semantic information corresponding to a scene in the sample video frame.

S320, determining a target editing effect according to the scene attribute information.

The target editing effect comprises at least one of editing scene fortune mirrors and editing scene shapes.

The automatic generation of the mirror track can be performed according to semantic information in the scene attribute information, and the target editing effect can comprise editing the scene mirror.

For example, the surrounding of the shot may be made according to the target object in the scene. The determining process of the target object can comprise the steps of segmenting different objects in each sample video frame through semantic analysis, responding to target object selecting operation of a user, and selecting the target object;

Or can be automatically determined according to the result of semantic analysis, for example, a target attention object category is predefined, an object belonging to the category in the semantic segmentation result is used as a target object, and for example, an object with the occurrence frequency higher than a preset frequency in each sample video frame is used as a target object.

For another example, a lens surrounding track may be automatically generated according to the distribution of the same objects in the scene. Therefore, the mirror special effect of the film and television level can be automatically generated only by simple user interaction or without interaction.

The automatic generation of the mirror track can also be performed according to the symmetry plane in the scene attribute information, and the target editing effect can comprise editing the scene mirror. For example, a Seagate Kelly mirror may be made on the plane of symmetry to create a suspense, stressful atmosphere, or to highlight strong emotional changes in the person's mind, etc.

The target object may be deformed according to semantic information in the scene attribute information, where the target editing effect may include editing a scene shape. For example, the target object may be enlarged, dithered, etc. The determining process of the target object may refer to the above, and will not be described herein.

Fig. 4 is a schematic diagram illustrating a scene change operation mirror in a video editing method according to an embodiment of the present disclosure. Referring to fig. 4, the arrow curve in fig. 4a is the original mirror trajectory, and the arrow curve in fig. 4b is the mirror trajectory after the scene is edited. By encircling the building mirror, an atmosphere for restoring the macro can be created.

Fig. 5 is a schematic diagram illustrating a scene shape change in a video editing method according to an embodiment of the present disclosure. Referring to fig. 5, fig. 5a is an original street shape, and fig. 5b is a street shape after editing a scene shape. By bending the street upwards, an ultra-realistic atmosphere can be created.

S330, generating a target video frame based on the scene attribute information and the target editing effect through the nerve radiation field model.

The nerve radiation field model is constructed based on a sample video frame.

The technical scheme of the embodiment of the disclosure limits the determination of the target editing effect. Editing scene mirrors and/or editing scene shapes may be determined as target editing effects based on scene attribute information, which may provide more automated video editing capabilities. The video editing method provided by the embodiment of the present disclosure belongs to the same disclosure concept as the video editing method provided by the above embodiment, technical details which are not described in detail in the present embodiment can be seen in the above embodiment, and the same technical features have the same beneficial effects in the present embodiment and the above embodiment.

Embodiments of the present disclosure may be combined with each of the alternatives in the video editing method provided in the above embodiments. The video editing method provided by the embodiment can be combined with the existing image generation model and the traditional rendering link in the video editing process, so that a richer video editing effect can be presented.

Fig. 6 is a block flow diagram of a video editing method according to an embodiment of the disclosure. As shown in fig. 6, in some implementations, the video editing process may first process in the video dimension based on the existing image generation model, and then generate the target video frame through the NeRF model, which may include:

S610, performing attribute analysis on scenes in the sample video frames to obtain scene attribute information.

S620, determining a target editing effect.

S630, generating a target video frame based on the scene attribute information and the target editing effect through the nerve radiation field model.

The nerve radiation field model is constructed based on a sample video frame.

S640, editing the key frame of the target video frame to obtain an updated key frame.

Wherein, after rendering the target video frames of each angle of the scene using NeRF model, at least part of the target video frames can be used as key frames. And can implement functions to edit the key frames, such as modifying, enhancing, etc., the key frames based on existing image generation models. Wherein editing the keyframe may include stylizing the keyframe.

In this embodiment, the keyframes will be uniformly described by taking the image generation model as an example, and other editing modes are the same and will not be described in an exhaustive manner. After the key frames are stylized, video frames of different styles than the original sample video frames can be obtained. For example, a realistic style sample video frame may be edited into a new video frame of the music score style, or the like.

S650, determining optical flow information between target video frames through a nerve radiation field model.

Wherein, the step S640 and the step S650 have no strict timing relationship.

In this embodiment, since each pixel in the scene rendered by the NeRF model has unique spatial coordinates, the generated pixel in the target video frame and the pixels in the front and rear target video frames have spatial correspondence, that is, have correct optical flow. Optical flow information between each target video frame may be determined by the change in the position of a pixel in the current target video frame and a corresponding pixel in an adjacent video frame. The optical flow information may be understood as information related to the motion of the pixel, such as a motion direction, a displacement, and the like.

And S660, interpolating the updated key frames based on the optical flow information to obtain intermediate video frames.

After determining the optical flow information between each target video frame, the optical flow information of the video frames between each adjacent key frame may be used as the optical flow information of the video frames between each adjacent updated key frame. Further, interpolation may be performed between each adjacent updated key frame based on optical flow information of the video frame between each updated key frame to obtain an intermediate video frame.

For example, the optical flow information of the video frames between the adjacent key frames before each patterning may be used as the optical flow information of the intermediate video frames between the adjacent key frames after each patterning. The adjacent stylized key frames can be interpolated according to the optical flow information of the intermediate frames, so that stylized intermediate video frames can be obtained.

And S670, taking the updated key frames and the intermediate video frames as new target video frames.

From each updated key frame, and intermediate video frames between adjacent updated key frames, a new target video frame may be generated. For example, a stylized target video frame may be generated from the stylized key frames, and the stylized intermediate video frames between the stylized key frames.

The conventional video stylization process may be considered that the image generation model processes each frame, and the same pixel point in the scene may generate a difference at the position of each frame of picture after stylization, so that the processed video may shake and flicker.

In this alternative embodiment, since each pixel in the scene rendered by the NeRF model has unique spatial coordinates, the pixels of the generated image and the pixels in the front and rear images have a corresponding relationship, that is, have correct optical flows. The NeRF model can utilize the correct optical flow among all the target video frames to interpolate among all the target video frames to obtain an intermediate frame, so that the jitter and flickering of the edited video can be reduced.

Fig. 7 is a block flow diagram of a video editing method according to an embodiment of the disclosure. As shown in fig. 7, in some implementations, the video editing process may generate a target video frame through a NeRF model, process the target video frame based on an existing image generation model, and then feed back to a NeRF model, so as to implement processing of the video in a spatial dimension, which may include:

S710, performing attribute analysis on the scenes in the sample video frames to obtain scene attribute information.

S720, determining a target editing effect.

And S730, generating a target video frame based on the scene attribute information and the target editing effect through the nerve radiation field model.

The nerve radiation field model is constructed based on a sample video frame.

S740, editing the key frame of the target video frame to obtain an updated key frame.

Wherein, after rendering the target video frames of each angle of the scene using NeRF model, at least part of the target video frames can be used as key frames. And can implement functions to edit the keyframes, such as modifying, enhancing, etc., the keyframes based on existing image generation models. Wherein editing the keyframe may include stylizing the keyframe. In this embodiment, the keyframes will be uniformly described by taking the image generation model as an example, and other editing modes are the same and will not be described in an exhaustive manner.

And S750, adjusting parameters in the nerve radiation field model by using the updated key frames so that the nerve radiation field model after adjusting the parameters generates a new target video frame.

After the stylized keyframe is obtained, the stylized keyframe may be mapped back to NeRF to modify color information of each pixel after the scene is mapped to the three-dimensional space (i.e., modify texture of the scene in the spatial dimension). And then, continuing the subsequent rendering by using the updated NeRF model so as to realize the stylization processing of the original video.

In these alternative embodiments, by changing the texture of the target video frame and projecting the texture back to the NeRF model, the NeRF model can be made to render the target video frame with consistent texture, so that consistency of the original video after style processing can be ensured.

In some alternative implementations, after generating the target video frame, generating a new target video frame including the target three-dimensional model may also include generating a new target video frame including the target three-dimensional model based on depth information of the scene in the target video frame and depth information of the target three-dimensional model.

Wherein a three-dimensional model can be understood as a three-dimensional special effect material that can be added to a scene in a sample video frame. Wherein the three-dimensional model to be added can be acquired before generating a new target video frame containing the target three-dimensional model. For example, a three-dimensional model to be added may be obtained from a preset three-dimensional model library according to scene attribute information of a scene in a sample video frame. Specifically, a corresponding relation between the three-dimensional model and scene semantics can be preset, and a three-dimensional model matched with the semantic information of the scene in the sample video frame can be determined from a preset three-dimensional model library based on the corresponding relation and used as the three-dimensional model to be added.

Further, a new target video frame including the target three-dimensional model may be generated based on depth information of the scene in the target video frame and depth information of the target three-dimensional model.

In a conventional rendering link, the rendering may be performed after a three-dimensional model (e.g., a triangular patch model) is constructed. The spatial position of the three-dimensional model in the scene can be predefined, or can be determined by a user setting received through a predefined user interface. Depth information corresponding to a scene of the target video frame and depth information of the three-dimensional model at a space position can be obtained through the NeRF model. On this basis, because the geometric information of the three-dimensional model is known, the three-dimensional model and the scene can have a reasonable occlusion relationship in a new target video frame obtained after the three-dimensional model is rendered into the three-dimensional scene. The depth information of the three-dimensional model can be compared with the depth information corresponding to the scene in the target video frame, and the pixels with smaller depth can be determined from the three-dimensional model and the scene to render, so that the three-dimensional model can be added into the scene of the target video frame, and the three-dimensional model and the scene can have a reasonable shielding relation.

Fig. 8 is a schematic diagram of a new target video frame of a video editing method according to an embodiment of the disclosure. Referring to fig. 8, when generating a target video frame, the football model may be added to the scene with depth information of the scene (football goal) and depth information of the football model in the target video frame, so that the scene content may be enriched.

In these alternative implementations, the target video frames may also incorporate a traditional rendering link, generating new target video frames, capable of presenting richer video editing effects.

According to the technical scheme, the video editing process can be combined with the generated artificial intelligent model and the traditional rendering link, and a richer video editing effect can be presented. The video editing method provided by the embodiment of the present disclosure belongs to the same disclosure concept as the video editing method provided by the above embodiment, technical details which are not described in detail in the present embodiment can be seen in the above embodiment, and the same technical features have the same beneficial effects in the present embodiment and the above embodiment.

Fig. 9 is a schematic structural diagram of a video editing apparatus according to an embodiment of the disclosure. The video editing apparatus provided in the present embodiment is suitable for use in the case of video editing, for example, the case of changing a mirror at the time of scene shooting in video, the case of changing the three-dimensional shape of a scene, and the like.

As shown in fig. 9, a video editing apparatus provided in an embodiment of the present disclosure may include:

The analysis module 910 is configured to perform attribute analysis on a scene in the sample video frame to obtain scene attribute information, where the scene attribute information at least includes spatial information and semantic information corresponding to the scene in the sample video frame;

The effect determining module 920 is configured to determine a target editing effect, where the target editing effect includes at least one of an editing scene fortune scope and an editing scene shape;

And the editing module 930 is used for generating a target video frame based on the scene attribute information and the target editing effect through a nerve radiation field model, wherein the nerve radiation field model is constructed based on the sample video frame.

In some alternative implementations, the analysis module may be configured to perform at least one of the following attribute analyses on the scenes in the sample video frames:

reconstructing three-dimensional coordinates corresponding to a scene in a sample video frame;

analyzing symmetry of scenes in the sample video frames;

and carrying out semantic analysis on the scene in the sample video frame.

In some alternative implementations, the analysis module may be configured to:

detecting straight lines in scenes in each sample video frame;

Determining the spatial position of each straight line through a nerve radiation field model;

clustering the straight lines based on the space positions of the straight lines to obtain target straight lines;

A target plane in the scene is identified and a three-dimensional coordinate system of the scene is determined from the target plane and the target line.

In some alternative implementations, the effect determination module may be configured to:

and determining the target editing effect according to the scene attribute information.

In some alternative implementations, the video editing apparatus may further include:

The sample video frame selecting module is used for selecting the sample video frame based on at least one of the following modes before performing attribute analysis on the scene in the sample video frame:

selecting a sample video frame according to the common view relation among the video frames;

and selecting the sample video frames according to the content of each video frame.

the post editing module is used for editing the key frames of the target video frames after the target video frames are generated, so as to obtain updated key frames;

determining optical flow information between target video frames through a nerve radiation field model;

Interpolating the updated key frames based on the optical flow information to obtain intermediate video frames;

and taking the updated key frame and the intermediate video frame as new target video frames.

In some optional implementations, the post-editing module may be further configured to edit the key frame of the target video frame after the target video frame is generated, to obtain an updated key frame;

Correspondingly, the model updating module can be used for adjusting parameters in the nerve radiation field model by utilizing the updated key frames so as to enable the nerve radiation field model after adjusting the parameters to generate a new target video frame.

In some alternative implementations, the post-editing module may also be used to:

After the target video frame is generated, a new target video frame containing the target three-dimensional model is generated according to the depth information of the scene in the target video frame and the depth information of the target three-dimensional model.

The video editing device provided by the embodiment of the disclosure can execute the video editing method provided by any embodiment of the disclosure, and has the corresponding functional modules and beneficial effects of the execution method.

It should be noted that the above-mentioned units and modules included in the apparatus are only divided according to the functional logic, but not limited to the above-mentioned division, so long as the corresponding functions can be implemented, and the specific names of the functional units are only used for distinguishing from each other, and are not used for limiting the protection scope of the embodiments of the present disclosure.

Referring now to fig. 10, a schematic diagram of a configuration of an electronic device (e.g., a terminal device or server in fig. 10) 1000 suitable for use in implementing embodiments of the present disclosure is shown. The terminal devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 10 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.

As shown in fig. 10, the electronic apparatus 1000 may include a processing device (e.g., a central processor, a graphics processor, etc.) 1001 that may perform various appropriate actions and processes according to a program stored in a Read-Only Memory (ROM) 1002 or a program loaded from a storage device 1008 into a random access Memory (Random Access Memory, RAM) 1003. In the RAM 1003, various programs and data necessary for the operation of the electronic apparatus 1000 are also stored. The processing device 1001, the ROM 1002, and the RAM 1003 are connected to each other by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.

In general, devices including input devices 1006 such as a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc., output devices 1007 including a Liquid Crystal Display (LCD), speaker, vibrator, etc., storage devices 1008 including, for example, magnetic tape, hard disk, etc., and communication devices 1009 may be connected to the I/O interface 1005. The communication means 1009 may allow the electronic device 1000 to communicate wirelessly or by wire with other devices to exchange data. While fig. 10 shows an electronic device 1000 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a non-transitory computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication device 1009, or installed from the storage device 1008, or installed from the ROM 1002. When the computer program is executed by the processing apparatus 1001, the above-described functions defined in the video editing method of the embodiment of the present disclosure are performed.

The electronic device provided by the embodiment of the present disclosure and the video editing method provided by the foregoing embodiment belong to the same disclosure concept, and technical details not described in detail in the present embodiment may be referred to the foregoing embodiment, and the present embodiment has the same beneficial effects as the foregoing embodiment.

The present disclosure provides a computer storage medium having stored thereon a computer program which, when executed by a processor, implements the video editing method provided by the above embodiments.

It should be noted that the computer readable medium described in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of a computer-readable storage medium may include, but are not limited to, an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-Only Memory (ROM), an erasable programmable read-Only Memory (EPROM) or FLASH Memory (FLASH), an optical fiber, a portable compact disc read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to electrical wiring, fiber optic cable, RF (radio frequency), and the like, or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP (Hyper Text Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.

The computer readable medium may be included in the electronic device or may exist alone without being incorporated into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to:

The method comprises the steps of carrying out attribute analysis on scenes in a sample video frame to obtain scene attribute information, determining target editing effects, wherein the scene attribute information at least comprises space information and semantic information corresponding to the scenes in the sample video frame, the target editing effects comprise at least one of editing scene fortune mirrors and editing scene shapes, generating a target video frame based on the scene attribute information and the target editing effects through a nerve radiation field model, and constructing the nerve radiation field model based on the sample video frame.

Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including, but not limited to, an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units involved in the embodiments of the present disclosure may be implemented by means of software, or may be implemented by means of hardware. The names of the units and modules do not limit the units and modules themselves in some cases.

The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, and without limitation, exemplary types of hardware logic that can be used include field programmable gate arrays (Field Programmable GATE ARRAY, FPGA), application SPECIFIC INTEGRATED Circuits (ASICs), application SPECIFIC STANDARD PARTS, ASSP, system On Chip (SOCs), complex Programmable Logic Devices (CPLDs), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

According to one or more embodiments of the present disclosure, there is provided a video editing method, the method including:

According to one or more embodiments of the present disclosure, there is provided a video editing method, further comprising:

In some optional implementations, the analyzing the properties of the scene in the sample video frame includes at least one of:

Reconstructing three-dimensional coordinates corresponding to a scene in the sample video frame;

analyzing symmetry of a scene in the sample video frame;

and carrying out semantic analysis on the scene in the sample video frame.

in some optional implementations, reconstructing three-dimensional coordinates corresponding to a scene in the sample video frame includes:

Detecting a straight line in a scene in each of the sample video frames;

determining the spatial position of each straight line through the nerve radiation field model;

clustering each straight line based on the spatial position of each straight line to obtain a target straight line;

and identifying a target plane corresponding to the scene in the sample video frame, and determining three-dimensional coordinates corresponding to the scene according to the target plane and the target straight line.

Analyzing symmetry of scenes in the sample video frames according to one or more embodiments of the present disclosure, there is provided a video editing method, further comprising:

In some optional implementations, the determining the target editing effect includes:

in some alternative implementations, after the generating the target video frame, further comprising:

selecting the sample video frames based on at least one of:

selecting the sample video frames according to the common view relation among the video frames;

editing the key frame of the target video frame to obtain an updated key frame;

determining optical flow information between the target video frames through the nerve radiation field model;

Interpolating the updated key frame based on the optical flow information to obtain an intermediate video frame;

editing the key frame of the target video frame to obtain an updated key frame;

And adjusting parameters in the nerve radiation field model by using the updated key frame so as to enable the nerve radiation field model after adjusting the parameters to generate a new target video frame.

And generating a new target video frame containing the target three-dimensional model according to the depth information of the scene in the target video frame and the depth information of the target three-dimensional model.

According to one or more embodiments of the present disclosure, there is provided a video editing apparatus including:

The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in this disclosure is not limited to the specific combinations of features described above, but also covers other embodiments which may be formed by any combination of features described above or equivalents thereof without departing from the spirit of the disclosure. Such as those described above, are mutually substituted with the technical features having similar functions disclosed in the present disclosure (but not limited thereto).

Moreover, although operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims.

Claims

1. A video editing method, comprising:

2. The method of claim 1, wherein the analyzing the scene in the sample video frame for attributes comprises at least one of:

analyzing symmetry of a scene in the sample video frame;

and carrying out semantic analysis on the scene in the sample video frame.

3. The method of claim 2, wherein reconstructing three-dimensional coordinates corresponding to a scene in the sample video frame comprises:

Detecting a straight line in a scene in each of the sample video frames;

4. The method of claim 1, wherein the determining the target editing effect comprises:

5. The method of claim 1, further comprising, prior to said analyzing the properties of the scenes in the sample video frames:

selecting the sample video frames based on at least one of:

6. The method of claim 1, further comprising, after the generating the target video frame:

editing the key frame of the target video frame to obtain an updated key frame;

7. The method of claim 1, further comprising, after the generating the target video frame:

editing the key frame of the target video frame to obtain an updated key frame;

8. The method of claim 1, further comprising, after the generating the target video frame:

9. A video editing apparatus, comprising:

10. An electronic device, the electronic device comprising:

one or more processors;

Storage means for storing one or more programs,

The one or more programs, when executed by the one or more processors, cause the one or more processors to implement the video editing method of any of claims 1-8.

11. A storage medium containing computer executable instructions for performing the video editing method of any of claims 1-8 when executed by a computer processor.