KR20200023013A

KR20200023013A - Video Service device for supporting search of video clip and Method thereof

Info

Publication number: KR20200023013A
Application number: KR1020180099343A
Authority: KR
Inventors: 이혜정
Original assignee: 에스케이텔레콤 주식회사
Priority date: 2018-08-24
Filing date: 2018-08-24
Publication date: 2020-03-04
Anticipated expiration: 2038-08-24
Also published as: KR102199446B1

Abstract

According to an embodiment of the present invention, a video service device can comprise: a memory storing a video clip; and a processor functionally connected to the memory. The processor can be configured to acquire a video clip stored in the memory, divide the video clip in scene units, extract candidate keywords by performing at least one recognition operation on at least one divided scene of the video clip, and select a representative keyword for the video clip based on the extracted candidate keywords.

Description

Video service device for supporting video content search and support for video content search {Video Service device for supporting search of video clip and Method

본 발명은 영상 컨텐츠 검색에 관한 것으로, 특히 영상 컨텐츠의 대표 키워드를 추출하여 영상 컨텐츠와 매핑한 후 해당 영상 컨텐츠를 용이하게 검색할 수 있도록 하는 영상 컨텐츠 검색을 지원하는 영상 서비스 장치 및 영상 컨텐츠 검색 지원 방법에 관한 것이다.The present invention relates to video content search. In particular, a video service apparatus and video content search support that support video content search for extracting a representative keyword of video content and mapping the video content to the video content for easy retrieval It is about a method.

최근에는 컨텐츠를 소비하는 형태가 점점 Text에서, Image나 Video로 급격히 전이되고 있으며, 특히, 영상 (또는 비디오 또는 동영상) 컨텐츠 정보의 소비가 급격히 증가하고 있다. 또한 종래 영상 컨텐츠는 단순히 미디어 소비 측면에서의 도구였다면, 현재에 이르러서는 영상 컨텐츠가 소통과 정보 검색을 위한 도구로도 활용되고 있는 상황이다. Recently, the form of consuming content is rapidly shifting from text to image or video, and in particular, the consumption of video (or video or video) content information is rapidly increasing. In addition, if the conventional video content was merely a tool in terms of media consumption, the current video content is also used as a tool for communication and information retrieval.

한편, 영상 컨텐츠는 범람하고 있으나, 바쁜 생활 환경으로 인하여 1 ~ 2시간 분량의 영상 컨텐츠를 시청하거나 다시보기 등의 관람이 어려워지고 있다. 이에, 최근에는 전체 컨텐츠를 다시보기 보다는 긴 시간(예: 1 ~ 2시간) 분량의 컨텐츠의 핵심 부분을 2~3분 정도로 짧게 추출하여 제공하는 컨텐츠가 대두되고 있다. 이러한 소비 경향을 맞추기 위하여, 방송국에서는 방송 컨텐츠를 서비스할 때, 1~2시간짜리 전체 영상 외에 해당 영상을 3~5분 내외로 추출해 낸 Short Clip 영상을 제작하여 제공하고 있다.On the other hand, the video content is flooded, but due to the busy living environment it is difficult to watch or replay the video content of 1 to 2 hours. Therefore, recently, contents that provide a short time (eg, 1 to 2 hours) to extract a core part of content for 2 to 3 minutes are provided rather than replaying the entire content. In order to meet this consumption trend, the broadcasting station produces and provides a short clip image which extracts the video in about 3 to 5 minutes in addition to the entire video of 1 to 2 hours when serving the broadcast contents.

한편, 원본 미디어 영상의 경우, 컨텐츠 제작자가 작업한 기본 메타 데이터가 비교적 풍부하고 정확하게 갖추어져 있는 편이나 (등장 인물, 촬영지, 시놉시스(줄거리), 장르 등), 원본 영상을 2~3분 내외로 재편집 한 Short Clip 영상은 해당 클립 내 등장인물, 클립의 줄거리, 배경 등에 대한 메타 데이터 확보가 어려운 상황이다. 또한, Short Clip 영상은 메타데이터를 추출해내기 위한 원본 정보의 확보가 불가능한 문제가 있다. 또한, UGC의 경우 전문 제작자가 아닌 일반 User가 제작하게 되면, 동영상에 대한 제목이나 상세 설명 등에 대한 Meta 입력을 제대로 되지 않은 경우가 많다. On the other hand, in the case of the original media image, the basic metadata that the content creator has worked with is relatively rich and accurate (represented characters, shooting locations, synopsis, genre, etc.), and the original image is reproduced in about 2 to 3 minutes. The edited short clip video is difficult to secure metadata about characters in the clip, the plot of the clip, and the background. In addition, the short clip image has a problem that it is impossible to secure the original information for extracting the metadata. In addition, in the case of UGC, when a general user, not a professional producer, produces a meta input for a title or a detailed description of a video, it is often incorrect.

이에, 풀 영상 및 클립 영상을 포함한 영상 컨텐츠 스트리밍 서비스를 제공하는 경우 컨텐츠에 대한 주요 키워드 정보(영상 정보를 압축해서 제공하며 검색 또는 추천 등에 활용되는 해시태그 #)를 사람이 직접 수작업하여 제공함에 따라, 정보의 부정확성이 높고, 정보 제공 효율이 무척 낮은 문제점이 있다. Therefore, in the case of providing a video content streaming service including full video and clip video, the main keyword information about the content (the hash tag # used to compress the video information and use it for search or recommendation) is provided manually by a person. However, there is a problem that information inaccuracy is high and information providing efficiency is very low.

한국공개특허 제10-1640317호, 2016년 07월 11일 등록 (명칭: 오디오 및 비디오 데이터를 포함하는 영상의 저장 및 검색 장치와 저장 및 검색 방법)Korean Laid-open Patent No. 10-1640317, registered on July 11, 2016 (Name: Image storage and retrieval device and storage and retrieval method including video and audio data)

본 발명은 상술한 요구를 충족하기 위한 것으로, 영화 또는 방송 등의 영상 컨텐츠 혹은 UGC 컨텐츠를 설명해주는 대표 키워드 (해시태그)를 인식 기술 기반으로 자동 생성할 수 있는 영상 컨텐츠 검색을 지원하는 영상 서비스 장치 및 영상 컨텐츠 검색 지원 방법을 제공한다.The present invention is to meet the above-described needs, an image service apparatus that supports video content retrieval that can automatically generate a representative keyword (hash tag) that describes video content such as a movie or broadcast or UGC content based on recognition technology. And an image content search support method.

또한, 본 발명은 시청자에게 해당 방송 클립 컨텐츠의 요약 정보를 제공하고, 나아가 방송 클립 컨텐츠 단위를 효과적으로 검색하고 추천할 수 있도록 하는 영상 컨텐츠 검색을 지원하는 영상 서비스 장치 및 영상 컨텐츠 검색 지원 방법을 제공한다.In addition, the present invention provides a video service apparatus and a video content search support method for supporting video content search, which provides a viewer with summary information of a corresponding broadcast clip content, and further effectively searches and recommends a broadcast clip content unit. .

본 발명의 실시 예에 따른 영상 서비스 장치는 영상 클립을 저장하는 메모리 및 상기 메모리에 기능적으로 연결되는 프로세서를 포함할 수 있다. 상기 프로세서는 상기 메모리에 저장된 영상 클립을 획득하고, 상기 영상 클립을 장면 단위로 분할하고, 상기 영상 클립의 분할된 장면들 또는 적어도 하나의 장면에 대하여 적어도 하나의 인식 동작을 수행하여 적어도 하나의 후보 키워드를 추출하고, 상기 적어도 하나의 후보 키워드를 토대로 상기 영상 클립에 대한 대표 키워드를 선정하도록 설정될 수 있다. An image service apparatus according to an embodiment of the present invention may include a memory for storing an image clip and a processor operatively connected to the memory. The processor acquires an image clip stored in the memory, divides the image clip into scene units, and performs at least one recognition operation on the divided scenes or at least one scene of the image clip to at least one candidate. A keyword may be extracted and a representative keyword for the video clip may be selected based on the at least one candidate keyword.

또는, 상기 적어도 하나의 후보 키워드는 상기 적어도 하나의 장면에 적용된 그래픽 자막들을 통해 추출될 수 있다.Alternatively, the at least one candidate keyword may be extracted through graphic subtitles applied to the at least one scene.

상기 적어도 하나의 후보 키워드는 상기 적어도 하나의 장면에 포함된 프레임별로 추출된 Feature Vector와 상기 영상 클립의 원본 Video Feature Vector (Global Feature)의 복합 적용에 따라 추출될 수 있다.The at least one candidate keyword may be extracted according to a complex application of the Feature Vector extracted for each frame included in the at least one scene and the original Video Feature Vector (Global Feature) of the video clip.

상기 적어도 하나의 후보 키워드는 스코어 모델을 통하여 우선 순위가 할당될 수 있다.The at least one candidate keyword may be assigned a priority through a score model.

특히, 상기 프로세서는 최우선순위를 상기 영상 클립에 대한 대표 키워드로 선정하도록 설정될 수 있다.In particular, the processor may be set to select the highest priority as a representative keyword for the video clip.

한편, 상기 프로세서는 얼굴 인식, 시간 또는 계절 인식, 장소 인식, 대사 인식, 음악 인식, 상황 또는 이벤트 인식 중 적어도 하나의 인식 동작을 수행하도록 설정될 수 있다.The processor may be configured to perform at least one recognition operation among face recognition, time or season recognition, place recognition, dialogue recognition, music recognition, situation or event recognition.

본 발명의 실시 예에 따른 영상 컨텐츠 검색 지원 방법은 영상 서비스 장치가, 키워드 선정이 필요한 영상 클립을 획득하는 단계, 상기 영상 클립을 장면 단위로 분할하는 단계, 상기 영상 클립의 분할된 적어도 하나의 장면에 대하여 적어도 하나의 인식 동작을 수행하여 적어도 하나의 후보 키워드를 추출하는 단계, 상기 적어도 하나의 후보 키워드를 토대로 상기 영상 클립에 대한 대표 키워드를 선정하는 단계를 포함할 수 있다.According to an embodiment of the present invention, a method of supporting image content search includes: acquiring, by a video service apparatus, an image clip requiring a keyword selection, dividing the image clip into scene units, and at least one divided scene of the image clip The method may include extracting at least one candidate keyword by performing at least one recognition operation on, and selecting a representative keyword for the video clip based on the at least one candidate keyword.

상기 영상 컨텐츠 검색 지원 방법에서 상기 영상 클립은 원본 동영상의 일부 적어도 하나의 장면을 편집하여 마련되어 수분 내지 수십분의 재생 시간 길이를 가지는 영상 컨텐츠를 포함할 수 있다. In the video content search support method, the video clip may include video content having a reproduction time length of several minutes to several ten minutes provided by editing at least one scene of the original video.

본 발명은 2~3분 분량의 영상 클립 컨텐츠나 UGC 뿐만 아니라, 일정 시간 길이를 가지는 영상 컨텐츠에 대하여 영상 인식, 자막 인식, 음성 인식 등의 인식 기술을 활용하여 장면 단위로 장면 메타 정보(예: 후보 키워드)를 자동 생성하고, 이를 분석하여 영상 컨텐츠를 대표하는 키워드를 자동 선별해 낼 수 있어, 보다 정확한 키워드 선정 및 작업 효율을 제공할 수 있다. The present invention utilizes recognition technology such as image recognition, subtitle recognition, voice recognition, etc. for scene content having a certain length of time, as well as video clip content or UGC of 2-3 minutes, and includes scene meta information (eg: Candidate keywords) can be automatically generated and analyzed to automatically select keywords representing image content, thereby providing more accurate keyword selection and work efficiency.

도 1은 본 발명의 실시 예에 따른 네트워크 환경의 한 예를 나타낸 도면이다.
도 2는 본 발명의 실시 예에 따른 영상 서비스 장치의 개략적인 형태를 나타낸 도면이다.
도 3은 본 발명의 실시 예에 따른 영상 서비스 장치의 프로세서의 한 예를 나타낸 도면이다.
도 4는 본 발명의 장면 메타 추출기의 한 예를 나타낸 도면이다.
도 5는 본 발명의 실시 예에 따른 장면 인식 방법의 한 예를 나타낸 도면이다.
도 6은 본 발명의 실시 예에 따른 영상 컨텐츠 검색 지원과 관련한 키워드 맵 생성 방법의 한 예를 나타낸 도면이다. 1 is a diagram illustrating an example of a network environment according to an exemplary embodiment of the present invention.
2 is a diagram illustrating a schematic form of a video service device according to an exemplary embodiment of the present invention.
3 is a diagram illustrating an example of a processor of an image service apparatus according to an exemplary embodiment of the present invention.
4 is a diagram illustrating an example of a scene meta extractor of the present invention.
5 is a diagram illustrating an example of a scene recognition method according to an exemplary embodiment of the present invention.
FIG. 6 is a diagram illustrating an example of a method of generating a keyword map associated with image content search support according to an exemplary embodiment of the present invention.

이하, 본 발명의 다양한 실시 예가 첨부된 도면을 참조하여 기재된다. 그러나 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 실시 예의 다양한 변경(modification), 균등물(equivalent), 및/또는 대체물(alternative)을 포함하는 것으로 이해되어야 한다. 도면의 설명과 관련하여, 유사한 구성요소에 대해서는 유사한 참조 부호가 사용될 수 있다.Hereinafter, various embodiments of the present invention will be described with reference to the accompanying drawings. It is to be understood, however, that the intention is not to limit the invention to particular embodiments, but to cover various modifications, equivalents, and / or alternatives to the embodiments of the invention. In connection with the description of the drawings, similar reference numerals may be used for similar components.

도 1은 본 발명의 실시 예에 따른 네트워크 환경의 한 예를 나타낸 도면이다.1 is a diagram illustrating an example of a network environment according to an exemplary embodiment of the present invention.

도 1을 참조하면, 본 발명의 실시 예에 따른 네트워크 환경(10)은 네트워크(50), 외부 서버 장치(300), 영상 서비스 장치(200) 및 단말(100)을 포함할 수 있다. Referring to FIG. 1, a network environment 10 according to an embodiment of the present invention may include a network 50, an external server device 300, an image service device 200, and a terminal 100.

상술한 본 발명의 네트? 환경은 미디어 특히 (동)영상 컨텐츠에 담긴 정보를 분석하여 대표 키워드를 제공하는 것으로, 영상 컨텐츠를 효과적으로 설명 및 표현하기 위한 키워드 정보를 자동 생성하고, 상기 키워드 정보를 기반으로 영상의 요약 설명, 검색 및 추천 등 다양한 방면에 활용할 수 있도록 지원한다. 특히, 본 발명은 영상 컨텐츠를 장면별로 분할 하고, 분할된 장면별로 이미지 인식 및 음성인식(또는 자막 인식, OCR 인식 등)을 통해 획득된 상세 메타데이터 또는 컨텍스트 정보를 기반으로 해당 영상을 특징 짓는 소수의 핵심 키워드를 선별해 내는 방법을 제공한다. 이러한 본 발명은 종래 단순 얼굴인식기술을 활용한 장면별 배우 얼굴 인식 과정에서 발생하는 배우 관련 단순 정보 제공(예: 특정 배우가 출연하는 장면 (시점)에 대한 정보 제공)의 한계를 벗어나, 영상을 특징짓는 대표 키워드 정보를 산출할 수 있도록 지원한다.The net of the present invention described above? The environment provides representative keywords by analyzing information contained in media, especially (video) content, and automatically generates keyword information for effectively describing and expressing video content, and based on the keyword information, a summary description and search of the video. It can be used for various purposes such as and recommendation. In particular, the present invention divides the video content by scenes, and a minority characterizing the video based on detailed metadata or context information acquired through image recognition and voice recognition (or subtitle recognition, OCR recognition, etc.) for each divided scene. It provides a way to select key keywords from. The present invention overcomes the limitation of providing simple information related to an actor (eg, providing information on a scene (viewing point) in which a specific actor appears) occurring in a scene recognition process for each scene using a conventional simple face recognition technology. It supports to calculate the representative keyword information characterizing.

상기 네트워크(50)는, 인터넷 망과 같은 IP 기반의 유선 통신망뿐만 아니라, LTE(Long term evolution) 망, WCDMA 망과 같은 이동통신망, Wi-Fi망과 같은 다양한 종류의 무선망, 및 이들의 조합으로 이루어질 수 있다. 즉, 본 발명에 따른 영상 컨텐츠 검색 지원과 관련한 네트워크 환경(10)은, 유무선 통신망에 구별 없이 모두 적용될 수 있다. 구체적으로 상기 네트워크(50)는 영상 서비스 장치(200)와 단말(100) 간의 통신 채널을 형성할 수 있다. 또는, 네트워크(50)는 영상 서비스 장치(200)와 외부 서버 장치(300) 간의 통신 채널을 형성할 수 있다. 예를 들어, 상기 네트워크(50)는 영상 서비스 장치(200), 외부 서버 장치(300), 또는 단말(100)이 운용할 수 있는 3G, 4G, 5G 무선 이동 통신 방식 중 적어도 하나의 방식을 지원할 수 있다. 또는, 상기 네트워크(50)는 유선 기반으로 상기 영상 서비스 장치(200)와 외부 서버 장치(300) 간의 통신 채널 또는 단말(100)과 상기 영상 서비스 장치(200) 간의 통신 채널을 형성할 수 있다. 이러한 네트워크(50)는 현재 개발되어 상용화되었거나 향후 개발되어 상용화될 각종 유선망, 무선망 및 이들의 결합망을 포함하는 개념으로 해석되어야 한다.The network 50 is not only an IP-based wired communication network such as the Internet network, but also a long term evolution (LTE) network, a mobile communication network such as a WCDMA network, various types of wireless networks such as a Wi-Fi network, and a combination thereof. Can be made. That is, the network environment 10 related to image content search support according to the present invention may be applied to both wired and wireless communication networks without distinction. In more detail, the network 50 may establish a communication channel between the video service device 200 and the terminal 100. Alternatively, the network 50 may establish a communication channel between the video service device 200 and the external server device 300. For example, the network 50 may support at least one of 3G, 4G, and 5G wireless mobile communication methods that the video service device 200, the external server device 300, or the terminal 100 may operate. Can be. Alternatively, the network 50 may form a communication channel between the video service device 200 and the external server device 300 or a communication channel between the terminal 100 and the video service device 200 on a wired basis. Such a network 50 should be interpreted as a concept including various wired networks, wireless networks, and combination networks thereof that are currently developed and commercialized, or may be developed and commercialized in the future.

상기 단말(100)은 네트워크(50)를 통하여 영상 서비스 장치(200)에 접속할 수 있다. 상기 단말(100)은 영상 서비스 장치(200)가 제공하는 다양한 영상 컨텐츠를 검색하고, 시청하거나 다운로드 받을 수 있는 소비자 역할을 수행할 수 있다. 이와 관련하여, 단말(100)은 영상 서비스 장치(200) 접속을 위한 단말 통신 회로, 영상 서비스 장치(200)로부터 수신된 영상 컨텐츠를 검색할 수 있는 웹 페이지를 출력할 수 있는 디스플레이, 영상 컨텐츠 검색과 관련한 입력 신호를 생성할 수 있는 입력부, 수신된 영상을 임시 또는 반영구 저장할 수 있는 메모리, 상술한 구성들 예컨대, 단말 통신 회로, 디스플레이, 입력부, 메모리 등을 제어할 수 있는 제어부를 포함할 수 있다. 한편, 본 발명의 단말(100)은 영상 서비스 장치(200)에 자신이 제작한 영상 컨텐츠(예: Short clip 영상)를 업로드할 수는 공급자 역할을 수행할 수도 있다. 이때, 단말(100)은 기 저장된 장시간(예: 수십분에서 수시간 사이의 영상) 영상 컨텐츠로부터 생성된 Short clip 영상을 제작할 수 있는 편집 기능을 제공할 수 있다. The terminal 100 may access the video service device 200 through the network 50. The terminal 100 may serve as a consumer that can search for, view, or download various image contents provided by the image service apparatus 200. In this regard, the terminal 100 may display a terminal communication circuit for accessing the image service apparatus 200, a display capable of outputting a web page for searching for image contents received from the image service apparatus 200, and image content search. It may include an input unit for generating an input signal related to the memory, a memory for temporarily or semi-permanently storing the received image, a control unit for controlling the above-described configuration, for example, the terminal communication circuit, the display, the input unit, the memory, etc. . On the other hand, the terminal 100 of the present invention may serve as a provider to upload the video content (for example, a short clip video) produced by the terminal to the video service device 200. In this case, the terminal 100 may provide an editing function for producing a short clip image generated from image contents stored for a long time (for example, an image of several tens of minutes to several hours).

상기 외부 서버 장치(300)는 적어도 하나의 영상 컨텐츠를 저장하고, 영상 서비스 장치(200) 요청에 따라 상기 적어도 하나의 영상 컨텐츠를 제공할 수 있다. 또는, 상기 외부 서버 장치(300)는 영상 서비스 장치(200)가 일정 시간 길이를 가지는 영상 컨텐츠(예: Short Clip 영상, 이하 “영상 클립”)에 대응하는 키워드 정보를 생성하는 동안 필요한 다양한 알고리즘 또는 다양한 메타 후보 정보 등을 제공할 수 있다. 또는, 외부 서버 장치(300)는 영상 클립의 장면 분석에 필요한 다양한 알고리즘 및 데이터(예: 얼굴 인식 알고리즘 및 얼굴 인식 DB, 오디오 인식 알고리즘 및 오디오 인식 DB, 배경 분석 알고리즘 및 배경 DB 등, 장소 분석 알고리즘 및 장소 DB, 음악 인식 알고리즘 및 음악 DB)를 저장하고, 영상 서비스 장치(200)에 제공할 수 있다. 한편, 상기 외부 서버 장치(300)는 영상 서비스 장치(200) 내에 포함될 수도 있다. The external server device 300 may store at least one video content and provide the at least one video content according to a request of the video service device 200. Alternatively, the external server device 300 may generate various algorithms necessary while the video service device 200 generates keyword information corresponding to video content having a predetermined length of time (eg, a short clip video, hereinafter referred to as a “video clip”) or Various meta candidate information may be provided. Alternatively, the external server device 300 may perform various algorithms and data (eg, a face recognition algorithm and a face recognition DB, an audio recognition algorithm and an audio recognition DB, a background analysis algorithm, and a background DB) for scene analysis of an image clip. And a place DB, a music recognition algorithm, and a music DB) may be stored and provided to the image service device 200. The external server device 300 may be included in the video service device 200.

상기 영상 서비스 장치(200)는 외부 서버 장치(300) 또는 단말(100)로부터 적어도 하나의 영상 클립을 수신하여 저장할 수 있다. 또는, 영상 서비스 장치(200)는 별도의 데이터 수신 경로(예: 입력 인터페이스 또는 메모리 장치 연결 등)를 통해 적어도 하나의 영상 클립을 수신하여 저장할 수 있다. 상기 영상 서비스 장치(200)는 수신된 영상 클립에 대하여 장면별로 분할하고, 분할된 장면에 대하여 다양한 인식 모듈을 적용하여, 다양한 메타 정보(예: 후보 키워드)를 획득하고, 획득된 메타 정보를 해당 영상 클립에 대응하여 분석함으로써, 해당 영상 클립을 대표할 수 있는 대표 키워드를 적어도 하나 선정할 수 있다. 상기 영상 서비스 장치(200)는 선정된 키워드를 해당 영상 클립에 매핑하여 저장할 수 있다. 이때, 영상 서비스 장치(200)는 영상 클립에 대하여 선정된 키워드로 검색할 수 있도록 선정된 키워드를 해당 영상 클립의 해쉬 태그로 저장 관리할 수 있다. The video service device 200 may receive and store at least one video clip from the external server device 300 or the terminal 100. Alternatively, the image service device 200 may receive and store at least one image clip through a separate data receiving path (eg, an input interface or a memory device connection). The video service device 200 divides the received video clip by scene, applies various recognition modules to the divided scene, obtains various meta information (eg, candidate keywords), and applies the obtained meta information to the corresponding scene. By analyzing corresponding to the video clip, it is possible to select at least one representative keyword that can represent the video clip. The video service device 200 may map the selected keyword to a corresponding video clip and store the same. In this case, the image service apparatus 200 may store and manage the selected keyword as a hash tag of the corresponding video clip so that the selected keyword may be searched with the selected keyword.

도 2는 본 발명의 실시 예에 따른 영상 서비스 장치의 개략적인 형태를 나타낸 도면이다.2 is a diagram illustrating a schematic form of a video service device according to an exemplary embodiment of the present invention.

도 2를 참조하면, 본 발명의 영상 서비스 장치(200)는 통신 회로(210), 메모리(230) 및 프로세서(250)를 포함할 수 있다. Referring to FIG. 2, the video service device 200 of the present invention may include a communication circuit 210, a memory 230, and a processor 250.

상기 통신 회로(210)는 영상 서비스 장치(200)의 통신 채널을 형성할 수 있다. 예컨대, 일정 주기로 또는 관리자 요청에 따라, 통신 회로(210)는 외부 서버 장치(300)와 통신 채널을 형성하고, 외부 서버 장치(300)로부터 적어도 하나의 영상 클립을 수신할 수 있다. 상기 통신 회로(210)는 메모리(230)에 저장된 영상 클립(231)을 분석하는 과정에서 영상 클립(231)에 포함된 적어도 하나의 장면들에 대한 외부 데이터 수집을 위하여 외부 서버 장치(300)와 통신 채널을 형성할 수도 있다. 한편, 통신 회로(210)는 예컨대, 네트워크(50)를 통한 단말(100)의 접속을 지원하는 통신 채널을 형성할 수 있다. 이때, 통신 회로(210)는 접속된 단말(100)에 검색어 입력이 가능한 화면을 제공하고, 단말(100)로부터 특정 영상 클립(231)을 검색하는 검색어(또는 키워드)를 수신할 수 있다. 통신 회로(210)는 프로세서(250) 제어에 대응하여 단말(100)이 제공한 키워드에 대응되는 영상 클립(231)을 단말(100)에 제공할 수 있다. The communication circuit 210 may form a communication channel of the video service device 200. For example, the communication circuit 210 may establish a communication channel with the external server device 300 and receive at least one image clip from the external server device 300 at regular intervals or according to an administrator's request. The communication circuit 210 may communicate with the external server device 300 to collect external data of at least one scene included in the image clip 231 in the process of analyzing the image clip 231 stored in the memory 230. It may also form a communication channel. Meanwhile, the communication circuit 210 may form a communication channel supporting the connection of the terminal 100 through, for example, the network 50. In this case, the communication circuit 210 may provide a screen for inputting a search word to the connected terminal 100 and receive a search word (or keyword) for searching for a specific image clip 231 from the terminal 100. The communication circuit 210 may provide the terminal 100 with an image clip 231 corresponding to a keyword provided by the terminal 100 in response to the control of the processor 250.

상기 메모리(230)는 적어도 하나의 영상 클립(231), 메타 DB(233), 키워드 맵(235)을 저장할 수 있다. 상기 영상 클립(231)은 외부 서버 장치(300) 또는 단말(100)로부터 수신될 수 있다. 또는, 영상 클립(231)은 영상 서비스 장치(200) 자체적으로 제작되어 저장될 수도 있다. 상기 영상 클립(231)은 예컨대, 장시간 영상 컨텐츠로부터 생성된 상대적으로 짧은 재생 시간(예: 수분에서 수십분, 또는 수시간)을 가지는 숏 영상 컨텐츠를 포함할 수 있다. 예컨대, 상기 영상 클립(231)은 앞서 언급한 Short Clip 영상 또는 UGC 영상 등을 포함할 수 있다. 또는, 상기 영상 클립(231)은 메타 정보가 지정된 양 이하의 크기를 가지는 영상 컨텐츠를 포함할 수 있다. 상기 메모리(230)는 복수개의 영상 클립(231)들이 저장 관리될 수 있다. 상기 영상 클립(231)에는 다양한 인식 모듈에 의하여 인식되고, 인식에 따른 결과물 즉 메타 정보를 토대로 선정된 적어도 하나의 키워드가 할당될 수 있다. 상기 키워드는 상기 영상 클립(231)을 검색하는데 이용될 수 있다. The memory 230 may store at least one image clip 231, a meta DB 233, and a keyword map 235. The image clip 231 may be received from the external server device 300 or the terminal 100. Alternatively, the image clip 231 may be manufactured and stored by the image service apparatus 200 itself. The image clip 231 may include, for example, short image content having a relatively short playing time (eg, several minutes to several minutes) generated from long time image content. For example, the video clip 231 may include the aforementioned Short Clip image or UGC image. Alternatively, the image clip 231 may include image content having a size equal to or less than a specified amount of meta information. The memory 230 may store and manage a plurality of image clips 231. The image clip 231 may be recognized by various recognition modules and may be assigned at least one keyword selected based on a result of recognition, that is, meta information. The keyword may be used to search for the video clip 231.

상기 영상 클립(231)은 적어도 하나의 장면들을 포함할 수 있다. 상기 적어도 하나의 장면은 예컨대, 영상을 구성하는 복수개의 프레임들이거나 또는 복수개의 프레임들 중 I(Intra)-프레임들일 수 있다. 또는, 상기 장면은 배경화면이 일정 영역 이상 변하지 않는 복수개의 프레임들을 포함할 수 있다. 이에 따라, 영상 클립(231)에서 적어도 하나의 사람 또는 동물 또는 사물 등이 객체가 배치되고, 그러한 객체가 일정 움직임을 보이더라도 배경 화면이 일정 영역 이상 유지되는 경우 하나의 장면으로 처리될 수 있다. The image clip 231 may include at least one scene. The at least one scene may be, for example, a plurality of frames constituting an image or I (Intra) -frames among the plurality of frames. Alternatively, the scene may include a plurality of frames in which the background screen does not change over a certain area. Accordingly, at least one person, animal, or object may be disposed in the image clip 231, and even if the object displays a certain movement, the background screen may be processed as one scene when the background screen is maintained over a certain area.

상기 메타 DB(233)는 다양한 인식 모듈에 의해 인식된 상기 영상 클립(231)에 대한 메타 정보들을 저장할 수 있다. 상기 영상 클립(231)에 대한 메타 정보들은 상기 영상 클립(231)의 키워드 선정에 이용될 수 있다. 또한, 상기 메타 DB(233)는 상기 영상 클립(231)에 할당될 적어도 하나의 키워드 정보들을 저장할 수 있다. 각각의 키워드들은 적어도 하나의 메타 정보들에 매핑될 수 있다. 이에 따라, 프로세서(250)는 영상 클립(231)에 대한 인식 과정에서 획득된 메타 정보들을 토대로 적어도 하나의 키워드들을 산출하고, 산출된 키워드들의 등장 시점, 등장 횟수 또는 중요도 등을 기반으로 대표 키워드를 선정할 수 있다. 상기 메타 DB(233)에 저장되는 다양한 메타 정보들과 키워드 등은 외부 서버 장치(300)가 제공한 정보에 의해 갱신될 수 있다. The meta DB 233 may store meta information about the image clip 231 recognized by various recognition modules. Meta information about the image clip 231 may be used to select a keyword of the image clip 231. In addition, the meta DB 233 may store at least one keyword information to be allocated to the image clip 231. Each keyword may be mapped to at least one meta information. Accordingly, the processor 250 calculates at least one keyword based on the meta information obtained in the recognition process of the image clip 231, and selects the representative keyword based on the appearance time, the number of appearances or the importance of the calculated keywords. Can be selected. Various meta information and keywords stored in the meta DB 233 may be updated by information provided by the external server device 300.

상기 키워드 맵(235)은 상기 영상 클립(231)의 식별 정보와 상기 식별 정보에 매핑된 적어도 하나의 키워드를 매핑한 정보를 포함할 수 있다. 또한, 상기 키워드 맵(235)은 특정 영상 클립(231)의 식별 정보와 해당 영상 클립을 대표하는 대표 키워드 정보를 매핑한 정보를 포함할 수 있다. 상기 키워드 맵(235)은 새로운 영상 클립(231)이 수신된 후, 인식 및 키워드 산출 과정이 수행됨에 따라 갱신될 수 있다. The keyword map 235 may include information that maps identification information of the image clip 231 and at least one keyword mapped to the identification information. In addition, the keyword map 235 may include information that maps identification information of a specific video clip 231 and representative keyword information representing the video clip. The keyword map 235 may be updated as a process of recognizing and calculating a keyword is performed after a new image clip 231 is received.

상기 프로세서(250)는 영상 서비스 장치(200)의 키워드 선정 및 영상 클립(231) 검색 기능을 제공할 수 있다. 이와 관련하여, 프로세서(250)는 일정 주기 또는 관리자 요청 또는 외부 서버 장치(300) 요청에 대응하여 외부 서버 장치(300) 또는 단말(100)로부터 적어도 하나의 영상 클립을 수신하여 메모리(230)에 저장할 수 있다. 상기 프로세서(250)는 새로운 영상 클립이 메모리(230)에 저장되거나 또는 지정된 시간이 도래하면, 메모리(230)에 저장된 영상 클립(231)에 대한 키워드 선정 작업을 수행할 수 있다. 키워드 선정 작업에서, 상기 메모리(230)는 영상 클립(231)에 대한 장면 분할, 분할 장면들에 대한 다양한 인식 처리, 인식 처리에 따라 획득된 메타 정보들을 분류와 분석, 분석 결과에 따라 대응되는 적어도 하나의 키워드 또는 대표 키워드 선정을 수행할 수 있다. 상기 프로세서(250)는 선정된 대표 키워드를 해당 영상 클립(231)의 해쉬 태그로 저장 관리함으로써, 영상 클립(231)에 대한 검색 기능을 제공할 수 있다. The processor 250 may provide a function of selecting a keyword and searching an image clip 231 of the image service apparatus 200. In this regard, the processor 250 receives at least one image clip from the external server device 300 or the terminal 100 in response to a predetermined period or an administrator request or an external server device 300 request, and stores the image clip in the memory 230. Can be stored. When a new image clip is stored in the memory 230 or a predetermined time arrives, the processor 250 may perform a keyword selection operation on the image clip 231 stored in the memory 230. In the keyword selection operation, the memory 230 classifies and analyzes the scene information on the image clip 231, various recognition processing on the divided scenes, and classifies and analyzes meta information obtained according to the recognition process according to a result of analysis. Selection of one keyword or representative keyword can be performed. The processor 250 may provide a search function for the image clip 231 by storing and managing the selected representative keyword as a hash tag of the corresponding image clip 231.

도 3은 본 발명의 실시 예에 따른 영상 서비스 장치의 프로세서의 한 예를 나타낸 도면이다.3 is a diagram illustrating an example of a processor of an image service apparatus according to an exemplary embodiment of the present invention.

도 3을 참조하면, 상기 프로세서(250)는 장면 메타 추출기(251), 영상 통합 메타 추출기(253), 외부 데이터 추출기(255), 분석기(257), 및 키워드 선정부(259)를 포함할 수 있다. 이러한, 본 발명의 프로세서(250)는 영상 인식 기술을 기반으로 하여 영상 클립의 대표 키워드를 자동 생성할 수 있다. 또는 상기 프로세서(250)는 영상 클립의 내용을 설명하는 요약 정보를 자동 생성할 수 있다. 예를 들어, 상기 프로세서(250)는 1개의 영상 클립 파일에 대해 Multi-Label Classification Task(해당 비디오를 설명하는 주요 키워드 정보를 다수 제공하는 기능)를 제공할 수 있다. 이 동작에서, 상기 프로세서(250)는 다양한 영상 인식 기술로 장면 단위로 추출한 수만 단위의 Meta 중에서 유의미하거나 중요도가 높은 키워드를 선별해서 list up하고, 일정 개수의 키워드를 제공할 수 있다. Referring to FIG. 3, the processor 250 may include a scene meta extractor 251, an image integrated meta extractor 253, an external data extractor 255, an analyzer 257, and a keyword selector 259. have. As such, the processor 250 of the present invention may automatically generate a representative keyword of an image clip based on an image recognition technology. Alternatively, the processor 250 may automatically generate summary information describing the content of the video clip. For example, the processor 250 may provide a multi-label classification task (a function of providing a plurality of key keyword information describing a corresponding video) for one image clip file. In this operation, the processor 250 may select up and list a keyword having significant or high importance among tens of thousands of meta extracted in scene units by various image recognition technologies, and provide a predetermined number of keywords.

상기 장면 메타 추출기(251)는 장면 Meta를 자동 인식하는 인식기(또는 인식 모듈, 또는 인식 프로세서)로서, 6하 원칙에 따라 영상 컨텐츠의 정보를 잘 설명해 낼 수 있는 Meta 정보를 영상 인식 기술을 활용하여 생성할 수 있다. 예컨대, 장면 메타 추출기(251)는 얼굴인식 기술을 통해 화면 속에 누가 등장하고 있는 지를 자동 생성 하고, 장소 인식 기술을 통해 스토리가 펼쳐지고 있는 배경 장소 정보를 생성할 수 있으며, 상황 인식 기술을 통해 영상 컨텐츠 내에서 진행되고 있는 상황/이벤트 정보를 생성할 수 있다. 또한, 장면 메타 추출기(251)는 객체 인식 기술을 통해 영상 컨텐츠 내에 등장하는 주요 객체 정보를 생성할 수 있으며, 음성인식 (대사인식) 및 자막인식 기술을 통해 대사와 화면에 쓰여져 있는 자막을 분석하여 추가적인 Meta를 추출할 수 있다. 추가적으로, 상기 장면 메타 추출기(251)는 화면에 미디어 제작자가 삽입한 그래픽 처리된 특수 자막을 인식(OCR(Optical Character Recognition) 인식 기술을 이용)하여 특수 자막 정보를 생성할 수 있다. 예를 들면, 장면 메타 추출기(251)는 예능 컨텐츠와 같이 특수 자막으로 적용된 특수 지명 정보들 #파리 상제리제 거리 #뉴욕 타임스퀘어 등과 같은 상세 지명 정보를 추출할 수 있다. 상술한 바와 같이, 상기 장면 메타 추출기(251)는 영상 클립을 장면 단위로 구분하고, 장면 단위(또는 이미지 단위 혹은 국소 단위)별로 분할한 구간에서 추출한 메타 데이터들을 기반으로 각 장면 단위 지역적 특성 (Local Feature)을 추출할 수 있다. The scene meta extractor 251 is a recognizer (or a recognition module or a recognition processor) for automatically recognizing scene meta. The meta information extractor 251 may use meta information which may well describe information of image content according to the principle below. Can be generated. For example, the scene meta extractor 251 may automatically generate who is appearing on the screen through face recognition technology, and generate background place information where a story is unfolded through place recognition technology, and image content through context recognition technology. You can create the status / event information that is in progress. In addition, the scene meta extractor 251 may generate main object information appearing in the image content through object recognition technology, and analyzes the subtitles and subtitles written on the screen through voice recognition (metabolism) and subtitle recognition technology. Additional Meta can be extracted. In addition, the scene meta extractor 251 may generate special subtitle information by recognizing graphic processed special subtitles inserted by a media producer on the screen (using an Optical Character Recognition (OCR) recognition technology). For example, the scene meta extractor 251 may extract detailed place information such as special place information #Paris Sangeri Street #New York Times Square applied to a special subtitle, such as entertainment content. As described above, the scene meta extractor 251 divides an image clip into scene units and localizes each scene unit based on metadata extracted in a section divided by scene units (or image units or local units). Feature) can be extracted.

상기 영상 통합 메타 추출기(253)는 영상 클립의 전체 Meta를 인식할 수 있다. 이러한 영상 통합 메타 추출기(253)는 Frame Image 단위의 정보를 보는 것이 아니라, 전체 Video를 단일 Feature Vector로 만들어 분류할 수 있다. 즉, 영상 통합 메타 추출기(253)는 영상 전체의 내용을 종합적으로 파악하여 장르를 특정짓거나, 대표성을 띄는 하나의 키워드를 추출할 수 있다. 예를 들면, 영상 통합 메타 추출기(253)는 Music Video를 다량으로 학습하고, 영상 클립이 음악 영상과 관련된 경우, Music Video로 키워드를 분류해 낼 수 있다. 또는, 영상 통합 메타 추출기(253)는 축구 영상을 다량으로 학습하고, 축구 관련 영상 클립의 키워드를 축구로서 분류할 수 있다. 여기서, 상기 영상 통합 메타 추출기(253)는 영상 컨텐츠를 일괄 분류하여 지도 학습으로 접근하는 방식 (Video Feature Vector), 영상 컨텐츠의 프레임 별로 추출한 Vector를 연결하고 이를 분류하여 접근하는 방식 (Aggregate Vector), 및 배경음악/대사와 같이 영상 전체에 걸쳐 연속성을 갖는 데이터를 활용하는 방식 등을 복합적으로 운용하여 분류 결과의 정확도와 다양성을 높일 수 있다. 상술한 바와 같이, 영상 통합 메타 추출기(253)는 동영상 파일 전체적인 특성을 파악하는 것으로 전체적인 특성 (Global Feature)을 요약하여 추출할 수 있다. 예컨대, 상기 영상 통합 메타 추출기(253)는 영상 클립에 대해 Genre 에 해당하는 키워드나, 영상 컨텐츠의 앞뒤 프레임 간의 관계, 이미지와 오디오 파일의 복합적인 관계를 종합하여 영상 클립에 대한 포괄적인 키워드를 분류해 낼 수 있다. The image integrated meta extractor 253 may recognize the entire meta of the image clip. The image integrated meta extractor 253 may classify the entire video as a single feature vector instead of viewing information in units of frame images. That is, the image integrated meta extractor 253 may comprehensively grasp the contents of the entire image to specify the genre or to extract one keyword having representativeness. For example, the image integrated meta extractor 253 may learn Music Video in large quantities, and classify keywords as Music Video when the video clip is associated with the music video. Alternatively, the image integrated meta extractor 253 may learn a large amount of soccer images, and classify keywords of soccer-related video clips as soccer. Here, the image integrated meta extractor 253 classifies the image contents in a batch to access the map learning method (Video Feature Vector), connects the vector extracted for each frame of the image content, and classifies and approaches the aggregate (Aggregate Vector), And a method of utilizing data having continuity throughout the image, such as background music / metabolism, to increase the accuracy and variety of classification results. As described above, the image integrated meta extractor 253 may grasp and extract the global feature by grasping the overall feature of the video file. For example, the video integrated meta extractor 253 classifies a comprehensive keyword for a video clip by combining keywords corresponding to Genre, a relationship between front and rear frames of video content, and a complex relationship between an image and an audio file. I can do it.

상기 외부 데이터 추출기(255)는 Web Crawler 등을 통해 외부 데이터 베이스(예: 외부 서버 장치(300)의 데이터베이스)에서 데이터를 수집하는 구성으로서, 영화 또는 방송사 등의 홈페이지, 전문가 평점 사이트, 블로그 등 다양한 정보 소스의 데이터를 분석할 수 있다. 외부 데이터 추출기(255)가 수집한 데이터는 Text 형태의 데이터로서, Topic Model과 형태소 분석기를 거쳐 명사 또는 형용사 형태의 키워드 후보 데이터로 변형 될 수 있다.The external data extractor 255 is a component that collects data from an external database (eg, a database of the external server device 300) through a web crawler. Analyze data from information sources. The data collected by the external data extractor 255 may be text data, and may be transformed into keyword candidate data of noun or adjective form through a topic model and a morpheme analyzer.

상기 분석기(257)는 추출한 영상 클립의 Frame 별 Meta(장면 메타 추출기(251)가 추출한 데이터)와 Video 통합 Meta(영상 통합 메타 추출기(253)가 추출한 데이터)에 대응하는 Model을 생성하고, 해당 모델에 대응하는 키워드를 매핑(분류)하여, 다수의 Keyword List 후보군을 생성할 수 있다. 한편, 키워드 리스트 후보군에 포함된 값들이 영상 클립을 특징 짓는 대표 키워드에 비하여 방대한 양이어서, 분석기(257)는 키워드 리스트 후보군에 포함된 값들 중에서 중요하고 의미있는 키워드를 추출하는 Scoring Modeling을 수행할 수 있다. Scoring Model에서는 앞서 추출된 Keyword들에서 우선 순위를 판별해내며, 이를 위해 Score Factor들을 사용할 수 있다. 스코어 팩터는 빈도 스코어, 신뢰 스코어, 동시 등장 스코어 및 특별 스코어를 포함할 수 있다. 빈도 스코어(Frequency Score)는 각각의 Meta 들이 영상에서 등장한 빈도를 포함할 수 있다. 예컨대, 장면에 등장한 인물이 배우 “정우성”이라고 인식된 경우, 정우성이 등장 및 인식된 프레임 개수가 빈도 스코어가 될 수 있다. 신뢰 스코어(Confidence Score)는 각각의 Meta 들이 인식기에서 추출될 때의 신뢰도 점수가 될 수 있다. 예컨대, 신뢰 스코어는 영상 클립의 특정 장면이 번지점프 장면이라고 인식된 경우, 해당 특정 장면이 번지점프 장면이 맞을 신뢰도 값이 될 수 있다. 동시 등장 스코어(Co-Occurrence Score)는 인식된 Meta 값과 외부 Crawling을 통해 수집된 Data에서 특정 키워드가 동시에 등장하는 경우, 동시 등장으로 인해 우선 순위를 상향 조정하는 점수를 포함할 수 있다. 특별 스코어(Specialty Score)는 키스신, 웨딩신과 같이 서비스 기획자의 의도에 따라 우선순위를 조정하고 싶은 키워드의 경우 Score를 높게 할당하기 위한 스코어이다. 이러한 특별 스코어는 기획자의 의도 및 Genre에 따라 점수를 부여하는 로직을 변경할 수 있다. The analyzer 257 generates a model corresponding to Meta (data extracted by the scene meta extractor 251) and Video unified Meta (data extracted by the image integrated meta extractor 253) for each frame of the extracted video clip, and the corresponding model. A plurality of Keyword List candidate groups can be generated by mapping (categorizing) keywords corresponding to the. On the other hand, since the values included in the keyword list candidate group are vastly larger than the representative keywords characterizing the video clip, the analyzer 257 may perform scoring modeling to extract important and meaningful keywords from the values included in the keyword list candidate group. have. In the Scoring Model, priorities are determined from the extracted keywords, and score factors can be used for this. The score factor may include a frequency score, a confidence score, a co-occurrence score, and a special score. The frequency score may include the frequency at which each Meta appears in the image. For example, when a person who appears in a scene is recognized as an actor "Jung Woo Sung", the number of frames that Jung Woo Sung appears and recognized may be a frequency score. The confidence score may be a confidence score when each meta is extracted from the recognizer. For example, when it is recognized that a specific scene of the video clip is a bungee jumping scene, the confidence score may be a confidence value to which the bungee jumping scene is correct. The co-occurrence score may include a score that raises the priority due to the simultaneous appearance when a specific keyword simultaneously appears in the recognized Meta value and data collected through external crawling. The special score is a score for assigning a high score to keywords such as kiss scenes and wedding scenes that want to adjust their priorities according to service planners' intentions. These special scores can change the logic for assigning scores according to the planner's intent and genre.

상기 키워드 선정부(259)는 상술한 과정을 거쳐 분석기(257)를 통해 출력된 키워드들의 최종 점수를 확인하고, 최종 점수에 따라 영상 클립에 대한 대표 키워드를 선정할 수 있다. 또는, 키워드 선정부(259)는 키워드 후보군 리스트에 포함된 후보 키워드들의 점수에 따라 영상 클립에 대한 적어도 하나의 주변 키워드를 선정할 수도 있다. 예컨대, 키워드 선정부(259)는 가장 높은 점수를 가지는 키워드를 대표 키워드로 선정하고, 차순위 또는 차차순위의 점수를 가지는 적어도 하나의 키워드를 주변 키워드로 선정할 수 있다. 상기 키워드 선정부(259)는 키워드 선정이 완료되면 키워드 맵(235)을 갱신할 수 있다. 예컨대, 키워드 선정부(259)는 특정 영상 클립에 산출된 대표 키워드 또는 적어도 하나의 주변 키워드를 매핑한 매핑 값을 키워드 맵(235)에 새로 기입할 수 있다. The keyword selecting unit 259 may check final scores of the keywords output through the analyzer 257 through the above-described process, and select a representative keyword for the video clip according to the final score. Alternatively, the keyword selector 259 may select at least one neighboring keyword for the video clip according to the scores of the candidate keywords included in the keyword candidate group list. For example, the keyword selecting unit 259 may select a keyword having the highest score as a representative keyword, and select at least one keyword having a next or second ranking score as a neighboring keyword. The keyword selector 259 may update the keyword map 235 when keyword selection is completed. For example, the keyword selector 259 may newly write a mapping value obtained by mapping a representative keyword or at least one neighboring keyword calculated in a specific image clip to the keyword map 235.

상술한 실시예들은 다양한 형태로 변형될 수 있으며, 예컨대, 상기 프로세서(250)는 영상 클립의 특성 또는 장르에 따라 인식부, Score Model, 외부 데이터 추출기 등을 다양하게 구성할 수 있다. 또한, 프로세서(250)에서 운용되는 인식 엔진(또는 인식 알고리즘 또는 인식 모듈 등)도 다양하게 조합될 수 있으며, Score Model 에서 반영하는 요소 및 로직도 조정될 수 있다. The above-described embodiments may be modified in various forms. For example, the processor 250 may variously configure a recognition unit, a score model, an external data extractor, etc. according to the characteristics or genres of image clips. In addition, a recognition engine (or a recognition algorithm or a recognition module, etc.) operated by the processor 250 may be variously combined, and elements and logic reflected in the score model may be adjusted.

또한, 상기 프로세서(250)는 영상 클립에 대한 데이터 처리 속도와 정확도, 효용성 등의 Trade-off를 고려하여, 영상 클립의 프레임별 인식 과정에서 모든 프레임을 다 인식할 지, 또는 일정 프레임마다 Skip 하며 처리할 지 등에 대한 조건 등이 달라질 수 있다. In addition, the processor 250 recognizes all frames in the frame-by-frame recognition process of the video clip or skips every predetermined frame in consideration of trade-off of data processing speed, accuracy, and utility of the video clip. Conditions such as whether to process may vary.

도 4는 본 발명의 장면 메타 추출기의 한 예를 나타낸 도면이다.4 is a diagram illustrating an example of a scene meta extractor of the present invention.

도 4를 참조하면, 장면 메타 추출기(251)는 얼굴 분석기(251a), 상황 분석기(251b) 및 OCR 인식기(251C)를 포함할 수 있다. Referring to FIG. 4, the scene meta extractor 251 may include a face analyzer 251a, a situation analyzer 251b, and an OCR recognizer 251C.

상기 장면 메타 추출기(251)는 Scene (또는 장면) 분할에 기반하여 분할된 Scene 별로 영상 인식을 통해 키워드를 추출할 수 있다. 이때 동영상의 경우 스토리라인을 기반으로 한 Scene 에서 인물, 장소, 이벤트, 객체와 관련된 내용은 여러 Frame 에 걸쳐 일정 기간 반복적으로 촬영되어 등장하는 속성이 있기 때문에, 상기 장면 메타 추출기(251)는 Frame 별로 각각의 인식 엔진을 교차 통과하여 처리할 수 있다. The scene meta extractor 251 may extract a keyword through image recognition for each scene divided based on scene (or scene) segmentation. In this case, since the content related to a person, a place, an event, and an object in a scene based on a story line is repeatedly recorded for a predetermined period of time and appeared over a plurality of frames, the scene meta extractor 251 is frame-by-frame. Each recognition engine can be crossed and processed.

상기 얼굴 분석기(251a)는 장면에 포함된 사람의 얼굴을 분석하여, 어떠한 배우의 얼굴인지 등에 관한 메타 정보를 추출할 수 있다. 얼굴 분석기(251a)는 장면에 복수의 사람이 포함된 경우, 복수의 사람들에 대한 메타 정보(예: 이름 정보)를 추출할 수 있다. The face analyzer 251a may analyze the face of the person included in the scene and extract meta information about the face of the actor. If a scene includes a plurality of people, the face analyzer 251a may extract meta information (eg, name information) about the plurality of people.

상기 상황 분석기(251b)는 장면에 포함된 배경 예컨대, 장소를 인식하거나, 시간 또는 계절적 배경 등을 인식하고, 인식 결과에 대응하는 메타 정보를 추출할 수 있다. 또는, 상황 분석기(251b)는 장면의 상황을 인식하고, 인식된 상황(예: 운전 상황, 운동 상황, 결혼 상황, 총격전 등)에 대한 메타 정보를 추출할 수 있다. 또한, 상황 분석기(251b)는 장면에 포함된 적어도 하나의 객체를 인식하고, 인식된 객체에 대응하는 메타 정보(예: 책상, 의자, 산, 강, 바다 등)를 추출할 수 있다. The situation analyzer 251 b may recognize a background included in a scene, for example, a place, a time or a seasonal background, and extract meta information corresponding to a recognition result. Alternatively, the situation analyzer 251 b may recognize the situation of the scene and may extract meta information about the recognized situation (eg, driving situation, exercise situation, marriage situation, shootout, etc.). In addition, the context analyzer 251b may recognize at least one object included in the scene, and may extract meta information (eg, desk, chair, mountain, river, sea, etc.) corresponding to the recognized object.

상기 OCR 인식기(251C)는 장면에 기입된 특수 자막 정보를 추출할 수 있다. The OCR recognizer 251C may extract special subtitle information written in the scene.

상술한 장면 메타 추출기(251)는 영상 클립의 특정 장면에 포함된 복수의 프레임들을 타임스탬프 단위로 구분하고 상술한 분석기들을 순차적으로 적용할 수 있다. 예를 들면, 장면이 수백 프레임으로 구성된 경우, 장면 메타 추출기(251)는 timestamp 1에 해당하는 프레임에 대해 얼굴 분석기(251a)를 이용하여 얼굴 인식을 통해 해당 장면에 등장한 인물을 인식하고, timestamp 2에 해당하는 프레임에 대해서는, 상황 분석기(251b)를 이용하여 시간 인식을 통해 해당 장면의 시간/계절적 배경을 인식할 수 있다. 장면 메타 추출기(251)는 timestamp 3에 대응하는 프레임 대해서는 상황 분석기(251b)를 이용하여 장소 인식을 통해 해당 장면에 등장한 장소를 인식하고, timestamp 4에 해당하는 프레임에 대해서는 상황 분석기(251b)를 이용하여 상황 인식을 통해 해당 장면에 등장한 상황을 인식하며, timestamp 5에 해당하는 프레임에 대해서는 상황 분석기(251b)를 이용하여 객체 인식을 통해 해당 장면에 등장한 주요 객체를 인식할 수 있다. 상기 장면 메타 추출기(251)는 상술한 동작에서 이용되는 인식 엔진 별로 교차 처리할 수 있다. 예를 들어, 하나의 Scene 이 3분짜리이고 timestamp를 1초 간격으로 처리한다고 가정하면, 장면 메타 추출기(251)는 얼굴 인식, 시간 인식, 장소 인식, 상황 인식 및 객체 인식을 각각 36번 처리 하고, 처리 결과를 제공할 수 있다. 여기서, 영상 클립의 영상 길이 및 timestamp 단위 시간 간격은 조정될 수 있다. 예를 들어, 관리자 요청에 의하여, 또는 영상 클립에 대한 키워드 정책에 따라 타임스탬프 간격을 더욱 좁게 처리하여 상대적으로 많은 키워드를 추출하거나, 타임스탬프 간격을 더 늘려서 일부 키워드만을 추출할 수도 있다. 이 경우, 특정 timestamp 에서는 해당 정보를 놓칠 수 있지만, 반복적으로 처리 하며 각각의 키워드 정보를 추출해 내게 되며, 또한 동일 정보가 여러 번 나오는 경우 선별/정제 해야 되는 후처리 시간을 효율화 할 수 있다.The scene meta extractor 251 may classify a plurality of frames included in a specific scene of an image clip in units of time stamps and sequentially apply the above-described analyzers. For example, when the scene is composed of hundreds of frames, the scene meta extractor 251 recognizes a person who appears in the scene through face recognition using the face analyzer 251a for a frame corresponding to timestamp 1, and timestamp 2 For a frame corresponding to, a time / seasonal background of the scene may be recognized through time recognition using the situation analyzer 251b. The scene meta extractor 251 recognizes a place appearing in the scene by using a situation analyzer 251b for a frame corresponding to timestamp 3, and uses a situation analyzer 251b for a frame corresponding to timestamp 4. In this case, the situation can be recognized through the situation recognition, and the frame corresponding to the timestamp 5 can be recognized through the object recognition using the situation analyzer 251b. The scene meta extractor 251 may cross process each recognition engine used in the above-described operation. For example, suppose a scene is 3 minutes long and processes timestamps at 1 second intervals, the scene meta extractor 251 processes face recognition, time recognition, place recognition, situation recognition, and object recognition 36 times each. It can provide the processing result. Here, the video length and timestamp unit time interval of the video clip may be adjusted. For example, a relatively large number of keywords may be extracted by processing a narrower time stamp interval by an administrator request or according to a keyword policy for a video clip, or only some keywords may be extracted by further increasing the time stamp interval. In this case, the corresponding information may be missed at a specific timestamp, but it may be processed repeatedly and each keyword information may be extracted, and the post-processing time that needs to be sorted / refined may be more efficient when the same information appears several times.

상술한 장면 메타 추출기(251)에서 처리하는 메타 정보는 6하 원칙의 내용을 포함할 수 있다. The meta information processed by the scene meta extractor 251 described above may include the contents of the following six principles.

예컨대, 누가(Who)에 대한 부분은 해당 방송 컨텐츠에 어떤 배우/유명인이 등장하고 있는 지에 관한 것으로, 얼굴 인식 기능을 유명인 DB 기반으로 학습/생성하여 적용 시 해결할 수 있다. 얼굴 인식 엔진을 통과 시, 상기 장면 메타 추출기(251)는 기존에 유명인 DB에 등록된 배우/예능인의 경우 자동 인식을 통해 인물에 관한 키워드 (해시태그)를 생성할 수 있다. For example, the part about Who is related to which actor / celebrity is appearing in the corresponding broadcast content. The face recognition function can be solved when applied by learning / creating a face recognition function based on a celebrity DB. When passing through the face recognition engine, the scene meta extractor 251 may generate a keyword (hash tag) about a person through automatic recognition in the case of an actor / entertainer registered in a celebrity DB.

언제 (When)에 대한 부분은 해당 방송 컨텐츠가 촬영된 시간이나 계절 정보에 관한 것으로, 시간/계절 정보를 유추할 수 있는 특정 개체를 인식 하여 시간/계절 정보를 추출할 수 있다. 예를 들어, 장면 메타 추출기(251)는 화면에서 벚꽃을 인식하면 # 봄이라고 판단하고, 겨울 패딩이나 눈오는 장면을 인식하면 #겨울 #설경으로 판단하고, 단풍이나 코스모스를 인식하면 #가을로 판단할 수 있다.The part about When relates to time or season information when the broadcast content is captured, and may recognize time-season information by recognizing a specific entity capable of inferring time / season information. For example, when the scene meta extractor 251 recognizes the cherry blossoms on the screen, it is determined that it is spring, and when the winter padding or snow scene is recognized, it is determined to be #winter #snow, and when it recognizes the foliage or cosmos, it is determined to be autumn. Can be.

어디서 (Where)에 해당하는 부분은 해당 방송 컨텐츠가 촬영된 장소 정보에 관한 것으로, 주요 장소를 나타내는 건축물, 랜드마크 등을 인식 하여 장소 정보를 판단해 낼 수 있다. 예를 들어, 장면 메타 추출기(251)는 교회/성당, 원형 경기장, 공항과 같은 주요 시설물을 인식하거나, 혹은 에펠탑, 청와대, 첨성대 등과 같은 주요 랜드마크를 인식 하여 지명 정보를 제공할 수 있다.Where (Where) corresponds to the place information where the broadcast content is photographed, it is possible to determine the place information by recognizing the buildings, landmarks, etc. representing the main place. For example, the scene meta extractor 251 may recognize key facilities such as a church / cathedral, an amphitheater, an airport, or may recognize key landmarks such as the Eiffel Tower, the Blue House, Cheomseongdae, etc., and provide name information.

무엇을 (What)에 해당하는 부분은 해당 방송 컨텐츠에서 어떠한 상황이 벌어지고 있는 지에 관련한 것으로, 스포츠/취미 활동 관련 액티비티를 하고 있거나, 생일파티/결혼식/환갑잔치 등의 이벤트를 하고 있는 상황을 포함할 수 있다. 장면 메타 추출기(251)는 스포츠/취미를 특징 짓는 도구 객체를 인식하거나, 이벤트를 특징 짓는 객체를 인식하여 판단 “무엇을”에 대한 메타 정보를 제공할 수 있다. 예를 들어, 장면 메타 추출기(251)는 야구 배트/야구장을 인식하면 #야구, 낚싯대와 미끼를 인식하면 #낚시, 풍선과 생일 케익을 인식하면 #생일파티, 웨딩 드레스와 결혼식장을 인식하면 #결혼식 등의 메타 정보를 제공할 수 있다. What is related to what is happening in the broadcast content, including situations involving sports / hobbies activities, events such as birthday parties, weddings, and sixtieth birthday feasts. can do. The scene meta extractor 251 may recognize a tool object characterizing the sport / hobbies, or may recognize the object characterizing the event and provide meta information about the decision “what”. For example, if the scene meta extractor 251 recognizes a baseball bat / baseball field, #recognizes a baseball, a fishing rod, and bait #recognizes a fishing, a balloon, and a birthday cake #recognizes a birthday party, a wedding dress, and a wedding hall # Meta information such as a wedding can be provided.

상기 장면 메타 추출기(251)는 OCR 인식기(251C)를 이용하여 좀 더 풍성하고 정확한 키워드를 생성할 수 있다. 예를 들어, 화면에는 미디어 제작자가 삽입한 그래픽 처리된 특수 자막을 인식하여 활용될 수 있다. 이 경우, 장면 메타 추출기(251)는 OCR 인식기(251C)를 이용하여 텍스트 데이터가 아닌 그래픽 데이터로 인식되는 특수 자막을 인식할 수 있다. The scene meta extractor 251 may generate more rich and accurate keywords using the OCR recognizer 251C. For example, the screen may recognize and utilize a graphic processed subtitle inserted by a media producer. In this case, the scene meta extractor 251 may recognize a special caption recognized as graphic data rather than text data using the OCR recognizer 251C.

다른 예시로서, 장면 메타 추출기(251)는 Scene의 구분 없이, 임의로 Frame을 Skip 하면서 상술한 인식 엔진들(예: 얼굴 인식, 시간 인식, 장소 인식, 상황 인식, 및 객체 인식)을 교차 처리할 수 있다. 하나의 Scene 안에서도 각각의 인식 엔진이 수회 이상 multiple로 영상 처리를 하고 있기 때문에, Scene 간의 구분이 전체 키워드 추출에 크게 영향을 주지 않을 수도 있다. 이에 따라, 임의의 프레임 스킵 과정은 Scene 구분 알고리즘의 성능으로 인해 전체 키워드 추출에 영향을 주게 되는 부분을 상쇄할 수 있다. 장면 메타 추출기(251)는 Scene의 구분 없이, 전체 영상 클립을 임의의 time 간격으로 frame skip을 해가며 각각의 인식 엔진을 통과한 메타 정보를 추출한 후, 이를 기반으로 영상 클립을 대표로 하는 키워드를 추출할 수 있다. 상기 장면 메타 추출기(251)는 장면별 분석에 따라 추출된 키워드와 임의의 스킵 정책에 따라 선택된 프레임 분석을 기반으로 추출된 키워드를 상호 비교하고, 비교 결과에 따라 대표 키워드를 선택할 수 있도록 지원할 수도 있다. As another example, the scene meta extractor 251 may cross process the above-described recognition engines (eg, face recognition, time recognition, place recognition, situation recognition, and object recognition) while arbitrarily skipping a frame without discriminating a scene. have. Since each recognition engine processes multiple images more than once within a scene, the division between scenes may not significantly affect the extraction of the entire keyword. Accordingly, the arbitrary frame skipping process may cancel a portion that affects the overall keyword extraction due to the performance of the scene classification algorithm. The scene meta extractor 251 extracts the meta information passing through each recognition engine by frame skipping the entire video clip at arbitrary time intervals without distinguishing the scene, and then selects a keyword that is representative of the video clip based on this. Can be extracted. The scene meta extractor 251 may compare the extracted keywords according to the scene-specific analysis with the extracted keywords based on the selected frame analysis according to a certain skip policy, and select a representative keyword based on the comparison result. .

도 5는 본 발명의 실시 예에 따른 장면 인식 방법의 한 예를 나타낸 도면이다.5 is a diagram illustrating an example of a scene recognition method according to an exemplary embodiment of the present invention.

도 5를 참조하면, 본 발명의 실시 예에 따른 장면 인식 방법과 관련하여, 영상 서비스 장치(200)의 프로세서(250)는 501 단계에서 영상 클립(231)을 획득할 수 있다. 예컨대, 프로세서(250)는 키워드 선정이 필요한 영상 클립(231)을 메모리(230)로부터 읽어올 수 있다. Referring to FIG. 5, in relation to a scene recognition method according to an embodiment of the present disclosure, the processor 250 of the image service apparatus 200 may acquire an image clip 231 in step 501. For example, the processor 250 may read an image clip 231 requiring keyword selection from the memory 230.

503 단계에서, 프로세서(250)는 영상 클립(231)을 이미지 변환할 수 있다. 예컨대, 프로세서(250)는 동영상 구조를 가지는 영상 클립(231)을 프레임 단위로 볼 수 있도록 변환할 수 있다. In operation 503, the processor 250 may convert an image clip 231 into an image. For example, the processor 250 may convert an image clip 231 having a video structure to be viewed in units of frames.

505 단계에서, 프로세서(250)는 변환된 영상 클립을 장면 번환(Scene) 단위로 분할할 수 있다. 즉, 프로세서(250)는 장면이 전환되는 시점을 인식하고, 해당 시점을 기준으로 프레임들을 장면 단위로 분할할 수 있다. 예를 들어, 프로세서(250)는 장면 변화에 사용에 사용되는 인트라 프레임을 비교하고, 일정 데이터 이상 변경된 경우, 해당 시점을 장면 분할 시점으로 인식할 수 있다. In operation 505, the processor 250 may divide the converted video clip in scene switching units. That is, the processor 250 may recognize a time point at which the scene is changed, and may divide the frames into scene units based on the time point. For example, the processor 250 may compare the intra frames used for the scene change, and if the predetermined data is changed more than the predetermined time, the processor 250 may recognize the corresponding view as the scene dividing view.

507 단계에서, 프로세서(250)는 장면별로, 얼굴 인식을 수행할 수 있다. 예컨대, 프로세서(250)는 등장 인물의 얼굴을 인식하고, 인식 결과(예: 등장 인물에 대응하는 배우의 이름)를 추출할 수 있다. In operation 507, the processor 250 may perform face recognition for each scene. For example, the processor 250 may recognize a face of a person and may extract a recognition result (eg, an actor's name corresponding to the person).

509 단계에서, 프로세서(250)는 해당 장면의 시간 및 계절 인식을 수행할 수 있다. 예를 들어, 프로세서(250)는 장면에 적용된 조도를 통하여 낮 또는 밤을 구분하거나, 눈, 비 등을 기준으로 시간 및 계절을 인식하고, 인식 결과를 추출할 수 있다.In operation 509, the processor 250 may perform time and season recognition of the scene. For example, the processor 250 may classify day or night through illumination applied to a scene, recognize time and season based on snow, rain, and the like, and extract a recognition result.

511 단계에서, 프로세서(250)는 장면별 장소/명소 인식을 수행할 수 있다. 예를 들어, 프로세서(250)는 장면에 포함된 구조물이나 배경에 보이는 자연 경관 등을 이용하여 장소 또는 명소 인식을 수행할 수 있다. 또는, 프로세서(250)는 자막이나, 대사 등을 통해서 장면의 장소 또는 명소 인식을 수행할 수도 있다. In operation 511, the processor 250 may perform scene / place recognition for each scene. For example, the processor 250 may recognize a place or a landmark by using a structure included in a scene or a natural landscape shown in the background. Alternatively, the processor 250 may perform recognition of a scene or a place of a scene through captions or dialogue.

513 단계에서, 프로세서(250)는 장면별 액티비티 또는 이벤트를 인식할 수 있다. 예컨대, 프로세서(250)는 등장인물의 행동이나 모습 등을 토대로, 결혼식, 스포츠, 게임 등 다양한 액티비티 또는 이벤트 등을 포함하는 상황 인식을 수행할 수 있다. In operation 513, the processor 250 may recognize a scene-specific activity or event. For example, the processor 250 may perform situational awareness including various activities or events such as weddings, sports, games, and the like, based on the actions or appearances of the characters.

515 단계에서, 프로세서(250)는 동일한 장면(Scene) 내 마지막 프레임인지 확인할 수 있다. 프로세서(250)는 마지막 프레임이 아닌 경우, 507 단계 이전으로 분기하여 다른 프레임에 대해서 이하 동작을 재수행할 수 있다. In operation 515, the processor 250 may determine whether the last frame is in the same scene. If it is not the last frame, the processor 250 branches to step 507 and may perform the following operation on another frame.

517 단계에서, 프로세서(250)는 동일한 장면 내 마지막 프레임인 경우, 프로세서(250)는 인식 결과를 모아서, 해당 장면에 대한 키워드로 매핑할 수 있다. In operation 517, when the processor 250 is the last frame in the same scene, the processor 250 may collect the recognition results and map the keyword to the corresponding scene.

519 단계에서, 프로세서(250)는 해당 영상 내 마지막 프레임인지 확인하고, 마지막 프레임이 아닌 경우, 이전 단계 예컨대, 507 단계 이전으로 분기하여 이하 동작을 재수행할 수 있다. 마지막 프레임인 경우, 영상 클립(231)에 대한 장면 인식을 종료할 수 있다.In operation 519, the processor 250 may determine whether it is the last frame in the corresponding image, and if it is not the last frame, the processor 250 may branch to a previous step, for example, step 507, and perform the following operation again. In the case of the last frame, scene recognition for the image clip 231 may be terminated.

한편, 상술한 설명에서, 507 단계 내지 513 단계의 순서는 변동될 수 있다. 또한, 507 단계 내지 513 단계에서 수행되는 장면 인식 과정은 장면에 포함된 각각의 프레임들에 대해서 수행될 수도 있고, 하나의 인식 과정당 하나의 프레임이 할당될 수도 있다. 또는, 복수의 프레임들에 대해 하나의 인식 과정이 수행될 수도 있다. 또는, 한 장면 내에 포함된 복수의 프레임들을 샘플링한 후, 샘플링된 프레임들에 대하여 상술한 507 단계 내지 517 단계의 인식 과정들이 단일 또는 복합적으로 적용될 수도 있다.Meanwhile, in the above description, the order of steps 507 to 513 may be changed. In addition, the scene recognition process performed in steps 507 to 513 may be performed for each frame included in the scene, or one frame may be allocated to one recognition process. Alternatively, one recognition process may be performed on the plurality of frames. Alternatively, after sampling a plurality of frames included in a scene, the above-described recognition processes of steps 507 to 517 may be applied to the sampled frames singly or in combination.

도 6은 본 발명의 실시 예에 따른 영상 컨텐츠 검색 지원과 관련한 키워드 맵 생성 방법의 한 예를 나타낸 도면이다.FIG. 6 is a diagram illustrating an example of a method of generating a keyword map associated with support of image content search according to an embodiment of the present invention.

도 6을 참조하면, 영상 컨텐츠 검색 지원과 관련한 키워드 맵 생성 방법과 관련하여, 601 단계에서, 프로세서(250)는 601 단계에서, 프로세서(250)는 영상 클립(231)을 획득할 수 있다. 예를 들어, 외부 서버 장치(300)로부터 키워드 선정이 필요한 영상 클립(231)을 수신하거나, 메모리(230)에 저장된 영상 클립(231)을 획득할 수 있다. Referring to FIG. 6, in relation to a keyword map generation method related to image content search support, in step 601, the processor 250 may acquire an image clip 231 in step 601. For example, an image clip 231 requiring keyword selection may be received from the external server device 300, or an image clip 231 stored in the memory 230 may be obtained.

603 단계에서, 프로세서(250)는 동영상 원본 파일의 벡터 변환을 처리할 수 있다. 605 단계에서, 프로세서(250)는 video feature vector 분석을 수행할 수 있다. 상기 video feature vector 분석 방식은 영상 컨텐츠를 일괄 분류하여 지도 학습으로 접근하는 방식이 될 수 있다. In operation 603, the processor 250 may process a vector conversion of the original video file. In operation 605, the processor 250 may perform a video feature vector analysis. The video feature vector analysis method may be a method of collectively classifying image contents and approaching supervised learning.

607 단계에서, 프로세서(250)는 Aggregated meta 분석을 수행할 수 있다. 상기 Aggregated meta 분석 방식은 영상 컨텐츠의 프레임 별로 추출한 Vector를 연결하고 이를 분류하여 접근하는 방식이 될 수 있다. 609 단계에서, 프로세서(250)는 Music 인식을 수행할 수 있다. Music 인식은 영상 클립 전체에 적용된 음악 또는 특정 장면에 적용된 적어도 하나의 음악을 인식할 수 있다. 611 단계에서, 프로세서(250)는 대사 인식을 처리할 수 있다. 대사 인식 과정에서 프로세서(250)는 대사에 포함된 명사 또는 형용사 등을 추출할 수 있다. 상기 프로세서(250)의 영상 클립(231)에 대한 인식은 별도의 순서를 가지지 않고 병렬적으로 수행될 수도 있다. 예컨대, 프로세서(250)는 video feature vector 분석, Aggregated meta 분석, Music 분석, 대사 인식 등을 순서에 관계 없이 또는 상술한 순서 이외의 다른 순서에 따라 수행할 수도 있다. In operation 607, the processor 250 may perform aggregate meta analysis. The aggregated meta analysis method may be a method of accessing by classifying and classifying a vector extracted for each frame of image content. In operation 609, the processor 250 may perform music recognition. Music recognition may recognize music applied to the entire image clip or at least one music applied to a specific scene. In operation 611, the processor 250 may process metabolic recognition. In the metabolic recognition process, the processor 250 may extract a noun or an adjective included in the metabolism. Recognition of the image clip 231 of the processor 250 may be performed in parallel without a separate order. For example, the processor 250 may perform video feature vector analysis, aggregated meta analysis, music analysis, metabolic recognition, etc. regardless of the order or in a different order than the above-described order.

한편, 602 단계에서, 프로세서(250)는 앞서 도 5에서 설명한 바와 같이 이미지 변환을 처리하고, 604 단계에서, 프로세서(250)는 얼굴 인식을 처리할 수 있다. 606 단계에서, 프로세서(250)는 상황/이벤트 인식을 수행하고, 608 단계에서, 프로세서(250)는 장소/명소 인식을 수행할 수 있다. 610 단계에서, 프로세서(250)는 대사/자막 인식을 수행할 수 있다. 612 단계에서, 프로세서(250)는 해당 영상 내 마지막 프레임인지 확인하고, 마지막 프레임이 아닌 경우 604 단계 이전으로 분기하여 이하 동작을 재수행할 수 있다. 상기 프로세서(250)는 상술한 다양한 인식 과정을 순서에 관계 없이 또는 상술한 순서 이외의 다른 순서에 따라 수행할 수도 있다. 612 단계에서, 현재 인식을 수행하는 프레임이 In operation 602, the processor 250 processes the image transformation as described above with reference to FIG. 5, and in operation 604, the processor 250 may process face recognition. In operation 606, the processor 250 may perform situation / event recognition, and in operation 608, the processor 250 may perform place / attraction recognition. In operation 610, the processor 250 may perform metabolic / subtitle recognition. In operation 612, the processor 250 may determine whether it is the last frame in the corresponding image, and if it is not the last frame, the processor 250 branches to step 604 to perform the following operation again. The processor 250 may perform the above-described various recognition processes regardless of the order or in a different order than the above-described order. In step 612, the frame performing the current recognition is

마지막 프레임인 경우, 613 단계에서, 프로세서(250)는 스코어링 모델(Scoring Model)을 기반으로 추출된 메타 정보들에 대한 또는 키워드 후보 리스트에 포함된 후보 키워드들에 대한 점수를 부여하고, 점수 비교를 통해 대표 키워드를 추출할 수 있다. In the case of the last frame, in step 613, the processor 250 assigns a score to the meta information extracted based on the scoring model or to candidate keywords included in the keyword candidate list, and performs a score comparison. Representative keywords can be extracted.

본 발명은 동영상 클립에 대한 대표 키워드 정보를 생성함으로써, 대표 키워드 정보 산출을 위한 인건비를 절약할 수 있으며 효율적인 작업을 수행할 수 있다. 또한, 본 발명은 영상 인식 기술을 통해 (동)영상 클립 컨텐츠의 키워드 생성을 거의 실시간 처리할 수 있도록 지원한다. 또한, 본 발명은 특정 영상의 Multi-Label을 영상 인식기를 통해 추출한 후, 각각의 Label (키워드) 정보의 출현 빈도, 인식 정확도 등을 조합하여, 대표 키워드를 추출하거나, 장르 분석 등의 부가 Value 도출에도 활용할 수 있다. 또한, 본 발명은 특정 키워드에 대한 학습 데이터 베이스 수집을 용이하게 할 수 있다. 본 발명은 등장인물, 시간/계절, 장소, 상황 별 키워드 정보를 기반으로 영상 컨텐츠(영상 클립 또는 동영상 클립)에서 학습 데이터 베이스를 추출하여 활용할 수 있다. 예를 들어, 본 발명은 결혼식 장면에 대한 동영상을 수집하여 결혼식 상황을 학습하고자 할 때, 영상 클립에 대한 다양한 태그 정보를 확보하고, 이를 다시 영상 인식 엔진의 학습 DB 확보에 활용할 수 있다.According to the present invention, by generating representative keyword information for a video clip, labor cost for calculating representative keyword information can be saved and efficient work can be performed. In addition, the present invention supports the keyword generation of the (video) video clip content through the image recognition technology in almost real time processing. In addition, the present invention, after extracting a multi-label of a specific image through an image recognizer, by combining the appearance frequency, recognition accuracy, etc. of each label (keyword) information, extract the representative keywords, or derive additional value such as genre analysis Can also be used. In addition, the present invention may facilitate collection of learning databases for specific keywords. The present invention can extract and utilize a learning database from image content (video clip or video clip) based on keyword information for each character, time / season, place, and situation. For example, when the present invention collects a video about a wedding scene and wants to learn a wedding situation, the present invention can obtain various tag information about an image clip and use it again to secure a learning DB of an image recognition engine.

본 발명에 따른 방법은 다양한 컴퓨터 수단을 통하여 판독 가능한 소프트웨어 형태로 구현되어 컴퓨터로 판독 가능한 기록매체에 기록될 수 있다. 여기서, 기록매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 기록매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 예컨대 기록매체는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(Magnetic Media), CD-ROM(Compact Disk Read Only Memory), DVD(Digital Video Disk)와 같은 광 기록 매체(Optical Media), 플롭티컬 디스크(Floptical Disk)와 같은 자기-광 매체(Magneto-Optical Media), 및 롬(ROM), 램(RAM, Random Access Memory), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치를 포함한다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함할 수 있다. 이러한 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The method according to the invention can be implemented in software form readable by various computer means and recorded on a computer readable recording medium. Here, the recording medium may include a program command, a data file, a data structure, etc. alone or in combination. The program instructions recorded on the recording medium may be those specially designed and constructed for the present invention, or may be known and available to those skilled in computer software. For example, the recording medium may be a magnetic medium such as a hard disk, a floppy disk, or a magnetic tape, an optical media such as a compact disk read only memory (CD-ROM), a digital video disk (DVD), or a floppy. Magnetic-Optical Media, such as floppy disks, and hardware devices specifically configured to store and execute program instructions, such as ROM, random access memory (RAM), flash memory, and the like. do. Examples of program instructions may include high-level language code that can be executed by a computer using an interpreter, as well as machine code such as produced by a compiler. Such hardware devices may be configured to operate as one or more software modules to perform the operations of the present invention, and vice versa.

또한, 본 명세서에서 설명하는 기능적인 동작과 주제의 구현물들은 다른 유형의 디지털 전자 회로로 구현되거나, 본 명세서에서 개시하는 구조 및 그 구조적인 등가물들을 포함하는 컴퓨터 소프트웨어, 펌웨어 혹은 하드웨어로 구현되거나, 이들 중 하나 이상의 결합으로 구현 가능하다. 본 명세서에서 설명하는 주제의 구현물들은 하나 이상의 컴퓨터 프로그램 제품, 다시 말해 본 발명에 따른 장치의 동작을 제어하기 위하여 혹은 이것에 의한 실행을 위하여 유형의 프로그램 저장매체 상에 인코딩된 컴퓨터 프로그램 명령에 관한 하나 이상의 모듈로서 구현될 수 있다. 컴퓨터로 판독 가능한 매체는 기계로 판독 가능한 저장 장치, 기계로 판독 가능한 저장 기판, 메모리 장치, 기계로 판독 가능한 전파형 신호에 영향을 미치는 물질의 조성물 혹은 이들 중 하나 이상의 조합일 수 있다.In addition, the functional operations and subject matter implementations described herein may be implemented in other types of digital electronic circuitry, computer software, firmware, or hardware including the structures and structural equivalents disclosed herein, or It can be implemented in combination with one or more of. Implementations of the subject matter described herein relate to one or more computer program products, ie computer program instructions encoded on a program storage medium of tangible type for controlling or by the operation of an apparatus according to the invention. It can be implemented as the above module. The computer readable medium may be a machine readable storage device, a machine readable storage substrate, a memory device, a composition of materials affecting a machine readable propagated signal, or a combination of one or more thereof.

아울러, 본 명세서는 다수의 특정한 구현물의 세부사항들을 포함하지만, 이들은 어떠한 발명이나 청구 가능한 것의 범위에 대해서도 제한적인 것으로서 이해되어서는 안되며, 오히려 특정한 발명의 특정한 실시형태에 특유할 수 있는 특징들에 대한 설명으로서 이해되어야 한다. 개별적인 실시형태의 문맥에서 본 명세서에 기술된 특정한 특징들은 단일 실시형태에서 조합하여 구현될 수도 있다. 반대로, 단일 실시형태의 문맥에서 기술한 다양한 특징들 역시 개별적으로 혹은 어떠한 적절한 하위 조합으로도 복수의 실시형태에서 구현 가능하다. 나아가, 특징들이 특정한 조합으로 동작하고 초기에 그와 같이 청구된 바와 같이 묘사될 수 있지만, 청구된 조합으로부터의 하나 이상의 특징들은 일부 경우에 그 조합으로부터 배제될 수 있으며, 그 청구된 조합은 하위 조합이나 하위 조합의 변형물로 변경될 수 있다.In addition, although the specification includes details of numerous specific implementations, these should not be construed as limited to any invention or scope of the claims, but rather to features that may be specific to a particular embodiment of a particular invention. It should be understood as an explanation. Certain features that are described in this specification in the context of separate embodiments may be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments individually or in any suitable subcombination. Furthermore, while the features operate in a specific combination and may be depicted as such initially claimed, one or more features from the claimed combination may in some cases be excluded from the combination, the claimed combination being a subcombination. Or a combination of subcombinations.

마찬가지로, 특정한 순서로 도면에서 동작들을 묘사하고 있지만, 이는 바람직한 결과를 얻기 위하여 도시된 그 특정한 순서나 순차적인 순서대로 그러한 동작들을 수행하여야 한다거나 모든 도시된 동작들이 수행되어야 하는 것으로 이해되어서는 안 된다. 특정한 경우, 멀티태스킹과 병렬 프로세싱이 유리할 수 있다. 또한, 상술한 실시형태의 다양한 시스템 컴포넌트의 분리는 그러한 분리를 모든 실시형태에서 요구하는 것으로 이해되어서는 안되며, 설명한 프로그램 컴포넌트와 시스템들은 일반적으로 단일의 소프트웨어 제품으로 함께 통합되거나 다중 소프트웨어 제품에 패키징될 수 있다는 점을 이해하여야 한다.Likewise, although the operations are depicted in the drawings in a specific order, it should not be understood that such operations must be performed in the specific order or sequential order shown in order to obtain desirable results or that all illustrated operations must be performed. In certain cases, multitasking and parallel processing may be advantageous. In addition, the separation of the various system components of the above-described embodiments should not be understood as requiring such separation in all embodiments, and the described program components and systems will generally be integrated together into a single software product or packaged into multiple software products. It should be understood that it can.

본 명세서에서 설명한 주제의 특정한 실시형태를 설명하였다. 기타의 실시형태들은 이하의 청구항의 범위 내에 속한다. 예컨대, 청구항에서 인용된 동작들은 상이한 순서로 수행되면서도 여전히 바람직한 결과를 성취할 수 있다. 일 예로서, 첨부도면에 도시한 프로세스는 바람직한 결과를 얻기 위하여 반드시 그 특정한 도시된 순서나 순차적인 순서를 요구하지 않는다. 특정한 구현예에서, 멀티태스킹과 병렬 프로세싱이 유리할 수 있다.Specific embodiments of the subject matter described in this specification have been described. Other embodiments are within the scope of the following claims. For example, the actions recited in the claims can be performed in a different order but still achieve desirable results. As an example, the process depicted in the accompanying drawings does not necessarily require that particular illustrated or sequential order to obtain the desired results. In certain implementations, multitasking and parallel processing may be advantageous.

본 기술한 설명은 본 발명의 최상의 모드를 제시하고 있으며, 본 발명을 설명하기 위하여, 그리고 당업자가 본 발명을 제작 및 이용할 수 있도록 하기 위한 예를 제공하고 있다. 이렇게 작성된 명세서는 그 제시된 구체적인 용어에 본 발명을 제한하는 것이 아니다. 따라서, 상술한 예를 참조하여 본 발명을 상세하게 설명하였지만, 당업자라면 본 발명의 범위를 벗어나지 않으면서도 본 예들에 대한 개조, 변경 및 변형을 가할 수 있다.The foregoing description presents the best mode of the invention, and provides examples to illustrate the invention and to enable those skilled in the art to make and use the invention. The specification thus produced is not intended to limit the invention to the specific terms presented. Thus, while the present invention has been described in detail with reference to the examples described above, those skilled in the art can make modifications, changes and variations to the examples without departing from the scope of the invention.

따라서 본 발명의 범위는 설명된 실시 예에 의하여 정할 것이 아니고 특허청구범위에 의해 정하여져야 한다.Therefore, the scope of the present invention should be determined by the claims rather than by the described embodiments.

본 발명은 통신 분야에 적용되는 것으로서, 특히, 영상 컨텐츠 검색 기술과 관련된다.FIELD OF THE INVENTION The present invention applies to the field of telecommunications and, in particular, relates to video content retrieval technology.

특히, 본 발명은 다양한 영상 클립에 대한 대표 키워드 선정을 하면서도 다양한 인식 모듈을 이용하여 영상 클립과 관련성이 높은 키워드를 대표 키워드를 선정함으로써, 효율적인 키워드 선정 작업을 제공할 수 있으며, 영상 클립 검색의 정확도를 높일 수 있다. In particular, the present invention can provide an efficient keyword selection operation by selecting a representative keyword of a keyword that is highly related to the image clip by using a variety of recognition modules while selecting a representative keyword for a variety of video clips, the accuracy of image clip search Can increase.

10: 네트워크 환경
100: 단말
200: 영상 서비스 장치
210: 통신 회로
230: 메모리
250: 프로세서
251: 장면 메타 추출기
253: 영상 통합 메타 추출기
255: 외부 데이터 추출기
257: 분석기
259: 키워드 선정부
251a: 얼굴 분석기
251b: 상황 분석기
251c: OCR 분석기
300: 외부 서버 장치10: Network environment
100: terminal
200: video service device
210: communication circuit
230: memory
250: processor
251: Scene Meta Extractor
253: Image Integration Meta Extractor
255: external data extractor
257: analyzer
259: keyword selection unit
251a: face analyzer
251b: Situation Analyzer
251c: OCR analyzer
300: external server device

Claims

A memory for storing an image clip;
A processor operatively coupled to the memory,
The processor is
Obtaining an image clip stored in the memory,
Splitting the video clip into scene units;
Extracting a candidate keyword by performing at least one recognition operation on at least one divided scene of the video clip;
And a representative keyword for the video clip is selected based on the extracted at least one candidate keyword.

The method of claim 1,
The at least one candidate keyword is
And image extracted from the graphic subtitles applied to the at least one scene.

The method of claim 1,
The at least one candidate keyword is
And a feature vector extracted for each frame included in the at least one scene according to a complex application of an original video feature vector (global feature) of the video clip.

The method of claim 1,
The at least one candidate keyword is
Video service device, characterized in that priority is assigned through the score model.

The method of claim 4, wherein
The processor is
And a highest priority is selected as a representative keyword for the video clip.

The method of claim 1,
The processor is
And at least one of a face recognition, a time or season recognition, a place recognition, a dialogue recognition, a music recognition, a situation or an event recognition.

Video service device,
Obtaining an image clip requiring keyword selection;
Dividing the video clip into scene units;
Extracting at least one candidate keyword by performing at least one recognition operation on at least one divided scene of the video clip;
And selecting a representative keyword for the video clip based on the at least one candidate keyword.

The method of claim 7, wherein
The video clip
The video content retrieval method of claim 1, wherein the at least one scene of the original video is edited to include video content having a reproduction time length of several minutes to several tens of minutes.