KR102848028B1

KR102848028B1 - Question- answering system based on question category settings using llm

Info

Publication number: KR102848028B1
Application number: KR1020250006576A
Authority: KR
Inventors: 이홍재; 고형석; 임철홍; 김성묵; 최현호; 오찬수; 안성진; 심경석
Original assignee: (주)유알피; 한국서부발전 주식회사
Filing date: 2025-01-16
Publication date: 2025-08-19
Anticipated expiration: 2045-01-16

Abstract

본 발명은 자연어 방식으로 입력된 카테고리를 판단한 후 설정한 카테고리 범위 내에서 질문과 관련된 문서를 검색하여 이를 기반으로 대규모 언어 모델(LLM)을 통해 질문에 대한 답변을 생성하는 질의응답 시스템에 관한 것으로, 제공된 대화 입력창을 통해 사용자가 입력한 자연어 입력 문장을 수신하여 카테고리 설정인지 질문인지를 판단하여 카테고리 설정 또는 답변 생성 작업을 지시하는 입력문분류부; 상기 입력문분류부의 지시에 따라 상기 입력 문장에서 카테고리를 추출하고 추출된 카테고리를 검색할 문서의 범위로 설정하는 카테고리설정부; 및 상기 입력문분류부의 지시에 따라 상기 입력 문장을 질문으로 인식하고, 설정된 카테고리 내에서 질문과 관련된 문서를 검색하여 상기 문서의 내용을 기반으로 답변을 생성하는 답변생성부;를 포함한다.The present invention relates to a question-answering system that determines a category input in a natural language, searches for documents related to a question within a set category range, and generates an answer to the question through a large-scale language model (LLM) based on the determined category. The system comprises: an input sentence classification unit that receives a natural language input sentence input by a user through a provided dialogue input window, determines whether it is a category setting or a question, and instructs a category setting or answer generation task; a category setting unit that extracts a category from the input sentence according to the instruction of the input sentence classification unit and sets the extracted category as the range of documents to be searched; and an answer generation unit that recognizes the input sentence as a question according to the instruction of the input sentence classification unit, searches for documents related to the question within the set category, and generates an answer based on the contents of the documents.

Description

Question-answering system based on question category settings using LLM

본 발명은 자연어 방식으로 입력된 카테고리를 판단한 후 설정한 카테고리 범위 내에서 질문과 관련된 문서를 검색하여 이를 기반으로 대규모 언어 모델(LLM)을 통해 질문에 대한 답변을 생성하는 질의응답 시스템에 관한 것이다.The present invention relates to a question-answering system that determines a category input in a natural language manner, searches for documents related to a question within a set category range, and generates an answer to the question through a large-scale language model (LLM) based on the documents.

현대 사회에서 대규모 언어 모델(LLM)은 다양한 질문에 대한 답변을 생성하는 데 있어 강력한 도구로 자리 잡고 있다. 그러나 질문의 범위가 넓고, 데이터 소스가 방대 할수록 모델이 정확한 답변을 제공하는 데 한계가 있을 수 있다. In modern society, large-scale language models (LLMs) have become powerful tools for generating answers to a wide range of questions. However, as the scope of questions expands and data sources become more extensive, models may face limitations in providing accurate answers.

특히, 특정 도메인에서 다루는 정보들은 도메인에 특화된 용어를 포함하거나, 일반적으로 사용하는 의미와는 상이하게 사용될 수 있다. 또한, 새롭게 생성되는 도메인 지식들을 반영하여 지속적인 모델 학습을 수행하는 것은 많은 비용과 시간이 소모되는 문제점이 있다.In particular, information addressed in a specific domain may contain domain-specific terminology or be used in a way that differs from its commonly used meaning. Furthermore, performing continuous model training to reflect newly generated domain knowledge presents significant challenges in terms of cost and time.

이러한 문제점을 극복하고 답변의 정확성을 높이기 위해, 모델을 학습하는 대신 질문과 관련된 데이터를 검색하고 이를 바탕으로 LLM이 답변을 생성하는 RAG(Retrieved-Augmented Generation) 방식이 사용되고 있다.To overcome these problems and improve the accuracy of answers, a Retrieved-Augmented Generation (RAG) method is being used, where, instead of training a model, data related to the question is retrieved and the LLM generates an answer based on this data.

그러나, 동일한 질문이라도 문맥이나 주제에 따라 적절한 답변이 달라져야 하는 상황이 있을 수 있고, 질문에 관련된 문서 또는 정보를 검색할 때 서로 다른 주제의 자료가 섞여 부정확한 답변을 생성할 가능성이 증가한다. 이에 따라, 사용자가 원하는 카테고리를 판단하고, 해당 카테고리 내에서만 리트리빙(Retrieving)을 수행하여 LLM 답변의 품질을 향상시키는 기술이 요구되고 있다.However, even for the same question, there may be situations where the appropriate answer must vary depending on the context or topic. Furthermore, when searching for documents or information related to a question, the likelihood of mixing materials from different topics, resulting in inaccurate answers, increases. Therefore, there is a need for technology that improves the quality of LLM answers by determining the user's desired category and retrieving only within that category.

본 발명은 상기 문제점을 해결하기 위해 자연어 입력 방식으로 사용자가 질문-답변에 대한 카테고리를 설정할 수 있도록 기능을 제공하고, 사용자가 설정한 카테고리 범위의 문서에서만 질문과 관련된 도메인 지식을 검색하여 답변을 생성하도록 함으로써, 질문에 대한 대규모 언어 모델의 답변 생성의 정확성을 높이고, 리트리빙(Retrieving) 성능을 향상시키는 것을 목적으로 한다.The present invention aims to solve the above problem by providing a function that allows a user to set a category for a question-answer using a natural language input method, and to generate an answer by searching for domain knowledge related to the question only in documents within the category range set by the user, thereby increasing the accuracy of generating an answer from a large-scale language model for a question and improving retrieving performance.

본 발명의 일 실시예에 따른 LLM을 활용한 질문 카테고리 설정 기반 질의응답 시스템에서, 제공된 대화 입력창을 통해 사용자가 입력한 자연어 입력 문장을 수신하여 카테고리 설정인지 질문인지를 판단하여 카테고리 설정 또는 답변 생성 작업을 지시하는 입력문분류부; 상기 입력문분류부의 지시에 따라 상기 입력 문장에서 카테고리를 추출하고 추출된 카테고리를 검색할 문서의 범위로 설정하는 카테고리설정부; 및 상기 입력문분류부의 지시에 따라 상기 입력 문장을 질문으로 인식하고, 설정된 카테고리 내에서 질문과 관련된 문서를 검색하여 상기 문서의 내용을 기반으로 답변을 생성하는 답변생성부;를 포함할 수 있다.In a question-answering system based on question category setting using LLM according to one embodiment of the present invention, the system may include: an input sentence classification unit that receives a natural language input sentence input by a user through a provided dialogue input window, determines whether it is a category setting or a question, and instructs a category setting or answer generation task; a category setting unit that extracts a category from the input sentence according to the instruction of the input sentence classification unit and sets the extracted category as a range of documents to be searched; and an answer generation unit that recognizes the input sentence as a question according to the instruction of the input sentence classification unit, searches for documents related to the question within the set category, and generates an answer based on the contents of the documents.

또한, 상기 입력문분류부는 입력문분류모델을 통해 입력된 문장을 분석하여 카테고리 설정인지 질문인지를 판단하되, 상기 입력문분류모델은 카테고리를 설정하는 다수의 문장 형식을 포함한 카테고리 설정 문장 및 다수의 질문 형식을 포함한 질문 문장으로 이루어진 학습데이터를 사용하여 대규모 언어 모델을 학습한 것일 수 있다.In addition, the input sentence classification unit analyzes the input sentence through the input sentence classification model to determine whether it is a category setting or a question. The input sentence classification model may be a large-scale language model learned using training data consisting of category setting sentences including a plurality of sentence formats for setting categories and question sentences including a plurality of question formats.

또한, 상기 카테고리설정부는 카테고리추출모델을 통해 주어진 입력 문장을 분석하여 카테고리를 추출하되, 상기 카테고리추출모델은 다수의 카테고리 설정 문장 및 상기 설정 문장에 포함된 카테고리를 포함하는 학습데이터를 사용하여 대규모 언어 모델을 학습한 것일 수 있다.In addition, the category setting unit may extract categories by analyzing a given input sentence through a category extraction model, and the category extraction model may be a large-scale language model learned using learning data including a plurality of category setting sentences and categories included in the setting sentences.

또한, 상기 카테고리설정부는 추출된 카테고리를 검색할 문서의 범위로 설정한 후, 설정된 상기 카테고리를 대화 입력창 화면에 표시하여 사용자가 현재 설정된 카테고리를 인식할 수 있도록 한다.In addition, the above category setting section sets the extracted category as the scope of the document to be searched, and then displays the set category on the dialogue input window screen so that the user can recognize the currently set category.

또한, 상기 카테고리설정부는 상기 대화 입력창 화면에 표시된 카테고리를 선택하여 현재 설정된 카테고리를 해제할 수 있도록 하는 사용자 인터페이스를 제공할 수 있다.In addition, the above category setting section may provide a user interface that allows the user to select a category displayed on the above dialogue input window screen and cancel the currently set category.

또한, 상기 답변생성부는 도메인 관련 지식이 저장된 지식데이터베이스에서 사용자에 의해 설정된 카테고리에 속한 문서에 대해 질문과 관련성이 높은 문서를 검색하되, 카테고리가 설정되지 않은 경우, 상기 지식데이터베이스의 전체 문서를 대상으로 검색할 수 있다.In addition, the above-mentioned answer generation unit searches for documents highly relevant to the question among documents belonging to a category set by the user in a knowledge database where domain-related knowledge is stored, but if no category is set, the search can target all documents in the knowledge database.

또한, 상기 지식데이터베이스는 대규모 언어 모델이 사용할 수 있도록 정제된 문서 내용, 질문 내용과의 유사도 비교를 위해 상기 문서 내용을 벡터화한 벡터데이터 및 상기 문서 내용에 대한 메타데이터를 포함하고, 상기 메타데이터에는 카테고리 항목이 포함될 수 있다.In addition, the knowledge database includes refined document content that can be used by a large-scale language model, vector data that vectorizes the document content for comparison of similarity with the question content, and metadata about the document content, and the metadata may include category items.

또한, 상기 답변생성부는 사용자가 입력한 질문 및 검색된 문서들을 대규모 언어 모델로 전달하여 상기 대규모 언어 모델이 입력된 문서내용을 기반으로 질문에 대한 답변을 생성하여 대화 입력창 화면에 출력할 수 있다.In addition, the answer generation unit can transmit the questions input by the user and the searched documents to a large-scale language model, and the large-scale language model can generate an answer to the question based on the contents of the input documents and display it on the dialogue input window screen.

사용자가 별도의 인터페이스를 통해 카테고리를 설정할 필요 없이 카테고리 설정과 질문을 하나의 대화창에서 입력하여 원하는 답변을 빠르게 획득할 수 있다.Users can quickly obtain the answers they want by setting categories and entering questions in a single chat window without having to set categories through a separate interface.

또한, 설정된 특정 카테고리의 정보(문서)만을 리트리빙 함으로써 LLM이 생성하는 답변의 정확도를 높일 수 있다.Additionally, the accuracy of the answers generated by LLM can be improved by retrieving only information (documents) from specific categories that have been set.

도 1은 본 발명의 일 실시예에 따른 LLM을 활용한 질문 카테고리 설정 기반 질의응답 시스템의 전체 구성도이다.
도 2는 본 발명의 일 실시예에 따른 LLM을 활용한 질문 카테고리 설정 기반 질의응답 시스템의 기능을 나타낸 블록도이다.
도 3은 본 발명의 일 실시예에 따른 LLM을 활용한 질문 카테고리 설정 기반 질의응답 시스템에서 지식데이터베이스 내에 저장된 데이터 구조를 나타낸 도면이다.
도 4는 본 발명의 일 실시예에 따른 LLM을 활용한 질문 카테고리 설정 기반 질의응답 시스템에서 사용자에게 제공되는 대화 입력 화면 중 카테고리 설정, 질문 입력 및 답변 출력을 나타낸 도면이다.
도 5는 본 발명의 일 실시예에 따른 LLM을 활용한 질문 카테고리 설정 기반 질의응답 시스템에서 사용자에게 제공되는 대화 입력 화면 중 카테고리 설정없이 질문 입력 및 답변 출력을 나타낸 도면이다.FIG. 1 is a diagram showing the overall configuration of a question-answering system based on question category setting using LLM according to one embodiment of the present invention.
FIG. 2 is a block diagram illustrating the function of a question-answering system based on question category setting using LLM according to one embodiment of the present invention.
FIG. 3 is a diagram showing a data structure stored in a knowledge database in a question answering system based on question category setting using LLM according to one embodiment of the present invention.
FIG. 4 is a diagram showing category setting, question input, and answer output among the dialogue input screens provided to a user in a question and answer system based on question category setting using LLM according to one embodiment of the present invention.
FIG. 5 is a diagram showing question input and answer output without category setting among the dialogue input screens provided to a user in a question and answer system based on question category setting using LLM according to one embodiment of the present invention.

이하에서는 도면을 참조하여 본 발명의 구체적인 실시예를 상세하게 설명한다. 다만, 본 발명의 사상은 제시되는 실시예에 제한되지 아니하고, 본 발명의 사상을 이해하는 당업자는 동일한 사상의 범위 내에서 다른 구성요소를 추가, 변경, 삭제 등을 통하여, 퇴보적인 다른 발명이나 본 발명 사상의 범위 내에 포함되는 다른 실시예를 용이하게 제안할 수 있을 것이나, 이 또한 본원 발명 사상 범위 내에 포함된다고 할 것이다.Hereinafter, specific embodiments of the present invention will be described in detail with reference to the drawings. However, the spirit of the present invention is not limited to the presented embodiments, and those skilled in the art who understand the spirit of the present invention will be able to easily propose other inventions that are retrograde or other embodiments included within the scope of the spirit of the present invention by adding, modifying, or deleting other components within the scope of the same spirit. However, this will also be considered to be included within the scope of the spirit of the present invention.

그리고, 후술되는 용어들은 본 발명에서의 기능을 고려하여 설정된 용어들로써 이는 발명자의 의도 또는 관례에 따라 달라질 수 있으므로 그 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이고, 본 명세서에서 본 발명에 관련된 공지의 구성 또는 기능에 대한 구체적인 설명이 본 발명의 요지를 흐릴 수 있다고 판단되는 경우에 이에 관한 자세한 설명은 생략하기로 한다.In addition, the terms described below are terms established in consideration of their functions in the present invention, and may vary depending on the inventor's intention or custom, so their definitions should be based on the contents throughout this specification, and if it is determined that a specific description of a known structure or function related to the present invention in this specification may obscure the gist of the present invention, a detailed description thereof will be omitted.

이하, 도면을 참조로 하여 본 발명에 따른 LLM을 활용한 질문 카테고리 설정 기반 질의응답 시스템(100)을 설명한다.Hereinafter, a question-answering system (100) based on question category setting using LLM according to the present invention will be described with reference to the drawings.

도 1은 본 발명의 일 실시예에 따른 LLM을 활용한 질문 카테고리 설정 기반 질의응답 시스템의 전체 구성도이다.FIG. 1 is a diagram showing the overall configuration of a question-answering system based on question category setting using LLM according to one embodiment of the present invention.

질문 카테고리 설정 기반 질의응답 시스템(이하, 질의응답 시스템으로 기재함)은 사용자 단말기(200)와 네트워크로 연결될 수 있다.A question-answering system based on question category setting (hereinafter referred to as a question-answering system) can be connected to a user terminal (200) via a network.

본 발명에서 언급하는 네트워크라 함은 유선 공중망, 무선 이동 통신망, 또는 휴대 인터넷 등과 통합된 코어 망일 수도 있고, TCP/IP 프로토콜 및 그 상위 계층에 존재하는 여러 서비스, 즉 HTTP(Hyper Text Transfer Protocol), HTTPS(Hyper Text Transfer Protocol Secure), Telnet, FTP(File Transfer Protocol) 등을 제공하는 전 세계적인 개방형 컴퓨터 네트워크 구조를 의미할 수 있으며, 이러한 예에 한정하지 않고 다양한 형태로 데이터를 송수신할 수 있는 데이터 통신망을 포괄적으로 의미하는 것이다.The network referred to in the present invention may be a core network integrated with a wired public network, a wireless mobile communication network, or a mobile Internet, or may mean a global open computer network structure that provides various services existing in the TCP/IP protocol and its upper layer, such as HTTP (Hyper Text Transfer Protocol), HTTPS (Hyper Text Transfer Protocol Secure), Telnet, FTP (File Transfer Protocol), etc., and is not limited to these examples, but comprehensively means a data communication network that can transmit and receive data in various forms.

본 발명에서 질의응답 시스템(100)은 자연어 입력 방식으로 사용자가 질문-답변에 대한 카테고리를 설정할 수 있도록 기능을 제공하고, 사용자가 설정한 카테고리 범위의 문서에서만 질문과 관련된 도메인 지식을 검색하여 답변을 생성함으로써, 질문에 대한 대규모 언어 모델(LLM)의 답변 생성의 정확성을 높이는 것을 목적으로 한다.In the present invention, the question-answering system (100) provides a function that allows a user to set a category for a question-answer using a natural language input method, and searches for domain knowledge related to a question only in documents within the category range set by the user to generate an answer, thereby increasing the accuracy of generating an answer for a large-scale language model (LLM) for a question.

보다 구체적으로는, 사용자에게 제공된 대화 입력창을 통해 사용자가 입력한 자연어 입력 문장을 수신하고, 해당 입력 문장이 카테고리 설정인지 질문인지를 판단하여 다음 작업을 수행한다.More specifically, the system receives a natural language input sentence entered by a user through a dialogue input window provided to the user, determines whether the input sentence is a category setting or a question, and performs the following task.

만약 사용자가 입력한 자연어 문장이 카테고리 설정으로 판단된 경우, 입력 문장에서 카테고리를 추출하고 추출된 카테고리를 검색할 문서의 범위로 설정한다.If the natural language sentence entered by the user is judged to be a category setting, the category is extracted from the input sentence and the extracted category is set as the scope of the document to be searched.

만약 사용자가 입력한 자연어 문장이 질문으로 판단된 경우, 이전에 설정된 카테고리 내에서 질문과 관련된 문서를 검색하여 상기 문서를 기반으로 답변을 생성하여 화면에 출력한다.If the natural language sentence entered by the user is judged to be a question, documents related to the question are searched within the previously set categories, an answer is generated based on the documents, and the answer is displayed on the screen.

여기서 검색이란 도메인 관련 문서들이 저장된 지식데이터베이스에서 사용자가 설정한 카테고리에 포함되고 질문과 관련성이 높은 문서들을 추출하는 것으로, RAG(Retrieval-Augmented Generation)에서의 리트리빙(Retrieving)을 의미한다.Here, search refers to retrieving documents that are included in a user-defined category and highly relevant to the question from a knowledge database where domain-related documents are stored, and refers to retrieval in RAG (Retrieval-Augmented Generation).

본 발명에서 사용자 단말기(200)는 데스크톱, 태블릿, 노트북, 스마트폰, 웨어러블 스마트 기기 등의 다양한 통신 수단을 포함하는 것으로 해석되어야 하며, 웹 기반 또는 별도의 소프트웨어/애플리케이션 등을 통해 질의응답 시스템(100)에서 제공하는 각종 기능을 실행할 수 있다.In the present invention, the user terminal (200) should be interpreted as including various communication means such as a desktop, tablet, laptop, smartphone, wearable smart device, etc., and can execute various functions provided by the question-and-answer system (100) through web-based or separate software/application, etc.

도 2는 본 발명의 일 실시예에 따른 LLM을 활용한 질문 카테고리 설정 기반 질의응답 시스템의 기능을 나타낸 블록도이다.FIG. 2 is a block diagram illustrating the function of a question-answering system based on question category setting using LLM according to one embodiment of the present invention.

도 2를 참조하여 질의응답 시스템(100)의 기능에 대해 설명한다.The function of the question and answer system (100) is described with reference to FIG. 2.

질의응답 시스템(100)은 요청수신부(110), 입력문분류부(120), 카테고리설정부(130), 답변생성부(140) 및 리트리빙데이터구성부(150) 및 지식데이터베이스(160)를 포함한다.The question-and-answer system (100) includes a request receiving unit (110), an input text classification unit (120), a category setting unit (130), an answer generation unit (140), a retrieval data configuration unit (150), and a knowledge database (160).

먼저 질의응답 시스템(100)은 사용자 단말기(200)로 사용자가 특정 도메인에 관련된 질문을 입력하고 이에 답변을 확인할 수 있도록 대화 입력창이 포함된 화면을 제공할 수 있다.First, the question-and-answer system (100) can provide a screen including a dialogue input window so that a user can input a question related to a specific domain and check the answer to it using a user terminal (200).

요청수신부(110)는 사용자 단말기(200)로부터 각종 요청을 수신하고, 이를 처리하는 백앤드 프로세스로 전달한다.The request receiving unit (110) receives various requests from the user terminal (200) and transmits them to the backend process that processes them.

먼저, 제공된 대화 입력창에 입력한 자연어 문장을 수신하고, 입력문분류부(120)로 전달할 수 있다.First, a natural language sentence entered in the provided dialogue input window can be received and transmitted to the input sentence classification unit (120).

또한, 대화 입력창에 표시된 카테고리에 대한 삭제 요청을 수신하고, 이를 카테고리설정부(130)로 전달할 수 있다.Additionally, a deletion request for a category displayed in the conversation input window can be received and transmitted to the category setting unit (130).

입력문분류부(120)는 요청수신부(110)로부터 전달받은 자연어 입력 문장을 수신하여 카테고리 설정인지 질문인지를 판단하고, 판단 결과에 따라 카테고리 설정 또는 답변 생성 작업을 지시한다.The input sentence classification unit (120) receives a natural language input sentence transmitted from the request receiving unit (110), determines whether it is a category setting or a question, and instructs a category setting or answer generation task based on the determination result.

이때, 카테고리를 설정하는 다수의 문장 형식을 포함한 카테고리 설정 문장 및 다수의 질문 형식을 포함한 질문 문장으로 이루어진 학습데이터를 사용하여 대규모 언어 모델(LLM)을 학습한 입력문분류모델을 통해 카테고리 설정인지 질문인지를 판단한다.At this time, the input sentence classification model is trained using a large-scale language model (LLM) using training data consisting of category setting sentences including multiple sentence formats that set categories and question sentences including multiple question formats to determine whether the sentence is a category setting or a question.

입력문분류모델은 LLM을 베이스 모델로 하여 도메인 특성에 맞게 구축된 학습데이터셋으로 파인 튜닝한 모델일 수 있다.The input text classification model can be a fine-tuned model using LLM as the base model and a learning dataset built to suit domain characteristics.

또한, 언어 모델의 출력층에 간단한 분류기를 추가하여 두 개의 클래스로 문장을 분류하도록 설정할 수 있다.Additionally, we can add a simple classifier to the output layer of the language model to classify sentences into two classes.

학습데이터셋은 카테고리를 설정하는 문장과 질문 문장을 포함하고, 카테고리 설정 문장 또는 질문 문정으로 분류하기 위한 라벨이 지정될 수 있다.The training dataset contains sentences that set categories and question sentences, and labels can be assigned to classify them as category-setting sentences or question sentences.

일례로, 카테고리를 설정하는 문장은 "~에 대해 질문할게/물어볼게", "너는 ~ 전문가야.", "이제부터/이후부터 ~ 에 대해 물어볼게.", "~을 카테고리로 설정해줘."등 형식의 문장일 수 있다.For example, sentences that set categories can be in the form of "I will ask/ask about ~", "You are an expert on ~", "From now on/from this point on, I will ask about ~", "Set ~ as a category.", etc.

일례로, 질문은 "~을 찾아줘.", "~에 대해 설명해줘/알려줘.", "~는 무엇인가?"등 형식의 문장일 수 있다.For example, questions can be sentences in the form of "Find me ~", "Explain/tell me about ~", "What is ~?", etc.

카테고리설정부(130)는 입력문분류부(120)의 지시에 따라 전달받은 입력 문장에서 카테고리를 추출하고 추출된 카테고리를 검색할 문서의 범위로 설정한다.The category setting unit (130) extracts categories from the input sentences received according to the instructions of the input sentence classification unit (120) and sets the extracted categories as the scope of documents to be searched.

카테고리설정부(130)는 카테고리추출부(131) 및 화면출력부(132)를 포함한다.The category setting unit (130) includes a category extraction unit (131) and a screen output unit (132).

카테고리추출부(131)는 카테고리추출모델을 통해 주어진 입력 문장을 분석하여 카테고리를 추출한다.The category extraction unit (131) analyzes the given input sentence through the category extraction model and extracts categories.

카테고리추출모델은 다수의 카테고리 설정 문장 및 카테고리를 포함하는 학습데이터를 사용하여 대규모 언어 모델을 학습하여 입력 문장을 분석하여 카테고리를 추출하도록 학습될 수 있다.A category extraction model can be trained to analyze input sentences and extract categories by training a large-scale language model using training data containing a large number of category-setting sentences and categories.

또한, 카테고리 추출은 일반적으로 시퀀스 라벨링 또는 토큰 분류 문제로 접근할 수 있다.Additionally, category extraction can generally be approached as a sequence labeling or token classification problem.

일례로, 카테고리추출모델에 대한 학습데이터셋은 지식데이터베이스(160) 내의 문서들을 분류한 카테고리들이 포함된 카테고리 설정 문장들과 해당 정답 레이블(추출된 카테고리)을 포함하여 구성될 수 있다.For example, a learning data set for a category extraction model may be configured to include category setting sentences containing categories that classify documents in a knowledge database (160) and corresponding correct labels (extracted categories).

또한, 사용자가 입력한 카테고리 설정 문장에서 카테고리로 판단된 단어를 추출한 후, 해당 단어와 일치된 카테고리가 없는 경우 유사어, 동의어가 존재하는지 확인하고, 이와 대응되는 카테고리를 출력할 수 있다.In addition, after extracting words judged to be categories from the category setting sentences entered by the user, if there is no category matching the word, it is possible to check whether there are similar words or synonyms and output the corresponding categories.

일례로, 사용자가 '규정에 대해 물어볼게.'라고 입력한 경우, 카테고리추출모델은 지식데이터베이스(160)의 각 문서에 대해 설정된 카테고리 목록 중에서 규정과 가장 유사한 카테고리인 '사규'를 카테고리로 출력할 수 있다.For example, if a user inputs 'I want to ask about the regulations,' the category extraction model can output 'private regulations,' which is the category most similar to the regulations among the category list set for each document in the knowledge database (160), as the category.

추출된 카테고리는 이후 사용자가 입력하는 질문에 대한 답변을 생성할 때 참조되는 문서들의 범위로 지정된다.The extracted categories are then assigned to the range of documents referenced when generating answers to questions entered by the user.

화면출력부(132)는 설정된 카테고리를 대화 입력창 화면에 표시하여 사용자가 현재 설정된 카테고리를 인식할 수 있도록 한다.The screen output section (132) displays the set category on the dialogue input window screen so that the user can recognize the currently set category.

도 4는 본 발명의 일 실시예에 따른 LLM을 활용한 질문 카테고리 설정 기반 질의응답 시스템에서 사용자에게 제공되는 대화 입력 화면 중 카테고리 설정, 질문 입력 및 답변 출력을 나타낸 도면이다.FIG. 4 is a diagram showing category setting, question input, and answer output among the dialogue input screens provided to a user in a question and answer system based on question category setting using LLM according to one embodiment of the present invention.

도 4를 참조하면, 화면출력부(132)는 대화 입력창 화면에 사용자가 입력한 카테고리 설정 문장에 대해 '사규'라는 카테고리가 설정되었음을 알리는 메시지(②)를 출력한다.Referring to Fig. 4, the screen output unit (132) outputs a message (②) indicating that a category called ‘private rules’ has been set for the category setting sentence entered by the user on the dialogue input window screen.

또한, 설정된 카테고리는 대화 입력창 하단(③)에 표시하여 사용자가 현재 설정된 카테고리를 인지할 수 있도록 한다.Additionally, the set category is displayed at the bottom of the conversation input window (③) so that the user can recognize the currently set category.

또한, 기존에 설정된 카테고리가 있고 새로운 카테고리가 설정된 경우, 새로운 카테고리를 추가로 설정한 후, 상기 대화 입력창에 설정된 복수개의 카테고리를 표시할 수 있다.Additionally, if there are existing categories set and a new category is set, after setting the new category, multiple categories set can be displayed in the above dialogue input window.

또한, 카테고리 옆에 'X'버튼과 같은 사용자 인터페이스를 제공하여 현재 설정된 카테고리를 해제할 수 있도록 한다.Additionally, it provides a user interface such as an 'X' button next to the category to allow the user to uncheck the currently set category.

답변생성부(140)는 입력문분류부(120)의 지시에 따라 사용자가 대화창에 입력한 문장을 질문으로 인식하고, 설정된 카테고리 내에서 질문과 관련된 문서를 검색하여 상기 문서를 기반으로 답변을 생성한다.The answer generation unit (140) recognizes the sentence entered by the user in the conversation window as a question according to the instructions of the input sentence classification unit (120), searches for documents related to the question within the set category, and generates an answer based on the documents.

답변생성부(140)는 리트리빙부(141), 순위재조정부(142) 및 LLM답변생성부(143)을 포함한다.The answer generation unit (140) includes a retrieval unit (141), a ranking readjustment unit (142), and an LLM answer generation unit (143).

리트리빙부(141)는 도메인 관련 지식이 저장된 지식데이터베이스(160)에서 사용자에 의해 설정된 카테고리에 속한 문서를 대상으로 질문과 관련성이 높은 문서를 검색한다.The retrieval unit (141) searches for documents highly relevant to the question from the knowledge database (160) where domain-related knowledge is stored, targeting documents belonging to categories set by the user.

여기서 상기 지식데이터베이스(160)에는 대규모 언어 모델(LLM)이 참조할 수 있는 데이터의 형태로 특정 도메인에 대한 지식들이 저장, 관리된다.Here, in the above knowledge database (160), knowledge about a specific domain is stored and managed in the form of data that can be referenced by a large-scale language model (LLM).

도 3은 본 발명의 일 실시예에 따른 LLM을 활용한 질문 카테고리 설정 기반 질의응답 시스템에서 지식데이터베이스(160) 내에 저장된 데이터 구조를 나타낸 도면으로, 이를 참조하면, 지식데이터베이스(160)에 저장된 데이터는 대규모 언어 모델(LLM)이 참조할 수 있도록 모델의 입력 데이터 사이즈를 고려하여 사전에 정해진 크기의 텍스트로 분할된 리트리빙 데이터로 구성될 수 있다.FIG. 3 is a diagram showing a data structure stored in a knowledge database (160) in a question answering system based on question category setting using LLM according to one embodiment of the present invention. Referring to this, data stored in the knowledge database (160) can be configured as retrieval data divided into texts of a predetermined size in consideration of the input data size of the model so that a large-scale language model (LLM) can refer to it.

리트리빙 데이터에는 문서 내용에 대한 메타데이터, 대규모 언어 모델(LLM)이 사용할 수는 있도록 문서 내용을 전처리한 정제된 텍스트, 상기 문서 내용에 포함된 키워드 및 질문 내용과의 유사도 비교를 위해 상기 문서 내용을 벡터화한 벡터데이터로 구성될 수 있다.The retrieval data may be composed of metadata about the document content, purified text preprocessed from the document content so that it can be used by a large-scale language model (LLM), and vector data vectorized from the document content for comparison of similarity with keywords and question content included in the document content.

메타데이터에는 지식 출처가 되는 원본 문서의 카테고리, 원본 문서 내의 상기 텍스트의 위치, 문서의 타입(pdf, hwp, doc 등), 태그 및 리트리빙 데이터 생성 날짜를 포함할 수 있다.Metadata may include the category of the original document that is the source of the knowledge, the location of the text within the original document, the type of document (pdf, hwp, doc, etc.), tags, and the date of creation of the retrieval data.

상기 리트리빙 데이터에 설정되는 카테고리는 지식데이터베이스(160)에 저장된 지식데이터의 특징에 따라 사전에 카테고리 목록이 정의된다.The categories set in the above retrieving data are defined in advance as a category list according to the characteristics of the knowledge data stored in the knowledge database (160).

그리고, 지식데이터베이스에 해당 지식데이터를 추가할 때 문서의 내용에 따라 카테고리가 분류되고 메타데이터의 카테고리 항목에 분류된 카테고리가 설정된다.And, when adding the knowledge data to the knowledge database, the category is classified according to the content of the document and the classified category is set in the category item of the metadata.

일례로, 카테고리 목록에는 업무매뉴얼, 사규, 기준, 지침서, 절차서, 표준구매규격서, 인허가 서류, 법규 등이 포함될 수 있다.For example, a list of categories might include work manuals, company regulations, standards, guidelines, procedures, standard purchasing specifications, permitting documents, and regulations.

이와 같이 지식데이터베이스(160)에 저장된 지식데이터는 출처 문서의 카테고리 정보가 포함되어 있으므로, 사용자가 설정한 카테고리에 해당 하는 문서 내용만을 검색 대상으로 할 수 있다.In this way, the knowledge data stored in the knowledge database (160) includes category information of the source document, so only the document contents corresponding to the category set by the user can be searched.

또한, 카테고리가 설정되지 않은 경우, 지식데이터베이스(160)의 전체 문서를 검색 대상으로 한다.Additionally, if no category is set, the entire document in the knowledge database (160) is searched.

설정된 카테고리에 해당하는 문서 내용(지식데이터)을 대상으로 질문과 관련되어 유사도가 높은 데이터를 검색한다.Searches for data with high similarity related to the question among document contents (knowledge data) corresponding to the set category.

이를 위해 질문 문장을 임베딩하고, 벡터화한 질문 문장과 지식데이터베이스(160)에 저장된 벡터데이터 간의 유사도를 비교하여 유사도가 정해진 수치 이상인 리트리빙 데이터들을 추출한다.To this end, the question sentence is embedded, and the similarity between the vectorized question sentence and the vector data stored in the knowledge database (160) is compared to extract retrieval data having a similarity greater than a predetermined value.

순위재조정부(142)는 리트리빙부(141)를 통해 추출된 리트리빙 데이터에 대해 질문과의 연관성 정도를 수치화하여 부여하고 이 점수가 큰 순서로 정렬한 후, 정해진 순위의 리트리빙 데이터 만을 대규모 언어 모델(LLM)로 전달한다.The ranking readjustment unit (142) assigns a numerical value to the degree of relevance to the question for the retrieval data extracted through the retrieval unit (141), sorts the data in order of the highest score, and then transmits only the retrieval data of the determined rank to the large-scale language model (LLM).

여기서 연관성 정도에 대한 수치화는 질문 문장의 벡터와 리트리빙 데이터 간의 유사 확률값일 수 있다.Here, the numerical value for the degree of relevance can be the similarity probability value between the vector of the question sentence and the retrieval data.

전달된 리트리빙 데이터는 대규모 언어 모델이 질문에 대한 답변 생성 시 참조하는 컨텍스트로 사용된다.The retrieved data provided is used as context for large-scale language models to generate answers to questions.

LLM답변생성부(143)는 전달받은 질문, 질문과 관련된 문서 내용으로 검색된 리트리빙 데이터를 기반으로 답변을 생성한다. The LLM answer generation unit (143) generates an answer based on the retrieved data searched for the received question and the content of documents related to the question.

대규모 언어 모델(LLM)은 리트리빙 데이터의 내용을 바탕으로 질문에 대한 답변을 생성함으로써, 답변의 신뢰성을 보장할 수 있다.Large-scale language models (LLMs) can ensure the reliability of answers by generating answers to questions based on the content of the retrieved data.

또한, 생성된 답변 출력 시 해당 답변 생성의 근거가 되는 출처 문서에 대한 정보를 함께 제공함으로써, 사용자가 해당 정보에 대한 더 자세한 정보를 원하는 경우 연관된 문서를 직접 확인할 수 있도록 하여 사용자 편의성을 제공할 수 있다.In addition, by providing information about the source document that serves as the basis for generating the generated answer when outputting the generated answer, the user can directly check the related document if they want more detailed information about the information, thereby providing user convenience.

도 4는 사용자에게 제공되는 대화 입력 화면 중 카테고리 설정, 질문 입력 및 답변 출력을 나타낸 도면이고, 도 5는 카테고리 설정없이 질문 입력 및 답변 출력을 나타낸 도면이다.Fig. 4 is a drawing showing category setting, question input, and answer output among the dialogue input screens provided to the user, and Fig. 5 is a drawing showing question input and answer output without category setting.

도 4 및 도 5를 참조하면, 카테고리 설정 여부에 따라 대규모 언어 모델(LLM)이 생성하는 답변의 내용에 차이가 있음을 확인할 수 있다.Referring to Figures 4 and 5, it can be seen that there is a difference in the content of the answers generated by the large-scale language model (LLM) depending on whether categories are set.

도 4에서 '품질 평가에 대해 알려줘.'라는 질문에 대해 '사규'라는 카테고리를 한정함으로써, 사규에 명시된 품질 평가 내용을 답변으로 제공한다.In Fig. 4, by limiting the category to ‘private regulations’ in response to the question ‘Tell me about quality evaluation’, the quality evaluation content specified in the private regulations is provided as an answer.

이에 반해 도 5에서는 카테고리가 설정되지 않았으므로, 품질 평가에 대한 일반적인 내용을 제공한다.In contrast, Figure 5 does not have categories set, so it provides general information on quality assessment.

리트리빙데이터구성부(150)는 외부에서 수집한 도메인 문서를 지식데이터베이스(160)의 구조에 맞도록 데이터를 구성하여 추가하는 작업을 수행한다.The retrieval data configuration unit (150) performs the task of adding domain documents collected from outside by configuring the data to fit the structure of the knowledge database (160).

리트리빙데이터구성부(150)는 오픈 된 웹 사이트 및 특정 기업/조직의 업무 시스템으로부터 도메인 관련 문서들을 수집하고, 대규모 언어 모델(LLM)이 사용할 수 있는 텍스트 입력 사이즈에 적합하도록 문서의 내용을 분할하고, 도 3에 기재된 바와 같이 리트리빙 데이터의 구조로 데이터를 가공한 후 데이터베이스에 저장한다.The retrieval data configuration unit (150) collects domain-related documents from open websites and business systems of specific companies/organizations, divides the contents of the documents to fit a text input size that can be used by a large-scale language model (LLM), processes the data into a retrieval data structure as described in FIG. 3, and then stores the data in a database.

이때 분할되는 사이즈는 대규모 언어 모델의 입력 데이터로 전달 가능한 최대 길이로 지정될 수 있다.The size of the split at this time can be specified as the maximum length that can be passed as input data for a large-scale language model.

리트리빙 데이터 구조에 대해서는 앞서 자세히 설명하였으므로, 여기에서는 생략한다.The retrieval data structure has been explained in detail previously, so it is omitted here.

한편, 질의응답 시스템(100)은 웹 서버 및 데이터베이스 서버를 포함하는 어플리케이션 서버로 구축될 수 있고, 중앙처리장치, 메모리, 하드디스크, 사용자 인터페이스, 네트워크 인터페이스 등을 포함하여 구성된다.Meanwhile, the question-and-answer system (100) can be constructed as an application server including a web server and a database server, and is configured to include a central processing unit, memory, hard disk, user interface, network interface, etc.

또한, 상술한 구성 또는 방법의 각 단계는, 컴퓨터 판독 가능한 기록 매체 상의 컴퓨터 판독 가능 코드로 구현되거나 전송 매체를 통해 전송될 수 있다. 컴퓨터 판독 가능한 기록 매체는, 컴퓨터 시스템에 의해 읽혀질 수 있는 데이터를 저장할 수 있는 데이터 저장 디바이스이다.Additionally, each step of the above-described configuration or method may be implemented as computer-readable code on a computer-readable recording medium or transmitted via a transmission medium. A computer-readable recording medium is a data storage device capable of storing data that can be read by a computer system.

컴퓨터 판독 가능한 기록 매체의 예로는 데이터베이스, ROM, RAM, CD-ROM, DVD, 자기 테이프, 플로피 디스크 및 광학 데이터 저장 디바이스가 있으나 이에 한정되는 것은 아니다. 전송 매체는 인터넷 또는 다양한 유형의 통신 채널을 통해 전송되는 반송파를 포함할 수 있다. 또한 컴퓨터 판독 가능한 기록 매체는, 컴퓨터 판독 가능 코드가 분산 방식으로 저장되고, 실행되도록 네트워크 결합 컴퓨터 시스템을 통해 분배될 수 있다.Examples of computer-readable recording media include, but are not limited to, databases, ROMs, RAMs, CD-ROMs, DVDs, magnetic tapes, floppy disks, and optical data storage devices. Transmission media may include carrier waves transmitted over the Internet or various types of communication channels. Furthermore, computer-readable recording media may be distributed through a network-connected computer system, such that computer-readable code is stored and executed in a distributed manner.

또한 본 발명에 적용된 적어도 하나 이상의 구성요소는, 각각의 기능을 수행하는 중앙처리장치(CPU), 마이크로프로세서 등과 같은 프로세서를 포함하거나 이에 의해 구현될 수 있으며, 상기 구성요소 중 둘 이상은 하나의 단일 구성요소로 결합되어 결합된 둘 이상의 구성요소에 대한 모든 동작 또는 기능을 수행할 수 있다. 또한 본 발명에 적용된 적어도 하나 이상의 구성요소의 일부는, 이들 구성요소 중 다른 구성요소에 의해 수행될 수 있다. 또한 상기 구성요소들 간의 통신은 버스(미도시)를 통해 수행될 수 있다.In addition, at least one component applied to the present invention may include or be implemented by a processor, such as a central processing unit (CPU) or microprocessor, which performs each function, and two or more of the components may be combined into a single component that may perform all operations or functions of the two or more combined components. In addition, a part of at least one component applied to the present invention may be performed by another component among these components. In addition, communication between the components may be performed via a bus (not shown).

상기에서는 본 발명에 따른 실시예를 기준으로 본 발명의 구성과 특징을 설명하였으나 본 발명은 이에 한정되지 않으며, 본 발명의 사상과 범위 내에서 다양하게 변경 또는 변형할 수 있음은 본 발명이 속하는 기술분야의 당업자에게 명한 것이며, 따라서 이와 같은 변경 또는 변형은 첨부된 특허청구범위에 속함을 밝혀둔다.Although the configuration and features of the present invention have been described above based on embodiments according to the present invention, the present invention is not limited thereto, and it is obvious to those skilled in the art that various changes or modifications can be made within the spirit and scope of the present invention, and therefore, it is made clear that such changes or modifications fall within the scope of the appended patent claims.

100: LLM을 활용한 질문 카테고리 설정 기반 질의응답 시스템
110: 요청수신부
120: 입력문분류부
130: 카테고리설정부
131: 카테고리추출부 132: 화면출력부
140: 답변생성부
141: 리트리빙부 142: 순위재조정부
143: LLM답변생성부
150: 리트리빙데이터구성부
160: 지식데이터베이스
200: 사용자 단말기100: A question-answering system based on question category settings using LLM.
110: Request receiving unit
120: Input text classification section
130: Category Setting Section
131: Category extraction section 132: Screen output section
140: Answer generation section
141: Retrieval Department 142: Ranking Reorganization Department
143: LLM Answer Generation Department
150: Retrieval data configuration section
160: Knowledge Database
200: User terminal

Claims

An input sentence classification unit that analyzes natural language input sentences entered by a user through a dialogue input window to determine whether they are category setting or questions and instructs the task of category setting or answer generation;
A category setting unit that extracts categories from the input sentences according to the instructions of the input sentence classification unit and sets the extracted categories as the scope of documents to be searched; and
An answer generation unit that recognizes the input sentence as a question according to the instructions of the input sentence classification unit, searches for documents related to the question within a set category, and generates an answer based on the contents of the documents;
The above input text classification section is:
The natural language sentence entered by the user is analyzed through the input sentence classification model to determine whether it is a category setting or a question.
The above input sentence classification model is a large-scale language model trained using training data consisting of category setting sentences including multiple sentence types that set categories and question sentences including multiple question types.
The above category setting section is,
After setting the extracted category as the scope of the document to be searched, the set category is displayed on the dialogue input window screen so that the user can recognize the currently set category, and a user interface is provided so that the user can click on the category displayed on the dialogue input window screen to release the currently set category.
A question-answering system based on question category settings using LLM.

delete

In the first paragraph,
The above category setting section is,
The given input sentence is analyzed through the category extraction model to extract categories.
The above category extraction model is a large-scale language model learned using training data including a number of category setting sentences and categories included in the setting sentences.
A question-answering system based on question category settings using LLM.

delete

In the first paragraph,
The above answer generation section is,
Searches for documents highly relevant to a question in a knowledge database where domain-related knowledge is stored, for documents belonging to a category set by a user, but if no category is set, searches for all documents in the knowledge database.
A question-answering system based on question category settings using LLM.

In paragraph 6,
The above knowledge database includes refined document content that can be used by a large-scale language model, vector data that vectorizes the document content for similarity comparison with the question content, and metadata about the document content, characterized in that the metadata includes category items.
A question-answering system based on question category settings using LLM.

In paragraph 6,
The above answer generation section is,
The questions entered by the user and the searched documents are passed to a large-scale language model, and the large-scale language model generates an answer to the question based on the contents of the entered document and displays it on the dialogue input window screen.
A question-answering system based on question category settings using LLM.