WO2019227581A1 - Interest point recognition method, apparatus, terminal device, and storage medium - Google Patents
Interest point recognition method, apparatus, terminal device, and storage medium Download PDFInfo
- Publication number
- WO2019227581A1 WO2019227581A1 PCT/CN2018/094372 CN2018094372W WO2019227581A1 WO 2019227581 A1 WO2019227581 A1 WO 2019227581A1 CN 2018094372 W CN2018094372 W CN 2018094372W WO 2019227581 A1 WO2019227581 A1 WO 2019227581A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- point
- interest
- pronunciation
- information
- sequence
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/14—Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
- G10L15/142—Hidden Markov Models [HMMs]
- G10L15/144—Training of HMMs
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/1822—Parsing for meaning understanding
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
Definitions
- the present application relates to the field of computer technology, and in particular, to a method, a device, a terminal device, and a storage medium for identifying a point of interest.
- the embodiments of the present application provide a method, an apparatus, a terminal device, and a storage medium for identifying interest points, so as to solve the problems of low recognition accuracy and low recognition efficiency of interest points in natural language voice information.
- an embodiment of the present application provides a method for identifying a point of interest, including:
- An N-gram model is used to analyze the preset training corpus to obtain word sequence data of the preset training corpus, wherein the word sequence data includes a word sequence and a word sequence frequency of each of the word sequences. degree;
- the speech information to be identified is parsed to obtain M pronunciation sequences of the speech information to be identified, where M is a positive integer greater than 1;
- the point of interest information corresponding to the target pronunciation sequence is obtained from the point of interest information database as a point of interest recognition result of the speech information to be recognized.
- an embodiment of the present application provides a device for identifying a point of interest, including:
- a training corpus acquisition module for acquiring a preset training corpus
- a training corpus analysis module is configured to analyze the preset training corpus using an N-gram model to obtain word sequence data of the preset training corpus, wherein the word sequence data includes a word sequence and each Word sequence frequency of the predicate sequence;
- a voice information parsing module configured to parse the voice information to be recognized if the voice information to be recognized is received, to obtain M pronunciation sequences of the voice information to be recognized, where M is a positive integer greater than 1;
- An occurrence probability calculation module configured to calculate an occurrence probability of each pronunciation sequence for each of the pronunciation sequences and according to the word sequence data, so as to obtain an occurrence probability of M pronunciation sequences;
- a pronunciation sequence confirmation module configured to select the pronunciation sequence corresponding to the occurrence probability that reaches a preset probability threshold from the occurrence probability of M said pronunciation sequences as a target pronunciation sequence;
- the recognition result obtaining module is configured to obtain the point of interest information corresponding to the target pronunciation sequence from the point of interest information database as a point of interest recognition result of the speech information to be recognized.
- an embodiment of the present application provides a terminal device, including a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, where the processor executes the computer may The steps of the method for identifying points of interest are implemented when the instruction is read.
- embodiments of the present application provide one or more non-volatile readable storage media storing computer-readable instructions.
- the computer-readable instructions are executed by one or more processors, the one or more A plurality of processors execute the steps of the point of interest recognition method.
- FIG. 1 is an implementation flowchart of an interest point identification method according to an embodiment of the present application
- FIG. 2 is a flowchart of implementing step S4 in the method of identifying a point of interest according to an embodiment of the present application
- FIG. 3 is an implementation flowchart of obtaining a training corpus in a method of identifying interest points provided by an embodiment of the present application
- FIG. 4 is an implementation flowchart of constructing a point of interest information database in a method of identifying a point of interest provided by an embodiment of the present application;
- FIG. 5 is an implementation flowchart of generating a supplementary corpus in the method of identifying interest points provided by an embodiment of the present application
- FIG. 6 is a schematic diagram of a point of interest identification device according to an embodiment of the present application.
- FIG. 7 is a schematic diagram of a terminal device according to an embodiment of the present application.
- FIG. 1 illustrates a flowchart of implementing an interest point identification method provided by an embodiment of the present application.
- the point of interest recognition method is applied to a scene of recognition of points of interest in speech information of natural language.
- the identification scenario includes a server and a client, where the server and the client are connected through a network, and the user sends voice information in natural language through the client.
- the client may specifically but not limited to various personal computers and notebooks For computers, smart phones, tablet computers, and portable wearable devices, the server can be implemented by an independent server or a server cluster composed of multiple servers.
- the method for identifying a point of interest provided in the embodiments of the present application is applied to a server, as follows:
- the training corpus is used to evaluate the speech information in natural language, and is a corpus obtained by training using related corpora.
- the content in the training corpus in the embodiment of the present application includes, but is not limited to, points of interest information and a general corpus. Wait.
- Corpus refers to a large-scale electronic text library that has been scientifically sampled and processed.
- Corpus is the basic resource of linguistic research and the main resource of empirical language research methods. It is used in dictionary compilation, language teaching, traditional language research, statistical or example-based research in natural language processing, corpus, that is, language material, Corpus is the content of linguistic research and the basic unit of corpus.
- S2 An N-gram model is used to analyze the preset training corpus to obtain the preset word sequence data of the training corpus, where the word sequence data includes the word sequence and the word sequence frequency of each word sequence.
- the word sequence refers to a sequence composed of at least two corpora in a certain order.
- the frequency of the word sequence refers to the proportion of the number of times that the word sequence appears in the entire corpus.
- the word segmentation here It is a word sequence obtained by combining consecutive word sequences in a preset combination manner. For example, if a word sequence "love tomatoes" appears 100 times in the entire corpus, and the total number of occurrences of all participles of the entire corpus is 100,000 times, the word sequence frequency of the word sequence "love tomatoes" is 0.0001. .
- the N-gram model is a language model commonly used in large vocabulary continuous speech recognition. Using the collocation information between adjacent words in the context, when it is necessary to convert continuous non-space pinyin into Chinese character strings (that is, sentences), Calculate the sentence with the highest probability, so as to realize automatic conversion to Chinese characters, without manual selection by the user, and avoiding the problem of recoding of many Chinese characters corresponding to the same pinyin.
- each Chinese pronunciation corresponds to one or more Chinese characters.
- the server After receiving the to-be-recognized voice information input by the user on the client, the server decodes the to-be-recognized voice information through an acoustic decoder, and converts into multiple pronunciation sequences. .
- the pronunciation sequence refers to a text sequence including at least two word segmentations obtained by converting speech information.
- the pronunciation sequence obtained by acoustically decoding this to-be-recognized voice information "woxixihuanchichizhonggugumeimeishi” can be a pronunciation sequence A: “I”, “like”, “ “Eat”, “China”, “Food” can also be the pronunciation sequence B: “I”, “Like”, “Chi Zhong”, “Guomei”, “Food”, and can also be the pronunciation sequence C: “I”, “Western ring", “hold”, “China”, “all right” and so on.
- a pronunciation probability calculation is performed for each pronunciation sequence to obtain the occurrence probability of M pronunciation sequences.
- the Markov hypothesis theory can be used to calculate the occurrence probability of the pronunciation sequence.
- the occurrence of the Y word is only related to the first Y-1 words, and it is not related to any other words.
- the probability of the entire sentence is the probability of occurrence of each word. product.
- P (T) is the probability of the entire sentence appearing
- W 1 W 2 ... W Y-1 ) is the probability that the Y- th participle appears after the word sequence composed of Y-1 participles.
- S5 From the occurrence probabilities of M pronunciation sequences, a pronunciation sequence corresponding to the occurrence probability that reaches a preset probability threshold is selected as the target pronunciation sequence.
- an occurrence probability is obtained through the calculation of step S4, and a total of M sounding sequence occurrence probabilities are obtained.
- the occurrence probabilities of the M sounding sequences are compared with a preset probability threshold, respectively, and are selected to be greater than Or the occurrence probability equal to the preset probability threshold is used as the effective occurrence probability, and then the pronunciation sequences corresponding to the effective occurrence probability are found, and these pronunciation sequences are used as the target pronunciation sequences.
- the pronunciation sequences whose occurrence probability does not meet the requirements are filtered, so that the selected target pronunciation sequence is closer to the meaning expressed in natural speech, and the accuracy of interest point recognition is improved.
- the voice message record is collected and sent to the background management staff. If the number of target pronunciation sequences is greater than the preset number, sort them in the order of their corresponding probability of occurrence, and select the preset number of pronunciation sequences before sorting as the target pronunciation sequence, for example, the preset number is 5 Then, after the effective occurrence probabilities are sorted, the first five effective occurrence probabilities are selected, and then the pronunciation word order corresponding to the five occurrence probabilities is used as the target pronunciation sequence.
- S6 Obtain the point of interest information corresponding to the target pronunciation sequence from the point of interest information database as the result of the point of interest recognition of the speech information to be recognized.
- the point of interest information contained in the target pronunciation sequence is obtained from the point of interest information database, and the point of interest information is pushed to the user as a result of the point of interest recognition of the voice information.
- the preset training corpus is obtained, and then the N-gram model is used to analyze the preset training corpus to obtain the word sequence data of the preset training corpus. All the words are calculated and analyzed in advance. Sequence data, so that the word sequence data can be used directly in the subsequent calculation of the probability of occurrence, which saves the time of calculating the probability and improves the efficiency; when the voice information to be recognized is received, the voice information to be recognized is parsed to obtain the voice information to be recognized.
- M pronunciation sequences for each pronunciation sequence, according to the word sequence data, calculate the occurrence probability of each pronunciation sequence, and select the pronunciation sequence corresponding to the occurrence probability of the preset probability threshold from the obtained occurrence probability of M pronunciation sequences
- this method calculates the probability of the pronunciation sequence and selects the Probability as the result of the screening method, can achieve the speech information The meaning is accurately identified, thereby improving the accuracy of interest point recognition.
- FIG. 2 illustrates a specific implementation process of step S4 provided by an embodiment of the present application, which is detailed as follows:
- the word segmentation in the pronunciation sequence is obtained in the order of the word order from front to back. For example, for a pronunciation sequence "I love China”, the word segmentation is performed in order from the word order to the first. One participle "I”, the second participle “Love”, and the third participle "China”.
- step S2 it can be known from step S2 that the word sequence frequency of each word sequence is obtained through the analysis of the training corpus by the N-gram model, and only calculation is required according to formula (2) here.
- bigram is calculated by using the formula (2) each word a present participle 2 illustrating a probability after 1 A 1, a word participle. 3 illustrating a probability after 2 A 2, ..., a n word The probability A n-1 that appears after the word segmentation a n-1 , and then uses formula (3) to calculate the probability of occurrence of the entire word sequence (a 1 a 2 ... a n-1 a n ):
- a training corpus may also be constructed.
- the method for identifying interest points further includes:
- the point of interest information database contains the point of interest information of each point of interest.
- the specific method is not specifically limited here.
- the method adopted in the embodiment of the present application is to obtain a point of interest by using a web crawler to build a point of interest information database.
- the point of interest information includes, but is not limited to, the name of the point of interest, the category to which the point of interest belongs, and the address of the point of interest.
- S72 Generate a supplementary corpus based on the interest point information database.
- the point of interest information in the point of interest information database is extracted, and all the obtained point of interest information is processed according to a preset processing method as a supplementary corpus.
- the specific processing method may be segmentation of interest points, or semantic statistics of interest point information, etc., which may be specifically selected according to actual needs, and is not limited here.
- S73 Combine the supplementary corpus with a preset basic corpus to obtain a training corpus.
- the training corpus must have a huge corpus so that it can evaluate whether a sentence is reasonable, so a preset corpus and supplement that will have sufficient corpus need to be used The corpus is combined to obtain the training corpus.
- the preset basic corpus is selected according to actual needs. For example, the news of Sohu's financial, sports, and current affairs in the past three years is selected, and the corpus generated by text cleaning and collation is used as the basic corpus.
- the supplementary corpus is combined with a preset basic corpus to obtain a training corpus for N-
- the training corpus analyzed by the gram model not only has the ability to evaluate whether a sentence is reasonable, but also contains information about points of interest, so that it can accurately evaluate whether a sentence contains points of interest, which is conducive to improving the accuracy of natural language speech information recognition. And the accuracy rate of interest point information identification.
- a specific embodiment is used to describe in detail a specific implementation method of constructing a point of interest information base mentioned in step S71.
- FIG. 4 illustrates a specific implementation process of step S71 provided by an embodiment of the present application, which is detailed as follows:
- S711 Classify the preset basic interest points according to a preset classification method to obtain a basic classification of the interest point information database.
- the basic interest points are classified according to a preset classification method of the interest points, the classification is used as the basic classification of the interest point information database, and the interest point information contained in each basic classification is stored in the information point information database.
- the corresponding position and classification method can be set according to actual needs, and there is no limitation here.
- the basic interest points refer to each small category of the points of interest
- the basic classification refers to the large categories of points of interest.
- a basic category contained in the information point information base is "cuisine”
- the basic points of interest contained below the basic category There are "breakfast,” “fast food,” “hot pot,” “buffet,” and “hotel.”
- S712 For each basic classification, through the web crawling method, obtain the interest point information of all the basic interest points in the administrative region of the country that includes the basic classification, and obtain the interest point information of the basic classification in each administrative region of the country.
- a web crawler is used to sequentially crawl each administrative region in the country to obtain the information of the administrative region containing all the basic interest points under this basic classification, thereby obtaining the Point of interest information of basic classifications in each administrative region of the country.
- information of points of interest of all basic classifications in each administrative region of the country is obtained.
- the web crawler is also called Scalable Web Crawler.
- the crawling object is expanded from some seed URLs (Uniform Resource Locator, Uniform Resource Locator) to the entire Web (World Wide Web, global wide area network), which is mainly a portal site search engine. And data from large web service providers.
- the crawling scope and number of web crawlers are huge, and the requirements for crawling speed and storage space are high.
- the order of crawling pages is relatively low.
- page crawl module page analysis module, link filtering module, page database, URL queue, and initial URL collection.
- Common crawling strategies are: depth-first strategy, breadth-first strategy.
- the basic method of the depth-first strategy is to visit the next level of web links in order from the lowest to the highest depth, until it cannot be further deepened. After completing a crawling branch, the crawler returns to the previous link node to further search for other links. When all links have been traversed, the crawling task ends.
- the breadth-first strategy is to crawl pages according to the depth of the content directory level of the web page, and the pages at the shallower directory level are crawled first. After the pages in the same level are crawled, the crawler goes deeper and continues to crawl.
- This strategy can effectively control the crawling depth of the page, avoid the problem that crawling cannot be ended when encountering an infinite deep branch, and it is easy to implement without storing a large number of intermediate nodes.
- the crawling strategy adopted in the embodiment of the present application is a breadth-first strategy.
- Step A Obtain information on administrative regions at all levels throughout the country and the latitude and longitude corresponding to each administrative region.
- the administrative district information of each city-level unit in the country is obtained, and then the county-level administrative district information included in the municipal-level administrative district is obtained, and then the information of the district, street, township and other office information included in the county-level administrative district is obtained.
- the administrative region information includes, but is not limited to, administrative region name, administrative region code, higher administrative region information, and lower administrative region information.
- the obtained administrative region information is 440300.
- longitude and latitude is a coordinate system composed of longitude and latitude.
- Called geographic coordinate system it is a spherical coordinate system that uses the spherical surface of three degrees to define the space on the earth, and can mark any position on the earth.
- the longitude and latitude coordinate systems commonly used in China include, but are not limited to: WGS84 coordinate system (World Geodetic System 1984, World Geodetic Coordinate System), Beijing 54 coordinate system (BJZ54), Xi'an 80 coordinate system (XIAN80).
- the latitude and longitude coordinate system used in the embodiment of the present application is a WGS84 coordinate system.
- Step B For each administrative area K, divide the administrative area according to the latitude and longitude according to the preset division side length to obtain n rectangular lists of the same size.
- the national administrative region list includes several administrative regions, and the sizes of different administrative regions are different.
- the latitude and longitude range of the administrative region four pole coordinates are obtained, and the coordinates of the four latitudes and longitudes are used as the coordinates of the four vertices of a large rectangle, and then a large rectangle is obtained. Long division results in n rectangles.
- the preset length of the division can be selected according to the actual situation. For administrative regions with dense points of interest, a smaller segmented edge length can be preset. For administrative regions with sparse points of interest, a larger segmented edge length can be preset to facilitate subsequent crawling of points of interest to improve crawling. Speed, thereby improving the efficiency of point of interest acquisition.
- the obtained vertices of the four vertices of the large rectangle are converted into space rectangular coordinates and divided according to a preset rectangle length and rectangle width, for example, the coordinates of the lower left corner are (lat_1, lon_1), and the coordinates of the upper right corner are (lat_2, lon_2), set the length of the split side to len, the coordinates of the lower left corner of the first rectangle are lat_1, lon_1, and the coordinates of the upper right corner are lat_1 + len, lon_1 + len; , Lon_1 + len, the upper right corner coordinates are lat_1 + len, lon_1 + len.
- the number of rectangles generated is:
- int is a rounding function.
- int (1.334) 1.
- the latitude and longitude range obtained from Shenzhen is: 113 ° 46 ' ⁇ 114 ° 37' east longitude, 22 ° 27 ' ⁇ 22 ° 52' north latitude, and the space rectangular coordinates are converted into the lower left corner ( 22.45, 113.769444), upper right corner (22.86667, 114.619444).
- the side length can be set to 0.04.
- the first rectangular coordinate is the lower left corner (22.45, 113.769444), the upper right corner (22.49, 113.809444), and the second rectangular coordinate is the lower left corner (22.53, 113.809444), top right corner (22.53, 113.809444).
- Step C For the basic classification J, according to the rectangular list of the administrative area K, generate a URL list of the administrative area.
- the rectangular list is traversed by a web crawler, and each rectangular list is crawled to contain any tasks under the basic classification J.
- using a web crawler to crawl a Baidu map with a bottom left corner coordinate (22.53, 113.809444) and a top right corner coordinate (22.53, 113.809444) in a rectangular area containing a basic interest point as "secondary”, may Use the following code:
- page_size refers to the preset number of content contained in each page
- page_num refers to the number of pages
- ak is the developer's Baidu Maps API console key.
- Step D Determine the distribution information of the interest points of the basic classification J in the administrative area K by analyzing the URL list, and obtain the information of the interest points belonging to the basic classification J contained in the administrative area.
- step C by performing a webpage parsing on the URL list obtained in step C to obtain the basic point of interest information contained in each URL, thereby obtaining the point of interest information contained in each administrative area.
- the obtained URL list includes 26 URLs, and each URL includes 20 points of interest information, one of which is as follows:
- the name of the point of interest is "Kunming Eighth Middle School”
- its specific address is "No. 628 Longquan Road, Wuhua District, Kunming City, Yunnan province”
- its administrative area is "Wuhua District ", whose street number is" 35debf29e6063d3aa7da399b ".
- Step E Store the acquired POI information into a corresponding location in the POI database.
- the acquired point of interest information is classified according to the base to which it is acquired, and stored in a corresponding position in the point of interest information database.
- the point-of-interest information whose name of the point of interest is "Kunming Eighth Middle School” is stored in the basic point of interest "medium” which is basically classified as "school”.
- the preset basic interest points are classified according to a preset classification method to obtain the basic classification of the interest point information database, and then for each basic classification, a web crawling method is used to obtain
- Each administrative region of the country contains the point of interest information of all the basic points of interest of the basic classification, and the information of the points of interest of the basic classification in each of the administrative regions of the country is obtained, so as to obtain all the information of the points of interest in each administrative region of the country.
- identifying points of interest it can provide accurate and comprehensive points of interest information, which is conducive to improving the accuracy of point of interest recognition.
- a specific embodiment is used to describe in detail a specific implementation method of generating a supplementary corpus based on the point of interest information base mentioned in step S72.
- FIG. 5 illustrates a specific implementation process of step S72 provided by an embodiment of the present application, which is detailed as follows:
- the basic interest points in each basic classification and the interest point information contained in the basic interest points are extracted from the interest point information database.
- S722 Perform word segmentation processing on the POI information to obtain the POI word segmentation.
- Chinese word segmentation is performed to obtain the point of interest segmentation of the point of interest information.
- Chinese word segmentation refers to cutting a sequence of Chinese characters into individual characters.
- Word segmentation is the process of recombining consecutive word sequences into word sequences in accordance with certain specifications.
- Existing word segmentation algorithms can be divided into three categories: word segmentation methods based on string matching, word segmentation methods based on understanding, and word segmentation methods based on statistics. According to whether it is combined with the part-of-speech tagging process, it can be divided into a simple word segmentation method and an integrated method combining word segmentation and labeling.
- the word segmentation algorithm adopted by the embodiment of the invention is an understanding-based word segmentation method.
- the obtained point-of-interest information is "basic classification-food, basic point of interest-fast food, name of point of interest-Yanjin pot, point of interest address-Bagua II, Luohu District, Shenzhen, Guangdong Province You can get “Food”, “Fast Food”, “Guangdong City”, “Shenzhen”, “Luohu District”, “Bagua Erlu” and “Yanjin Pot” according to the participle.
- S723 Establish a mapping relationship between the point-of-interest segmentation and the corresponding point-of-interest information, and save the point-of-interest segmentation, the point-of-interest information, and the mapping relationship in the supplementary corpus correspondingly.
- each acquired point of interest segmentation is associated with the point of interest information to form a mapping, and the point of interest segmentation, point of interest information, and mapping relationship are correspondingly stored in a supplementary corpus.
- a mapping the point of interest segmentation, point of interest information, and mapping relationship are correspondingly stored in a supplementary corpus.
- the POI information is "basic classification-cuisine, basic POI-fast food, POI name-Yanjin pot, POI address-Bagua Second Road, Luohu District, Shenzhen, Guangdong province
- the Yanjin pot's point of interest contains the set of participles: ⁇ "Cuisine", “Fast food”, “Guangdong City”, “Shenzhen”, “Luohu District”, “Bagua Erlu", "Yanjin pot” ⁇ .
- the point of interest segmentation is obtained, and then a mapping between the point of interest segmentation and the corresponding point of interest information is established. Relationship, and save the interest point segmentation, interest point information, and mapping relationship to the supplementary corpus, so that the supplementary corpus contains interest point information, interest point segmentation, and their mapping relationships, so that in the subsequent interest point detection, it can be based on the corresponding
- the POI segmentation directly finds the corresponding POI information, thereby improving the efficiency of POI recognition.
- step S71 After the point of interest information database mentioned in step S71 is constructed, more points of interest information area can be updated, and the method of identifying points of interest further includes:
- the point of interest information base is updated in real time, or the point of interest information base is automatically updated according to a preset condition.
- the point of interest information will change over time. After some point of interest changes, if the point of interest database is not updated accordingly, the point of interest will be unrecognized when these points of interest are identified. The identification or identification information is incorrect. Therefore, the point of interest information database needs to be updated.
- the embodiments of the present application provide two ways of updating the point of interest information database, which are to update according to preset conditions, and to perform real-time update when an update instruction sent by a user is received.
- updating according to a preset condition refers to triggering an automatic update procedure to perform an automatic update after the preset condition is reached.
- the preset condition may be a preset update period. For example, the preset update period is 7 days. It can also be detected that the crawled URL list has changed in step S712. For example, under the same crawling condition, the crawling result has changed from the previous 16000 to 17600. At this time, the interest point The information database is updated.
- the specific preset conditions can be set in various and flexible settings according to the actual situation, and there is no specific limitation here.
- the point of interest information database is updated in real time or automatically, so that the point of interest information contained in the point of interest information database is always maintained in an accurate state, so that in the subsequent point of interest identification , Can provide accurate and comprehensive point of interest information, which is conducive to improving the accuracy of interest point identification.
- FIG. 6 shows a point of interest identification device that corresponds one-to-one to the method of identifying points of interest provided by the above method embodiment. For ease of description, only the implementation with this application is shown. Example related parts.
- the point of interest recognition device includes a training corpus acquisition module 10, a training corpus analysis module 20, a voice information analysis module 30, an occurrence probability calculation module 40, a pronunciation sequence confirmation module 50, and a recognition result acquisition module 60.
- the detailed description of each function module is as follows:
- a training corpus acquisition module 10 configured to acquire a preset training corpus
- a training corpus analysis module 20 is configured to analyze a preset training corpus using an N-gram model to obtain preset word sequence data of the training corpus, where the word sequence data includes a word sequence and a word sequence frequency of each word sequence degree;
- the voice information analysis module 30 is configured to parse the voice information to be recognized if the voice information to be recognized is received, to obtain M pronunciation sequences of the voice information to be recognized, where M is a positive integer greater than 1;
- the occurrence probability calculation module 40 is configured to calculate the occurrence probability of each pronunciation sequence according to the word sequence data for each pronunciation sequence, thereby obtaining the occurrence probability of M pronunciation sequences;
- the pronunciation sequence confirmation module 50 is configured to select, from the occurrence probabilities of M pronunciation sequences, a pronunciation sequence corresponding to an occurrence probability reaching a preset probability threshold as a target pronunciation sequence;
- the recognition result obtaining module 60 is configured to obtain the point of interest information corresponding to the target pronunciation sequence from the point of interest information database as a point of interest recognition result of the speech information to be recognized.
- the occurrence probability calculation module 40 includes:
- Segmentation sequence extraction unit 41 is configured to obtain, for each pronunciation sequence, all the segmentations a 1 , a 2 , ..., an n-1 , an n within the pronunciation sequence, where n is a positive integer greater than 1;
- the occurrence probability calculation unit 42 is configured to calculate the probability that the nth participle a n of the n participles appears after the word sequence (a 1 a 2 ... a n-1 ) according to the word sequence data using the following formula. Probability as the probability of occurrence of the pronunciation sequence:
- the interest point recognition device further includes:
- the point of interest information base construction unit 71 is configured to construct a point of interest information base
- a supplementary corpus acquisition unit 72 configured to generate a supplementary corpus based on the point of interest information base
- the training corpus generating unit 73 is configured to combine a supplementary corpus with a preset basic corpus to obtain a training corpus.
- the point of interest information base construction unit 71 includes:
- a classification division subunit 711 configured to classify a preset basic interest point according to a preset classification method to obtain a basic classification of an interest point information database;
- An information acquisition subunit 712 is configured to obtain the interest point information of all the basic points of interest in each administrative region of the country containing the basic classification in each administrative region of the country by using a web crawling method for each basic classification to obtain the basic classification in each administrative region of the country Point of interest information.
- the supplementary corpus acquisition unit 72 includes:
- a corpus acquisition subunit 723 is configured to establish a mapping relationship between a point of interest segmentation and corresponding point of interest information, and correspondingly save the point of interest segmentation, point of interest information, and mapping relationship in a supplementary corpus.
- the interest point recognition device further includes:
- the information base update module 80 is configured to update the point of interest information base in real time if an update instruction is received, or automatically update the point of interest information base according to a preset condition.
- This embodiment provides one or more nonvolatile readable storage media storing computer readable instructions.
- the nonvolatile readable storage medium stores computer readable instructions, and the computer readable instructions are When the processors execute, the point of interest identification method in the foregoing method embodiment is implemented, or when the computer-readable instructions are executed by one or more processors, the functions of each module / unit in the point of interest identification device in the foregoing device embodiment are implemented. To avoid repetition, we will not repeat them here.
- the non-volatile readable storage medium may include: any entity or device capable of carrying the computer-readable instruction code, a recording medium, a U disk, a mobile hard disk, a magnetic disk, an optical disk, a computer memory, Read-Only Memory (ROM), Random Access Memory (RAM), electric carrier signals and telecommunication signals.
- ROM Read-Only Memory
- RAM Random Access Memory
- FIG. 7 is a schematic diagram of a terminal device according to an embodiment of the present application.
- the terminal device 90 of this embodiment includes a processor 91, a memory 92, and computer-readable instructions 93 stored in the memory 92 and executable on the processor 91, such as a point of interest recognition program.
- the processor 91 executes the computer-readable instructions 93
- the steps in the foregoing embodiments of the method for identifying points of interest are implemented, for example, steps S1 to S6 shown in FIG.
- the processor 91 executes the computer-readable instructions 93
- the functions of each module / unit in the foregoing device embodiments are implemented, for example, the functions of the modules 10 to 60 shown in FIG. 6.
- the computer-readable instructions 93 may be divided into one or more modules / units, and the one or more modules / units are stored in the memory 92 and executed by the processor 91 to complete the present application.
- One or more modules / units may be instruction segments of a series of computer-readable instructions capable of performing specific functions, and the instruction segments are used to describe the execution process of the computer-readable instructions 93 in the terminal device 90.
- the computer-readable instructions 93 may be divided into a training corpus acquisition module, a training corpus analysis module, a voice information analysis module, an occurrence probability calculation module, a pronunciation sequence confirmation module, and a recognition result acquisition module. The specific functions of each module are as shown in the device embodiment. To avoid repetition, details are not described here.
- the terminal device 90 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
- the terminal device 90 may include, but is not limited to, a processor 91 and a memory 92.
- FIG. 7 is only an example of the terminal device 90, and does not constitute a limitation on the terminal device 90.
- the terminal device 90 may include more or fewer components than shown in the figure, or some components may be combined or different components
- the terminal device 90 may further include an input / output device, a network access device, and a bus.
- the so-called processor 91 may be a central processing unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application specific integrated circuits (ASICs), Ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
- a general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
- the memory 92 may be an internal storage unit of the terminal device 90, such as a hard disk or a memory of the terminal device 90.
- the memory 92 may also be an external storage device of the terminal device 90, such as a plug-in hard disk provided on the terminal device 90, a Smart Memory Card (SMC), a Secure Digital (SD) card, and a flash memory card (Flash Card) and so on.
- the memory 92 may include both an internal storage unit of the terminal device 90 and an external storage device.
- the memory 92 is used to store computer-readable instructions and other programs and data required by the terminal device 90.
- the memory 92 may also be used to temporarily store data that has been output or is to be output.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Probability & Statistics with Applications (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Disclosed in the present application are an interest point recognition method, an apparatus, a terminal device, and a storage medium. The method comprises: acquiring a pre-set training corpus, using an N-gram model to analyze the training corpus, to obtain word sequence data, upon reception of voice information to be recognized, parsing said to-be-recognized voice information, to obtain M pronunciation sequences of said to-be-recognized voice information, with regard to each of the pronunciation sequences, according to the word sequence data, calculating the occurrence probability of each pronunciation sequence, to obtain the occurrence probabilities of the M pronunciation sequences. By selecting, from the occurrence probabilities of the M pronunciation sequences, the pronunciation sequence corresponding to the occurrence probability reaching a pre-set probability threshold, as a target pronunciation sequence, and acquiring, from an interest point information base, interest point information corresponding to the target pronunciation sequence, as an interest point recognition result of said to-be-recognized voice information. The present invention accurately recognizes the meaning of voice information, improving the accuracy and efficiency of recognizing an interest point.
Description
本申请以2018年5月29日提交的申请号为201810529490.2,名称为“兴趣点识别方法、装置、终端设备及存储介质”的中国发明专利申请为基础,并要求其优先权。This application is based on a Chinese invention patent application filed on May 29, 2018 with the application number 201810529490.2 and entitled "Interest Point Recognition Method, Device, Terminal Equipment and Storage Medium" and claims its priority.
本申请涉及计算机技术领域,尤其涉及一种兴趣点识别方法、装置、终端设备及存储介质。The present application relates to the field of computer technology, and in particular, to a method, a device, a terminal device, and a storage medium for identifying a point of interest.
随着社会的进步和经济的发展,许多人因业务需要会经常出差,也有些人会利用闲暇之余外出旅游,在陌生的地方往往需要通过智能设备对一些地址或兴趣点进行查找,为了给人们提供方便,许多智能设备都提供语音识别功能进行兴趣点识别。With the progress of society and economic development, many people will travel frequently due to business needs, and some people will use their spare time to travel. In strange places, it is often necessary to search for some addresses or points of interest through smart devices. People provide convenience, and many smart devices provide voice recognition for point of interest recognition.
当前智能设备提供的语音识别功能大多通过使用通用模型,将获取到的自然语言信息进行语音文本转换,来识别其中包含的预设兴趣点,但自然语言中往往存在许多对预设兴趣点干扰的词汇,且由于每个人的表达方式、口音等问题,使得对于自然语言的语音信息中兴趣点的识别准确率不高且效率较低。Most of the current speech recognition functions provided by smart devices use general models to convert the acquired natural language information into speech and text to identify the preset interest points contained in them. However, there are often many natural languages that interfere with the preset interest points. Vocabulary, and because of everyone's expression, accent and other issues, the recognition accuracy of interest points in natural language speech information is not high and the efficiency is low.
发明内容Summary of the Invention
本申请实施例提供一种兴趣点识别方法、装置、终端设备及存储介质,以解决对自然语言的语音信息中兴趣点的识别准确率低并且识别效率低的问题。The embodiments of the present application provide a method, an apparatus, a terminal device, and a storage medium for identifying interest points, so as to solve the problems of low recognition accuracy and low recognition efficiency of interest points in natural language voice information.
第一方面,本申请实施例提供一种兴趣点识别方法,包括:In a first aspect, an embodiment of the present application provides a method for identifying a point of interest, including:
获取预设的训练语料库;Obtain a preset training corpus;
使用N-gram模型对所述预设的训练语料库进行分析,得到所述预设的训练语料库的词序列数据,其中,所述词序列数据包括词序列以及每个所述词序列的词序列频度;An N-gram model is used to analyze the preset training corpus to obtain word sequence data of the preset training corpus, wherein the word sequence data includes a word sequence and a word sequence frequency of each of the word sequences. degree;
若接收到待识别语音信息,则对所述待识别语音信息进行解析,得到所述待识别语音信息的M个发音序列,其中,M为大于1的正整数;If the speech information to be identified is received, the speech information to be identified is parsed to obtain M pronunciation sequences of the speech information to be identified, where M is a positive integer greater than 1;
针对每个所述发音序列,依据所述词序列数据,计算每个发音序列的发生概率,从而得到M个发音序列的发生概率;For each of the pronunciation sequences, calculating the occurrence probability of each pronunciation sequence according to the word sequence data, thereby obtaining the occurrence probability of M pronunciation sequences;
从M个所述发音序列的发生概率中,选取达到预设概率阈值的发生概率对应的所述发音序列,作为目标发音序列;Selecting the pronunciation sequence corresponding to the occurrence probability reaching a preset probability threshold from the occurrence probabilities of M said pronunciation sequences as a target pronunciation sequence;
从兴趣点信息库中获取与所述目标发音序列对应的兴趣点信息,作为所述待识别语音信息的兴趣点识别结果。The point of interest information corresponding to the target pronunciation sequence is obtained from the point of interest information database as a point of interest recognition result of the speech information to be recognized.
第二方面,本申请实施例提供一种兴趣点识别装置,包括:In a second aspect, an embodiment of the present application provides a device for identifying a point of interest, including:
训练语料获取模块,用于获取预设的训练语料库;A training corpus acquisition module for acquiring a preset training corpus;
训练语料分析模块,用于使用N-gram模型对所述预设的训练语料库进行分析,得到所述预设的训练语料库的词序列数据,其中,所述词序列数据包括词序列以及每个所述词 序列的词序列频度;A training corpus analysis module is configured to analyze the preset training corpus using an N-gram model to obtain word sequence data of the preset training corpus, wherein the word sequence data includes a word sequence and each Word sequence frequency of the predicate sequence;
语音信息解析模块,用于若接收到待识别语音信息,则对所述待识别语音信息进行解析,得到所述待识别语音信息的M个发音序列,其中,M为大于1的正整数;A voice information parsing module, configured to parse the voice information to be recognized if the voice information to be recognized is received, to obtain M pronunciation sequences of the voice information to be recognized, where M is a positive integer greater than 1;
发生概率计算模块,用于针对每个所述发音序列,依据所述词序列数据,计算每个发音序列的发生概率,从而得到M个发音序列的发生概率;An occurrence probability calculation module, configured to calculate an occurrence probability of each pronunciation sequence for each of the pronunciation sequences and according to the word sequence data, so as to obtain an occurrence probability of M pronunciation sequences;
发音序列确认模块,用于从M个所述发音序列的发生概率中,选取达到预设概率阈值的发生概率对应的所述发音序列,作为目标发音序列;A pronunciation sequence confirmation module, configured to select the pronunciation sequence corresponding to the occurrence probability that reaches a preset probability threshold from the occurrence probability of M said pronunciation sequences as a target pronunciation sequence;
识别结果获取模块,用于从兴趣点信息库中获取与所述目标发音序列对应的兴趣点信息,作为所述待识别语音信息的兴趣点识别结果。The recognition result obtaining module is configured to obtain the point of interest information corresponding to the target pronunciation sequence from the point of interest information database as a point of interest recognition result of the speech information to be recognized.
第三方面,本申请实施例提供一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现所述兴趣点识别方法的步骤。According to a third aspect, an embodiment of the present application provides a terminal device, including a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, where the processor executes the computer may The steps of the method for identifying points of interest are implemented when the instruction is read.
第四方面,本申请实施例提供一个或多个存储有计算机可读指令的非易失性可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行所述兴趣点识别方法的步骤。According to a fourth aspect, embodiments of the present application provide one or more non-volatile readable storage media storing computer-readable instructions. When the computer-readable instructions are executed by one or more processors, the one or more A plurality of processors execute the steps of the point of interest recognition method.
本申请的一个或多个实施例的细节在下面的附图和描述中提出,本申请的其他特征和优点将从说明书、附图以及权利要求变得明显。Details of one or more embodiments of the present application are set forth in the accompanying drawings and description below, and other features and advantages of the present application will become apparent from the description, the drawings, and the claims.
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to explain the technical solutions of the embodiments of the present application more clearly, the drawings used in the description of the embodiments of the application will be briefly introduced below. Obviously, the drawings in the following description are just some embodiments of the application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without paying creative labor.
图1是本申请实施例提供的兴趣点识别方法的实现流程图;FIG. 1 is an implementation flowchart of an interest point identification method according to an embodiment of the present application; FIG.
图2是本申请实施例提供的兴趣点识别方法中步骤S4的实现流程图;FIG. 2 is a flowchart of implementing step S4 in the method of identifying a point of interest according to an embodiment of the present application; FIG.
图3是本申请实施例提供的兴趣点识别方法中得到训练语料库的实现流程图;3 is an implementation flowchart of obtaining a training corpus in a method of identifying interest points provided by an embodiment of the present application;
图4是本申请实施例提供的兴趣点识别方法中构建兴趣点信息库的实现流程图;4 is an implementation flowchart of constructing a point of interest information database in a method of identifying a point of interest provided by an embodiment of the present application;
图5是本申请实施例提供的兴趣点识别方法中生成补充语料库的实现流程图;FIG. 5 is an implementation flowchart of generating a supplementary corpus in the method of identifying interest points provided by an embodiment of the present application; FIG.
图6是本申请实施例提供的兴趣点识别装置的示意图;FIG. 6 is a schematic diagram of a point of interest identification device according to an embodiment of the present application; FIG.
图7是本申请实施例提供的终端设备的示意图。FIG. 7 is a schematic diagram of a terminal device according to an embodiment of the present application.
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。In the following, the technical solutions in the embodiments of the present application will be clearly and completely described with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of this application.
请参阅图1,图1示出本申请实施例提供的兴趣点识别方法的实现流程图。该兴趣点识别方法应用在对自然语言的语音信息中的兴趣点的识别场景中。该识别场景包括服务端和客户端,其中,服务端和客户端之间通过网络进行连接,用户通过客户端发送自然语言 中的语音信息,客户端具体可以但不限于是各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备,服务端具体可以用独立的服务器或者多个服务器组成的服务器集群实现。本申请实施例提供的兴趣点识别方法应用于服务端,详述如下:Please refer to FIG. 1, which illustrates a flowchart of implementing an interest point identification method provided by an embodiment of the present application. The point of interest recognition method is applied to a scene of recognition of points of interest in speech information of natural language. The identification scenario includes a server and a client, where the server and the client are connected through a network, and the user sends voice information in natural language through the client. The client may specifically but not limited to various personal computers and notebooks For computers, smart phones, tablet computers, and portable wearable devices, the server can be implemented by an independent server or a server cluster composed of multiple servers. The method for identifying a point of interest provided in the embodiments of the present application is applied to a server, as follows:
S1:获取预设的训练语料库。S1: Obtain a preset training corpus.
具体地,训练语料库是用来为了对自然语言中的语音信息进行评估,而使用相关语料进行训练得到的语料库,本申请实施例中训练语料库中的内容包含但不限于:兴趣点信息和通用语料库等。Specifically, the training corpus is used to evaluate the speech information in natural language, and is a corpus obtained by training using related corpora. The content in the training corpus in the embodiment of the present application includes, but is not limited to, points of interest information and a general corpus. Wait.
其中,语料库(Corpus)是指经科学取样和加工的大规模电子文本库。语料库是语言学研究的基础资源,也是经验主义语言研究方法的主要资源,应用于词典编纂,语言教学,传统语言研究,自然语言处理中基于统计或实例的研究等方面,语料,即语言材料,语料是语言学研究的内容,也是构成语料库的基本单元。Among them, Corpus refers to a large-scale electronic text library that has been scientifically sampled and processed. Corpus is the basic resource of linguistic research and the main resource of empirical language research methods. It is used in dictionary compilation, language teaching, traditional language research, statistical or example-based research in natural language processing, corpus, that is, language material, Corpus is the content of linguistic research and the basic unit of corpus.
S2:使用N-gram模型对预设的训练语料库进行分析,得到预设的训练语料库的词序列数据,其中,词序列数据包括词序列以及每个词序列的词序列频度。S2: An N-gram model is used to analyze the preset training corpus to obtain the preset word sequence data of the training corpus, where the word sequence data includes the word sequence and the word sequence frequency of each word sequence.
具体地,通过使用N-gram模型对预设的训练语料库中每个语料进行统计分析,得出预设的训练语料库中一个语料H出现在另一个语料I之后的次数,进而得到“语料I+语料H”组成的词序列出现的词序列数据。Specifically, by using the N-gram model to statistically analyze each corpus in the preset training corpus, the number of times that one corpus H appears after another corpus I in the preset training corpus is obtained, and then “corpus I + corpus” is obtained. H "word sequence data appears.
其中,词序列是指由至少两个语料按照一定顺序组合而成的序列,词序列频度是指该词序列出现的次数占整个语料库中分词(Word Segmentation)出现次数的比例,这里的分词指的是将连续的字序列按照预设的组合方式进行组合得到的词序列。例如,某个词序列“爱吃西红柿”在整个语料库中出现的次数为100次,整个语料库所有分词出现的次数之和为100000次,则词序列“爱吃西红柿”的词序列频度为0.0001。The word sequence refers to a sequence composed of at least two corpora in a certain order. The frequency of the word sequence refers to the proportion of the number of times that the word sequence appears in the entire corpus. The word segmentation here It is a word sequence obtained by combining consecutive word sequences in a preset combination manner. For example, if a word sequence "love tomatoes" appears 100 times in the entire corpus, and the total number of occurrences of all participles of the entire corpus is 100,000 times, the word sequence frequency of the word sequence "love tomatoes" is 0.0001. .
其中,N-gram模型是大词汇连续语音识别中常用的一种语言模型,利用上下文中相邻词间的搭配信息,在需要把连续无空格的拼音转换成汉字串(即句子)时,可以计算出具有最大概率的句子,从而实现到汉字的自动转换,无需用户手动选择,避开了许多汉字对应一个相同的拼音的重码问题。Among them, the N-gram model is a language model commonly used in large vocabulary continuous speech recognition. Using the collocation information between adjacent words in the context, when it is necessary to convert continuous non-space pinyin into Chinese character strings (that is, sentences), Calculate the sentence with the highest probability, so as to realize automatic conversion to Chinese characters, without manual selection by the user, and avoiding the problem of recoding of many Chinese characters corresponding to the same pinyin.
通过使用N-gram模型对预设的训练语料库的每个词序列数据进行分析,使得后续在计算发生概率时直接使用这些词序列数据即可,节省了计算时间,提高了兴趣点识别的效率。By using the N-gram model to analyze each word sequence data of the preset training corpus, it is only necessary to directly use these word sequence data when calculating the probability of occurrence, which saves calculation time and improves the efficiency of interest point recognition.
S3:若接收到待识别语音信息,则对待识别语音信息进行解析,得到待识别语音信息的M个发音序列,其中,M为大于1的正整数。S3: If the speech information to be identified is received, the speech information to be identified is parsed to obtain M pronunciation sequences of the speech information to be identified, where M is a positive integer greater than 1.
具体地,每个汉语发音对应一个或多个汉字,服务端在接收到用户在客户端输入的待识别语音信息后,通过声学解码器对该待识别语音信息进行解码,转化得到多个发音序列。Specifically, each Chinese pronunciation corresponds to one or more Chinese characters. After receiving the to-be-recognized voice information input by the user on the client, the server decodes the to-be-recognized voice information through an acoustic decoder, and converts into multiple pronunciation sequences. .
其中,发音序列是指语音信息经过转化,得到的包含至少两个分词的文字序列。Wherein, the pronunciation sequence refers to a text sequence including at least two word segmentations obtained by converting speech information.
例如,在一具体实施方式中,将这一待识别语音信息“wo xi huan chi zhong guo mei shi”经声学解码后提取到的发音序列可以为发音序列A:“我”、“喜欢”、“吃”、“中国”、“美食”,也可以为发音序列B:“我”、“喜欢”、“驰中”、“国美”、“食”,还可以为发音序列C:“我”、“西环”、“持”、“中国”、“没事”等。For example, in a specific embodiment, the pronunciation sequence obtained by acoustically decoding this to-be-recognized voice information "woxixihuanchichizhonggugumeimeishi" can be a pronunciation sequence A: "I", "like", " "Eat", "China", "Food" can also be the pronunciation sequence B: "I", "Like", "Chi Zhong", "Guomei", "Food", and can also be the pronunciation sequence C: "I", "Western ring", "hold", "China", "all right" and so on.
S4:针对每个发音序列,依据词序列数据,计算每个发音序列的发生概率,从而得到M个发音序列的发生概率。S4: For each pronunciation sequence, according to the word sequence data, calculate the occurrence probability of each pronunciation sequence, thereby obtaining the occurrence probability of M pronunciation sequences.
具体地,根据步骤S2中获取到的词序列数据,对每个发音序列进行发音概率计算,得到M个发音序列的发生概率。Specifically, according to the word sequence data obtained in step S2, a pronunciation probability calculation is performed for each pronunciation sequence to obtain the occurrence probability of M pronunciation sequences.
对发音序列计算发生概率具体可使用马尔科夫假设理论:第Y个词的出现只与前面Y-1个词相关,而与其它任何词都不相关,整句的概率就是各个词出现概率的乘积。这些概率可以通过直接从语料中统计Y个词同时出现的次数得到。即:The Markov hypothesis theory can be used to calculate the occurrence probability of the pronunciation sequence. The occurrence of the Y word is only related to the first Y-1 words, and it is not related to any other words. The probability of the entire sentence is the probability of occurrence of each word. product. These probabilities can be obtained by directly counting the number of simultaneous occurrences of Y words from the corpus. which is:
P(T)=P(W
1W
2...W
Y)=P(W
1)P(W
2|W
1)...P(W
Y|W
1W
2...W
Y-1)公式(1)
P (T) = P (W 1 W 2 ... W Y ) = P (W 1 ) P (W 2 | W 1 ) ... P (W Y | W 1 W 2 ... W Y-1 )Formula 1)
其中,P(T)为整句出现的概率,P(W
Y|W
1W
2...W
Y-1)为第Y个分词出现在Y-1个分词组成的词序列之后的概率。
Among them, P (T) is the probability of the entire sentence appearing, and P (W Y | W 1 W 2 ... W Y-1 ) is the probability that the Y- th participle appears after the word sequence composed of Y-1 participles.
例如:在“中华民族是一个有着悠久文明历史的民族”这句话进行语音识别后,划分的一种发音序列为:“中华民族”、“是”、“一个”、“有着”、“悠久”、“文明”、“历史”、“的”、“民族”,一共出现了9个分词,当n=9的时候,即计算“民族”这个分词在出现在“中华民族是一个有着悠久文明历史的”这个词序列之后的概率。For example: after speech recognition of the sentence "Chinese nation is a nation with a long history of civilization", a pronunciation sequence divided is: "Chinese nation", "Yes", "one", "has", "long-term" "," Civilization "," history "," of "," nation ", a total of 9 participles appear. When n = 9, it is calculated that the part of" nation "appears in the" Chinese nation is a civilization with a long history. " Probability after the word sequence "historical".
S5:从M个发音序列的发生概率中,选取达到预设概率阈值的发生概率对应的发音序列,作为目标发音序列。S5: From the occurrence probabilities of M pronunciation sequences, a pronunciation sequence corresponding to the occurrence probability that reaches a preset probability threshold is selected as the target pronunciation sequence.
具体地,针对每个发音序列,通过步骤S4的计算均得到一个发生概率,共得到M个发音序列的发生概率,将这M个发音序列的发生概率分别与预设概率阈值进行比较,选取大于或者等于预设概率阈值的发生概率,作为有效发生概率,进而找到有效发生概率对应的发音序列,将这些发音序列作为目标发音序列。Specifically, for each sounding sequence, an occurrence probability is obtained through the calculation of step S4, and a total of M sounding sequence occurrence probabilities are obtained. The occurrence probabilities of the M sounding sequences are compared with a preset probability threshold, respectively, and are selected to be greater than Or the occurrence probability equal to the preset probability threshold is used as the effective occurrence probability, and then the pronunciation sequences corresponding to the effective occurrence probability are found, and these pronunciation sequences are used as the target pronunciation sequences.
通过与预设概率阈值进行比较,过滤掉发生概率不符合要求的发音序列,从而使得选取的目标发音序列更为接近自然语音中表达的含义,提高了兴趣点识别的准确率。By comparing with the preset probability threshold, the pronunciation sequences whose occurrence probability does not meet the requirements are filtered, so that the selected target pronunciation sequence is closer to the meaning expressed in natural speech, and the accuracy of interest point recognition is improved.
需要说明的是,若计算出的M个发音序列的发生概率均小于预设的概率阈值,将向用户推送提醒信息,例如,“未找到目标位置,请确认您的发音规范并进行重新尝试”,同时,收录该条语音信息记录并发送给后台管理人员。若目标发音序列个数大于预设个数,按照其对应的发生概率的大小顺序进行排序,并选取排序前面的预设个数发音序列作为目标发音序列,例如,预设的个数为5个,则在将有效发生概率进行排序后,选取排序前5个的有效发生概率,进而得到这5个发生概率对应的发音语序作为目标发音序列。It should be noted that if the calculated probability of M pronunciation sequences is less than a preset probability threshold, a reminder message will be pushed to the user, for example, "No target location was found, please confirm your pronunciation specifications and try again" At the same time, the voice message record is collected and sent to the background management staff. If the number of target pronunciation sequences is greater than the preset number, sort them in the order of their corresponding probability of occurrence, and select the preset number of pronunciation sequences before sorting as the target pronunciation sequence, for example, the preset number is 5 Then, after the effective occurrence probabilities are sorted, the first five effective occurrence probabilities are selected, and then the pronunciation word order corresponding to the five occurrence probabilities is used as the target pronunciation sequence.
S6:从兴趣点信息库中获取与目标发音序列对应的兴趣点信息,作为待识别语音信息的兴趣点识别结果。S6: Obtain the point of interest information corresponding to the target pronunciation sequence from the point of interest information database as the result of the point of interest recognition of the speech information to be recognized.
具体地,在获取到目标发音序列后,从兴趣点信息库中获取目标发音序列中包含的兴趣点信息,并将该兴趣点信息作为语音信息的兴趣点识别结果推送给用户。Specifically, after obtaining the target pronunciation sequence, the point of interest information contained in the target pronunciation sequence is obtained from the point of interest information database, and the point of interest information is pushed to the user as a result of the point of interest recognition of the voice information.
在图1对应的实施例中,通过获取预设的训练语料库,再使用N-gram模型对预设的训练语料库进行分析,得到预设的训练语料库的词序列数据,通过预先分析统计出所有词序列数据,方便后续计算发生概率时可直接使用词序列数据,从而节省了计算概率的时间,提高了效率;在接收到待识别语音信息时,对待识别语音信息进行解析,得到待识别语音信息的M个发音序列,针对每个发音序列,依据词序列数据,计算每个发音序列的发生概率,从得到的M个发音序列的发生概率中,选取达到预设概率阈值的发生概率对应的发音序列作为目标发音序列,进而从兴趣点信息库中获取与目标发音序列对应的兴趣点信息,作为待识别语音信息的兴趣点识别结果,这种通过对发音序列的概率进行计算,并选取符 合条件的概率作为结果的筛选方式,能够实现对语音信息的含义进行准确识别,从而提升了兴趣点识别的准确率。In the embodiment corresponding to FIG. 1, the preset training corpus is obtained, and then the N-gram model is used to analyze the preset training corpus to obtain the word sequence data of the preset training corpus. All the words are calculated and analyzed in advance. Sequence data, so that the word sequence data can be used directly in the subsequent calculation of the probability of occurrence, which saves the time of calculating the probability and improves the efficiency; when the voice information to be recognized is received, the voice information to be recognized is parsed to obtain the voice information to be recognized. M pronunciation sequences, for each pronunciation sequence, according to the word sequence data, calculate the occurrence probability of each pronunciation sequence, and select the pronunciation sequence corresponding to the occurrence probability of the preset probability threshold from the obtained occurrence probability of M pronunciation sequences As the target pronunciation sequence, and further obtain the point of interest information corresponding to the target pronunciation sequence from the point of interest information database, and as the result of the point of interest recognition of the speech information to be recognized, this method calculates the probability of the pronunciation sequence and selects the Probability as the result of the screening method, can achieve the speech information The meaning is accurately identified, thereby improving the accuracy of interest point recognition.
接下来,在图1对应的实施例的基础之上,下面通过一个具体的实施例来对步骤S4中所提及的针对每个发音序列,依据词序列数据,计算该发音序列的发生概率的具体实现方法进行详细说明。Next, on the basis of the embodiment corresponding to FIG. 1, a specific embodiment is used below to calculate the occurrence probability of the pronunciation sequence for each pronunciation sequence mentioned in step S4 according to the word sequence data. The specific implementation method will be described in detail.
请参阅图2,图2示出了本申请实施例提供的步骤S4的具体实现流程,详述如下:Please refer to FIG. 2, which illustrates a specific implementation process of step S4 provided by an embodiment of the present application, which is detailed as follows:
S41:针对每个发音序列,获取该发音序列内的所有分词a
1,a
2,...,a
n-1,a
n,其中,n为大于1的正整数。
S41: For each pronunciation sequence, obtain all the participles a 1 , a 2 , ..., an n-1 , an n within the pronunciation sequence, where n is a positive integer greater than 1.
需要说明的是,获取该发音序列内的分词是分别按照词序从前到后的顺序依次获取,例如,针对一发音序列“我爱中国”,按照词序从前到后的顺序依次进行分词提取,得到第一个分词“我”,第二个分词“爱”,第三个分词“中国”。It should be noted that the word segmentation in the pronunciation sequence is obtained in the order of the word order from front to back. For example, for a pronunciation sequence "I love China", the word segmentation is performed in order from the word order to the first. One participle "I", the second participle "Love", and the third participle "China".
S42:依据词序列数据,使用公式(2)计算n个分词中第n个分词a
n出现在词序列(a
1a
2...a
n-1)之后的概率,将该概率作为发音序列的发生概率:
S42: According to the word sequence data, use formula (2) to calculate the probability that the nth participle a n of the n participles appears after the word sequence (a 1 a 2 ... a n-1 ), and use this probability as the pronunciation sequence Probability of occurrence:
其中,P(a
n|a
1a
2...a
n-1)为n个分词中第n个分词a
n出现在词序列(a
1a
2...a
n-1)之后的概率,C(a
1a
2...a
n-1a
n)为词序列(a
1a
2...a
n-1a
n)的词序列频度,C(a
1a
2...a
n-1)为词序列(a
1a
2...a
n-1)的词序列频度。
After probability | (a 1 a 2 ... a n-1 a n) for the n th word n-th word appears in a n word sequence (a 1 a 2 ... a n -1) where, P , C (a 1 a 2 ... an n-1 an n ) is the word sequence frequency of the word sequence (a 1 a 2 ... an n-1 an n ), C (a 1 a 2 ... a n-1 ) is the word sequence frequency of the word sequence (a 1 a 2 ... an n-1 ).
具体地,由步骤S2可知,每个词序列的词序列频度均通过N-gram模型对训练语料库的分析得出,此处仅需按照公式(2)进行计算即可。Specifically, it can be known from step S2 that the word sequence frequency of each word sequence is obtained through the analysis of the training corpus by the N-gram model, and only calculation is required according to formula (2) here.
值得说明的是,由于N-gram模型使用的训练语料库较为庞大,而且数据稀疏严重,时间复杂度高,对于兴趣点计算出的发生概率数值偏小,所以也可以使用二元模型来计算发生概率。It is worth noting that because the training corpus used by the N-gram model is relatively large, the data is sparse and serious, and the time complexity is high. .
其中,二元模型是通过使用公式(2)来分别计算分词a
2出现在分词a
1之后的概率A
1,分词a
3出现在分词a
2之后的概率A
2,...,分词a
n出现在分词a
n-1之后的概率A
n-1,进而使用公式(3)计算整个词序列(a
1a
2...a
n-1a
n)的发生概率:
Wherein, bigram is calculated by using the formula (2) each word a present participle 2 illustrating a probability after 1 A 1, a word participle. 3 illustrating a probability after 2 A 2, ..., a n word The probability A n-1 that appears after the word segmentation a n-1 , and then uses formula (3) to calculate the probability of occurrence of the entire word sequence (a 1 a 2 ... a n-1 a n ):
P(T')=A
1A
2...A
n-1
P (T ') = A 1 A 2 ... A n-1
在图2对应的实施例中,针对每个发音序列,获取该发音序列内的所有分词,并计算最后一个分词出现在前面所有分词组合而成的词序列之后的概率来得到整个句子出现的概率,进而评估句子是否合理,从而识别出自然语言的语音信息包含的语义,得到相关要获取的兴趣点名称等信息,有效提高了兴趣点识别的准确率。In the embodiment corresponding to FIG. 2, for each pronunciation sequence, all the participles in the pronunciation sequence are obtained, and the probability that the last participle appears after the combination of all the previous participles is calculated to obtain the probability of the entire sentence. , And then evaluate whether the sentence is reasonable, so as to recognize the semantics contained in the speech information of natural language, and obtain relevant information such as the name of the point of interest to be obtained, which effectively improves the accuracy of the point of interest recognition.
在图1或图2对应的实施例的基础之上,在步骤S1提及的获取预设的训练语料库之 前,还可以构建训练语料库,如图3所示,该兴趣点识别方法还包括:On the basis of the embodiment corresponding to FIG. 1 or FIG. 2, before acquiring the preset training corpus mentioned in step S1, a training corpus may also be constructed. As shown in FIG. 3, the method for identifying interest points further includes:
S71:构建兴趣点信息库。S71: Construct a point of interest information database.
具体地,在进行兴趣点识别之前,为了保证兴趣点识别的准确性,需要构建一个包含兴趣点较全面的兴趣点信息库,该兴趣点信息库中包含每个兴趣点的兴趣点信息,可以使用现有的通用模型中包含的兴趣点来生成兴趣点信息库,也可以通过人工采集兴趣点的方式进行兴趣点信息库的构建,或者使用网络爬虫的方式获取兴趣点来构建兴趣点信息库,具体方式在此不做具体限制。Specifically, before performing the point of interest recognition, in order to ensure the accuracy of the point of interest recognition, a comprehensive point of interest information database containing points of interest needs to be constructed. The point of interest information database contains the point of interest information of each point of interest. Use the points of interest contained in the existing general model to generate a point of interest information base. You can also construct a point of interest information base by manually collecting points of interest, or use a web crawler to obtain points of interest to build a point of interest information base. , The specific method is not specifically limited here.
优选地,本申请实施例采用的方式为使用网络爬虫的方式获取兴趣点来构建兴趣点信息库。Preferably, the method adopted in the embodiment of the present application is to obtain a point of interest by using a web crawler to build a point of interest information database.
其中,兴趣点信息包括但不限于:兴趣点名称,兴趣点所属类别和兴趣点地址等。The point of interest information includes, but is not limited to, the name of the point of interest, the category to which the point of interest belongs, and the address of the point of interest.
S72:基于兴趣点信息库,生成补充语料库。S72: Generate a supplementary corpus based on the interest point information database.
具体地,对兴趣点信息库中的兴趣点信息进行提取,将获取到的所有兴趣点信息按照预设的处理方式进行处理后作为补充语料库。Specifically, the point of interest information in the point of interest information database is extracted, and all the obtained point of interest information is processed according to a preset processing method as a supplementary corpus.
其中,具体的处理方式可以是对兴趣点进行分词,也可以是对兴趣点信息进行语义统计等,具体可根据实际需要选择,此处不做限制。The specific processing method may be segmentation of interest points, or semantic statistics of interest point information, etc., which may be specifically selected according to actual needs, and is not limited here.
S73:将补充语料库与预设的基础语料库进行组合,得到训练语料库。S73: Combine the supplementary corpus with a preset basic corpus to obtain a training corpus.
具体地,由于使用N-gram模型来对训练语料库进行分析,使得训练语料库必须具有庞大的语料,以便可以对一个句子是否合理做出评估,所以需要使用一个将具有足够语料的预设语料库和补充语料库组合得到训练语料库。Specifically, because the N-gram model is used to analyze the training corpus, the training corpus must have a huge corpus so that it can evaluate whether a sentence is reasonable, so a preset corpus and supplement that will have sufficient corpus need to be used The corpus is combined to obtain the training corpus.
其中,预设的基础语料库根据实际需要来进行选取,例如,选取搜狐近三年财经体育时事等领域的新闻,并经过文本清理和整理生成的语料库作为基础语料库。Among them, the preset basic corpus is selected according to actual needs. For example, the news of Sohu's financial, sports, and current affairs in the past three years is selected, and the corpus generated by text cleaning and collation is used as the basic corpus.
在图3对应的实施例中,通过构建兴趣点信息库,并基于兴趣点信息库,生成补充语料库,进而将补充语料库与预设的基础语料库进行组合,得到训练语料库,使得用来进行N-gram模型分析的训练语料库不仅具有评估语句是否合理的能力,还包含了兴趣点的相关信息,从而可以对一条语句中是否包含兴趣点进行准确的评估,有利于提高自然语言的语音信息识别准确率和对兴趣点信息识别的准确率。In the embodiment corresponding to FIG. 3, by constructing a point of interest information base and generating a supplementary corpus based on the point of interest information base, the supplementary corpus is combined with a preset basic corpus to obtain a training corpus for N- The training corpus analyzed by the gram model not only has the ability to evaluate whether a sentence is reasonable, but also contains information about points of interest, so that it can accurately evaluate whether a sentence contains points of interest, which is conducive to improving the accuracy of natural language speech information recognition. And the accuracy rate of interest point information identification.
在图3对应的实施例的基础之上,下面通过一个具体的实施例来对步骤S71中所提及的构建兴趣点信息库的具体实现方法进行详细说明。On the basis of the embodiment corresponding to FIG. 3, a specific embodiment is used to describe in detail a specific implementation method of constructing a point of interest information base mentioned in step S71.
请参阅图4,图4示出了本申请实施例提供的步骤S71的具体实现流程,详述如下:Please refer to FIG. 4, which illustrates a specific implementation process of step S71 provided by an embodiment of the present application, which is detailed as follows:
S711:对预设的基础兴趣点按照预设的分类方式进行分类,得到兴趣点信息库的基础分类。S711: Classify the preset basic interest points according to a preset classification method to obtain a basic classification of the interest point information database.
具体地,按照预先设置好的兴趣点的分类方式,对基础兴趣点进行分类,将该分类作为兴趣点信息库的基础分类,并将每个基础分类包含的兴趣点信息存储到信息点信息库中对应的位置,分类方式具体可以根据实际需要进行设置,此处不作限制。Specifically, the basic interest points are classified according to a preset classification method of the interest points, the classification is used as the basic classification of the interest point information database, and the interest point information contained in each basic classification is stored in the information point information database. The corresponding position and classification method can be set according to actual needs, and there is no limitation here.
其中,基础兴趣点是指兴趣点的每个小类,基础分类是指兴趣点的大类,例如,信息点信息库包含的一个基础分类是“美食”,该基础分类下面包含的基础兴趣点有“早餐”、“快餐”、“火锅”、“自助餐”和“酒店”等。Among them, the basic interest points refer to each small category of the points of interest, and the basic classification refers to the large categories of points of interest. For example, a basic category contained in the information point information base is "cuisine", and the basic points of interest contained below the basic category There are "breakfast," "fast food," "hot pot," "buffet," and "hotel."
S712:针对每个基础分类,通过网络爬取的方式,获取全国每个行政区中包含该基础分类的所有基础兴趣点的兴趣点信息,得到该基础分类在全国每个行政区的兴趣点信息。S712: For each basic classification, through the web crawling method, obtain the interest point information of all the basic interest points in the administrative region of the country that includes the basic classification, and obtain the interest point information of the basic classification in each administrative region of the country.
具体地,针对信息点信息库中的每个基础分类,通过网络爬虫(Web Crawler),依次爬取全国每个行政区,来获取该行政区包含这个基础分类下所有基础兴趣点的信息,从而得到该基础分类在全国每个行政区的兴趣点信息,按照此方法,获取所有基础分类在全国每个行政区的兴趣点信息。Specifically, for each basic classification in the information point information database, a web crawler is used to sequentially crawl each administrative region in the country to obtain the information of the administrative region containing all the basic interest points under this basic classification, thereby obtaining the Point of interest information of basic classifications in each administrative region of the country. According to this method, information of points of interest of all basic classifications in each administrative region of the country is obtained.
其中,网络爬虫又称全网爬虫(Scalable Web Crawler),爬行对象从一些种子URL(Uniform Resource Locator,统一资源定位符)扩充到整个Web(World Wide Web,全球广域网),主要为门户站点搜索引擎和大型Web服务提供商采集数据。网络爬虫的爬行范围和数量巨大,对于爬行速度和存储空间要求较高,对于爬行页面的顺序要求相对较低,同时由于待刷新的页面太多,通常采用并行工作方式,网络爬虫的结构大致可以分为页面爬行模块、页面分析模块、链接过滤模块、页面数据库、URL队列、初始URL集合几个部分。为提高工作效率,通用网络爬虫会采取一定的爬行策略。常用的爬行策略有:深度优先策略、广度优先策略。Among them, the web crawler is also called Scalable Web Crawler. The crawling object is expanded from some seed URLs (Uniform Resource Locator, Uniform Resource Locator) to the entire Web (World Wide Web, global wide area network), which is mainly a portal site search engine. And data from large web service providers. The crawling scope and number of web crawlers are huge, and the requirements for crawling speed and storage space are high. The order of crawling pages is relatively low. At the same time, because there are too many pages to be refreshed, parallel work is usually used. It is divided into page crawl module, page analysis module, link filtering module, page database, URL queue, and initial URL collection. To improve work efficiency, general web crawlers will adopt certain crawling strategies. Common crawling strategies are: depth-first strategy, breadth-first strategy.
其中,深度优先策略的基本方法是按照深度由低到高的顺序,依次访问下一级网页链接,直到不能再深入为止。爬虫在完成一个爬行分支后返回到上一链接节点进一步搜索其它链接。当所有链接遍历完后,爬行任务结束。Among them, the basic method of the depth-first strategy is to visit the next level of web links in order from the lowest to the highest depth, until it cannot be further deepened. After completing a crawling branch, the crawler returns to the previous link node to further search for other links. When all links have been traversed, the crawling task ends.
其中,广度优先策略是按照网页内容目录层次深浅来爬行页面,处于较浅目录层次的页面首先被爬行。当同一层次中的页面爬行完毕后,爬虫再深入下一层继续爬行。这种策略能够有效控制页面的爬行深度,避免遇到一个无穷深层分支时无法结束爬行的问题,实现方便,无需存储大量中间节点。Among them, the breadth-first strategy is to crawl pages according to the depth of the content directory level of the web page, and the pages at the shallower directory level are crawled first. After the pages in the same level are crawled, the crawler goes deeper and continues to crawl. This strategy can effectively control the crawling depth of the page, avoid the problem that crawling cannot be ended when encountering an infinite deep branch, and it is easy to implement without storing a large number of intermediate nodes.
优选地,本申请实施例采用的爬行策略为广度优先策略。Preferably, the crawling strategy adopted in the embodiment of the present application is a breadth-first strategy.
针对每个基础分类,通过网络爬虫,依次爬取全国每个行政区,来获取该行政区包含这个基础分类下所有基础兴趣点的信息,从而得到该基础分类在全国每个行政区的兴趣点信息的具体实现流程包括步骤A至步骤E,详述如下:步骤A:获取全国各级行政区信息及每个行政区对应的经纬度。For each basic classification, through a web crawler, crawl each administrative region of the country in order to obtain the information of the administrative region containing all the basic points of interest under this basic classification, so as to obtain the specific points of interest information of the basic classification in each administrative region of the country. The implementation process includes steps A to E, which are detailed as follows: Step A: Obtain information on administrative regions at all levels throughout the country and the latitude and longitude corresponding to each administrative region.
具体地,获取全国每个市级单位的行政区信息,然后获取市级行政区包含的县级行政区信息,进而获取县级行政区包含的区、街道、乡镇等办事处信息。Specifically, the administrative district information of each city-level unit in the country is obtained, and then the county-level administrative district information included in the municipal-level administrative district is obtained, and then the information of the district, street, township and other office information included in the county-level administrative district is obtained.
其中,行政区信息包括但不限于:行政区名称、行政区代码、上级行政区信息和下级行政区信息等,例如,如表一所示,表一是获取到的行政区代码为440300的行政区信息。The administrative region information includes, but is not limited to, administrative region name, administrative region code, higher administrative region information, and lower administrative region information. For example, as shown in Table 1, the obtained administrative region information is 440300.
表一Table I
进一步地,获取每个行政区对应的经纬度信息。Further, the latitude and longitude information corresponding to each administrative area is obtained.
其中,经纬度是经度与纬度的合称组成一个坐标系统。称为地理坐标系统,它是一种利用三度空间的球面来定义地球上的空间的球面坐标系统,能够标示地球上的任何一个位置。Among them, longitude and latitude is a coordinate system composed of longitude and latitude. Called geographic coordinate system, it is a spherical coordinate system that uses the spherical surface of three degrees to define the space on the earth, and can mark any position on the earth.
我国常用的经纬度坐标系统包括但不限于:WGS84坐标系(World Geodetic System 1984,世界大地坐标系)、北京54坐标系(BJZ54)、西安80坐标系(XIAN80)。The longitude and latitude coordinate systems commonly used in China include, but are not limited to: WGS84 coordinate system (World Geodetic System 1984, World Geodetic Coordinate System), Beijing 54 coordinate system (BJZ54), Xi'an 80 coordinate system (XIAN80).
优选地,本申请实施例采用的经纬度坐标系为WGS84坐标系。Preferably, the latitude and longitude coordinate system used in the embodiment of the present application is a WGS84 coordinate system.
步骤B:针对每个行政区K,根据预设的切分边长,对该行政区按照经纬度进行切分,得到n个大小相同的矩形列表。Step B: For each administrative area K, divide the administrative area according to the latitude and longitude according to the preset division side length to obtain n rectangular lists of the same size.
具体地,全国行政区列表包含若干个行政区,不同行政区大小均不一样。通过获取行政区的经纬度范围,得到四个极点坐标,并将这四个经纬度的坐标作为一个大矩形的四个顶点的坐标,进而得到一个大矩形,通过将这个大矩形按照预设的切分边长分割得到n个矩形。Specifically, the national administrative region list includes several administrative regions, and the sizes of different administrative regions are different. By obtaining the latitude and longitude range of the administrative region, four pole coordinates are obtained, and the coordinates of the four latitudes and longitudes are used as the coordinates of the four vertices of a large rectangle, and then a large rectangle is obtained. Long division results in n rectangles.
值得说明的是,不同行政区由于其繁华程度的不一致,存在有些行政区兴趣点密集,有些行政区兴趣点稀疏,从而针对不同的行政区,预设的切分边长可以根据实际情况选择不同的值,针对兴趣点密集的行政区,可以预设偏小一些的切分边长,针对兴趣点稀疏的行政区,可以预设偏大一些的切分边长,以便于后续获取兴趣点的时候,提高爬取的速度,从而提高兴趣点获取的效率。It is worth noting that due to the inconsistency in the prosperity of different administrative regions, there are some administrative regions with dense points of interest and some administrative regions with sparse points of interest. Therefore, for different administrative regions, the preset length of the division can be selected according to the actual situation. For administrative regions with dense points of interest, a smaller segmented edge length can be preset. For administrative regions with sparse points of interest, a larger segmented edge length can be preset to facilitate subsequent crawling of points of interest to improve crawling. Speed, thereby improving the efficiency of point of interest acquisition.
进一步地,将获取的大矩形的四个顶点经纬度转换成空间直角坐标并按照预设的矩形长和矩形宽进行分割,例如:左下角坐标为(lat_1,lon_1)、右上角坐标为(lat_2,lon_2),设定的切分边长为len,第一个矩形的左下角坐标是lat_1,lon_1,右上角坐标是lat_1+len,lon_1+len;第二个矩形的左下角坐标是lat_1+len,lon_1+len,右上角坐标是lat_1+len,lon_1+len。生成的矩形数量为:Further, the obtained vertices of the four vertices of the large rectangle are converted into space rectangular coordinates and divided according to a preset rectangle length and rectangle width, for example, the coordinates of the lower left corner are (lat_1, lon_1), and the coordinates of the upper right corner are (lat_2, lon_2), set the length of the split side to len, the coordinates of the lower left corner of the first rectangle are lat_1, lon_1, and the coordinates of the upper right corner are lat_1 + len, lon_1 + len; , Lon_1 + len, the upper right corner coordinates are lat_1 + len, lon_1 + len. The number of rectangles generated is:
(int((lat_2-lat_1)/len)+1)×(int((lon_2-lon_1)/len)+1)。(int ((lat_2-lat_1) / len) +1) × (int ((lon_2-lon_1) / len) +1).
其中,int是一个取整函数。例如:int(1.334)=1。Among them, int is a rounding function. For example: int (1.334) = 1.
例如,在一具体实施方式中,获取到深圳市的纬度经度范围为:东经113°46'~114°37',北纬22°27'~22°52',转换成空间直角坐标为左下角(22.45,113.769444)、右上角(22.86667,114.619444)。在实际需求中,将边长可以设置为0.04,依照上述方式,第一个矩形坐标为左下角(22.45,113.769444)、右上角(22.49,113.809444),第二个矩形坐标为左下角(22.53,113.809444)、右上角(22.53,113.809444)。For example, in a specific embodiment, the latitude and longitude range obtained from Shenzhen is: 113 ° 46 '~ 114 ° 37' east longitude, 22 ° 27 '~ 22 ° 52' north latitude, and the space rectangular coordinates are converted into the lower left corner ( 22.45, 113.769444), upper right corner (22.86667, 114.619444). In actual requirements, the side length can be set to 0.04. According to the above method, the first rectangular coordinate is the lower left corner (22.45, 113.769444), the upper right corner (22.49, 113.809444), and the second rectangular coordinate is the lower left corner (22.53, 113.809444), top right corner (22.53, 113.809444).
步骤C:针对基础分类J,依据行政区K的矩形列表,生成该行政区的URL列表。Step C: For the basic classification J, according to the rectangular list of the administrative area K, generate a URL list of the administrative area.
具体地,假设当前基础分类为J,当前行政区为K,则在行政区K的n个矩形列表生成之后,通过网络爬虫对矩形列表进行遍历,爬取每个矩形列表中包含基础分类J下的任一基础兴趣点的URL,生成URL列表。Specifically, assuming that the current basic classification is J and the current administrative area is K, after the n rectangular lists of administrative area K are generated, the rectangular list is traversed by a web crawler, and each rectangular list is crawled to contain any tasks under the basic classification J. A URL of a basic point of interest to generate a URL list.
例如,在一具体实施方式中,使用网络爬虫爬取百度地图中左下角坐标为(22.53,113.809444)、右上角坐标为(22.53,113.809444)的矩形区域包含的基础兴趣点为“中学”,可使用如下代码:For example, in a specific implementation, using a web crawler to crawl a Baidu map with a bottom left corner coordinate (22.53, 113.809444) and a top right corner coordinate (22.53, 113.809444) in a rectangular area containing a basic interest point as "secondary", may Use the following code:
url=’http://api.map.baidu.com/place/v2/search?query=’中学’&bounds=’+url = ’http: //api.map.baidu.com/place/v2/search? query = ’secondary’ & bounds = ’+
22.53’+’,’+’113.809444’+’,’+’22.53’+’,’+’113.809444’+’,’+’&page_size=20&22.53 ’+’, ’+’ 113.809444 ’+’, ’+’ 22.53 ’+’, ’+’ 113.809444 ’+’, ’+’ & page_size = 20 &
page_num=’+str(page_num)+’&output=json&ak=9s5GSYZsWbMaFU8Ps2V2VWvDlDlqGaaO’。page_num = '+ str (page_num) +' & output = json & ak = 9s5GSYZsWbMaFU8Ps2V2VWvDlDlqGaaO '.
其中,“page_size”是指预设的每页包含的内容的数目,page_num是指分页数量,“ak”(Apiconsole Key,AK)为开发人员的百度地图API控制台钥匙。Among them, "page_size" refers to the preset number of content contained in each page, page_num refers to the number of pages, and "ak" (Apiconsole Key (AK)) is the developer's Baidu Maps API console key.
步骤D:通过对URL列表的解析确定基础分类J的兴趣点在行政区K的分布信息,得到该行政区内包含的属于基础分类J的兴趣点信息。Step D: Determine the distribution information of the interest points of the basic classification J in the administrative area K by analyzing the URL list, and obtain the information of the interest points belonging to the basic classification J contained in the administrative area.
具体地,通过对步骤C中获取到的URL列表进行网页解析,获取每个URL上包含的基础分类的兴趣点信息,从而得到每个行政区内包含的兴趣点信息。Specifically, by performing a webpage parsing on the URL list obtained in step C to obtain the basic point of interest information contained in each URL, thereby obtaining the point of interest information contained in each administrative area.
例如,在一具体实施方式中,获取到的URL列表包含了26个URL,每个URL包含20条兴趣点信息,其中一条兴趣点信息如下:For example, in a specific implementation manner, the obtained URL list includes 26 URLs, and each URL includes 20 points of interest information, one of which is as follows:
通过对URL上的该地址进行解析,得到的结果为:兴趣点名称为“昆明八中”,其具体地址为“云南省昆明市五华区龙泉路628号”,其所属行政区为“五华区”,其所属街道编号为“35debf29e6063d3aa7da399b”。After parsing the address on the URL, the result is: the name of the point of interest is "Kunming Eighth Middle School", and its specific address is "No. 628 Longquan Road, Wuhua District, Kunming City, Yunnan Province", and its administrative area is "Wuhua District ", whose street number is" 35debf29e6063d3aa7da399b ".
步骤E:将获取到的兴趣点信息存入到兴趣点信息库中的对应位置。Step E: Store the acquired POI information into a corresponding location in the POI database.
具体地,将获取到的兴趣点信息按照所属的基础分类,存入到兴趣点信息库中的对应位置。Specifically, the acquired point of interest information is classified according to the base to which it is acquired, and stored in a corresponding position in the point of interest information database.
以步骤D中获取到的兴趣点信息为例,将兴趣点名称为“昆明八中”的兴趣点信息存入到基础分类为“学校”的基础兴趣点“中学”之中。Taking the point-of-interest information obtained in step D as an example, the point-of-interest information whose name of the point of interest is "Kunming Eighth Middle School" is stored in the basic point of interest "medium" which is basically classified as "school".
在图4对应的实施例中,通过对预设的基础兴趣点按照预设的分类方式进行分类,得到兴趣点信息库的基础分类,进而针对每个基础分类,通过网络爬取的方式,获取全国每个行政区中包含该基础分类的所有基础兴趣点的兴趣点信息,得到该基础分类在全国每个行政区的兴趣点信息,从而获取到全国各个行政区范围内所有的兴趣点信息,使得在进行兴趣点识别时候,能够提供准确全面的兴趣点信息,有利于提升兴趣点识别的准确率。In the embodiment corresponding to FIG. 4, the preset basic interest points are classified according to a preset classification method to obtain the basic classification of the interest point information database, and then for each basic classification, a web crawling method is used to obtain Each administrative region of the country contains the point of interest information of all the basic points of interest of the basic classification, and the information of the points of interest of the basic classification in each of the administrative regions of the country is obtained, so as to obtain all the information of the points of interest in each administrative region of the country. When identifying points of interest, it can provide accurate and comprehensive points of interest information, which is conducive to improving the accuracy of point of interest recognition.
在图3对应的实施例的基础之上,下面通过一个具体的实施例来对步骤S72中所提及 的基于兴趣点信息库,生成补充语料库的具体实现方法进行详细说明。Based on the embodiment corresponding to FIG. 3, a specific embodiment is used to describe in detail a specific implementation method of generating a supplementary corpus based on the point of interest information base mentioned in step S72.
请参阅图5,图5示出了本申请实施例提供的步骤S72的具体实现流程,详述如下:Please refer to FIG. 5, which illustrates a specific implementation process of step S72 provided by an embodiment of the present application, which is detailed as follows:
S721:提取兴趣点信息库中的兴趣点信息。S721: Extract interest point information in the interest point information database.
具体地,从兴趣点信息库中提取各个基础分类下的基础兴趣点,以及基础兴趣点包含的兴趣点信息。Specifically, the basic interest points in each basic classification and the interest point information contained in the basic interest points are extracted from the interest point information database.
S722:对兴趣点信息进行分词处理,得到兴趣点分词。S722: Perform word segmentation processing on the POI information to obtain the POI word segmentation.
具体地,针对提取出的每一条兴趣点信息,进行中文分词,得到该兴趣点信息的兴趣点分词。Specifically, for each extracted point of interest information, Chinese word segmentation is performed to obtain the point of interest segmentation of the point of interest information.
其中,中文分词是指的是将一个汉字序列切分成一个个单独的字。分词就是将连续的字序列按照一定的规范重新组合成词序列的过程。现有的分词算法可分为三大类:基于字符串匹配的分词方法、基于理解的分词方法和基于统计的分词方法。按照是否与词性标注过程相结合,又可以分为单纯分词方法和分词与标注相结合的一体化方法。Among them, Chinese word segmentation refers to cutting a sequence of Chinese characters into individual characters. Word segmentation is the process of recombining consecutive word sequences into word sequences in accordance with certain specifications. Existing word segmentation algorithms can be divided into three categories: word segmentation methods based on string matching, word segmentation methods based on understanding, and word segmentation methods based on statistics. According to whether it is combined with the part-of-speech tagging process, it can be divided into a simple word segmentation method and an integrated method combining word segmentation and labeling.
优选地,被发明实施例采用的分词算法为基于理解的分词方法。Preferably, the word segmentation algorithm adopted by the embodiment of the invention is an understanding-based word segmentation method.
例如,在一具体实施方式中,获取到的某个兴趣点信息为“基础分类-美食,基础兴趣点-快餐,兴趣点名称-盐津煲,兴趣点地址-广东省深圳市罗湖区八卦二路盐津煲”可以按照分词得到“美食”、“快餐”、“广东省”、“深圳市”、“罗湖区”、“八卦二路”和“盐津煲”。For example, in a specific embodiment, the obtained point-of-interest information is "basic classification-food, basic point of interest-fast food, name of point of interest-Yanjin pot, point of interest address-Bagua II, Luohu District, Shenzhen, Guangdong Province You can get “Food”, “Fast Food”, “Guangdong Province”, “Shenzhen”, “Luohu District”, “Bagua Erlu” and “Yanjin Pot” according to the participle.
S723:建立兴趣点分词与对应的兴趣点信息之间的映射关系,并将兴趣点分词、兴趣点信息和映射关系对应保存到补充语料库中。S723: Establish a mapping relationship between the point-of-interest segmentation and the corresponding point-of-interest information, and save the point-of-interest segmentation, the point-of-interest information, and the mapping relationship in the supplementary corpus correspondingly.
具体地,在对兴趣点信息进行分词之后,将获取到的每个兴趣点分词与该兴趣点信息进行关联,形成映射,并将兴趣点分词、兴趣点信息和映射关系对应保存到补充语料库中,以便在识别到某个兴趣点分词时可以找到对应的兴趣点信息,同时,将兴趣点分词、兴趣点信息都放入补充语料库中,可以提高兴趣点的相关信息出现在训练语料中的出现次数。Specifically, after segmenting the point of interest information, each acquired point of interest segmentation is associated with the point of interest information to form a mapping, and the point of interest segmentation, point of interest information, and mapping relationship are correspondingly stored in a supplementary corpus. , So that when a certain point of interest segmentation is identified, the corresponding point of interest information can be found. At the same time, both the point of interest segmentation and the point of interest information are put into the supplementary corpus, which can improve the appearance of relevant information of the point of interest in the training corpus frequency.
以步骤S722中获取到的兴趣点分词为例,兴趣点信息为“基础分类-美食,基础兴趣点-快餐,兴趣点名称-盐津煲,兴趣点地址-广东省深圳市罗湖区八卦二路盐津煲”的兴趣点包含的分词集合为:{“美食”,“快餐”,“广东省”,“深圳市”,“罗湖区”,“八卦二路”,“盐津煲”}。Taking the POI segmentation obtained in step S722 as an example, the POI information is "basic classification-cuisine, basic POI-fast food, POI name-Yanjin pot, POI address-Bagua Second Road, Luohu District, Shenzhen, Guangdong Province The Yanjin pot's point of interest contains the set of participles: {"Cuisine", "Fast food", "Guangdong Province", "Shenzhen", "Luohu District", "Bagua Erlu", "Yanjin pot"}.
在图5对应的实施例中,通过提取兴趣点信息库中的兴趣点信息,并对兴趣点信息进行分词处理,得到兴趣点分词,进而建立兴趣点分词与对应的兴趣点信息之间的映射关系,并将兴趣点分词、兴趣点信息和映射关系对应保存到补充语料库中,使得补充语料库包含兴趣点信息、兴趣点分词和它们的映射关系,使得在后续兴趣点检测时,能够根据相应的兴趣点分词直接找到对应的兴趣点信息,从而提高兴趣点的识别效率。In the embodiment corresponding to FIG. 5, by extracting the point of interest information in the point of interest information database and performing word segmentation processing on the point of interest information, the point of interest segmentation is obtained, and then a mapping between the point of interest segmentation and the corresponding point of interest information is established. Relationship, and save the interest point segmentation, interest point information, and mapping relationship to the supplementary corpus, so that the supplementary corpus contains interest point information, interest point segmentation, and their mapping relationships, so that in the subsequent interest point detection, it can be based on the corresponding The POI segmentation directly finds the corresponding POI information, thereby improving the efficiency of POI recognition.
在图3对应的实施例的基础之上,在步骤S71提及的构建兴趣点信息库之后,还可以更兴趣点信息区进行更新,该兴趣点识别方法还包括:On the basis of the embodiment corresponding to FIG. 3, after the point of interest information database mentioned in step S71 is constructed, more points of interest information area can be updated, and the method of identifying points of interest further includes:
若接收到更新指令,则对兴趣点信息库进行实时更新,或者,根据预设的条件,对兴趣点信息库进行自动更新。If the update instruction is received, the point of interest information base is updated in real time, or the point of interest information base is automatically updated according to a preset condition.
可以理解地,兴趣点信息会随着时间变化而有所改变,在一些兴趣点发生变化后,若不对兴趣点信息库做相应更新,在对这些兴趣点进行识别时,将会导致兴趣点无法识别或者识别信息有误,因此,需要对兴趣点信息库进行更新。Understandably, the point of interest information will change over time. After some point of interest changes, if the point of interest database is not updated accordingly, the point of interest will be unrecognized when these points of interest are identified. The identification or identification information is incorrect. Therefore, the point of interest information database needs to be updated.
具体地,本申请实施例提供了两种兴趣点信息库更新方式,分别是按预设的条件进行更新,以及在接收到用户发送的更新指令时进行实时更新。Specifically, the embodiments of the present application provide two ways of updating the point of interest information database, which are to update according to preset conditions, and to perform real-time update when an update instruction sent by a user is received.
其中,按预设的条件进行更新是指在达到预设条件后,触发自动更新程序,进行自动更新,预设的条件可以是预设的更新周期,例如,预设的更新周期为7天,也可以是检测到在步骤S712的中,爬取的URL列表发生变化,例如,在检测到同一爬取条件下,爬取结果由之前的16000条变成了17600条,此时,对兴趣点信息库进行更新,具体预设条件可以根据实际情况进行多样灵活的设置,此处不做具体限制。Among them, updating according to a preset condition refers to triggering an automatic update procedure to perform an automatic update after the preset condition is reached. The preset condition may be a preset update period. For example, the preset update period is 7 days. It can also be detected that the crawled URL list has changed in step S712. For example, under the same crawling condition, the crawling result has changed from the previous 16000 to 17600. At this time, the interest point The information database is updated. The specific preset conditions can be set in various and flexible settings according to the actual situation, and there is no specific limitation here.
需要说明的是,兴趣点信息库的更新过程请参照S711中步骤C至步骤E的描述,为避免重复,此处不在赘述。It should be noted that, for the update process of the interest point information database, please refer to the description of steps C to E in S711. To avoid repetition, details are not described here.
在本申请实施例中,在接收到更新指令时,对兴趣点信息库进行实时更新或者自动更新,使得兴趣点信息库中包含的兴趣点信息始终保持准确状态,以便在后续进行兴趣点识别时,能够提供准确全面的兴趣点信息,有利于提升兴趣点识别的准确率。In the embodiment of the present application, when an update instruction is received, the point of interest information database is updated in real time or automatically, so that the point of interest information contained in the point of interest information database is always maintained in an accurate state, so that in the subsequent point of interest identification , Can provide accurate and comprehensive point of interest information, which is conducive to improving the accuracy of interest point identification.
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should be understood that the size of the sequence numbers of the steps in the above embodiments does not mean the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application.
对应于上述方法实施例中的兴趣点识别方法,图6示出了与上述方法实施例提供的兴趣点识别方法一一对应的兴趣点识别装置,为了便于说明,仅示出了与本申请实施例相关的部分。Corresponding to the method for identifying points of interest in the above method embodiment, FIG. 6 shows a point of interest identification device that corresponds one-to-one to the method of identifying points of interest provided by the above method embodiment. For ease of description, only the implementation with this application is shown. Example related parts.
如图6所示,该兴趣点识别装置包括:训练语料获取模块10、训练语料分析模块20、语音信息解析模块30、发生概率计算模块40、发音序列确认模块50和识别结果获取模块60。各功能模块详细说明如下:As shown in FIG. 6, the point of interest recognition device includes a training corpus acquisition module 10, a training corpus analysis module 20, a voice information analysis module 30, an occurrence probability calculation module 40, a pronunciation sequence confirmation module 50, and a recognition result acquisition module 60. The detailed description of each function module is as follows:
训练语料获取模块10,用于获取预设的训练语料库;A training corpus acquisition module 10, configured to acquire a preset training corpus;
训练语料分析模块20,用于使用N-gram模型对预设的训练语料库进行分析,得到预设的训练语料库的词序列数据,其中,词序列数据包括词序列以及每个词序列的词序列频度;A training corpus analysis module 20 is configured to analyze a preset training corpus using an N-gram model to obtain preset word sequence data of the training corpus, where the word sequence data includes a word sequence and a word sequence frequency of each word sequence degree;
语音信息解析模块30,用于若接收到待识别语音信息,则对待识别语音信息进行解析,得到待识别语音信息的M个发音序列,其中,M为大于1的正整数;The voice information analysis module 30 is configured to parse the voice information to be recognized if the voice information to be recognized is received, to obtain M pronunciation sequences of the voice information to be recognized, where M is a positive integer greater than 1;
发生概率计算模块40,用于针对每个发音序列,依据词序列数据,计算每个发音序列的发生概率,从而得到M个发音序列的发生概率;The occurrence probability calculation module 40 is configured to calculate the occurrence probability of each pronunciation sequence according to the word sequence data for each pronunciation sequence, thereby obtaining the occurrence probability of M pronunciation sequences;
发音序列确认模块50,用于从M个发音序列的发生概率中,选取达到预设概率阈值的发生概率对应的发音序列,作为目标发音序列;The pronunciation sequence confirmation module 50 is configured to select, from the occurrence probabilities of M pronunciation sequences, a pronunciation sequence corresponding to an occurrence probability reaching a preset probability threshold as a target pronunciation sequence;
识别结果获取模块60,用于从兴趣点信息库中获取与目标发音序列对应的兴趣点信息,作为待识别语音信息的兴趣点识别结果。The recognition result obtaining module 60 is configured to obtain the point of interest information corresponding to the target pronunciation sequence from the point of interest information database as a point of interest recognition result of the speech information to be recognized.
进一步地,发生概率计算模块40包括:Further, the occurrence probability calculation module 40 includes:
分词序列提取单元41,用于针对每个发音序列,获取该发音序列内的所有分词a
1,a
2,...,a
n-1,a
n,其中,n为大于1的正整数;
Segmentation sequence extraction unit 41 is configured to obtain, for each pronunciation sequence, all the segmentations a 1 , a 2 , ..., an n-1 , an n within the pronunciation sequence, where n is a positive integer greater than 1;
发生概率计算单元42,用于依据词序列数据,使用如下公式计算n个分词中第n个分词a
n出现在词序列(a
1a
2...a
n-1)之后的概率,将该概率作为发音序列的发生概率:
The occurrence probability calculation unit 42 is configured to calculate the probability that the nth participle a n of the n participles appears after the word sequence (a 1 a 2 ... a n-1 ) according to the word sequence data using the following formula. Probability as the probability of occurrence of the pronunciation sequence:
其中,P(a
n|a
1a
2...a
n-1)为n个分词中第n个分词a
n出现在词序列(a
1a
2...a
n-1)之后的概率,C(a
1a
2...a
n-1a
n)为词序列(a
1a
2...a
n-1a
n)的词序列频度,C(a
1a
2...a
n-1)为词序列(a
1a
2...a
n-1)的词序列频度。
After probability | (a 1 a 2 ... a n-1 a n) for the n th word n-th word appears in a n word sequence (a 1 a 2 ... a n -1) where, P , C (a 1 a 2 ... an n-1 an n ) is the word sequence frequency of the word sequence (a 1 a 2 ... an n-1 an n ), C (a 1 a 2 ... a n-1 ) is the word sequence frequency of the word sequence (a 1 a 2 ... an n-1 ).
进一步地,该兴趣点识别装置还包括:Further, the interest point recognition device further includes:
兴趣点信息库构建单元71,用于构建兴趣点信息库;The point of interest information base construction unit 71 is configured to construct a point of interest information base;
补充语料库获取单元72,用于基于兴趣点信息库,生成补充语料库;A supplementary corpus acquisition unit 72, configured to generate a supplementary corpus based on the point of interest information base;
训练语料库生成单元73,用于将补充语料库与预设的基础语料库进行组合,得到训练语料库。The training corpus generating unit 73 is configured to combine a supplementary corpus with a preset basic corpus to obtain a training corpus.
进一步地,兴趣点信息库构建单元71包括:Further, the point of interest information base construction unit 71 includes:
分类划分子单元711,用于对预设的基础兴趣点按照预设的分类方式进行分类,得到兴趣点信息库的基础分类;A classification division subunit 711, configured to classify a preset basic interest point according to a preset classification method to obtain a basic classification of an interest point information database;
信息获取子单元712,用于针对每个基础分类,通过网络爬取的方式,获取全国每个行政区中包含该基础分类的所有基础兴趣点的兴趣点信息,得到该基础分类在全国每个行政区的兴趣点信息。An information acquisition subunit 712 is configured to obtain the interest point information of all the basic points of interest in each administrative region of the country containing the basic classification in each administrative region of the country by using a web crawling method for each basic classification to obtain the basic classification in each administrative region of the country Point of interest information.
进一步地,补充语料库获取单元72包括:Further, the supplementary corpus acquisition unit 72 includes:
信息提取子单元721,用于提取兴趣点信息库中的兴趣点信息;An information extraction subunit 721, configured to extract interest point information in a point of interest information database;
信息分割子单元722,用于对兴趣点信息进行分词处理,得到兴趣点分词;An information segmentation subunit 722, configured to perform word segmentation processing on the point of interest information to obtain the word segmentation of the point of interest;
语料获取子单元723,用于建立兴趣点分词与对应的兴趣点信息之间的映射关系,并将兴趣点分词、兴趣点信息和映射关系对应保存到补充语料库中。A corpus acquisition subunit 723 is configured to establish a mapping relationship between a point of interest segmentation and corresponding point of interest information, and correspondingly save the point of interest segmentation, point of interest information, and mapping relationship in a supplementary corpus.
进一步地,该兴趣点识别装置还包括:Further, the interest point recognition device further includes:
信息库更新模块80,用于若接收到更新指令,则对兴趣点信息库进行实时更新,或者,根据预设的条件,对兴趣点信息库进行自动更新。The information base update module 80 is configured to update the point of interest information base in real time if an update instruction is received, or automatically update the point of interest information base according to a preset condition.
本实施例提供的一种兴趣点识别装置中各模块实现各自功能的过程,具体可参考前述方法实施例的描述,此处不再赘述。For the process of implementing each function of each module in the point of interest recognition device provided in this embodiment, reference may be made to the description of the foregoing method embodiment for details, and details are not described herein again.
本实施例提供一个或多个存储有计算机可读指令的非易失性可读存储介质,该非易失性可读存储介质上存储有计算机可读指令,该计算机可读指令被一个或多个处理器执行时实现上述方法实施例中兴趣点识别方法,或者,该计算机可读指令被一个或多个处理器执行时实现上述装置实施例中兴趣点识别装置中各模块/单元的功能。为避免重复,这里不再赘述。This embodiment provides one or more nonvolatile readable storage media storing computer readable instructions. The nonvolatile readable storage medium stores computer readable instructions, and the computer readable instructions are When the processors execute, the point of interest identification method in the foregoing method embodiment is implemented, or when the computer-readable instructions are executed by one or more processors, the functions of each module / unit in the point of interest identification device in the foregoing device embodiment are implemented. To avoid repetition, we will not repeat them here.
可以理解地,所述非易失性可读存储介质可以包括:能够携带所述计算机可读指令代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、电载波信号和电信信号等。Understandably, the non-volatile readable storage medium may include: any entity or device capable of carrying the computer-readable instruction code, a recording medium, a U disk, a mobile hard disk, a magnetic disk, an optical disk, a computer memory, Read-Only Memory (ROM), Random Access Memory (RAM), electric carrier signals and telecommunication signals.
图7是本申请一实施例提供的终端设备的示意图。如图7所示,该实施例的终端设备90包括:处理器91、存储器92以及存储在存储器92中并可在处理器91上运行的计算机 可读指令93,例如兴趣点识别程序。处理器91执行计算机可读指令93时实现上述各个兴趣点识别方法实施例中的步骤,例如图1所示的步骤S1至步骤S6。或者,处理器91执行计算机可读指令93时实现上述各装置实施例中各模块/单元的功能,例如图6所示模块10至模块60的功能。FIG. 7 is a schematic diagram of a terminal device according to an embodiment of the present application. As shown in FIG. 7, the terminal device 90 of this embodiment includes a processor 91, a memory 92, and computer-readable instructions 93 stored in the memory 92 and executable on the processor 91, such as a point of interest recognition program. When the processor 91 executes the computer-readable instructions 93, the steps in the foregoing embodiments of the method for identifying points of interest are implemented, for example, steps S1 to S6 shown in FIG. Alternatively, when the processor 91 executes the computer-readable instructions 93, the functions of each module / unit in the foregoing device embodiments are implemented, for example, the functions of the modules 10 to 60 shown in FIG. 6.
示例性的,计算机可读指令93可以被分割成一个或多个模块/单元,一个或者多个模块/单元被存储在存储器92中,并由处理器91执行,以完成本申请。一个或多个模块/单元可以是能够完成特定功能的一系列计算机可读指令的指令段,该指令段用于描述计算机可读指令93在终端设备90中的执行过程。例如,计算机可读指令93可以被分割成训练语料获取模块、训练语料分析模块、语音信息解析模块、发生概率计算模块、发音序列确认模块和识别结果获取模块。各模块的具体功能如装置实施例所示,为避免重复,此处不一一赘述。For example, the computer-readable instructions 93 may be divided into one or more modules / units, and the one or more modules / units are stored in the memory 92 and executed by the processor 91 to complete the present application. One or more modules / units may be instruction segments of a series of computer-readable instructions capable of performing specific functions, and the instruction segments are used to describe the execution process of the computer-readable instructions 93 in the terminal device 90. For example, the computer-readable instructions 93 may be divided into a training corpus acquisition module, a training corpus analysis module, a voice information analysis module, an occurrence probability calculation module, a pronunciation sequence confirmation module, and a recognition result acquisition module. The specific functions of each module are as shown in the device embodiment. To avoid repetition, details are not described here.
终端设备90可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。终端设备90可包括,但不仅限于,处理器91、存储器92。本领域技术人员可以理解,图7仅仅是终端设备90的示例,并不构成对终端设备90的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如终端设备90还可以包括输入输出设备、网络接入设备、总线等。The terminal device 90 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server. The terminal device 90 may include, but is not limited to, a processor 91 and a memory 92. Those skilled in the art can understand that FIG. 7 is only an example of the terminal device 90, and does not constitute a limitation on the terminal device 90. The terminal device 90 may include more or fewer components than shown in the figure, or some components may be combined or different components For example, the terminal device 90 may further include an input / output device, a network access device, and a bus.
所称处理器91可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。The so-called processor 91 may be a central processing unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application specific integrated circuits (ASICs), Ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
存储器92可以是终端设备90的内部存储单元,例如终端设备90的硬盘或内存。存储器92也可以是终端设备90的外部存储设备,例如终端设备90上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,存储器92还可以既包括终端设备90的内部存储单元也包括外部存储设备。存储器92用于存储计算机可读指令以及终端设备90所需的其他程序和数据。存储器92还可以用于暂时地存储已经输出或者将要输出的数据。The memory 92 may be an internal storage unit of the terminal device 90, such as a hard disk or a memory of the terminal device 90. The memory 92 may also be an external storage device of the terminal device 90, such as a plug-in hard disk provided on the terminal device 90, a Smart Memory Card (SMC), a Secure Digital (SD) card, and a flash memory card (Flash Card) and so on. Further, the memory 92 may include both an internal storage unit of the terminal device 90 and an external storage device. The memory 92 is used to store computer-readable instructions and other programs and data required by the terminal device 90. The memory 92 may also be used to temporarily store data that has been output or is to be output.
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。Those skilled in the art can clearly understand that, for the convenience and brevity of the description, only the above-mentioned division of functional units and modules is used as an example. In practical applications, the above functions can be assigned by different functional units, Module completion, that is, dividing the internal structure of the device into different functional units or modules to complete all or part of the functions described above.
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。The above-mentioned embodiments are only used to describe the technical solution of the present application, but not limited thereto. Although the present application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that they can still implement the foregoing implementations. The technical solutions described in the examples are modified, or some of the technical features are equivalently replaced; and these modifications or replacements do not deviate the essence of the corresponding technical solutions from the spirit and scope of the technical solutions of the embodiments of the application, and should be included in Within the scope of this application.
Claims (20)
- 一种兴趣点识别方法,其特征在于,所述兴趣点识别方法包括:A point of interest recognition method, characterized in that the point of interest recognition method includes:获取预设的训练语料库;Obtain a preset training corpus;使用N-gram模型对所述预设的训练语料库进行分析,得到所述预设的训练语料库的词序列数据,其中,所述词序列数据包括词序列以及每个所述词序列的词序列频度;An N-gram model is used to analyze the preset training corpus to obtain word sequence data of the preset training corpus, wherein the word sequence data includes a word sequence and a word sequence frequency of each of the word sequences. degree;若接收到待识别语音信息,则对所述待识别语音信息进行解析,得到所述待识别语音信息的M个发音序列,其中,M为大于1的正整数;If the speech information to be identified is received, the speech information to be identified is parsed to obtain M pronunciation sequences of the speech information to be identified, where M is a positive integer greater than 1;针对每个所述发音序列,依据所述词序列数据,计算每个发音序列的发生概率,从而得到M个发音序列的发生概率;For each of the pronunciation sequences, calculating the occurrence probability of each pronunciation sequence according to the word sequence data, thereby obtaining the occurrence probability of M pronunciation sequences;从M个所述发音序列的发生概率中,选取达到预设概率阈值的发生概率对应的所述发音序列,作为目标发音序列;Selecting the pronunciation sequence corresponding to the occurrence probability reaching a preset probability threshold from the occurrence probabilities of M said pronunciation sequences as a target pronunciation sequence;从兴趣点信息库中获取与所述目标发音序列对应的兴趣点信息,作为所述待识别语音信息的兴趣点识别结果。The point of interest information corresponding to the target pronunciation sequence is obtained from the point of interest information database as a point of interest recognition result of the speech information to be recognized.
- 如权利要求1所述的兴趣点识别方法,其特征在于,所述针对每个所述发音序列,依据所述词序列数据,计算每个发音序列的发生概率包括:The method of claim 1, wherein, for each of the pronunciation sequences, calculating the occurrence probability of each pronunciation sequence according to the word sequence data comprises:针对每个所述发音序列,获取该发音序列内的所有分词a 1,a 2,...,a n-1,a n,其中,n为大于1的正整数; For each of the pronunciation sequences, obtaining all the participles a 1 , a 2 , ..., an n-1 , an n within the pronunciation sequence, where n is a positive integer greater than 1;依据所述词序列数据,使用如下公式计算n个分词中第n个分词a n出现在词序列(a 1a 2...a n-1)之后的概率,将所述概率作为该发音序列的发生概率: According to the word sequence data, the following formula is used to calculate the probability that the nth participle a n of the n participles appears after the word sequence (a 1 a 2 ... a n-1 ), and the probability is used as the pronunciation sequence Probability of occurrence:其中,P(a n|a 1a 2...a n-1)为n个分词中第n个分词a n出现在词序列(a 1a 2...a n-1)之后的概率,C(a 1a 2...a n-1a n)为词序列(a 1a 2...a n-1a n)的词序列频度,C(a 1a 2...a n-1)为词序列(a 1a 2...a n-1)的词序列频度。 After probability | (a 1 a 2 ... a n-1 a n) for the n th word n-th word appears in a n word sequence (a 1 a 2 ... a n -1) where, P , C (a 1 a 2 ... an n-1 an n ) is the word sequence frequency of the word sequence (a 1 a 2 ... an n-1 an n ), C (a 1 a 2 ... a n-1 ) is the word sequence frequency of the word sequence (a 1 a 2 ... an n-1 ).
- 如权利要求1或2所述的兴趣点识别方法,其特征在于,所述获取预设的训练语料库之前,所述兴趣点识别方法还包括:The method of identifying a point of interest according to claim 1 or 2, wherein before the acquiring a preset training corpus, the method of identifying the point of interest further comprises:构建兴趣点信息库;Building a point of interest database;基于所述兴趣点信息库,生成补充语料库;Generating a supplementary corpus based on the point of interest information base;将所述补充语料库与预设的基础语料库进行组合,得到所述训练语料库。Combining the supplementary corpus with a preset basic corpus to obtain the training corpus.
- 如权利要求3所述的兴趣点识别方法,其特征在于,所述构建兴趣点信息库包括:The point of interest identification method according to claim 3, wherein the constructing the point of interest information database comprises:对预设的基础兴趣点按照预设的分类方式进行分类,得到所述兴趣点信息库的基础分类;Classifying a preset basic interest point according to a preset classification method to obtain a basic classification of the interest point information database;针对每个所述基础分类,通过网络爬取的方式,获取全国每个行政区中包含该基础分类的所有基础兴趣点的兴趣点信息,得到所述基础分类在全国每个行政区的兴趣点信息。For each of the basic classifications, through a web crawling method, the point of interest information of all the basic points of interest in the basic classification in each administrative region of the country is obtained, and the information of the points of interest of the basic classification in each administrative region of the country is obtained.
- 如权利要求3所述的兴趣点识别方法,其特征在于,所述基于所述兴趣点信息库,生成补充语料库包括:The point of interest recognition method according to claim 3, wherein the generating a supplementary corpus based on the point of interest information database comprises:提取所述兴趣点信息库中的所述兴趣点信息;Extracting the point of interest information in the point of interest information database;对所述兴趣点信息进行分词处理,得到兴趣点分词;Performing word segmentation processing on the point of interest information to obtain the point of interest segmentation;建立所述兴趣点分词与对应的所述兴趣点信息之间的映射关系,并将所述兴趣点分词、所述兴趣点信息和所述映射关系对应保存到所述补充语料库中。Establishing a mapping relationship between the point of interest segmentation and the corresponding point of interest information, and correspondingly storing the point of interest segmentation, the point of interest information, and the mapping relationship in the supplementary corpus.
- 如权利要求3所述的兴趣点识别方法,其特征在于,在所述构建兴趣点信息库之后,所述兴趣点识别方法还包括:The method of identifying a point of interest according to claim 3, wherein after the constructing the point of interest information database, the method of identifying the point of interest further comprises:若接收到更新指令,则对所述兴趣点信息库进行实时更新,或者,根据预设的条件,对所述兴趣点信息库进行自动更新。If an update instruction is received, the point of interest information base is updated in real time, or the point of interest information base is automatically updated according to a preset condition.
- 一种兴趣点识别装置,其特征在于,所述兴趣点识别装置包括:An interest point recognition device, characterized in that the interest point recognition device includes:训练语料获取模块,用于获取预设的训练语料库;A training corpus acquisition module for acquiring a preset training corpus;训练语料分析模块,用于使用N-gram模型对所述预设的训练语料库进行分析,得到所述预设的训练语料库的词序列数据,其中,所述词序列数据包括词序列以及每个所述词序列的词序列频度;A training corpus analysis module is configured to analyze the preset training corpus using an N-gram model to obtain word sequence data of the preset training corpus, wherein the word sequence data includes a word sequence and each Word sequence frequency of the predicate sequence;语音信息解析模块,用于若接收到待识别语音信息,则对所述待识别语音信息进行解析,得到所述待识别语音信息的M个发音序列,其中,M为大于1的正整数;A voice information parsing module, configured to parse the voice information to be recognized if the voice information to be recognized is received, to obtain M pronunciation sequences of the voice information to be recognized, where M is a positive integer greater than 1;发生概率计算模块,用于针对每个所述发音序列,依据所述词序列数据,计算每个发音序列的发生概率,从而得到M个发音序列的发生概率;An occurrence probability calculation module, configured to calculate an occurrence probability of each pronunciation sequence for each of the pronunciation sequences and according to the word sequence data, so as to obtain an occurrence probability of M pronunciation sequences;发音序列确认模块,用于从M个所述发音序列的发生概率中,选取达到预设概率阈值的发生概率对应的所述发音序列,作为目标发音序列;A pronunciation sequence confirmation module, configured to select the pronunciation sequence corresponding to the occurrence probability that reaches a preset probability threshold from the occurrence probability of M said pronunciation sequences as a target pronunciation sequence;识别结果获取模块,用于从兴趣点信息库中获取与所述目标发音序列对应的兴趣点信息,作为所述待识别语音信息的兴趣点识别结果。The recognition result obtaining module is configured to obtain the point of interest information corresponding to the target pronunciation sequence from the point of interest information database as a point of interest recognition result of the speech information to be recognized.
- 如权利要求7所述的兴趣点识别装置,其特征在于,所述发生概率计算模块包括:The point of interest recognition device according to claim 7, wherein the occurrence probability calculation module comprises:分词序列提取单元,用于针对每个所述发音序列,获取该发音序列内的所有分词a 1,a 2,...,a n-1,a n,其中,n为大于1的正整数; Segmentation sequence extraction unit, for each said pronunciation sequence, to obtain all the segmentations a 1 , a 2 , ..., an n-1 , an n in the pronunciation sequence, where n is a positive integer greater than 1 ;发生概率计算单元,用于依据所述词序列数据,使用如下公式计算n个分词中第n个分词a n出现在词序列(a 1a 2...a n-1)之后的概率,将所述概率作为该发音序列的发生概率: The occurrence probability calculation unit is configured to calculate the probability that the nth participle a n of the n participles appears after the word sequence (a 1 a 2 ... a n-1 ) according to the word sequence data, and The probability is taken as the occurrence probability of the pronunciation sequence:其中,P(a n|a 1a 2...a n-1)为n个分词中第n个分词a n出现在词序列(a 1a 2...a n-1)之后的概率,C(a 1a 2...a n-1a n)为词序列(a 1a 2...a n-1a n)的词序列频度,C(a 1a 2...a n-1)为词序列(a 1a 2...a n-1)的词序列频度。 After probability | (a 1 a 2 ... a n-1 a n) for the n th word n-th word appears in a n word sequence (a 1 a 2 ... a n -1) where, P , C (a 1 a 2 ... an n-1 an n ) is the word sequence frequency of the word sequence (a 1 a 2 ... an n-1 an n ), C (a 1 a 2 ... a n-1 ) is the word sequence frequency of the word sequence (a 1 a 2 ... an n-1 ).
- 如权利要求7或8所述的兴趣点识别装置,其特征在于,所述兴趣点识别装置还包括:The point of interest recognition device according to claim 7 or 8, wherein the point of interest recognition device further comprises:兴趣点信息库构建单元,用于构建兴趣点信息库;Point-of-interest information base construction unit, which is used to construct a point-of-interest information base;补充语料库获取单元,用于基于所述兴趣点信息库,生成补充语料库;A supplementary corpus acquisition unit, configured to generate a supplementary corpus based on the point of interest information base;训练语料库生成单元,用于将所述补充语料库与预设的基础语料库进行组合,得到所述训练语料库。A training corpus generating unit is configured to combine the supplementary corpus with a preset basic corpus to obtain the training corpus.
- 如权利要求9所述的兴趣点识别装置,其特征在于,所述兴趣点信息库构建单元包括:The point of interest recognition device according to claim 9, wherein the point of interest information base construction unit comprises:分类划分子单元,用于对预设的基础兴趣点按照预设的分类方式进行分类,得到所述兴趣点信息库的基础分类;A classification division subunit, configured to classify a preset basic interest point according to a preset classification method to obtain a basic classification of the interest point information database;信息获取子单元,用于针对每个所述基础分类,通过网络爬取的方式,获取全国每个行政区中包含该基础分类的所有基础兴趣点的兴趣点信息,得到所述基础分类在全国每个行政区的兴趣点信息。An information acquisition subunit is configured to obtain, for each of the basic classifications, a point of interest information of all the basic interest points of the basic classification in each administrative region of the country through a web crawling method, and obtain the basic classification in each Point of interest information for each administrative district.
- 一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其特征在于,所述处理器执行所述计算机可读指令时实现如下步骤:A terminal device includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, and is characterized in that the processor implements the computer-readable instructions as follows step:获取预设的训练语料库;Obtain a preset training corpus;使用N-gram模型对所述预设的训练语料库进行分析,得到所述预设的训练语料库的词序列数据,其中,所述词序列数据包括词序列以及每个所述词序列的词序列频度;An N-gram model is used to analyze the preset training corpus to obtain word sequence data of the preset training corpus, wherein the word sequence data includes a word sequence and a word sequence frequency of each of the word sequences. degree;若接收到待识别语音信息,则对所述待识别语音信息进行解析,得到所述待识别语音信息的M个发音序列,其中,M为大于1的正整数;If the speech information to be identified is received, the speech information to be identified is parsed to obtain M pronunciation sequences of the speech information to be identified, where M is a positive integer greater than 1;针对每个所述发音序列,依据所述词序列数据,计算每个发音序列的发生概率,从而得到M个发音序列的发生概率;For each of the pronunciation sequences, calculating the occurrence probability of each pronunciation sequence according to the word sequence data, thereby obtaining the occurrence probability of M pronunciation sequences;从M个所述发音序列的发生概率中,选取达到预设概率阈值的发生概率对应的所述发音序列,作为目标发音序列;Selecting the pronunciation sequence corresponding to the occurrence probability reaching a preset probability threshold from the occurrence probabilities of M said pronunciation sequences as a target pronunciation sequence;从兴趣点信息库中获取与所述目标发音序列对应的兴趣点信息,作为所述待识别语音信息的兴趣点识别结果。The point of interest information corresponding to the target pronunciation sequence is obtained from the point of interest information database as a point of interest recognition result of the speech information to be recognized.
- 如权利要求11所述的终端设备,其特征在于,所述针对每个所述发音序列,依据所述词序列数据,计算每个发音序列的发生概率包括:The terminal device according to claim 11, wherein, for each of the pronunciation sequences, calculating the occurrence probability of each pronunciation sequence according to the word sequence data comprises:针对每个所述发音序列,获取该发音序列内的所有分词a 1,a 2,...,a n-1,a n,其中,n为大于1的正整数; For each of the pronunciation sequences, obtaining all the participles a 1 , a 2 , ..., an n-1 , an n within the pronunciation sequence, where n is a positive integer greater than 1;依据所述词序列数据,使用如下公式计算n个分词中第n个分词a n出现在词序列(a 1a 2...a n-1)之后的概率,将所述概率作为该发音序列的发生概率: According to the word sequence data, the following formula is used to calculate the probability that the nth participle a n of the n participles appears after the word sequence (a 1 a 2 ... a n-1 ), and the probability is used as the pronunciation sequence Probability of occurrence:其中,P(a n|a 1a 2...a n-1)为n个分词中第n个分词a n出现在词序列(a 1a 2...a n-1)之后的概率,C(a 1a 2...a n-1a n)为词序列(a 1a 2...a n-1a n)的词序列频度,C(a 1a 2...a n-1)为词序列(a 1a 2...a n-1)的词序列频度。 After probability | (a 1 a 2 ... a n-1 a n) for the n th word n-th word appears in a n word sequence (a 1 a 2 ... a n -1) where, P , C (a 1 a 2 ... an n-1 an n ) is the word sequence frequency of the word sequence (a 1 a 2 ... an n-1 an n ), C (a 1 a 2 ... a n-1 ) is the word sequence frequency of the word sequence (a 1 a 2 ... an n-1 ).
- 如权利要求11或12所述的终端设备,其特征在于,所述获取预设的训练语料库之前,所述处理器执行所述计算机可读指令时还实现如下步骤:The terminal device according to claim 11 or 12, wherein before the obtaining a preset training corpus, the processor further implements the following steps when the processor executes the computer-readable instructions:构建兴趣点信息库;Building a point of interest database;基于所述兴趣点信息库,生成补充语料库;Generating a supplementary corpus based on the point of interest information base;将所述补充语料库与预设的基础语料库进行组合,得到所述训练语料库。Combining the supplementary corpus with a preset basic corpus to obtain the training corpus.
- 如权利要求13所述的终端设备,其特征在于,所述构建兴趣点信息库包括:The terminal device according to claim 13, wherein the constructing a point of interest information database comprises:对预设的基础兴趣点按照预设的分类方式进行分类,得到所述兴趣点信息库的基础分类;Classifying a preset basic interest point according to a preset classification method to obtain a basic classification of the interest point information database;针对每个所述基础分类,通过网络爬取的方式,获取全国每个行政区中包含该基础分类的所有基础兴趣点的兴趣点信息,得到所述基础分类在全国每个行政区的兴趣点信息。For each of the basic classifications, through a web crawling method, the point of interest information of all the basic points of interest in the basic classification in each administrative region of the country is obtained, and the information of the points of interest of the basic classification in each administrative region of the country is obtained.
- 如权利要求13所述的终端设备,其特征在于,所述基于所述兴趣点信息库,生成补充语料库包括:The terminal device according to claim 13, wherein the generating a supplementary corpus based on the interest point information base comprises:提取所述兴趣点信息库中的所述兴趣点信息;Extracting the point of interest information in the point of interest information database;对所述兴趣点信息进行分词处理,得到兴趣点分词;Performing word segmentation processing on the point of interest information to obtain the point of interest segmentation;建立所述兴趣点分词与对应的所述兴趣点信息之间的映射关系,并将所述兴趣点分词、所述兴趣点信息和所述映射关系对应保存到所述补充语料库中。Establishing a mapping relationship between the point of interest segmentation and the corresponding point of interest information, and correspondingly storing the point of interest segmentation, the point of interest information, and the mapping relationship in the supplementary corpus.
- 一个或多个存储有计算机可读指令的非易失性可读存储介质,其特征在于,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤::One or more non-volatile readable storage media storing computer readable instructions, characterized in that when the computer readable instructions are executed by one or more processors, the one or more processors are caused to execute The following steps:获取预设的训练语料库;Obtain a preset training corpus;使用N-gram模型对所述预设的训练语料库进行分析,得到所述预设的训练语料库的词序列数据,其中,所述词序列数据包括词序列以及每个所述词序列的词序列频度;An N-gram model is used to analyze the preset training corpus to obtain word sequence data of the preset training corpus, wherein the word sequence data includes a word sequence and a word sequence frequency of each of the word sequences. degree;若接收到待识别语音信息,则对所述待识别语音信息进行解析,得到所述待识别语音信息的M个发音序列,其中,M为大于1的正整数;If the speech information to be identified is received, the speech information to be identified is parsed to obtain M pronunciation sequences of the speech information to be identified, where M is a positive integer greater than 1;针对每个所述发音序列,依据所述词序列数据,计算每个发音序列的发生概率,从而得到M个发音序列的发生概率;For each of the pronunciation sequences, calculating the occurrence probability of each pronunciation sequence according to the word sequence data, thereby obtaining the occurrence probability of M pronunciation sequences;从M个所述发音序列的发生概率中,选取达到预设概率阈值的发生概率对应的所述发音序列,作为目标发音序列;Selecting the pronunciation sequence corresponding to the occurrence probability reaching a preset probability threshold from the occurrence probabilities of M said pronunciation sequences as a target pronunciation sequence;从兴趣点信息库中获取与所述目标发音序列对应的兴趣点信息,作为所述待识别语音信息的兴趣点识别结果。The point of interest information corresponding to the target pronunciation sequence is obtained from the point of interest information database as a point of interest recognition result of the speech information to be recognized.
- 如权利要求16所述的非易失性可读存储介质,其特征在于,所述针对每个所述发音序列,依据所述词序列数据,计算每个发音序列的发生概率包括:The nonvolatile readable storage medium according to claim 16, wherein, for each of the pronunciation sequences, calculating the occurrence probability of each pronunciation sequence according to the word sequence data comprises:针对每个所述发音序列,获取该发音序列内的所有分词a 1,a 2,...,a n-1,a n,其中,n为大于1的正整数; For each of the pronunciation sequences, obtaining all the participles a 1 , a 2 , ..., an n-1 , an n within the pronunciation sequence, where n is a positive integer greater than 1;依据所述词序列数据,使用如下公式计算n个分词中第n个分词a n出现在词序列(a 1a 2...a n-1)之后的概率,将所述概率作为该发音序列的发生概率: According to the word sequence data, the following formula is used to calculate the probability that the nth participle a n of the n participles appears after the word sequence (a 1 a 2 ... a n-1 ), and the probability is used as the pronunciation sequence Probability of occurrence:其中,P(a n|a 1a 2...a n-1)为n个分词中第n个分词a n出现在词序列(a 1a 2...a n-1)之后的概率,C(a 1a 2...a n-1a n)为词序列(a 1a 2...a n-1a n)的词序列频度,C(a 1a 2...a n-1)为词序列(a 1a 2...a n-1)的词序列频度。 After probability | (a 1 a 2 ... a n-1 a n) for the n th word n-th word appears in a n word sequence (a 1 a 2 ... a n -1) where, P , C (a 1 a 2 ... an n-1 an n ) is the word sequence frequency of the word sequence (a 1 a 2 ... an n-1 an n ), C (a 1 a 2 ... a n-1 ) is the word sequence frequency of the word sequence (a 1 a 2 ... an n-1 ).
- 如权利要求16或17所述的非易失性可读存储介质,其特征在于,所述获取预设的训练语料库之前,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器还执行如下步骤:The non-volatile readable storage medium according to claim 16 or 17, wherein before the obtaining the preset training corpus, the computer-readable instructions are executed by one or more processors, so that all the The one or more processors also perform the following steps:构建兴趣点信息库;Building a point of interest database;基于所述兴趣点信息库,生成补充语料库;Generating a supplementary corpus based on the point of interest information base;将所述补充语料库与预设的基础语料库进行组合,得到所述训练语料库。Combining the supplementary corpus with a preset basic corpus to obtain the training corpus.
- 如权利要求18所述的非易失性可读存储介质,其特征在于,所述构建兴趣点信息库包括:The non-volatile readable storage medium of claim 18, wherein the constructing a point of interest information database comprises:对预设的基础兴趣点按照预设的分类方式进行分类,得到所述兴趣点信息库的基础分类;Classifying a preset basic interest point according to a preset classification method to obtain a basic classification of the interest point information database;针对每个所述基础分类,通过网络爬取的方式,获取全国每个行政区中包含该基础分类的所有基础兴趣点的兴趣点信息,得到所述基础分类在全国每个行政区的兴趣点信息。For each of the basic classifications, through a web crawling method, the point of interest information of all the basic points of interest in the basic classification in each administrative region of the country is obtained, and the information of the points of interest of the basic classification in each administrative region of the country is obtained.
- 如权利要求18所述的非易失性可读存储介质,其特征在于,所述基于所述兴趣点信息库,生成补充语料库包括:The non-volatile readable storage medium of claim 18, wherein the generating a supplementary corpus based on the point of interest information base comprises:提取所述兴趣点信息库中的所述兴趣点信息;Extracting the point of interest information in the point of interest information database;对所述兴趣点信息进行分词处理,得到兴趣点分词;Performing word segmentation processing on the point of interest information to obtain the point of interest segmentation;建立所述兴趣点分词与对应的所述兴趣点信息之间的映射关系,并将所述兴趣点分词、所述兴趣点信息和所述映射关系对应保存到所述补充语料库中。Establishing a mapping relationship between the point of interest segmentation and the corresponding point of interest information, and correspondingly storing the point of interest segmentation, the point of interest information, and the mapping relationship in the supplementary corpus.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810529490.2 | 2018-05-29 | ||
CN201810529490.2A CN108831442A (en) | 2018-05-29 | 2018-05-29 | Point of interest recognition methods, device, terminal device and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2019227581A1 true WO2019227581A1 (en) | 2019-12-05 |
Family
ID=64146126
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2018/094372 WO2019227581A1 (en) | 2018-05-29 | 2018-07-03 | Interest point recognition method, apparatus, terminal device, and storage medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN108831442A (en) |
WO (1) | WO2019227581A1 (en) |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109830226A (en) * | 2018-12-26 | 2019-05-31 | 出门问问信息科技有限公司 | A kind of phoneme synthesizing method, device, storage medium and electronic equipment |
CN111401355A (en) * | 2018-12-29 | 2020-07-10 | 北京奇虎科技有限公司 | A method and device for identifying POI data aggregation relationship |
CN109871534B (en) * | 2019-01-10 | 2020-03-24 | 北京海天瑞声科技股份有限公司 | Method, device and equipment for generating Chinese-English mixed corpus and storage medium |
CN110263248B (en) * | 2019-05-21 | 2023-11-28 | 平安科技(深圳)有限公司 | Information pushing method, device, storage medium and server |
CN110334321B (en) * | 2019-06-24 | 2023-03-31 | 天津城建大学 | City rail transit station area function identification method based on interest point data |
CN112988989B (en) * | 2019-12-18 | 2022-08-12 | 中国移动通信集团四川有限公司 | A place name address matching method and server |
CN111209363B (en) * | 2019-12-25 | 2024-02-09 | 华为技术有限公司 | Corpus data processing method, corpus data processing device, server and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103262156A (en) * | 2010-08-27 | 2013-08-21 | 思科技术公司 | Speech recognition language model |
CN105550169A (en) * | 2015-12-11 | 2016-05-04 | 北京奇虎科技有限公司 | Method and device for identifying point of interest names based on character length |
CN106503131A (en) * | 2016-10-19 | 2017-03-15 | 北京小米移动软件有限公司 | Obtain the method and device of interest information |
US9899021B1 (en) * | 2013-12-20 | 2018-02-20 | Amazon Technologies, Inc. | Stochastic modeling of user interactions with a detection system |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101271450B (en) * | 2007-03-19 | 2010-09-29 | 株式会社东芝 | Method and device for cutting language model |
CN103674012B (en) * | 2012-09-21 | 2017-09-29 | 高德软件有限公司 | Speech customization method and its device, audio recognition method and its device |
US9396723B2 (en) * | 2013-02-01 | 2016-07-19 | Tencent Technology (Shenzhen) Company Limited | Method and device for acoustic language model training |
CN103198828B (en) * | 2013-04-03 | 2015-09-23 | 中金数据系统有限公司 | The construction method of speech corpus and system |
CN107154260B (en) * | 2017-04-11 | 2020-06-16 | 北京儒博科技有限公司 | Domain-adaptive speech recognition method and device |
CN107204184B (en) * | 2017-05-10 | 2018-08-03 | 平安科技(深圳)有限公司 | Audio recognition method and system |
-
2018
- 2018-05-29 CN CN201810529490.2A patent/CN108831442A/en active Pending
- 2018-07-03 WO PCT/CN2018/094372 patent/WO2019227581A1/en active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103262156A (en) * | 2010-08-27 | 2013-08-21 | 思科技术公司 | Speech recognition language model |
US9899021B1 (en) * | 2013-12-20 | 2018-02-20 | Amazon Technologies, Inc. | Stochastic modeling of user interactions with a detection system |
CN105550169A (en) * | 2015-12-11 | 2016-05-04 | 北京奇虎科技有限公司 | Method and device for identifying point of interest names based on character length |
CN106503131A (en) * | 2016-10-19 | 2017-03-15 | 北京小米移动软件有限公司 | Obtain the method and device of interest information |
Also Published As
Publication number | Publication date |
---|---|
CN108831442A (en) | 2018-11-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2019227581A1 (en) | Interest point recognition method, apparatus, terminal device, and storage medium | |
CN106777274B (en) | A kind of Chinese tour field knowledge mapping construction method and system | |
US9558179B1 (en) | Training a probabilistic spelling checker from structured data | |
CN106649818B (en) | Application search intent identification method, device, application search method and server | |
CN104462126B (en) | A kind of entity link method and device | |
CN103218444B (en) | Based on semantic method of Tibetan language webpage text classification | |
WO2015149533A1 (en) | Method and device for word segmentation processing on basis of webpage content classification | |
CN103544266B (en) | A kind of method and device for searching for suggestion word generation | |
CN111488468B (en) | Geographic information knowledge point extraction method and device, storage medium and computer equipment | |
CN106126619A (en) | A kind of video retrieval method based on video content and system | |
CN112559658B (en) | A method and device for address matching | |
CN110298039B (en) | Event place identification method, system, equipment and computer readable storage medium | |
CN105550169A (en) | Method and device for identifying point of interest names based on character length | |
CN106599215A (en) | Question generation method and question generation system based on deep learning | |
CN116842168B (en) | Cross-domain problem processing method and device, electronic equipment and storage medium | |
CN105608075A (en) | Related knowledge point acquisition method and system | |
CN113535883A (en) | Business place entity linking method, system, electronic device and storage medium | |
CN105159885A (en) | Point-of-interest name identification method and device | |
Sagcan et al. | Toponym recognition in social media for estimating the location of events | |
CN111680122B (en) | Space data active recommendation method and device, storage medium and computer equipment | |
CN105138708A (en) | Method and device for identifying names of points of interest (POI) | |
EP3631737A1 (en) | Automated classification of network-accessible content | |
Mehta et al. | Natural language processing approach and geospatial clustering to explore the unexplored geotags using media | |
Chang et al. | Enhancing POI search on maps via online address extraction and associated information segmentation | |
CN105279249A (en) | A method and device for determining the confidence level of point-of-interest data in a website |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 18921270 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 12.03.2021) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 18921270 Country of ref document: EP Kind code of ref document: A1 |