CN101393544A - Chinese Address Semantic Analysis Method Oriented to Address Coding - Google Patents
Chinese Address Semantic Analysis Method Oriented to Address Coding Download PDFInfo
- Publication number
- CN101393544A CN101393544A CNA2008101565884A CN200810156588A CN101393544A CN 101393544 A CN101393544 A CN 101393544A CN A2008101565884 A CNA2008101565884 A CN A2008101565884A CN 200810156588 A CN200810156588 A CN 200810156588A CN 101393544 A CN101393544 A CN 101393544A
- Authority
- CN
- China
- Prior art keywords
- address
- numbers
- character
- chinese
- string
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Machine Translation (AREA)
Abstract
本发明公开了一种面向地址编码的中文地址语义解析方法,其步骤如下:第一步:根据样本数据,构建地址特征字库;a.建立样本数据;b.筛选特征字;c.筛选附属特征字;所筛选的特征字和附属特征字构成了特征字库;第二步:根据特征字库,按照地址表示规则,将中文地址转换为数字形式的字符串;第三步:构建地址解析规则库;第四步:语义解析,包括:地址表示,将原始地址转换为数字表示、地址解析,将数字表示的地址拆分为地址要素;地址还原,将数字表示的解析结果还原为与原始地址对应的字符串。The invention discloses an address coding-oriented Chinese address semantic analysis method, the steps of which are as follows: Step 1: construct an address feature font library according to sample data; a. establish sample data; b. filter feature words; c. filter subsidiary features Characters; the selected characteristic words and subsidiary characteristic words constitute the characteristic character library; the second step: according to the characteristic character library, according to the address representation rules, convert the Chinese address into a string in digital form; the third step: build the address analysis rule library; Step 4: Semantic analysis, including: address representation, converting the original address into a digital representation, address resolution, splitting the address represented by the number into address elements; restoring the address, restoring the parsing result of the digital representation to the corresponding to the original address string.
Description
技术领域 technical field
本发明提出了一种不依赖地名词典的中文地址语义解析方法,适用于房地产管理、土地管理、城市规划、公安、邮政、税收、电讯、公共卫生和号码百事通等领域地理信息系统(GIS)中的地址编码。The present invention proposes a Chinese address semantic analysis method that does not rely on a gazetteer, and is suitable for geographic information systems (GIS) in the fields of real estate management, land management, urban planning, public security, postal services, taxation, telecommunications, public health, and number know-how address encoding.
背景技术 Background technique
在日常生产与生活中,地址是最常用的使用自然语言描述地理位置的参考系统之一。地址描述是当前各类业务系统中描述空间位置的最常用手段,利用地址编码技术能够使大量的原来已经存在于管理信息系统(MIS)中的地址定位信息转换成可以被用于地理信息系统(GIS)的地理坐标方式,使得GIS可以通过对地理数据的集成、存储、检索、操作和分析,将分散在各个部门的数据通过空间参照系联系起来,为土地利用、资源管理、环境监测、交通运输、城市规划等提供决策服务,从而大大促进GIS技术的应用。In daily production and life, addresses are one of the most commonly used reference systems for describing geographic locations using natural language. Address description is the most commonly used means to describe the spatial location in various business systems. Using address coding technology can convert a large amount of address positioning information that already exists in the management information system (MIS) into one that can be used in geographic information systems ( The geographic coordinate method of GIS) enables GIS to integrate, store, retrieve, operate, and analyze geographic data, link the data scattered in various departments through the spatial reference system, and provide information for land use, resource management, environmental monitoring, and transportation. Transportation, urban planning, etc. provide decision-making services, thereby greatly promoting the application of GIS technology.
地址编码是指将自然语言描述的地址信息,根据地址模型和编码规则进行智能语义解析,通过与数据库中匹配,建立与对应的空间坐标信息和地理编码关联的过程,其基本原理如图1所示。Address coding refers to the process of intelligently analyzing the address information described in natural language according to the address model and coding rules, and establishing the association with the corresponding spatial coordinate information and geographic coding by matching with the database. The basic principle is shown in Figure 1. Show.
地址编码需要解决以下三个关键技术问题:①地址语义解析:指将自然语言描述的地址拆分为在某一限定区域内,可以指定某一具体地理范围的地址要素。比如“南京市鼓楼区宁海路122号”解析为“南京市”、“鼓楼区”、“宁海路”、“122号”四个地址要素。地址中各个地址要素按照从大到小的关系排列,后面的地址要素必须相对于前面的地址要素才有意义。②地址模型:地址模型用于描述各种类型地址中地址要素的构成规则;③地址匹配:是指根据既定的地址模型和编码规则,将通过计算机语义解析的地址与GIS中标准地址进行匹配,并给出地理坐标值的智能化过程。Address coding needs to solve the following three key technical problems: ①Address semantic analysis: refers to splitting the address described in natural language into address elements within a certain limited area that can specify a specific geographical range. For example, "No. 122, Ninghai Road, Gulou District, Nanjing City" is resolved into four address elements of "Nanjing City", "Gulou District", "Ninghai Road" and "No. 122". Each address element in the address is arranged according to the relationship from large to small, and the following address elements must be meaningful relative to the previous address elements. ②Address model: The address model is used to describe the composition rules of address elements in various types of addresses; ③Address matching: refers to matching the address analyzed by computer semantics with the standard address in GIS according to the established address model and coding rules. And give the intelligent process of geographical coordinate value.
从20世纪70年代起,美国就开始建立全国地理编码系统一“双重独立地图编码系统”(Dual Independent Map Encoding,DIME),DIME的开发在GIS技术的发展史上具有里程碑的意义。80年代后期,美国国情普查局将DIME系统发展为拓扑集成的地址编码参照系统(Topologically Integrated GeographicEncoding and Referencing,TIGER)。由于TIGER数据库覆盖范围广,精度好,更新有保证,而且费用低廉,已经成为美国的公认地址编码参考标准。目前,国外地理编码库以及地理编码软件工具已经商品化,有很多关于地址数据的内容标准和规范说明,例如FGDC地址数据内容标准公共草案。绝大多数国外GIS软件中都有地址编码功能,比如Mapinfo的MapMaker,ArcGIS的Geocoding和GeoMedia的Geocodes Addresses,在具体应用中具有很高的响应速度和准确率。国外地址编码技术成功推广应用的关键因素在于:一是从语言角度看,英文等西方语种地址描述中单词之间存在空格分隔符;从标准化角度看,地址的数据命名和表述,以及软件开发和应用服务都遵循标准化和规范化原则。因此,国外地址编码中地址语义解析都采用与标准地址库中地名进行简单字符匹配的方法(统称“词典匹配法”)。Since the 1970s, the United States has begun to establish a national geographic coding system, the "Dual Independent Map Encoding System" (Dual Independent Map Encoding, DIME). The development of DIME is a milestone in the development history of GIS technology. In the late 1980s, the US Census Bureau developed the DIME system into a topologically integrated address coding reference system (Topologically Integrated Geographic Encoding and Referencing, TIGER). Due to its wide coverage, good accuracy, guaranteed update, and low cost, the TIGER database has become a recognized address coding reference standard in the United States. At present, foreign geocoding databases and geocoding software tools have been commercialized, and there are many content standards and specifications for address data, such as the public draft of FGDC address data content standards. Most foreign GIS software have address coding functions, such as MapInfo's MapMaker, ArcGIS's Geocoding and GeoMedia's Geocodes Addresses, which have high response speed and accuracy in specific applications. The key factors for the successful popularization and application of foreign address coding technology are: first, from the perspective of language, there are space separators between words in the address description of English and other western languages; from the perspective of standardization, the data naming and expression of addresses, as well as software development and Application services follow the principles of standardization and normalization. Therefore, address semantic analysis in foreign address coding adopts the method of simple character matching with place names in the standard address database (collectively referred to as "dictionary matching method").
具体地讲,词典匹配法是以GIS中地名数据为地名词典,运用字符串匹配算法进行词法和语法分析,将地址串与词典中的地名进行匹配,达到地址解析的目的。词典匹配法只能将地址串中与词典中完全相同的地名进行解析。例如,假设词典中收录有“南京市”、“鼓楼区”、“宁海路”三个地名而没有“文苑路”,就可以将地址“南京市鼓楼区宁海路122号”解析为“南京市”、“鼓楼区”、“宁海路”、“122号”四个地址要素,而将“南京市文苑路12号”解析为“南京市”和“文苑路12号”。因此,词典匹配法的效果与词典规模和更新速度呈正比,效率却与词典规模呈反比。然而,由于汉语言文字自身的历史和文化的特点,以及地址编码规范的严重匮乏,导致中文地址存在较为严重的不规范性,不仅地址命名的规律性差、格式复杂、存在一地多名的现象,而且增加、删除、修改比较随意。很显然,词典匹配法在适用范围、更新维护,准确率和响应速度等方面都不能满足大规模数据处理的需求。中文地址语义解析已成为目前中文地址编码技术需要解决的重点问题(黄颂.中文地址编码技术的研究[D].北京大学:硕士学位论文,2005)。Specifically, the dictionary matching method uses the place-name data in GIS as a place-name dictionary, uses the string matching algorithm for lexical and grammatical analysis, and matches the address string with the place-name in the dictionary to achieve the purpose of address resolution. The dictionary matching method can only analyze the place names in the address string that are exactly the same as those in the dictionary. For example, assuming that the dictionary contains the three place names of "Nanjing City", "Gulou District" and "Ninghai Road" but not "Wenyuan Road", the address "No. 122, Ninghai Road, Gulou District, Nanjing City" can be resolved into "Nanjing City ", "Gulou District", "Ninghai Road" and "No. 122" are four address elements, and "Nanjing City Wenyuan Road No. 12" is resolved into "Nanjing City" and "Wenyuan Road No. 12". Therefore, the effect of the dictionary matching method is directly proportional to the dictionary size and update speed, but the efficiency is inversely proportional to the dictionary size. However, due to the historical and cultural characteristics of the Chinese language itself, as well as the serious lack of address coding standards, there are serious irregularities in Chinese addresses, not only the regularity of address naming is poor, the format is complex, and there are multiple names in one place. , and adding, deleting, and modifying are more random. Obviously, the dictionary matching method cannot meet the needs of large-scale data processing in terms of scope of application, update maintenance, accuracy and response speed. Semantic analysis of Chinese addresses has become a key problem to be solved in current Chinese address coding technology (Huang Song. Research on Chinese Address Coding Technology [D]. Peking University: Master's Degree Thesis, 2005).
从上世纪80年代开始,国内地址编码技术研究侧重于地址标准库和地址匹配算法,例如,北京市和上海市先后颁布了一系列城市道路、道路交叉口等的编码标准(朱建伟,王泽民.地理编码原理及其本地化解决方案[J].北京测绘,2004(2),24-27.)。地址语义解析都采用字典匹配法,如北京长地计算机公司的“寻址神”,北大方正的“Map Searcher”,朝夕科技的“北京地址编码数据库系统及标准地址匹配引擎”,北京超图公司的“客户关系管理系统”、山海易绘的EzGeoCoding等系统。Since the 1980s, domestic address coding technology research has focused on address standard libraries and address matching algorithms. For example, Beijing and Shanghai have promulgated a series of coding standards for urban roads and road intersections (Zhu Jianwei, Wang Zemin. Geography Coding Principles and Localization Solutions [J]. Beijing Surveying and Mapping, 2004(2), 24-27.). The semantic analysis of addresses all adopts the dictionary matching method, such as the "Addressing God" of Beijing Changdi Computer Company, the "Map Searcher" of Founder of Peking University, the "Beijing Address Code Database System and Standard Address Matching Engine" of Zhaoxi Technology, and the Beijing SuperMap Company The "customer relationship management system" of Shanhai Yihui, EzGeoCoding and other systems.
发明内容 Contents of the invention
本发明所要解决的技术问题在于,克服现有技术存在的缺陷,提供一种面向地址编码的中文地址语义解析方法,而不依赖于地名词典进行地址解析。The technical problem to be solved by the present invention is to overcome the defects in the prior art and provide an address coding-oriented Chinese address semantic analysis method without relying on a gazetteer for address analysis.
本发明面向地址编码的中文地址语义解析方法,其具体技术流程如下:The present invention is oriented to the Chinese address semantic analysis method of address coding, and its specific technical process is as follows:
第一步:根据样本数据,构建地址特征字库The first step: according to the sample data, construct the address feature font library
中文地址包括行政区划、街道、门楼牌号和补充信息四个类型的地址要素,按照地址要素的地理区域范围由大到小排列(最后为补充信息)构成中文地址串;中文地址串的构成中会有部分类型的地址要素缺失;The Chinese address includes four types of address elements: administrative division, street, gate number and supplementary information, which are arranged from large to small according to the geographical area of the address elements (the last is supplementary information) to form a Chinese address string; the composition of the Chinese address string will Some types of address elements are missing;
所述行政区划,按照<中华人民共和国行政区代码>(GB 2260-1995),分为四级,由村以上的行政区域由大到小(会有缺失)排序:第一级为省、自治区、直辖市和特别行政区;第二级为市、地区、自治州、盟及国家直辖市所属市辖区和县;第三级为县、市辖区、县级市、旗;第四级为乡、镇、村;The above-mentioned administrative divisions are divided into four levels according to the <Administrative Area Code of the People's Republic of China> (GB 2260-1995), and the administrative areas above the village are sorted from large to small (there will be missing): the first level is provinces, autonomous regions, Municipalities directly under the central government and special administrative regions; the second level is cities, districts, autonomous prefectures, leagues, and municipal districts and counties directly under the central government; the third level is counties, city districts, county-level cities, and banners; the fourth level is townships, towns, and villages;
一般说来,一个地址中往往包含多个不同级别的行政区划名称。例如“南京市鼓楼区宁海路122号”中包括“南京市”(第二级)和“鼓楼区”(第三级)两个不同级别的行政区划名称。Generally speaking, an address often contains the names of multiple administrative divisions of different levels. For example, "No. 122, Ninghai Road, Gulou District, Nanjing City" includes the names of two different levels of administrative divisions, "Nanjing City" (second level) and "Gulou District" (third level).
所述街道是指路名和/或街道名;said street is a road name and/or street name;
所述门楼牌号是指门牌号、楼牌号、楼名和/或房间号;The gate number refers to house number, building number, building name and/or room number;
所述补充信息,是指门楼牌号之后加上的机构名称或者表示空间关系的词汇(东、西、南、北等),比如“南京市鼓楼区江东北路301号滨江市场”中的“滨江市场”就是一个机构名称,“南京市江浦县永宁镇西葛街西”中的“西”就是一个表示空间方向关系的词汇。The supplementary information refers to the name of the institution or the vocabulary (east, west, south, north, etc.) "Market" is the name of an institution, and "West" in "Xige Street West, Yongning Town, Jiangpu County, Nanjing City" is a vocabulary that expresses the relationship of spatial direction.
一个中文地址串可以拆分为多个不同类型的地址要素;地址要素为普通字符+特征字的组合(补充信息除外);其中A Chinese address string can be split into multiple address elements of different types; the address element is a combination of ordinary characters + characteristic words (except for supplementary information); among them
行政区划的特征字为:省、自治区、直辖市、特别行政区、市、地区、自治州、盟、区、县、旗、乡、镇、村、屯、庄等;The characteristic characters of administrative divisions are: provinces, autonomous regions, municipalities directly under the central government, special administrative regions, cities, regions, autonomous prefectures, leagues, districts, counties, banners, townships, towns, villages, villages, villages, etc.;
街道的特征字为:路、街道、街、大街、大道、马路、里、弄、胡同、巷、条等;The characteristic characters of streets are: road, street, street, avenue, avenue, road, li, alley, alley, alley, strip, etc.;
门楼牌号的特征字为:号、楼、宿舍、斋、馆、堂等;The characteristic characters of the gate number are: number, building, dormitory, fasting, hall, hall, etc.;
构建地址特征字库包括以下几个步骤:Building the address feature font library includes the following steps:
1、建立样本数据:将原始地址数据中的各个地址要素分离出来,形成样本数据。1. Establish sample data: Separate each address element in the original address data to form sample data.
表1 原始数据样例Table 1 Raw data sample
表2 样本数据样例Table 2 sample data sample
2、筛选特征字:特征字表示一个地址要素的结尾,可以看作是地址要素的单位;大多数情况下,根据特征字就可以比较准确地将地址划分成独立的语义单元。特征字筛选过程:将样本数据中所有地址要素的最后一个字符和两个字符的频率分别进行统计,并按照由大到小排序;将累积频率占百分比80%以上的单个字符筛选为特征字(称为“单特征字”);将累积频率占80%以上两个字符(必须最后一个字符不是单特征字)筛选为特征字(称为“复特征字”);2. Filter feature words: feature words indicate the end of an address element, which can be regarded as the unit of address elements; in most cases, addresses can be more accurately divided into independent semantic units according to feature words. Feature word screening process: count the frequency of the last character and the two characters of all address elements in the sample data respectively, and sort them in descending order; filter the single character whose cumulative frequency accounts for more than 80% of the percentage as feature words ( Referred to as "single characteristic word"); Accumulated frequency accounts for more than 80% two characters (must last character is not single characteristic word) is screened as characteristic word (referred to as " multiple characteristic word ");
3、筛选附属特征字:中文地址中通常包含一些表达空间关系的词汇,如东、南、西、北等,可用于辅助判断地址要素的拆分位置,将这些词汇筛选为附属特征字;3. Screening of subsidiary feature words: Chinese addresses usually contain some vocabulary expressing spatial relations, such as east, south, west, north, etc., which can be used to assist in judging the split position of address elements, and filter these words into subsidiary feature words;
所筛选的特征字和附属特征字构成了特征字库;The selected characteristic words and subsidiary characteristic words constitute the characteristic word library;
第二步:根据特征字库,按照地址表示规则,将中文地址转换为数字形式的字符串;Step 2: Convert the Chinese address into a string in digital form according to the characteristic font library and according to the address representation rules;
为了便于计算机处理,需要将中文地址字符串转换为数字表示,其中1表示特征字,2表示附属特征字,3表示两个连续重复特征字的后一个字符,0表示普通字符,9表示结束符。普通字符对于拆分规则的制定没有意义,可将连续的0字符压缩为一个0字符。例如,“江苏省六合县八百桥镇冶东村小林32号”表示为“01010110212019”,“建邺区应天路叶圩村村部”表示为“0101011309”。In order to facilitate computer processing, it is necessary to convert the Chinese address string into a digital representation, where 1 represents a characteristic character, 2 represents a subsidiary characteristic character, 3 represents the last character of two consecutive repeated characteristic characters, 0 represents a common character, and 9 represents a terminator . Ordinary characters are meaningless for the formulation of splitting rules, and consecutive 0 characters can be compressed into one 0 character. For example, "No. 32, Xiaolin, Yedong Village, Babaiqiao Town, Liuhe County, Jiangsu Province" is expressed as "01010110212019", and "The Village Department of Yexu Village, Yingtian Road, Jianye District" is expressed as "0101011309".
第三步:构建地址解析规则库Step 3: Build an address resolution rule library
将中文地址转换为数字串之后,其构成均遵循以下规则:After the Chinese address is converted into a numeric string, its composition follows the following rules:
●“0”后只能是“1”、“2”、“9”中的一个数;● "0" can only be a number among "1", "2" and "9";
●“1”后只能是“0”、“1”、“2”、“3”、“9”中的一个数;● "1" can only be followed by one of "0", "1", "2", "3" and "9";
●“2”后只能是“0”、“1”、“2”、“9”中的一个数;● "2" can only be one of "0", "1", "2" and "9";
●“3”后只能是“0”、“1”、“2”、“9”中的一个数;● "3" can only be one of "0", "1", "2" and "9";
●只能以“0”、“1”、“2”、“3”中的一个数开始;●It can only start with one of "0", "1", "2" and "3";
●只能以“9”结束。●It can only end with "9".
按照上述规则,地址可以表示为树结构,每一条路径代表一条解析规则。树的第一级节点分别为“0”、“1”、“2”、“3”,其后裔结点按照上述规则来组织。但是,当各条路径到达一定长度时,可以确定地址要素的拆分点,从而终止该路径的继续扩展。同时,每条解析规则必须规定具体的拆分位置,并用“f+拆分位置”表示。以样本数据中的地址为例,对规则的应用频率进行统计,将累积频率占95%以上的解析路径筛选为解析规则(如表3所示)。解析规则的树状结构如图3所示。According to the above rules, the address can be expressed as a tree structure, and each path represents a resolution rule. The first-level nodes of the tree are "0", "1", "2", and "3", and their descendant nodes are organized according to the above rules. However, when each path reaches a certain length, the split point of the address element can be determined, thereby terminating the further expansion of the path. At the same time, each parsing rule must specify a specific split position, which is represented by "f+ split position". Taking the address in the sample data as an example, the application frequency of the rules is counted, and the analysis paths with a cumulative frequency of more than 95% are selected as analysis rules (as shown in Table 3). The tree structure of parsing rules is shown in Figure 3.
表3 地址解析规则Table 3 address resolution rules
图3中,规则“0120f2”表示在各层上的节点分别为“0”、“1”、“2”、“0”。当扫描地址串时,如果有字符串序列与“0120”匹配,即可确定在第二个数字后面拆分,用“f2”表示拆分位置。例如,“白下区南台巷”表示为“01201”,与规则“0120”匹配,即在左边“01”之后拆分,将其解析为“白下区”和南台巷。规则“029f2”表示当地址串中的部分字符序列与“029”匹配时,即可确定在第二个数字后面拆分。由于“9”即表示数字串结束,所以规则也表示在末尾拆分。In FIG. 3 , the rule "0120f2" indicates that the nodes on each layer are "0", "1", "2", and "0", respectively. When scanning the address string, if there is a string sequence matching "0120", it can be determined to split after the second number, and "f2" is used to indicate the split position. For example, "Nantai Lane, Baixia District" is represented as "01201", which matches the rule "0120", that is, splits after "01" on the left, and parses it into "Baixia District" and Nantai Lane. The rule "029f2" indicates that when a part of the character sequence in the address string matches "029", it can be determined to split after the second number. Since "9" means the end of the digit string, the rule also means to split at the end.
第四步:语义解析Step 4: Semantic Analysis
在特征字和解析规则制定的基础上,本发明设计了一种中文地址解析算法(简称“RBAI算法”)。该算法包括三个部分:地址表示、地址解析和地址还原。具体解析过程如下:On the basis of formulating characteristic words and analysis rules, the present invention designs a Chinese address analysis algorithm (referred to as "RBAI algorithm"). The algorithm includes three parts: address representation, address resolution and address restoration. The specific analysis process is as follows:
输入:一条原始地址数据,用Address_Before表示;Input: a piece of original address data, represented by Address_Before;
输出:该原始地址的解析结果,用Address_After表示。Output: The parsing result of the original address, represented by Address_After.
(1)地址表示:将原始地址转换为数字表示,结果为Numbers_Before。(1) Address representation: convert the original address into a digital representation, and the result is Numbers_Before.
第1步:Numbers_Before置为空串;用n表示当前待解析原始地址的长度Step 1: Set Numbers_Before to an empty string; use n to represent the length of the current original address to be parsed
第2步:i从1直到n,循环执行:Step 2: i is from 1 to n, execute in a loop:
如果原始地址的第i个字符为主特征字,则Numbers_Before[i]表示为1;If the i-th character of the original address is the main feature word, then Numbers_Before[i] represents 1;
如果原始地址的第i+1个字符为主特征字,则Numbers_Before[i]表示为3;If the i+1th character of the original address is the main feature word, then Numbers_Before[i] is represented as 3;
如果原始地址的第i个字符为附属特征字,则Numbers_Before[i]表示为2;If the i-th character of the original address is an auxiliary character, then Numbers_Before[i] is represented as 2;
如果原始地址的第i个字符为普通字符,则Numbers_Before[i]表示为0;If the i-th character of the original address is an ordinary character, then Numbers_Before[i] represents 0;
i赋值为i+1;i is assigned the value i+1;
结束循环;end loop;
第3步:在Numbers_Before的末尾添加9;将Numbers_Before中的连续多个0压缩为一个0。Step 3: Add 9 at the end of Numbers_Before; compress consecutive multiple 0s in Numbers_Before into one 0.
(2)地址解析:将Numbers_Before根据解析规则拆分为地址要素,结果为Numbers_After。(2) Address analysis: Split Numbers_Before into address elements according to the analysis rules, and the result is Numbers_After.
第4步:Numbers_After置为空串;用k表示Numbers_before的长度Step 4: Set Numbers_After to an empty string; use k to represent the length of Numbers_before
第5步:m从1直到k,执行循环:Step 5: m from 1 to k, execute the loop:
如果Numbers_Before的左边m个字符与某条解析规则匹配,则按照规则将Numbers_before拆分左右两个子串;左子串Numbers_Left保存为解析结果中的一个地址要素,不再进行拆分;右子串Numbers-right继续进行拆分;将Numbers_Before定义为Numbers-right;If the m characters on the left of Numbers_Before match a certain parsing rule, the left and right substrings of Numbers_before will be split according to the rules; the left substring Numbers_Left will be saved as an address element in the parsing result and will not be split; the right substring Numbers -right to continue splitting; define Numbers_Before as Numbers-right;
m赋值为Numbers_Left的长度+1;m is assigned the length of Numbers_Left + 1;
否则otherwise
m赋值为m+1;m assignment is m+1;
结束循环;end loop;
(3)地址还原:第6步:将数字表示的解析结果还原为与原始地址对应的字符串,结果为Address_After。(3) Address Restoration: Step 6: restore the parsing result represented by the number to a character string corresponding to the original address, and the result is Address_After.
相对于现有中文地址编码中采用的地址解析技术,本发明主要以下几个优点:Compared with the address resolution technology adopted in the existing Chinese address coding, the present invention mainly has the following advantages:
a.不依赖于词典:避免词典构建和更新,而且可以解析出GIS(或词典)中没有收录的地址要素名称;a. Not dependent on dictionaries: avoid dictionary construction and updating, and can resolve address element names that are not included in GIS (or dictionaries);
b.不依赖中文分词等自然语言处理技术;b. Do not rely on natural language processing technologies such as Chinese word segmentation;
c.效率高:由于没有采用基于词黄的字符串匹配算法,而是对单个地址进行操作,效率显著提高;c. High efficiency: Since no word-based string matching algorithm is used, but a single address is operated, the efficiency is significantly improved;
d.适用性强:可以根据实际应用情况,通过更新样本数据,快速更新特征字库和解析规则,而解析算法则不需要更新。d. Strong applicability: According to the actual application situation, by updating the sample data, the feature font library and analysis rules can be quickly updated, while the analysis algorithm does not need to be updated.
e.实现简单,易于推广:特征字库和解析规则重用性强,算法简单,方便嵌入各类应用系统中。e. Simple to implement and easy to popularize: the feature font library and analysis rules are highly reusable, the algorithm is simple, and it is easy to embed in various application systems.
附图说明 Description of drawings
图1、是地址编码基本原理与过程示意图;Figure 1 is a schematic diagram of the basic principle and process of address coding;
图2、是本发明方法流程示意图;Fig. 2 is a schematic flow chart of the method of the present invention;
图3、表示基于树结构的解析规则;Figure 3 shows the parsing rules based on the tree structure;
图4、地址解析示意图(地址:六合县雄州镇朝天街108号);Figure 4. Schematic diagram of address resolution (Address: No. 108, Chaotian Street, Xiongzhou Town, Liuhe County);
图5、地址解析示意图(地址:江苏省六合县八百镇金山村);Figure 5. Schematic diagram of address resolution (Address: Jinshan Village, Babai Town, Liuhe County, Jiangsu Province);
图6、地址解析示意图(地址:六合县六城镇泰山村82号);Figure 6. Schematic diagram of address resolution (Address: No. 82, Taishan Village, Liucheng Town, Liuhe County);
图7、地址解析示意图(地址:六合区八百桥镇街道);Figure 7. Schematic diagram of address resolution (address: Babaiqiao Town Street, Liuhe District);
图8、地址解析示意图(地址:六合区雄州镇健康巷1号-2);Figure 8. Schematic diagram of address resolution (Address: No. 1-2, Jiankang Lane, Xiongzhou Town, Liuhe District);
图9、地址解析示意图(地址:南京市玄武区明故宫4号);Figure 9. Schematic diagram of address resolution (address: No. 4, Ming Palace, Xuanwu District, Nanjing);
图10、地址解析示意图(地址:六合区雄州镇中心农贸市场);Figure 10. Schematic diagram of address resolution (address: Xiongzhou Town Central Farmer’s Market, Liuhe District);
图11、地址解析示意图(地址:北门桥路5号302室);Figure 11. Schematic diagram of address resolution (address: Room 302, No. 5, Beimenqiao Road);
图12、地址解析示意图(地址:六合区程桥镇东大桥边);Figure 12. Schematic diagram of address resolution (address: beside the East Bridge, Chengqiao Town, Liuhe District);
图13、地址解析示意图(地址:玄武区相府营14号104室);Figure 13. Schematic diagram of address resolution (Address: Room 104, No. 14, Xiangfuying, Xuanwu District);
图14、中文地址语义解析算法流程图;Figure 14, the flow chart of Chinese address semantic analysis algorithm;
图15、地址表示算法流程图;Figure 15, address representation algorithm flow chart;
图16、地址解析算法流程图。Figure 16, address resolution algorithm flow chart.
具体实施方式 Detailed ways
下面结合附图和实施例对本发明方法作进一步详细说明。The method of the present invention will be described in further detail below in conjunction with the accompanying drawings and embodiments.
在主频1.73GHz和内存1GB的PC机上,用Visual C#.NET2003平台开发的实现本发明技术的软件系统,以江苏省南京市194719户企业地址数据为样本数据,根据样本数据,构建地址特征字库,根据特征字库,按照地址表示规则,进行地址语义解析操作。On a PC with a main frequency of 1.73GHz and a memory of 1GB, the software system for realizing the technology of the present invention developed by the Visual C#.NET2003 platform takes the address data of 194,719 enterprises in Nanjing City, Jiangsu Province as sample data, and builds an address feature font library according to the sample data , according to the feature font library, according to the address representation rules, the address semantic analysis operation is performed.
实例1:中文地址字符串:六合县雄州镇朝天街108号,如图4所示,解析为:六合县/雄州镇/朝天街/108号(“/”表示切分位置)。Example 1: Chinese address string: No. 108, Chaotian Street, Xiongzhou Town, Liuhe County, as shown in Figure 4, parsed as: Liuhe County/Xiongzhou Town/No. Chaotian Street/108 ("/" indicates the segmentation position).
实例2:中文地址字符串:江苏省六合县八百镇金山村,如图5所示,解析为:江苏省/六合县/八百镇/金山村。Example 2: Chinese address string: Jinshan Village, Babai Town, Liuhe County, Jiangsu Province, as shown in Figure 5, is parsed as: Jiangsu Province/Liuhe County/Babai Town/Jinshan Village.
实例3:中文地址字符串:六合县六城镇泰山村82号,如图6所示,解析为:六合县/六城镇/泰山村/82号。Example 3: The Chinese address string: No. 82, Taishan Village, Liucheng Town, Liuhe County, as shown in Figure 6, is parsed as: Liuhe County/Liucheng Town/Taishan Village/No.82.
实例4:中文地址字符串:六合区八百桥镇街道,如图7所示,解析为:六合区/八百桥镇街道。Example 4: Chinese address string: Babaiqiao Town Street, Liuhe District, as shown in Figure 7, parsed as: Liuhe District/Babaiqiao Town Street.
实例5:中文地址字符串:六合区雄州镇健康巷1号-2,如图8所示,解析为:六合区/雄州镇/健康巷/1号/-2。Example 5: Chinese address string: No. 1-2, Jiankang Lane, Xiongzhou Town, Liuhe District, as shown in Figure 8, parsed as: Liuhe District/Xiongzhou Town/Kangkang Lane/No. 1/-2.
实例6:中文地址字符串:南京市玄武区明故宫4号,如图9所示,解析为:南京市/玄武区/明故宫/4号。Example 6: Chinese address string: Ming Palace No. 4, Xuanwu District, Nanjing City, as shown in FIG. 9 , resolved as: Nanjing City/Xuanwu District/Ming Palace/No.4.
实例7:中文地址字符串:六合区雄州镇中心农贸市场,如图10所示,解析为:六合区/雄州镇/中心农贸市场。Example 7: Chinese address string: central farmer's market in Xiongzhou Town, Liuhe District, as shown in Figure 10, parsed as: Liuhe District/Xiongzhou Town/central farmer's market.
实例8:中文地址字符串:北门桥路5号302室,如图11所示,解析为:北门桥路/5号/302室。Example 8: Chinese address string: Room 302, No. 5, Beimenqiao Road, as shown in Figure 11, is parsed as: Beimenqiao Road/No. 5/Room 302.
实例9:中文地址字符串:六合区程桥镇东大桥边,如图12所示,解析为:六合区/程桥镇/东大桥/边。Example 9: Chinese address string: the side of the East Bridge, Chengqiao Town, Liuhe District, as shown in Figure 12, and parsed as: Liuhe District/Chengqiao Town/East Bridge/Bian.
实例10:中文地址字符串:玄武区相府营14号104室,如图13所示,解析为:玄武区/相府营/14号/104室。Example 10: Chinese address string: Room 104, No. 14, Xiangfuying, Xuanwu District, as shown in Figure 13, parsed as: Xuanwu District/Xiangfuying/No. 14/Room 104.
Claims (1)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CNA2008101565884A CN101393544A (en) | 2008-10-07 | 2008-10-07 | Chinese Address Semantic Analysis Method Oriented to Address Coding |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CNA2008101565884A CN101393544A (en) | 2008-10-07 | 2008-10-07 | Chinese Address Semantic Analysis Method Oriented to Address Coding |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN101393544A true CN101393544A (en) | 2009-03-25 |
Family
ID=40493843
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CNA2008101565884A Pending CN101393544A (en) | 2008-10-07 | 2008-10-07 | Chinese Address Semantic Analysis Method Oriented to Address Coding |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN101393544A (en) |
Cited By (18)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101996247A (en) * | 2010-11-10 | 2011-03-30 | 百度在线网络技术(北京)有限公司 | Method and device for constructing address database |
| CN102024024A (en) * | 2010-11-10 | 2011-04-20 | 百度在线网络技术(北京)有限公司 | Method and device for constructing address database |
| CN102073724A (en) * | 2011-01-11 | 2011-05-25 | 深圳市络道科技有限公司 | System and method for automatically identifying Chinese address subscribers |
| CN101719128B (en) * | 2009-12-31 | 2012-05-23 | 浙江工业大学 | Fuzzy matching-based Chinese geo-code determination method |
| CN102880650A (en) * | 2012-08-27 | 2013-01-16 | 中国工商银行股份有限公司 | Data matching method and device |
| CN104933024A (en) * | 2015-05-12 | 2015-09-23 | 深圳市华傲数据技术有限公司 | Chinese address word segmentation and annotation method |
| CN104933023A (en) * | 2015-05-12 | 2015-09-23 | 深圳市华傲数据技术有限公司 | Chinese address word segmentation and annotation method |
| CN105069056A (en) * | 2015-07-24 | 2015-11-18 | 湖北文理学院 | Character string matching based method and system for analyzing address information of identification card |
| CN105447152A (en) * | 2015-11-30 | 2016-03-30 | 明博教育科技股份有限公司 | Basic education e-textbook coding method |
| CN106502978A (en) * | 2016-09-19 | 2017-03-15 | 浪潮软件股份有限公司 | A kind of Chinese address segmenting method and device |
| CN106502995A (en) * | 2016-11-30 | 2017-03-15 | 福建榕基软件股份有限公司 | A kind of hierarchical information intelligent identification Method and device |
| CN106528605A (en) * | 2016-09-27 | 2017-03-22 | 武汉工程大学 | A rule-based Chinese address resolution method |
| CN106682214A (en) * | 2016-12-30 | 2017-05-17 | 中国科学院深圳先进技术研究院 | Personal information base address coding method |
| CN108549656A (en) * | 2018-03-09 | 2018-09-18 | 北京百度网讯科技有限公司 | Sentence analytic method, device, computer equipment and readable medium |
| CN108733810A (en) * | 2018-05-21 | 2018-11-02 | 北京神州泰岳软件股份有限公司 | A kind of address date matching process and device |
| CN111538796A (en) * | 2020-03-26 | 2020-08-14 | 中国平安人寿保险股份有限公司 | Address normalization processing method, device, equipment and storage medium |
| CN112181978A (en) * | 2020-08-19 | 2021-01-05 | 杭州数梦工场科技有限公司 | Address storage structure, address resolution method, device, medium and computer equipment |
| CN112417812A (en) * | 2020-11-26 | 2021-02-26 | 新智认知数据服务有限公司 | Address standardization method and system and electronic equipment |
-
2008
- 2008-10-07 CN CNA2008101565884A patent/CN101393544A/en active Pending
Cited By (26)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101719128B (en) * | 2009-12-31 | 2012-05-23 | 浙江工业大学 | Fuzzy matching-based Chinese geo-code determination method |
| CN102024024B (en) * | 2010-11-10 | 2013-07-10 | 百度在线网络技术(北京)有限公司 | Method and device for constructing address database |
| CN102024024A (en) * | 2010-11-10 | 2011-04-20 | 百度在线网络技术(北京)有限公司 | Method and device for constructing address database |
| CN101996247B (en) * | 2010-11-10 | 2013-02-20 | 百度在线网络技术(北京)有限公司 | Method and device for constructing address database |
| CN101996247A (en) * | 2010-11-10 | 2011-03-30 | 百度在线网络技术(北京)有限公司 | Method and device for constructing address database |
| CN102073724A (en) * | 2011-01-11 | 2011-05-25 | 深圳市络道科技有限公司 | System and method for automatically identifying Chinese address subscribers |
| CN102880650A (en) * | 2012-08-27 | 2013-01-16 | 中国工商银行股份有限公司 | Data matching method and device |
| CN102880650B (en) * | 2012-08-27 | 2015-11-18 | 中国工商银行股份有限公司 | A kind of data matching method and device |
| CN104933023B (en) * | 2015-05-12 | 2017-09-01 | 深圳市华傲数据技术有限公司 | Chinese address participle mask method |
| CN104933024A (en) * | 2015-05-12 | 2015-09-23 | 深圳市华傲数据技术有限公司 | Chinese address word segmentation and annotation method |
| CN104933023A (en) * | 2015-05-12 | 2015-09-23 | 深圳市华傲数据技术有限公司 | Chinese address word segmentation and annotation method |
| CN104933024B (en) * | 2015-05-12 | 2017-09-01 | 深圳市华傲数据技术有限公司 | Chinese address participle mask method |
| CN105069056A (en) * | 2015-07-24 | 2015-11-18 | 湖北文理学院 | Character string matching based method and system for analyzing address information of identification card |
| CN105447152A (en) * | 2015-11-30 | 2016-03-30 | 明博教育科技股份有限公司 | Basic education e-textbook coding method |
| CN106502978A (en) * | 2016-09-19 | 2017-03-15 | 浪潮软件股份有限公司 | A kind of Chinese address segmenting method and device |
| CN106528605A (en) * | 2016-09-27 | 2017-03-22 | 武汉工程大学 | A rule-based Chinese address resolution method |
| CN106502995A (en) * | 2016-11-30 | 2017-03-15 | 福建榕基软件股份有限公司 | A kind of hierarchical information intelligent identification Method and device |
| CN106502995B (en) * | 2016-11-30 | 2019-10-15 | 福建榕基软件股份有限公司 | A kind of hierarchical information intelligent identification Method and device |
| CN106682214A (en) * | 2016-12-30 | 2017-05-17 | 中国科学院深圳先进技术研究院 | Personal information base address coding method |
| CN108549656A (en) * | 2018-03-09 | 2018-09-18 | 北京百度网讯科技有限公司 | Sentence analytic method, device, computer equipment and readable medium |
| CN108733810A (en) * | 2018-05-21 | 2018-11-02 | 北京神州泰岳软件股份有限公司 | A kind of address date matching process and device |
| CN108733810B (en) * | 2018-05-21 | 2021-02-05 | 鼎富智能科技有限公司 | Address data matching method and device |
| CN111538796A (en) * | 2020-03-26 | 2020-08-14 | 中国平安人寿保险股份有限公司 | Address normalization processing method, device, equipment and storage medium |
| CN112181978A (en) * | 2020-08-19 | 2021-01-05 | 杭州数梦工场科技有限公司 | Address storage structure, address resolution method, device, medium and computer equipment |
| CN112417812A (en) * | 2020-11-26 | 2021-02-26 | 新智认知数据服务有限公司 | Address standardization method and system and electronic equipment |
| CN112417812B (en) * | 2020-11-26 | 2024-05-17 | 新智认知数据服务有限公司 | Address standardization method and system and electronic equipment |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN101393544A (en) | Chinese Address Semantic Analysis Method Oriented to Address Coding | |
| CN107145577A (en) | Address standardization method, device, storage medium and computer | |
| CN100573506C (en) | A kind of space-time fusion method of natural language expressing dynamic traffic information | |
| CN112528174B (en) | Address trimming and complementing method based on knowledge graph and multiple matching and application | |
| CN108628811B (en) | Address text matching method and device | |
| CN105224622A (en) | The place name address extraction of Internet and standardized method | |
| US20030165254A1 (en) | Adapting point geometry for storing address density | |
| CN101882163A (en) | A Geographic Assignment Method of Fuzzy Chinese Addresses Based on Matching Rules | |
| CN109933797A (en) | Geocoding method and system based on Jieba word segmentation and address thesaurus | |
| WO2015027836A1 (en) | Method and system for place name entity recognition | |
| CN111324679B (en) | Method, device and system for processing address information | |
| WO2022095256A1 (en) | Geocoding method and system, terminal and storage medium | |
| CN101763574A (en) | Historic building conservation technical information management system and method based on domain knowledge | |
| CN103838825A (en) | Global geographical name data integrating and encoding method | |
| CN110765773A (en) | Address data acquisition method and device | |
| CN106021336A (en) | A method for automatic administrative district division for mass address information | |
| CN107368471A (en) | The extracting method of place name address in a kind of web page text | |
| CN113536070A (en) | Address resolution method, system, computer equipment and storage medium | |
| CN110990520A (en) | Address coding method and device, electronic equipment and storage medium | |
| CN115185986A (en) | Province and city address information matching method, device, computer equipment and storage medium | |
| CN104102667A (en) | POI (Point of Interest) information differentiation method and device | |
| CN106649803A (en) | Address matching method and system | |
| CN108268445A (en) | A kind of method and device for handling address information | |
| Moura et al. | Reference data enhancement for geographic information retrieval using linked data | |
| CN106682175A (en) | Method and system for matching address |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| C06 | Publication | ||
| PB01 | Publication | ||
| C10 | Entry into substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| C12 | Rejection of a patent application after its publication | ||
| RJ01 | Rejection of invention patent application after publication |
Open date: 20090325 |