CN101393544A

CN101393544A - Chinese Address Semantic Analysis Method Oriented to Address Coding

Info

Publication number: CN101393544A
Application number: CNA2008101565884A
Authority: CN
Inventors: 张雪英; 申琪君; 李伯秋; 陈文君
Original assignee: Nanjing Normal University
Current assignee: Nanjing Normal University
Priority date: 2008-10-07
Filing date: 2008-10-07
Publication date: 2009-03-25

Abstract

本发明公开了一种面向地址编码的中文地址语义解析方法，其步骤如下：第一步：根据样本数据，构建地址特征字库；a.建立样本数据；b.筛选特征字；c.筛选附属特征字；所筛选的特征字和附属特征字构成了特征字库；第二步：根据特征字库，按照地址表示规则，将中文地址转换为数字形式的字符串；第三步：构建地址解析规则库；第四步：语义解析，包括：地址表示，将原始地址转换为数字表示、地址解析，将数字表示的地址拆分为地址要素；地址还原，将数字表示的解析结果还原为与原始地址对应的字符串。The invention discloses an address coding-oriented Chinese address semantic analysis method, the steps of which are as follows: Step 1: construct an address feature font library according to sample data; a. establish sample data; b. filter feature words; c. filter subsidiary features Characters; the selected characteristic words and subsidiary characteristic words constitute the characteristic character library; the second step: according to the characteristic character library, according to the address representation rules, convert the Chinese address into a string in digital form; the third step: build the address analysis rule library; Step 4: Semantic analysis, including: address representation, converting the original address into a digital representation, address resolution, splitting the address represented by the number into address elements; restoring the address, restoring the parsing result of the digital representation to the corresponding to the original address string.

Description

Chinese Address Semantic Analysis Method Oriented to Address Coding

技术领域 technical field

本发明提出了一种不依赖地名词典的中文地址语义解析方法，适用于房地产管理、土地管理、城市规划、公安、邮政、税收、电讯、公共卫生和号码百事通等领域地理信息系统(GIS)中的地址编码。The present invention proposes a Chinese address semantic analysis method that does not rely on a gazetteer, and is suitable for geographic information systems (GIS) in the fields of real estate management, land management, urban planning, public security, postal services, taxation, telecommunications, public health, and number know-how address encoding.

背景技术 Background technique

在日常生产与生活中，地址是最常用的使用自然语言描述地理位置的参考系统之一。地址描述是当前各类业务系统中描述空间位置的最常用手段，利用地址编码技术能够使大量的原来已经存在于管理信息系统(MIS)中的地址定位信息转换成可以被用于地理信息系统(GIS)的地理坐标方式，使得GIS可以通过对地理数据的集成、存储、检索、操作和分析，将分散在各个部门的数据通过空间参照系联系起来，为土地利用、资源管理、环境监测、交通运输、城市规划等提供决策服务，从而大大促进GIS技术的应用。In daily production and life, addresses are one of the most commonly used reference systems for describing geographic locations using natural language. Address description is the most commonly used means to describe the spatial location in various business systems. Using address coding technology can convert a large amount of address positioning information that already exists in the management information system (MIS) into one that can be used in geographic information systems ( The geographic coordinate method of GIS) enables GIS to integrate, store, retrieve, operate, and analyze geographic data, link the data scattered in various departments through the spatial reference system, and provide information for land use, resource management, environmental monitoring, and transportation. Transportation, urban planning, etc. provide decision-making services, thereby greatly promoting the application of GIS technology.

地址编码是指将自然语言描述的地址信息，根据地址模型和编码规则进行智能语义解析，通过与数据库中匹配，建立与对应的空间坐标信息和地理编码关联的过程，其基本原理如图1所示。Address coding refers to the process of intelligently analyzing the address information described in natural language according to the address model and coding rules, and establishing the association with the corresponding spatial coordinate information and geographic coding by matching with the database. The basic principle is shown in Figure 1. Show.

地址编码需要解决以下三个关键技术问题：①地址语义解析：指将自然语言描述的地址拆分为在某一限定区域内，可以指定某一具体地理范围的地址要素。比如“南京市鼓楼区宁海路122号”解析为“南京市”、“鼓楼区”、“宁海路”、“122号”四个地址要素。地址中各个地址要素按照从大到小的关系排列，后面的地址要素必须相对于前面的地址要素才有意义。②地址模型：地址模型用于描述各种类型地址中地址要素的构成规则；③地址匹配：是指根据既定的地址模型和编码规则，将通过计算机语义解析的地址与GIS中标准地址进行匹配，并给出地理坐标值的智能化过程。Address coding needs to solve the following three key technical problems: ①Address semantic analysis: refers to splitting the address described in natural language into address elements within a certain limited area that can specify a specific geographical range. For example, "No. 122, Ninghai Road, Gulou District, Nanjing City" is resolved into four address elements of "Nanjing City", "Gulou District", "Ninghai Road" and "No. 122". Each address element in the address is arranged according to the relationship from large to small, and the following address elements must be meaningful relative to the previous address elements. ②Address model: The address model is used to describe the composition rules of address elements in various types of addresses; ③Address matching: refers to matching the address analyzed by computer semantics with the standard address in GIS according to the established address model and coding rules. And give the intelligent process of geographical coordinate value.

从20世纪70年代起，美国就开始建立全国地理编码系统一“双重独立地图编码系统”(Dual Independent Map Encoding，DIME)，DIME的开发在GIS技术的发展史上具有里程碑的意义。80年代后期，美国国情普查局将DIME系统发展为拓扑集成的地址编码参照系统(Topologically Integrated GeographicEncoding and Referencing，TIGER)。由于TIGER数据库覆盖范围广，精度好，更新有保证，而且费用低廉，已经成为美国的公认地址编码参考标准。目前，国外地理编码库以及地理编码软件工具已经商品化，有很多关于地址数据的内容标准和规范说明，例如FGDC地址数据内容标准公共草案。绝大多数国外GIS软件中都有地址编码功能，比如Mapinfo的MapMaker，ArcGIS的Geocoding和GeoMedia的Geocodes Addresses，在具体应用中具有很高的响应速度和准确率。国外地址编码技术成功推广应用的关键因素在于：一是从语言角度看，英文等西方语种地址描述中单词之间存在空格分隔符；从标准化角度看，地址的数据命名和表述，以及软件开发和应用服务都遵循标准化和规范化原则。因此，国外地址编码中地址语义解析都采用与标准地址库中地名进行简单字符匹配的方法(统称“词典匹配法”)。Since the 1970s, the United States has begun to establish a national geographic coding system, the "Dual Independent Map Encoding System" (Dual Independent Map Encoding, DIME). The development of DIME is a milestone in the development history of GIS technology. In the late 1980s, the US Census Bureau developed the DIME system into a topologically integrated address coding reference system (Topologically Integrated Geographic Encoding and Referencing, TIGER). Due to its wide coverage, good accuracy, guaranteed update, and low cost, the TIGER database has become a recognized address coding reference standard in the United States. At present, foreign geocoding databases and geocoding software tools have been commercialized, and there are many content standards and specifications for address data, such as the public draft of FGDC address data content standards. Most foreign GIS software have address coding functions, such as MapInfo's MapMaker, ArcGIS's Geocoding and GeoMedia's Geocodes Addresses, which have high response speed and accuracy in specific applications. The key factors for the successful popularization and application of foreign address coding technology are: first, from the perspective of language, there are space separators between words in the address description of English and other western languages; from the perspective of standardization, the data naming and expression of addresses, as well as software development and Application services follow the principles of standardization and normalization. Therefore, address semantic analysis in foreign address coding adopts the method of simple character matching with place names in the standard address database (collectively referred to as "dictionary matching method").

具体地讲，词典匹配法是以GIS中地名数据为地名词典，运用字符串匹配算法进行词法和语法分析，将地址串与词典中的地名进行匹配，达到地址解析的目的。词典匹配法只能将地址串中与词典中完全相同的地名进行解析。例如，假设词典中收录有“南京市”、“鼓楼区”、“宁海路”三个地名而没有“文苑路”，就可以将地址“南京市鼓楼区宁海路122号”解析为“南京市”、“鼓楼区”、“宁海路”、“122号”四个地址要素，而将“南京市文苑路12号”解析为“南京市”和“文苑路12号”。因此，词典匹配法的效果与词典规模和更新速度呈正比，效率却与词典规模呈反比。然而，由于汉语言文字自身的历史和文化的特点，以及地址编码规范的严重匮乏，导致中文地址存在较为严重的不规范性，不仅地址命名的规律性差、格式复杂、存在一地多名的现象，而且增加、删除、修改比较随意。很显然，词典匹配法在适用范围、更新维护，准确率和响应速度等方面都不能满足大规模数据处理的需求。中文地址语义解析已成为目前中文地址编码技术需要解决的重点问题(黄颂.中文地址编码技术的研究[D].北京大学：硕士学位论文，2005)。Specifically, the dictionary matching method uses the place-name data in GIS as a place-name dictionary, uses the string matching algorithm for lexical and grammatical analysis, and matches the address string with the place-name in the dictionary to achieve the purpose of address resolution. The dictionary matching method can only analyze the place names in the address string that are exactly the same as those in the dictionary. For example, assuming that the dictionary contains the three place names of "Nanjing City", "Gulou District" and "Ninghai Road" but not "Wenyuan Road", the address "No. 122, Ninghai Road, Gulou District, Nanjing City" can be resolved into "Nanjing City ", "Gulou District", "Ninghai Road" and "No. 122" are four address elements, and "Nanjing City Wenyuan Road No. 12" is resolved into "Nanjing City" and "Wenyuan Road No. 12". Therefore, the effect of the dictionary matching method is directly proportional to the dictionary size and update speed, but the efficiency is inversely proportional to the dictionary size. However, due to the historical and cultural characteristics of the Chinese language itself, as well as the serious lack of address coding standards, there are serious irregularities in Chinese addresses, not only the regularity of address naming is poor, the format is complex, and there are multiple names in one place. , and adding, deleting, and modifying are more random. Obviously, the dictionary matching method cannot meet the needs of large-scale data processing in terms of scope of application, update maintenance, accuracy and response speed. Semantic analysis of Chinese addresses has become a key problem to be solved in current Chinese address coding technology (Huang Song. Research on Chinese Address Coding Technology [D]. Peking University: Master's Degree Thesis, 2005).

从上世纪80年代开始，国内地址编码技术研究侧重于地址标准库和地址匹配算法，例如，北京市和上海市先后颁布了一系列城市道路、道路交叉口等的编码标准(朱建伟，王泽民.地理编码原理及其本地化解决方案[J].北京测绘，2004(2)，24-27.)。地址语义解析都采用字典匹配法，如北京长地计算机公司的“寻址神”，北大方正的“Map Searcher”，朝夕科技的“北京地址编码数据库系统及标准地址匹配引擎”，北京超图公司的“客户关系管理系统”、山海易绘的EzGeoCoding等系统。Since the 1980s, domestic address coding technology research has focused on address standard libraries and address matching algorithms. For example, Beijing and Shanghai have promulgated a series of coding standards for urban roads and road intersections (Zhu Jianwei, Wang Zemin. Geography Coding Principles and Localization Solutions [J]. Beijing Surveying and Mapping, 2004(2), 24-27.). The semantic analysis of addresses all adopts the dictionary matching method, such as the "Addressing God" of Beijing Changdi Computer Company, the "Map Searcher" of Founder of Peking University, the "Beijing Address Code Database System and Standard Address Matching Engine" of Zhaoxi Technology, and the Beijing SuperMap Company The "customer relationship management system" of Shanhai Yihui, EzGeoCoding and other systems.

发明内容 Contents of the invention

本发明所要解决的技术问题在于，克服现有技术存在的缺陷，提供一种面向地址编码的中文地址语义解析方法，而不依赖于地名词典进行地址解析。The technical problem to be solved by the present invention is to overcome the defects in the prior art and provide an address coding-oriented Chinese address semantic analysis method without relying on a gazetteer for address analysis.

本发明面向地址编码的中文地址语义解析方法，其具体技术流程如下：The present invention is oriented to the Chinese address semantic analysis method of address coding, and its specific technical process is as follows:

第一步：根据样本数据，构建地址特征字库The first step: according to the sample data, construct the address feature font library

中文地址包括行政区划、街道、门楼牌号和补充信息四个类型的地址要素，按照地址要素的地理区域范围由大到小排列(最后为补充信息)构成中文地址串；中文地址串的构成中会有部分类型的地址要素缺失；The Chinese address includes four types of address elements: administrative division, street, gate number and supplementary information, which are arranged from large to small according to the geographical area of the address elements (the last is supplementary information) to form a Chinese address string; the composition of the Chinese address string will Some types of address elements are missing;

所述行政区划，按照<中华人民共和国行政区代码>(GB 2260-1995)，分为四级，由村以上的行政区域由大到小(会有缺失)排序：第一级为省、自治区、直辖市和特别行政区；第二级为市、地区、自治州、盟及国家直辖市所属市辖区和县；第三级为县、市辖区、县级市、旗；第四级为乡、镇、村；The above-mentioned administrative divisions are divided into four levels according to the <Administrative Area Code of the People's Republic of China> (GB 2260-1995), and the administrative areas above the village are sorted from large to small (there will be missing): the first level is provinces, autonomous regions, Municipalities directly under the central government and special administrative regions; the second level is cities, districts, autonomous prefectures, leagues, and municipal districts and counties directly under the central government; the third level is counties, city districts, county-level cities, and banners; the fourth level is townships, towns, and villages;

一般说来，一个地址中往往包含多个不同级别的行政区划名称。例如“南京市鼓楼区宁海路122号”中包括“南京市”(第二级)和“鼓楼区”(第三级)两个不同级别的行政区划名称。Generally speaking, an address often contains the names of multiple administrative divisions of different levels. For example, "No. 122, Ninghai Road, Gulou District, Nanjing City" includes the names of two different levels of administrative divisions, "Nanjing City" (second level) and "Gulou District" (third level).

所述街道是指路名和/或街道名；said street is a road name and/or street name;

所述门楼牌号是指门牌号、楼牌号、楼名和/或房间号；The gate number refers to house number, building number, building name and/or room number;

所述补充信息，是指门楼牌号之后加上的机构名称或者表示空间关系的词汇(东、西、南、北等)，比如“南京市鼓楼区江东北路301号滨江市场”中的“滨江市场”就是一个机构名称，“南京市江浦县永宁镇西葛街西”中的“西”就是一个表示空间方向关系的词汇。The supplementary information refers to the name of the institution or the vocabulary (east, west, south, north, etc.) "Market" is the name of an institution, and "West" in "Xige Street West, Yongning Town, Jiangpu County, Nanjing City" is a vocabulary that expresses the relationship of spatial direction.

一个中文地址串可以拆分为多个不同类型的地址要素；地址要素为普通字符+特征字的组合(补充信息除外)；其中A Chinese address string can be split into multiple address elements of different types; the address element is a combination of ordinary characters + characteristic words (except for supplementary information); among them

行政区划的特征字为：省、自治区、直辖市、特别行政区、市、地区、自治州、盟、区、县、旗、乡、镇、村、屯、庄等；The characteristic characters of administrative divisions are: provinces, autonomous regions, municipalities directly under the central government, special administrative regions, cities, regions, autonomous prefectures, leagues, districts, counties, banners, townships, towns, villages, villages, villages, etc.;

街道的特征字为：路、街道、街、大街、大道、马路、里、弄、胡同、巷、条等；The characteristic characters of streets are: road, street, street, avenue, avenue, road, li, alley, alley, alley, strip, etc.;

门楼牌号的特征字为：号、楼、宿舍、斋、馆、堂等；The characteristic characters of the gate number are: number, building, dormitory, fasting, hall, hall, etc.;

构建地址特征字库包括以下几个步骤：Building the address feature font library includes the following steps:

1、建立样本数据：将原始地址数据中的各个地址要素分离出来，形成样本数据。1. Establish sample data: Separate each address element in the original address data to form sample data.

表1 原始数据样例Table 1 Raw data sample

原始数据南京市玄武区后宰门西村87号六合区竹镇镇仕林路2号南京市浦口区桥林镇南二村山根组南京市江浦县龙山乡龙南村南京市雨花台区西云村3号南京杭州环北市场负一楼037-054号秦淮区红花街道翁家营村翁家营153号 Raw data No. 87, Houzaimen West Village, Xuanwu District, Nanjing No. 2 Shilin Road, Zhuzhen Town, Liuhe District Shangen Formation, Naner Village, Qiaolin Town, Pukou District, Nanjing Longnan Village, Longshan Township, Jiangpu County, Nanjing City No. 3, Xiyun Village, Yuhuatai District, Nanjing No. 037-054, B1 Floor, Huanbei Market, Hangzhou, Nanjing No. 153, Wengjiaying Village, Wengjiaying Village, Honghua Street, Qinhuai District

表2 样本数据样例Table 2 sample data sample

2、筛选特征字：特征字表示一个地址要素的结尾，可以看作是地址要素的单位；大多数情况下，根据特征字就可以比较准确地将地址划分成独立的语义单元。特征字筛选过程：将样本数据中所有地址要素的最后一个字符和两个字符的频率分别进行统计，并按照由大到小排序；将累积频率占百分比80％以上的单个字符筛选为特征字(称为“单特征字”)；将累积频率占80％以上两个字符(必须最后一个字符不是单特征字)筛选为特征字(称为“复特征字”)；2. Filter feature words: feature words indicate the end of an address element, which can be regarded as the unit of address elements; in most cases, addresses can be more accurately divided into independent semantic units according to feature words. Feature word screening process: count the frequency of the last character and the two characters of all address elements in the sample data respectively, and sort them in descending order; filter the single character whose cumulative frequency accounts for more than 80% of the percentage as feature words ( Referred to as "single characteristic word"); Accumulated frequency accounts for more than 80% two characters (must last character is not single characteristic word) is screened as characteristic word (referred to as " multiple characteristic word ");

3、筛选附属特征字：中文地址中通常包含一些表达空间关系的词汇，如东、南、西、北等，可用于辅助判断地址要素的拆分位置，将这些词汇筛选为附属特征字；3. Screening of subsidiary feature words: Chinese addresses usually contain some vocabulary expressing spatial relations, such as east, south, west, north, etc., which can be used to assist in judging the split position of address elements, and filter these words into subsidiary feature words;

所筛选的特征字和附属特征字构成了特征字库；The selected characteristic words and subsidiary characteristic words constitute the characteristic word library;

第二步：根据特征字库，按照地址表示规则，将中文地址转换为数字形式的字符串；Step 2: Convert the Chinese address into a string in digital form according to the characteristic font library and according to the address representation rules;

为了便于计算机处理，需要将中文地址字符串转换为数字表示，其中1表示特征字，2表示附属特征字，3表示两个连续重复特征字的后一个字符，0表示普通字符，9表示结束符。普通字符对于拆分规则的制定没有意义，可将连续的0字符压缩为一个0字符。例如，“江苏省六合县八百桥镇冶东村小林32号”表示为“01010110212019”，“建邺区应天路叶圩村村部”表示为“0101011309”。In order to facilitate computer processing, it is necessary to convert the Chinese address string into a digital representation, where 1 represents a characteristic character, 2 represents a subsidiary characteristic character, 3 represents the last character of two consecutive repeated characteristic characters, 0 represents a common character, and 9 represents a terminator . Ordinary characters are meaningless for the formulation of splitting rules, and consecutive 0 characters can be compressed into one 0 character. For example, "No. 32, Xiaolin, Yedong Village, Babaiqiao Town, Liuhe County, Jiangsu Province" is expressed as "01010110212019", and "The Village Department of Yexu Village, Yingtian Road, Jianye District" is expressed as "0101011309".

第三步：构建地址解析规则库Step 3: Build an address resolution rule library

将中文地址转换为数字串之后，其构成均遵循以下规则：After the Chinese address is converted into a numeric string, its composition follows the following rules:

●“0”后只能是“1”、“2”、“9”中的一个数；● "0" can only be a number among "1", "2" and "9";

●“1”后只能是“0”、“1”、“2”、“3”、“9”中的一个数；● "1" can only be followed by one of "0", "1", "2", "3" and "9";

●“2”后只能是“0”、“1”、“2”、“9”中的一个数；● "2" can only be one of "0", "1", "2" and "9";

●“3”后只能是“0”、“1”、“2”、“9”中的一个数；● "3" can only be one of "0", "1", "2" and "9";

●只能以“0”、“1”、“2”、“3”中的一个数开始；●It can only start with one of "0", "1", "2" and "3";

●只能以“9”结束。●It can only end with "9".

按照上述规则，地址可以表示为树结构，每一条路径代表一条解析规则。树的第一级节点分别为“0”、“1”、“2”、“3”，其后裔结点按照上述规则来组织。但是，当各条路径到达一定长度时，可以确定地址要素的拆分点，从而终止该路径的继续扩展。同时，每条解析规则必须规定具体的拆分位置，并用“f+拆分位置”表示。以样本数据中的地址为例，对规则的应用频率进行统计，将累积频率占95％以上的解析路径筛选为解析规则(如表3所示)。解析规则的树状结构如图3所示。According to the above rules, the address can be expressed as a tree structure, and each path represents a resolution rule. The first-level nodes of the tree are "0", "1", "2", and "3", and their descendant nodes are organized according to the above rules. However, when each path reaches a certain length, the split point of the address element can be determined, thereby terminating the further expansion of the path. At the same time, each parsing rule must specify a specific split position, which is represented by "f+ split position". Taking the address in the sample data as an example, the application frequency of the rules is counted, and the analysis paths with a cumulative frequency of more than 95% are selected as analysis rules (as shown in Table 3). The tree structure of parsing rules is shown in Figure 3.

表3 地址解析规则Table 3 address resolution rules

序号解析规则序号解析规则 1. 010f2 93. 0202019f6 2. 019f2 94. 02110101f4 3. 2010f3 95. 011122101f3 4. 0110f3 96. 112f2 5. 09f1 97. 020210101f5 6. 0210f3 98. 309f2 7. 012f2 99. 02021019f7 serial number parsing rules serial number parsing rules 1. 010f2 93. 0202019f6 2. 019f2 94. 02110101f4 3. 2010f3 95. 011122101f3 4. 0110f3 96. 112f2 5. 09f1 97. 020210101f5 6. 0210f3 98. 309f2 7. 012f2 99. 02021019f7

8. 0120f2 100. 012101021f2 9. 0112f3 101. 2020109f2 10. 0119f3 102. 21119f4 11. 2110f3 103. 012101011f2 12. 21010f2 104. 202010101f2 13. 21019f2 105. 211111f3 14. 21210f4 106. 21112f3 15. 2012f3 107. 119f2 16. 2019f3 108. 3101f2 17. 20210f4 109. 202021f2 18. 01110f4 110. 012101010109f2 19. 10f2 111. 02221f1 20. 0219f3 112. 21211019f5 21. 02010f4 113. 11109f4 22. 201101f4 114. 2209f3 23. 219f2 115. 0111219f3 24. 0212f3 116. 0202109f6 25. 02210f4 117. 202020101f2 26. 012101019f2 118. 2120f2 27. 11101f3 119. 2013f3 28. 011110f2 120. 1211f1 29. 2210f3 121. 0121010101010f4 30. 2109f3 122. 02211f5 31. 02019f4 123. 210111f2 32. 2119f3 124. 2111101f2 33. 209f2 125. 020119f5 34. 2112f3 126. 020209f5 35. 013f2 127. 021119f3 36. 201102f4 128. 202120f4 37. 301f3 129. 020219f5 38. 2219f3 130. 20111019f3 39. 1210f3 131. 1212f3 40. 0209f3 132. 2021109f4 41. 2201f4 133. 129f2 42. 201109f3 134. 201121019f3 43. 2221f4 135. 2021101f2 44. 21021f2 136. 020210109f8 45. 01210101019f4 137. 02110109f3 46. 21219f4 138. 22119f4 47. 19f1 139. 0113f3 48. 2101101f2 140. 02111019f3 49. 21012f2 141. 2102019f2 50. 29f1 142. 213f2 51. 110f2 143. 20221f5 8. 0120f2 100. 012101021f2 9. 0112f3 101. 2020109f2 10. 0119f3 102. 21119f4 11. 2110f3 103. 012101011f2 12. 21010f2 104. 202010101f2 13. 21019f2 105. 211111f3 14. 21210f4 106. 21112f3 15. 2012f3 107. 119f2 16. 2019f3 108. 3101f2 17. 20210f4 109. 202021f2 18. 01110f4 110. 012101010109f2 19. 10f2 111. 02221f1 20. 0219f3 112. 21211019f5 twenty one. 02010f4 113. 11109f4 twenty two. 201101f4 114. 2209f3 twenty three. 219f2 115. 0111219f3 twenty four. 0212f3 116. 0202109f6 25. 02210f4 117. 202020101f2 26. 012101019f2 118. 2120f2 27. 11101f3 119. 2013f3 28. 011110f2 120. 1211f1 29. 2210f3 121. 0121010101010f4 30. 2109f3 122. 02211f5 31. 02019f4 123. 210111f2 32. 2119f3 124. 2111101f2 33. 209f2 125. 020119f5 34. 2112f3 126. 020209f5 35. 013f2 127. 021119f3 36. 201102f4 128. 202120f4 37. 301f3 129. 020219f5 38. 2219f3 130. 20111019f3 39. 1210f3 131. 1212f3 40. 0209f3 132. 2021109f4 41. 2201f4 133. 129f2 42. 201109f3 134. 201121019f3 43. 2221f4 135. 2021101f2 44. 21021f2 136. 020210109f8 45. 01210101019f4 137. 02110109f3 46. 21219f4 138. 22119f4 47. 19f1 139. 0113f3 48. 2101101f2 140. 02111019f3 49. 21012f2 141. 2102019f2 50. 29f1 142. 213f2 51. 110f2 143. 20221f5

52. 0202010f6 144. 0221201f4 53. 20119f4 145. 212112f4 54. 01119f4 146. 229f2 55. 01112101f3 147. 021120f4 56. 02012f4 148. 212119f2 57. 20201019f2 149. 01112019f3 58. 1119f3 150. 0202210f6 59. 021109f3 151. 2011109f4 60. 20212101f2 152. 21020109f2 61. 211101f4 153. 011129f3 62. 022010f5 154. 021102f3 63. 2122f2 155. 3109f3 64. 2212f3 156. 021101101f4 65. 0211019f3 157. 022019f5 66. 21212f4 158. 2121109f4 67. 1112f3 159. 021110101f5 68. 0121010109f2 160. 021111f3 69. 22110f4 161. 02209f4 70. 20219f4 162. 210201019f6 71. 02119f4 163. 2102210f2 72. 11102f3 164. 01112010f4 73. 029f2 165. 11110f4 74. 2202f5 166. 20201109f5 75. 202010109f5 167. 211119f2 76. 011119f2 168. 3219f3 77. 120f1 169. 39f1 78. 3210f3 170. 0111211f2 79. 210119f2 171. 201119f3 80. 011111f3 172. 3021f4 81. 211109f5 173. 0202021f7 82. 2101109f4 174. 021110109f3 83. 02219f4 175. 0211109f3 84. 0201101f5 176. 20201011f2 85. 0201109f6 177. 20212109f2 86. 011112f2 178. 210201011f6 87. 0121010101019f2 179. 210209f2 88. 2011201f4 180. 0121010201010f4 89. 1219f3 181. 201111019f3 90. 12210f4 182. 202011019f2 91. 202019f5 183. 2101102f2 92. 20209f4 184. 3221f4 52. 0202010f6 144. 0221201f4 53. 20119f4 145. 212112f4 54. 01119f4 146. 229f2 55. 01112101f3 147. 021120f4 56. 02012f4 148. 212119f2 57. 20201019f2 149. 01112019f3 58. 1119f3 150. 0202210f6 59. 021109f3 151. 2011109f4 60. 20212101f2 152. 21020109f2 61. 211101f4 153. 011129f3 62. 022010f5 154. 021102f3 63. 2122f2 155. 3109f3 64. 2212f3 156. 021101101f4 65. 0211019f3 157. 022019f5 66. 21212f4 158. 2121109f4 67. 1112f3 159. 021110101f5 68. 0121010109f2 160. 021111f3 69. 22110f4 161. 02209f4 70. 20219f4 162. 210201019f6 71. 02119f4 163. 2102210f2 72. 11102f3 164. 01112010f4 73. 029f2 165. 11110f4 74. 2202f5 166. 20201109f5 75. 202010109f5 167. 211119f2 76. 011119f2 168. 3219f3 77. 120f1 169. 39f1 78. 3210f3 170. 0111211f2 79. 210119f2 171. 201119f3 80. 011111f3 172. 3021f4 81. 211109f5 173. 0202021f7 82. 2101109f4 174. 021110109f3 83. 02219f4 175. 0211109f3 84. 0201101f5 176. 20201011f2 85. 0201109f6 177. 20212109f2 86. 011112f2 178. 210201011f6 87. 0121010101019f2 179. 210209f2 88. 2011201f4 180. 0121010201010f4 89. 1219f3 181. 201111019f3 90. 12210f4 182. 202011019f2 91. 202019f5 183. 2101102f2 92. 20209f4 184. 3221f4

图3中，规则“0120f2”表示在各层上的节点分别为“0”、“1”、“2”、“0”。当扫描地址串时，如果有字符串序列与“0120”匹配，即可确定在第二个数字后面拆分，用“f2”表示拆分位置。例如，“白下区南台巷”表示为“01201”，与规则“0120”匹配，即在左边“01”之后拆分，将其解析为“白下区”和南台巷。规则“029f2”表示当地址串中的部分字符序列与“029”匹配时，即可确定在第二个数字后面拆分。由于“9”即表示数字串结束，所以规则也表示在末尾拆分。In FIG. 3 , the rule "0120f2" indicates that the nodes on each layer are "0", "1", "2", and "0", respectively. When scanning the address string, if there is a string sequence matching "0120", it can be determined to split after the second number, and "f2" is used to indicate the split position. For example, "Nantai Lane, Baixia District" is represented as "01201", which matches the rule "0120", that is, splits after "01" on the left, and parses it into "Baixia District" and Nantai Lane. The rule "029f2" indicates that when a part of the character sequence in the address string matches "029", it can be determined to split after the second number. Since "9" means the end of the digit string, the rule also means to split at the end.

第四步：语义解析Step 4: Semantic Analysis

在特征字和解析规则制定的基础上，本发明设计了一种中文地址解析算法(简称“RBAI算法”)。该算法包括三个部分：地址表示、地址解析和地址还原。具体解析过程如下：On the basis of formulating characteristic words and analysis rules, the present invention designs a Chinese address analysis algorithm (referred to as "RBAI algorithm"). The algorithm includes three parts: address representation, address resolution and address restoration. The specific analysis process is as follows:

输入：一条原始地址数据，用Address_Before表示；Input: a piece of original address data, represented by Address_Before;

输出：该原始地址的解析结果，用Address_After表示。Output: The parsing result of the original address, represented by Address_After.

(1)地址表示：将原始地址转换为数字表示，结果为Numbers_Before。(1) Address representation: convert the original address into a digital representation, and the result is Numbers_Before.

第1步：Numbers_Before置为空串；用n表示当前待解析原始地址的长度Step 1: Set Numbers_Before to an empty string; use n to represent the length of the current original address to be parsed

第2步：i从1直到n，循环执行：Step 2: i is from 1 to n, execute in a loop:

如果原始地址的第i个字符为主特征字，则Numbers_Before[i]表示为1；If the i-th character of the original address is the main feature word, then Numbers_Before[i] represents 1;

如果原始地址的第i+1个字符为主特征字，则Numbers_Before[i]表示为3；If the i+1th character of the original address is the main feature word, then Numbers_Before[i] is represented as 3;

如果原始地址的第i个字符为附属特征字，则Numbers_Before[i]表示为2；If the i-th character of the original address is an auxiliary character, then Numbers_Before[i] is represented as 2;

如果原始地址的第i个字符为普通字符，则Numbers_Before[i]表示为0；If the i-th character of the original address is an ordinary character, then Numbers_Before[i] represents 0;

i赋值为i+1；i is assigned the value i+1;

结束循环；end loop;

第3步：在Numbers_Before的末尾添加9；将Numbers_Before中的连续多个0压缩为一个0。Step 3: Add 9 at the end of Numbers_Before; compress consecutive multiple 0s in Numbers_Before into one 0.

(2)地址解析：将Numbers_Before根据解析规则拆分为地址要素，结果为Numbers_After。(2) Address analysis: Split Numbers_Before into address elements according to the analysis rules, and the result is Numbers_After.

第4步：Numbers_After置为空串；用k表示Numbers_before的长度Step 4: Set Numbers_After to an empty string; use k to represent the length of Numbers_before

第5步：m从1直到k，执行循环：Step 5: m from 1 to k, execute the loop:

如果Numbers_Before的左边m个字符与某条解析规则匹配，则按照规则将Numbers_before拆分左右两个子串；左子串Numbers_Left保存为解析结果中的一个地址要素，不再进行拆分；右子串Numbers-right继续进行拆分；将Numbers_Before定义为Numbers-right；If the m characters on the left of Numbers_Before match a certain parsing rule, the left and right substrings of Numbers_before will be split according to the rules; the left substring Numbers_Left will be saved as an address element in the parsing result and will not be split; the right substring Numbers -right to continue splitting; define Numbers_Before as Numbers-right;

m赋值为Numbers_Left的长度+1；m is assigned the length of Numbers_Left + 1;

否则otherwise

m赋值为m+1；m assignment is m+1;

结束循环；end loop;

(3)地址还原：第6步：将数字表示的解析结果还原为与原始地址对应的字符串，结果为Address_After。(3) Address Restoration: Step 6: restore the parsing result represented by the number to a character string corresponding to the original address, and the result is Address_After.

相对于现有中文地址编码中采用的地址解析技术，本发明主要以下几个优点：Compared with the address resolution technology adopted in the existing Chinese address coding, the present invention mainly has the following advantages:

a.不依赖于词典：避免词典构建和更新，而且可以解析出GIS(或词典)中没有收录的地址要素名称；a. Not dependent on dictionaries: avoid dictionary construction and updating, and can resolve address element names that are not included in GIS (or dictionaries);

b.不依赖中文分词等自然语言处理技术；b. Do not rely on natural language processing technologies such as Chinese word segmentation;

c.效率高：由于没有采用基于词黄的字符串匹配算法，而是对单个地址进行操作，效率显著提高；c. High efficiency: Since no word-based string matching algorithm is used, but a single address is operated, the efficiency is significantly improved;

d.适用性强：可以根据实际应用情况，通过更新样本数据，快速更新特征字库和解析规则，而解析算法则不需要更新。d. Strong applicability: According to the actual application situation, by updating the sample data, the feature font library and analysis rules can be quickly updated, while the analysis algorithm does not need to be updated.

e.实现简单，易于推广：特征字库和解析规则重用性强，算法简单，方便嵌入各类应用系统中。e. Simple to implement and easy to popularize: the feature font library and analysis rules are highly reusable, the algorithm is simple, and it is easy to embed in various application systems.

附图说明 Description of drawings

图1、是地址编码基本原理与过程示意图；Figure 1 is a schematic diagram of the basic principle and process of address coding;

图2、是本发明方法流程示意图；Fig. 2 is a schematic flow chart of the method of the present invention;

图3、表示基于树结构的解析规则；Figure 3 shows the parsing rules based on the tree structure;

图4、地址解析示意图(地址：六合县雄州镇朝天街108号)；Figure 4. Schematic diagram of address resolution (Address: No. 108, Chaotian Street, Xiongzhou Town, Liuhe County);

图5、地址解析示意图(地址：江苏省六合县八百镇金山村)；Figure 5. Schematic diagram of address resolution (Address: Jinshan Village, Babai Town, Liuhe County, Jiangsu Province);

图6、地址解析示意图(地址：六合县六城镇泰山村82号)；Figure 6. Schematic diagram of address resolution (Address: No. 82, Taishan Village, Liucheng Town, Liuhe County);

图7、地址解析示意图(地址：六合区八百桥镇街道)；Figure 7. Schematic diagram of address resolution (address: Babaiqiao Town Street, Liuhe District);

图8、地址解析示意图(地址：六合区雄州镇健康巷1号-2)；Figure 8. Schematic diagram of address resolution (Address: No. 1-2, Jiankang Lane, Xiongzhou Town, Liuhe District);

图9、地址解析示意图(地址：南京市玄武区明故宫4号)；Figure 9. Schematic diagram of address resolution (address: No. 4, Ming Palace, Xuanwu District, Nanjing);

图10、地址解析示意图(地址：六合区雄州镇中心农贸市场)；Figure 10. Schematic diagram of address resolution (address: Xiongzhou Town Central Farmer’s Market, Liuhe District);

图11、地址解析示意图(地址：北门桥路5号302室)；Figure 11. Schematic diagram of address resolution (address: Room 302, No. 5, Beimenqiao Road);

图12、地址解析示意图(地址：六合区程桥镇东大桥边)；Figure 12. Schematic diagram of address resolution (address: beside the East Bridge, Chengqiao Town, Liuhe District);

图13、地址解析示意图(地址：玄武区相府营14号104室)；Figure 13. Schematic diagram of address resolution (Address: Room 104, No. 14, Xiangfuying, Xuanwu District);

图14、中文地址语义解析算法流程图；Figure 14, the flow chart of Chinese address semantic analysis algorithm;

图15、地址表示算法流程图；Figure 15, address representation algorithm flow chart;

图16、地址解析算法流程图。Figure 16, address resolution algorithm flow chart.

具体实施方式 Detailed ways

下面结合附图和实施例对本发明方法作进一步详细说明。The method of the present invention will be described in further detail below in conjunction with the accompanying drawings and embodiments.

在主频1.73GHz和内存1GB的PC机上，用Visual C#.NET2003平台开发的实现本发明技术的软件系统，以江苏省南京市194719户企业地址数据为样本数据，根据样本数据，构建地址特征字库，根据特征字库，按照地址表示规则，进行地址语义解析操作。On a PC with a main frequency of 1.73GHz and a memory of 1GB, the software system for realizing the technology of the present invention developed by the Visual C#.NET2003 platform takes the address data of 194,719 enterprises in Nanjing City, Jiangsu Province as sample data, and builds an address feature font library according to the sample data , according to the feature font library, according to the address representation rules, the address semantic analysis operation is performed.

实例1：中文地址字符串：六合县雄州镇朝天街108号，如图4所示，解析为：六合县/雄州镇/朝天街/108号(“/”表示切分位置)。Example 1: Chinese address string: No. 108, Chaotian Street, Xiongzhou Town, Liuhe County, as shown in Figure 4, parsed as: Liuhe County/Xiongzhou Town/No. Chaotian Street/108 ("/" indicates the segmentation position).

实例2：中文地址字符串：江苏省六合县八百镇金山村，如图5所示，解析为：江苏省/六合县/八百镇/金山村。Example 2: Chinese address string: Jinshan Village, Babai Town, Liuhe County, Jiangsu Province, as shown in Figure 5, is parsed as: Jiangsu Province/Liuhe County/Babai Town/Jinshan Village.

实例3：中文地址字符串：六合县六城镇泰山村82号，如图6所示，解析为：六合县/六城镇/泰山村/82号。Example 3: The Chinese address string: No. 82, Taishan Village, Liucheng Town, Liuhe County, as shown in Figure 6, is parsed as: Liuhe County/Liucheng Town/Taishan Village/No.82.

实例4：中文地址字符串：六合区八百桥镇街道，如图7所示，解析为：六合区/八百桥镇街道。Example 4: Chinese address string: Babaiqiao Town Street, Liuhe District, as shown in Figure 7, parsed as: Liuhe District/Babaiqiao Town Street.

实例5：中文地址字符串：六合区雄州镇健康巷1号-2，如图8所示，解析为：六合区/雄州镇/健康巷/1号/-2。Example 5: Chinese address string: No. 1-2, Jiankang Lane, Xiongzhou Town, Liuhe District, as shown in Figure 8, parsed as: Liuhe District/Xiongzhou Town/Kangkang Lane/No. 1/-2.

实例6：中文地址字符串：南京市玄武区明故宫4号，如图9所示，解析为：南京市/玄武区/明故宫/4号。Example 6: Chinese address string: Ming Palace No. 4, Xuanwu District, Nanjing City, as shown in FIG. 9 , resolved as: Nanjing City/Xuanwu District/Ming Palace/No.4.

实例7：中文地址字符串：六合区雄州镇中心农贸市场，如图10所示，解析为：六合区/雄州镇/中心农贸市场。Example 7: Chinese address string: central farmer's market in Xiongzhou Town, Liuhe District, as shown in Figure 10, parsed as: Liuhe District/Xiongzhou Town/central farmer's market.

实例8：中文地址字符串：北门桥路5号302室，如图11所示，解析为：北门桥路/5号/302室。Example 8: Chinese address string: Room 302, No. 5, Beimenqiao Road, as shown in Figure 11, is parsed as: Beimenqiao Road/No. 5/Room 302.

实例9：中文地址字符串：六合区程桥镇东大桥边，如图12所示，解析为：六合区/程桥镇/东大桥/边。Example 9: Chinese address string: the side of the East Bridge, Chengqiao Town, Liuhe District, as shown in Figure 12, and parsed as: Liuhe District/Chengqiao Town/East Bridge/Bian.

实例10：中文地址字符串：玄武区相府营14号104室，如图13所示，解析为：玄武区/相府营/14号/104室。Example 10: Chinese address string: Room 104, No. 14, Xiangfuying, Xuanwu District, as shown in Figure 13, parsed as: Xuanwu District/Xiangfuying/No. 14/Room 104.

Claims

1, the semantic analytic method of a kind of Chinese address towards geocoding, its step is as follows: the first step: according to sample data, make up address feature character library

A, set up sample data: each address key element in the original address data is separated, formed sample data;

B, screening tagged word: last character of all address key elements in the sample data and the frequency of two characters are added up respectively, and according to descending ordering; Cumulative frequency is accounted for the single character screening of number percent more than 80% be tagged word; Cumulative frequency is accounted for 80% above two characters screening for tagged word, must last character not be single tagged word;

C, screening subsidiary characteristic word;

Tagged word that is screened and subsidiary characteristic word have constituted the feature character library;

Second step: according to the feature character library, represent rule, Chinese address is converted to the character string of digital form according to the address, 1 representation feature word wherein, 2 expression subsidiary characteristic words, a back character of two continuous repeated characteristic words of 3 expressions, the common character of 0 expression, 9 expression end marks; With 0 continuous character compression is one 0 character;

The 3rd step: make up the address resolution rule base

Chinese address is converted to after the numeric string, and it constitutes all follows following rule:

Can only the number in " 1 ", " 2 ", " 9 " after " 0 ";

Can only the number in " 0 ", " 1 ", " 2 ", " 3 ", " 9 " after " 1 ";

Can only the number in " 0 ", " 1 ", " 2 ", " 9 " after " 2 "; .

Can only the number in " 0 ", " 1 ", " 2 ", " 9 " after " 3 ";

Can only begin with the number in " 0 ", " 1 ", " 2 ", " 3 ";

Can only finish with " 9 ";

According to above-mentioned rule, address table is shown tree construction, each paths is represented a resolution rules, and the first order node of tree is respectively " 0 ", " 1 ", " 2 ", " 3 ", and its descendant's node is organized according to above-mentioned rule; When each paths arrives certain-length, can determine the fractionation point of address key element, thereby stop the continuation expansion in this path; Simultaneously, every resolution rules is stipulated concrete fractionation position, and represents with " f+ splits the position ";

The 4th step: the semantic parsing

Input: original address data, represent with Address_Before;

A, address are represented: original address is converted to numeral, and the result is Numbers_Before,

A, Numbers_Before are changed to empty string; The length of representing current original address to be resolved with n

B, i are from 1 up to n, and circulation is carried out:

If i character of original address is main tagged word, then Numbers_Before[i] be expressed as 1;

If i+1 character of original address is main tagged word, then Numbers_Before[i] be expressed as 3;

If i character of original address is the subsidiary characteristic word, then Numbers_Before[i] be expressed as 2;

If i character of original address is common character, then Numbers_Before[i] be expressed as 0;

The i assignment is i+1;

End loop;

C, add 9 at the end of Numbers_Before;

D, with one 0 of continuous a plurality of 0 boil down among the Numbers_Before;

B, address resolution: Numbers_Before is split as the address key element according to resolution rules, and the result is Numbers_After;

A, Numbers_After are changed to empty string; The length of representing Numbers_before with k;

B, m carry out circulation from 1 up to k:

If the left side m character of Numbers_Before and certain bar resolution rules coupling, two substring about then Numbers_before being split according to rule; Left side substring Numbers_Left saves as an address key element in the analysis result, no longer splits; Right substring Numbers-right proceeds to split; Numbers_Before is defined as Numbers-right;

The m assignment is length+1 of Numbers_Left;

Otherwise

The m assignment is m+1;

End loop;

C, address reduction: the analysis result of numeral is reduced to and original address corresponding characters string, and the result is Address_After.