+

CN118862889A - A bidding data filling management method, device, equipment and medium - Google Patents

A bidding data filling management method, device, equipment and medium Download PDF

Info

Publication number
CN118862889A
CN118862889A CN202410872098.3A CN202410872098A CN118862889A CN 118862889 A CN118862889 A CN 118862889A CN 202410872098 A CN202410872098 A CN 202410872098A CN 118862889 A CN118862889 A CN 118862889A
Authority
CN
China
Prior art keywords
data
governance
preset
filling
threshold
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410872098.3A
Other languages
Chinese (zh)
Inventor
陈�峰
孙永超
陈义蒙
陈昕
傅玉鑫
杨启源
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chaozhou Zhuoshu Big Data Industry Development Co Ltd
Original Assignee
Chaozhou Zhuoshu Big Data Industry Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chaozhou Zhuoshu Big Data Industry Development Co Ltd filed Critical Chaozhou Zhuoshu Big Data Industry Development Co Ltd
Priority to CN202410872098.3A priority Critical patent/CN118862889A/en
Publication of CN118862889A publication Critical patent/CN118862889A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本说明书实施例公开了一种招标数据填充治理方法、装置、设备及介质,包括:根据预设字段对招标源数据进行数据填充,得到待评估招标填充数据;若待评估招标填充数据中预设字段的填充率小于第一阈值,确定待评估招标填充数据的未填充预设字段;将未填充预设字段输入预先训练的信息抽取模型,得到第一治理数据;若第一治理数据中预设字段的填充率不小于第一阈值,判断第一治理数据中预设字段的准确率是否小于第二阈值;若第一治理数据中预设字段的准确率小于第二阈值,对第一治理数据中的填充数据进行正则治理,得到第二治理数据;若第二治理数据中预设字段的准确率不小于第二阈值,得到符合条件的招标数据。

The embodiments of the present specification disclose a bidding data filling and management method, device, equipment and medium, including: filling the bidding source data according to preset fields to obtain the bidding filling data to be evaluated; if the filling rate of the preset fields in the bidding filling data to be evaluated is less than a first threshold, determining the unfilled preset fields of the bidding filling data to be evaluated; inputting the unfilled preset fields into a pre-trained information extraction model to obtain first management data; if the filling rate of the preset fields in the first management data is not less than the first threshold, judging whether the accuracy of the preset fields in the first management data is less than a second threshold; if the accuracy of the preset fields in the first management data is less than the second threshold, performing regular management on the filling data in the first management data to obtain second management data; if the accuracy of the preset fields in the second management data is not less than the second threshold, obtaining qualified bidding data.

Description

一种招标数据填充治理方法、装置、设备及介质A bidding data filling management method, device, equipment and medium

技术领域Technical Field

本说明书涉及计算机技术领域,尤其涉及一种招标数据填充治理方法、装置、设备及介质。The present invention relates to the field of computer technology, and in particular to a bidding data filling management method, device, equipment and medium.

背景技术Background Art

随着信息技术的飞速发展,招标公告信息的数量和格式都变得越来越复杂。这些信息通常包含大量的数据,如项目名称、招标方、投标截止日期、要求的文件和资质等。这些信息的格式可能因招标方而异,导致难以统一处理。With the rapid development of information technology, the amount and format of tender notice information have become increasingly complex. This information usually contains a large amount of data, such as project name, tenderer, bid deadline, required documents and qualifications, etc. The format of this information may vary from tenderer to tenderer, making it difficult to process uniformly.

为了更好地促进市场信息流动和高效使用,需要对这些招标公告信息进行高质量的数据治理。然而,传统的人工识别处理方法无法满足高效和高质量的要求。人工处理速度慢,容易出现错误,而且难以处理大量的数据。In order to better promote the flow and efficient use of market information, high-quality data governance is required for these tender notice information. However, traditional manual recognition and processing methods cannot meet the requirements of high efficiency and high quality. Manual processing is slow, prone to errors, and difficult to process large amounts of data.

因此,需要一种新的方法来处理这些招标公告信息,以提高数据治理的效率和质量。Therefore, a new approach is needed to process these tender announcement information to improve the efficiency and quality of data governance.

发明内容Summary of the invention

本说明书一个或多个实施例提供了一种招标数据填充治理方法、装置、设备及介质,用于解决背景技术提出的技术问题。One or more embodiments of this specification provide a bidding data filling management method, device, equipment and medium for solving the technical problems raised by the background technology.

本说明书一个或多个实施例采用下述技术方案:One or more embodiments of this specification adopt the following technical solutions:

本说明书一个或多个实施例提供的一种招标数据填充治理方法,所述方法包括:One or more embodiments of this specification provide a bidding data filling management method, the method comprising:

根据预设字段对招标源数据进行数据填充,得到待评估招标填充数据;Fill the tender source data according to the preset fields to obtain the tender filling data to be evaluated;

若所述待评估招标填充数据中所述预设字段的填充率小于第一阈值,确定所述待评估招标填充数据的未填充预设字段;If the filling rate of the preset field in the tender filling data to be evaluated is less than a first threshold, determining the unfilled preset field of the tender filling data to be evaluated;

将所述未填充预设字段输入预先训练的信息抽取模型,得到第一治理数据;Inputting the unfilled preset fields into a pre-trained information extraction model to obtain first governance data;

若所述第一治理数据中预设字段的填充率不小于第一阈值,判断所述第一治理数据中预设字段的准确率是否小于第二阈值;If the filling rate of the preset field in the first governance data is not less than the first threshold, determining whether the accuracy rate of the preset field in the first governance data is less than the second threshold;

若所述第一治理数据中预设字段的准确率小于所述第二阈值,对所述第一治理数据中的填充数据进行正则治理,得到第二治理数据;If the accuracy of the preset field in the first managed data is less than the second threshold, regularization management is performed on the padding data in the first managed data to obtain second managed data;

若所述第二治理数据中预设字段的准确率不小于所述第二阈值,得到符合条件的招标数据。If the accuracy of the preset field in the second governance data is not less than the second threshold, qualified bidding data is obtained.

需要说明的是,本说明书实施例通过上述内容,具有下述有益效果:It should be noted that the embodiments of this specification have the following beneficial effects through the above contents:

提高数据填充的效率和质量:通过信息抽取模型和正则治理,可以自动处理大量的数据,避免人工处理时可能出现的错误和低效率。Improve the efficiency and quality of data filling: Through information extraction models and regular governance, large amounts of data can be processed automatically, avoiding errors and inefficiencies that may occur during manual processing.

确保数据的准确性和完整性:通过对填充数据进行准确率的判断和正则治理,可以提高数据的准确性和完整性,确保数据的质量符合要求。Ensure the accuracy and completeness of data: By judging the accuracy of filled data and regularizing it, the accuracy and completeness of data can be improved to ensure that the quality of data meets the requirements.

促进市场信息流动和高效使用:通过对招标公告信息进行高质量的数据治理,能够更好地促进市场信息流动和高效使用,提高市场交易效率。Promote the flow and efficient use of market information: Through high-quality data governance of tender announcement information, it is possible to better promote the flow and efficient use of market information and improve market transaction efficiency.

降低人工成本和工作量:避免了传统的人工识别处理方法,减少了人工成本和工作量,提高了工作效率。Reduce labor costs and workload: Avoid traditional manual identification processing methods, reduce labor costs and workload, and improve work efficiency.

进一步的,所述将所述未填充预设字段输入预先训练的信息抽取模型,得到第一治理数据,包括:Furthermore, the step of inputting the unfilled preset fields into a pre-trained information extraction model to obtain the first governance data includes:

通过预先训练的信息抽取模型,对所述未填充预设字段进行数据抽取,得到所述未填充预设字段的待填充数据;Extracting data from the unfilled preset fields through a pre-trained information extraction model to obtain data to be filled in the unfilled preset fields;

根据所述待填充数据对所述未填充预设字段进行数据填充,得到第一治理数据。The unfilled preset fields are filled with data according to the data to be filled to obtain first governance data.

需要说明的是,本说明书实施例通过上述内容,具有下述有益效果:It should be noted that the embodiments of this specification have the following beneficial effects through the above contents:

提高数据填充的准确性:通过预先训练的信息抽取模型,可以更准确地抽取未填充预设字段的待填充数据,从而提高数据填充的准确性。Improve the accuracy of data filling: Through the pre-trained information extraction model, the data to be filled in the preset fields that are not filled can be extracted more accurately, thereby improving the accuracy of data filling.

减少人工干预:自动进行数据抽取和填充,减少了人工干预的需求,提高了工作效率。Reduce manual intervention: Automatic data extraction and filling reduces the need for manual intervention and improves work efficiency.

快速处理大量数据:可以快速处理大量的未填充预设字段,提高数据处理的效率。Quickly process large amounts of data: A large number of unpopulated preset fields can be quickly processed to improve the efficiency of data processing.

进一步的,所述待评估招标填充数据为HTML格式文档,通过预先训练的信息抽取模型,对所述未填充预设字段进行数据抽取,得到所述未填充预设字段的待填充数据,包括:Furthermore, the tender filling data to be evaluated is an HTML format document, and the unfilled preset fields are extracted by a pre-trained information extraction model to obtain the data to be filled in the unfilled preset fields, including:

将所述待评估招标填充数据输入所述信息抽取模型,通过指针网络对所述未填充预设字段所对应的信息进行片段抽取,以实现命名实体识别、关系抽取、事件抽取,以及属性情感抽取。The tender filling data to be evaluated is input into the information extraction model, and the information corresponding to the unfilled preset fields is extracted through a pointer network to achieve named entity recognition, relationship extraction, event extraction, and attribute sentiment extraction.

需要说明的是,本说明书实施例通过上述内容,具有下述有益效果:It should be noted that the embodiments of this specification have the following beneficial effects through the above contents:

提高数据抽取的准确性:使用预先训练的信息抽取模型和指针网络,能够更准确地抽取未填充预设字段所对应的信息。这些模型经过训练,可以学习到HTML格式文档中的语义和结构特征,从而更好地识别和提取所需的数据。Improve the accuracy of data extraction: Using pre-trained information extraction models and pointer networks, you can more accurately extract information corresponding to unfilled preset fields. These models are trained to learn the semantic and structural features of HTML formatted documents to better identify and extract the required data.

实现多任务数据抽取:通过命名实体识别、关系抽取、事件抽取和属性情感抽取等多种任务,可以获取更丰富和全面的信息。这有助于深入了解待评估招标填充数据中的各个方面,为后续的分析和决策提供更有力的支持。Implement multi-task data extraction: Through multiple tasks such as named entity recognition, relationship extraction, event extraction, and attribute sentiment extraction, richer and more comprehensive information can be obtained. This helps to gain a deeper understanding of all aspects of the tender filling data to be evaluated, and provides stronger support for subsequent analysis and decision-making.

适应HTML格式:由于待评估招标填充数据为HTML格式文档,这种方法能够直接处理这种常见的网页格式。无需进行额外的格式转换或预处理,减少了数据处理的复杂性和工作量。Adapt to HTML format: Since the data filled in the tender to be evaluated is in HTML format, this method can directly process this common web page format. No additional format conversion or preprocessing is required, which reduces the complexity and workload of data processing.

提高数据治理效率:自动的数据抽取过程能够显著提高数据治理的效率。相比手动抽取数据,模型可以快速处理大量的HTML文档,节省时间和人力成本。Improve data management efficiency: The automatic data extraction process can significantly improve the efficiency of data management. Compared with manual data extraction, the model can quickly process a large number of HTML documents, saving time and labor costs.

提升数据质量:准确抽取待填充数据有助于提高数据质量。高质量的数据对于准确的分析和决策至关重要,可以减少错误和不确定性。Improve data quality: Accurately extracting the data to be filled helps improve data quality. High-quality data is essential for accurate analysis and decision-making, and can reduce errors and uncertainties.

提供全面的信息视角:多种任务的抽取结果可以提供更全面的信息视角。例如,命名实体识别可以识别出重要的实体,关系抽取可以揭示它们之间的关系,事件抽取可以了解相关事件,属性情感抽取可以获取对事物的评价和情感倾向。Providing a comprehensive information perspective: The extraction results of multiple tasks can provide a more comprehensive information perspective. For example, named entity recognition can identify important entities, relationship extraction can reveal the relationship between them, event extraction can understand related events, and attribute sentiment extraction can obtain the evaluation and emotional tendency of things.

进一步的,所述通过预先训练的信息抽取模型,对所述未填充预设字段进行数据抽取,得到所述未填充预设字段的待填充数据前,所述方法还包括:Furthermore, before extracting data from the unfilled preset fields by using a pre-trained information extraction model to obtain the data to be filled in the unfilled preset fields, the method further includes:

通过Python的bs4模块中的BeautifulSoup类移除所述待评估招标填充数据中的script标签内容与style标签内容,并移除其他标签的class属性、id属性。The BeautifulSoup class in the bs4 module of Python is used to remove the script tag content and the style tag content in the tender filling data to be evaluated, and remove the class attributes and id attributes of other tags.

需要说明的是,本说明书实施例通过上述内容,具有下述有益效果:It should be noted that the embodiments of this specification have the following beneficial effects through the above contents:

提高数据清洁度:移除ˋscriptˋ标签内容和ˋstyleˋ标签内容可以减少无关信息的干扰,提高数据的清洁度和可读性。这些标签通常包含与页面布局和样式相关的信息,对于数据抽取和分析来说并非关键。Improve data cleanliness: Removing the content of the script and style tags can reduce the interference of irrelevant information and improve the cleanliness and readability of the data. These tags usually contain information related to page layout and style, which is not critical for data extraction and analysis.

减少数据冗余:移除其他标签的ˋclassˋ属性和ˋidˋ属性可以减少数据的冗余性。这些属性通常用于页面的样式设计和交互控制,对于数据本身的含义影响较小。Reduce data redundancy: Removing the class and id attributes of other tags can reduce data redundancy. These attributes are usually used for page style design and interactive control, and have little impact on the meaning of the data itself.

提升模型效率:清理和简化HTML数据可以提高信息抽取模型的效率和准确性。模型不需要处理不必要的标签属性,从而能够更专注于抽取有意义的内容。Improve model efficiency: Cleaning and simplifying HTML data can improve the efficiency and accuracy of information extraction models. The model does not need to process unnecessary tag attributes, so it can focus more on extracting meaningful content.

增强数据一致性:通过统一移除特定的标签内容和属性,有助于确保数据的一致性和标准化。这对于后续的数据分析和处理非常重要,可以减少因不同标签结构而产生的差异。Enhanced data consistency: By uniformly removing specific tag content and attributes, it helps ensure data consistency and standardization. This is very important for subsequent data analysis and processing, and can reduce differences caused by different tag structures.

便于数据解析:简化后的HTML数据更易于解析和处理。去除无关的标签和属性可以使数据更易于被其他工具和算法理解和操作。Easier data parsing: Simplified HTML data is easier to parse and process. Removing irrelevant tags and attributes can make the data easier to understand and operate by other tools and algorithms.

提高信息抽取准确性:减少干扰因素能够提高信息抽取模型对关键信息的提取能力,从而提高抽取结果的准确性。Improve the accuracy of information extraction: Reducing interference factors can improve the information extraction model's ability to extract key information, thereby improving the accuracy of the extraction results.

降低模型学习难度:清理数据可以降低信息抽取模型的学习难度。减少不必要的噪音和干扰可以使模型更容易学习到有意义的特征和模式。Reduce the difficulty of model learning: Cleaning data can reduce the difficulty of learning information extraction models. Reducing unnecessary noise and interference can make it easier for the model to learn meaningful features and patterns.

进一步的,所述对所述第一治理数据中的填充数据进行正则治理,得到第二治理数据,包括:Furthermore, performing regularization on the padding data in the first governance data to obtain second governance data includes:

通过正则表达式对所述第一治理数据中的填充数据进行地址拆分、金额提取、日期标准化、手机号码提取,以及预设符号的过滤,得到第二治理数据。The fill data in the first governance data is subjected to address splitting, amount extraction, date standardization, mobile phone number extraction, and preset symbol filtering by regular expressions to obtain second governance data.

需要说明的是,本说明书实施例通过上述内容,具有下述有益效果:It should be noted that the embodiments of this specification have the following beneficial effects through the above contents:

提高数据质量:通过正则治理,可以去除数据中的噪声、错误和不一致性,提高数据的准确性和可靠性。Improve data quality: Through regular governance, noise, errors and inconsistencies in data can be removed to improve data accuracy and reliability.

提取关键信息:正则表达式可以用于地址拆分、金额提取、日期标准化、手机号码提取等,帮助从数据中快速准确地提取出关键信息。Extract key information: Regular expressions can be used for address splitting, amount extraction, date standardization, mobile phone number extraction, etc., to help quickly and accurately extract key information from the data.

数据标准化:日期标准化和预设符号的过滤有助于将数据转换为统一的格式,使其更易于分析和比较。Data Standardization: Date standardization and filtering with preset symbols help convert data into a uniform format, making it easier to analyze and compare.

增强数据一致性:正则治理可以确保填充数据遵循一定的规则和模式,从而增强数据的一致性和可重复性。Enhanced data consistency: Regular governance can ensure that the populated data follows certain rules and patterns, thereby enhancing the consistency and repeatability of the data.

简化数据处理:处理后的第二治理数据更加规范和简洁,便于后续的数据分析、存储和传输。Simplified data processing: The processed second governance data is more standardized and concise, which is convenient for subsequent data analysis, storage and transmission.

提高数据可用性:通过提高数据质量和一致性,正则治理使得数据更有价值,能够更好地支持决策制定和业务流程。Improve data availability: By improving data quality and consistency, formal governance makes data more valuable and better supports decision making and business processes.

进一步的,若所述第二治理数据中预设字段的准确率小于所述第二阈值,所述方法还包括:Furthermore, if the accuracy of the preset field in the second governance data is less than the second threshold, the method further includes:

通过预设大模型对所述第二治理数据中的填充数据进行信息提取,得到第三治理数据;Extracting information from the filling data in the second governance data by using a preset large model to obtain third governance data;

若所述第三治理数据中预设字段的准确率不小于所述第二阈值,得到符合条件的招标数据。If the accuracy of the preset field in the third governance data is not less than the second threshold, qualified bidding data is obtained.

需要说明的是,本说明书实施例通过上述内容,具有下述有益效果:It should be noted that the embodiments of this specification have the following beneficial effects through the above contents:

提高数据治理的准确性:通过预设大模型进行信息提取,可以进一步提高数据治理的准确性。大模型通常具有更强大的语言理解和信息提取能力,能够更全面、准确地从填充数据中提取所需信息。Improve the accuracy of data governance: By presetting large models for information extraction, the accuracy of data governance can be further improved. Large models usually have more powerful language understanding and information extraction capabilities, and can extract the required information from the populated data more comprehensively and accurately.

减少误判和遗漏:在第二治理数据中预设字段准确率小于阈值的情况下,使用大模型进行进一步处理可以减少误判和遗漏的发生。大模型可以利用其学习到的模式和知识,更好地处理复杂和模糊的信息,从而提高数据治理的质量。Reduce misjudgments and omissions: When the accuracy of the preset field in the second governance data is less than the threshold, using a large model for further processing can reduce the occurrence of misjudgments and omissions. The large model can use its learned patterns and knowledge to better handle complex and ambiguous information, thereby improving the quality of data governance.

提供更符合条件的招标数据:通过对第二治理数据进行信息提取,得到的第三治理数据中预设字段的准确率不小于第二阈值,这意味着得到的数据更符合招标条件和要求。这有助于提高招标数据的质量和可靠性,为招标决策提供更有力的支持。Providing bidding data that is more in line with the conditions: By extracting information from the second governance data, the accuracy of the preset fields in the third governance data obtained is not less than the second threshold, which means that the obtained data is more in line with the bidding conditions and requirements. This helps to improve the quality and reliability of bidding data and provide stronger support for bidding decisions.

增强数据的可用性和利用价值:准确和符合要求的招标数据具有更高的可用性和利用价值。这些数据可以用于进一步的分析、比较和决策,帮助招标方更好地了解潜在供应商的情况,做出明智的选择。Enhance the availability and value of data: Accurate and compliant bidding data has higher availability and value. This data can be used for further analysis, comparison and decision-making, helping the tenderer to better understand the situation of potential suppliers and make wise choices.

提高招标流程的效率和效果:准确的招标数据可以减少在招标过程中因数据错误或不准确而导致的延误和问题。同时,符合条件的数据也能够更好地满足招标需求,提高招标流程的效果和效率。Improve the efficiency and effectiveness of the bidding process: Accurate bidding data can reduce delays and problems caused by wrong or inaccurate data during the bidding process. At the same time, qualified data can better meet bidding needs and improve the effectiveness and efficiency of the bidding process.

降低风险和成本:通过确保招标数据的准确性和符合要求,可以降低因错误数据而导致的风险和成本。例如,避免与不符合条件的供应商进行合作,减少潜在的法律风险和经济损失。Reduce risks and costs: By ensuring the accuracy and compliance of bidding data, the risks and costs caused by incorrect data can be reduced. For example, avoiding cooperation with unqualified suppliers can reduce potential legal risks and economic losses.

改进数据治理流程和方法:对数据治理过程中的不足进行识别和改进,有助于不断优化数据治理流程和方法。通过引入大模型等先进技术,可以提升数据治理的能力和水平,适应不断变化的业务需求和数据挑战。Improve data governance processes and methods: Identifying and improving deficiencies in the data governance process will help to continuously optimize data governance processes and methods. By introducing advanced technologies such as big models, the ability and level of data governance can be improved to adapt to changing business needs and data challenges.

进一步的,所述通过预设大模型对所述第二治理数据中的填充数据进行信息提取,得到第三治理数据,包括:Furthermore, extracting information from the fill data in the second governance data by using a preset large model to obtain third governance data includes:

定义所述大模型的角色,提供所述第二治理数据中信息提取的任务描述;Define the role of the large model and provide a task description for information extraction from the second governance data;

根据所述大模型依据所述第二治理数据中信息提取的任务描述,对所述第二治理数据中的填充数据进行信息提取,得到第三治理数据。According to the task description of extracting information from the second governance data based on the large model, information is extracted from the filling data in the second governance data to obtain third governance data.

需要说明的是,本说明书实施例通过上述内容,具有下述有益效果:It should be noted that the embodiments of this specification have the following beneficial effects through the above contents:

提高信息提取的准确性:通过明确大模型的角色和提供详细的任务描述,大模型可以更好地理解要提取的信息类型和上下文。这有助于提高提取结果的准确性,减少误判和遗漏。Improve the accuracy of information extraction: By clarifying the role of the big model and providing a detailed task description, the big model can better understand the type and context of information to be extracted. This helps improve the accuracy of the extraction results and reduce misjudgments and omissions.

降低人工干预:定义大模型的角色可以使信息提取过程更加自动化,减少对人工干预的需求。这可以提高工作效率,节省时间和人力资源。Reduced manual intervention: Defining the role of a large model can make the information extraction process more automated, reducing the need for manual intervention. This can improve work efficiency and save time and human resources.

提高数据质量:准确的信息提取有助于提高第三治理数据的质量。这可以使后续的数据分析、决策制定和业务流程更加可靠和有效。Improve data quality: Accurate information extraction helps improve the quality of third-party governance data. This can make subsequent data analysis, decision-making, and business processes more reliable and effective.

支持更好的决策制定:通过提供更准确和详细的信息,第三治理数据可以为决策制定提供更有力的支持。帮助决策者做出更明智的选择,并减少基于不准确或不完整信息做出决策的风险。Support better decision making: By providing more accurate and detailed information, third-party governance data can provide stronger support for decision making, helping decision makers make more informed choices and reduce the risk of making decisions based on inaccurate or incomplete information.

增强模型的可解释性:明确大模型的角色和任务描述有助于增强模型的可解释性。这可以使人们更好地理解模型如何做出决策和提取信息,增加对模型结果的信任和理解。Enhance model interpretability: Clarifying the role and task description of large models helps enhance model interpretability. This allows people to better understand how the model makes decisions and extracts information, increasing trust and understanding of the model results.

适应不同的数据需求:通过灵活定义大模型的角色和任务描述,可以适应不同的信息提取需求和数据特点。这使得该方法更具通用性和可扩展性。Adapt to different data requirements: By flexibly defining the roles and task descriptions of the large model, it can adapt to different information extraction requirements and data characteristics. This makes the method more general and scalable.

本说明书一个或多个实施例提供的一种招标数据填充治理装置,包括:One or more embodiments of this specification provide a bidding data filling management device, including:

数据填充单元,根据预设字段对招标源数据进行数据填充,得到待评估招标填充数据;A data filling unit fills the tender source data according to preset fields to obtain tender filling data to be evaluated;

未填充字段确定单元,若所述待评估招标填充数据中所述预设字段的填充率小于第一阈值,确定所述待评估招标填充数据的未填充预设字段;an unfilled field determining unit, which determines unfilled preset fields of the tender filling data to be evaluated if the filling rate of the preset fields in the tender filling data to be evaluated is less than a first threshold;

第一治理单元,将所述未填充预设字段输入预先训练的信息抽取模型,得到第一治理数据;A first management unit inputs the unfilled preset field into a pre-trained information extraction model to obtain first management data;

准确率判断单元,若所述第一治理数据中预设字段的填充率不小于第一阈值,判断所述第一治理数据中预设字段的准确率是否小于第二阈值;an accuracy judgment unit, which judges whether the accuracy of the preset field in the first governance data is less than a second threshold value if the filling rate of the preset field in the first governance data is not less than a first threshold value;

第二治理单元,若所述第一治理数据中预设字段的准确率小于所述第二阈值,对所述第一治理数据中的填充数据进行正则治理,得到第二治理数据;A second governance unit, if the accuracy of the preset field in the first governance data is less than the second threshold, performs regularization governance on the padding data in the first governance data to obtain second governance data;

招标数据确定单元,若所述第二治理数据中预设字段的准确率不小于所述第二阈值,得到符合条件的招标数据。The bidding data determining unit obtains bidding data that meets the conditions if the accuracy of the preset field in the second governance data is not less than the second threshold.

本说明书一个或多个实施例提供的一种招标数据填充治理设备,包括:One or more embodiments of this specification provide a bidding data filling management device, including:

至少一个处理器;以及,at least one processor; and,

与所述至少一个处理器通信连接的存储器;其中,a memory communicatively connected to the at least one processor; wherein,

所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够:The memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to:

根据预设字段对招标源数据进行数据填充,得到待评估招标填充数据;Fill the tender source data according to the preset fields to obtain the tender filling data to be evaluated;

若所述待评估招标填充数据中所述预设字段的填充率小于第一阈值,确定所述待评估招标填充数据的未填充预设字段;If the filling rate of the preset field in the tender filling data to be evaluated is less than a first threshold, determining the unfilled preset field of the tender filling data to be evaluated;

将所述未填充预设字段输入预先训练的信息抽取模型,得到第一治理数据;Inputting the unfilled preset fields into a pre-trained information extraction model to obtain first governance data;

若所述第一治理数据中预设字段的填充率不小于第一阈值,判断所述第一治理数据中预设字段的准确率是否小于第二阈值;If the filling rate of the preset field in the first governance data is not less than the first threshold, determining whether the accuracy rate of the preset field in the first governance data is less than the second threshold;

若所述第一治理数据中预设字段的准确率小于所述第二阈值,对所述第一治理数据中的填充数据进行正则治理,得到第二治理数据;If the accuracy of the preset field in the first managed data is less than the second threshold, regularization management is performed on the padding data in the first managed data to obtain second managed data;

若所述第二治理数据中预设字段的准确率不小于所述第二阈值,得到符合条件的招标数据。If the accuracy of the preset field in the second governance data is not less than the second threshold, qualified bidding data is obtained.

本说明书一个或多个实施例提供的一种非易失性计算机存储介质,存储有计算机可执行指令,所述计算机可执行指令被计算机执行时能够实现:One or more embodiments of this specification provide a non-volatile computer storage medium storing computer executable instructions, which can achieve the following when executed by a computer:

根据预设字段对招标源数据进行数据填充,得到待评估招标填充数据;Fill the tender source data according to the preset fields to obtain the tender filling data to be evaluated;

若所述待评估招标填充数据中所述预设字段的填充率小于第一阈值,确定所述待评估招标填充数据的未填充预设字段;If the filling rate of the preset field in the tender filling data to be evaluated is less than a first threshold, determining the unfilled preset field of the tender filling data to be evaluated;

将所述未填充预设字段输入预先训练的信息抽取模型,得到第一治理数据;Inputting the unfilled preset fields into a pre-trained information extraction model to obtain first governance data;

若所述第一治理数据中预设字段的填充率不小于第一阈值,判断所述第一治理数据中预设字段的准确率是否小于第二阈值;If the filling rate of the preset field in the first governance data is not less than the first threshold, determining whether the accuracy rate of the preset field in the first governance data is less than the second threshold;

若所述第一治理数据中预设字段的准确率小于所述第二阈值,对所述第一治理数据中的填充数据进行正则治理,得到第二治理数据;If the accuracy of the preset field in the first managed data is less than the second threshold, regularization management is performed on the padding data in the first managed data to obtain second managed data;

若所述第二治理数据中预设字段的准确率不小于所述第二阈值,得到符合条件的招标数据。If the accuracy of the preset field in the second governance data is not less than the second threshold, qualified bidding data is obtained.

本说明书实施例采用的上述至少一个技术方案能够达到以下有益效果:At least one of the above technical solutions adopted in the embodiments of this specification can achieve the following beneficial effects:

提高数据填充的效率和质量:通过信息抽取模型和正则治理,可以自动处理大量的数据,避免人工处理时可能出现的错误和低效率。Improve the efficiency and quality of data filling: Through information extraction models and regular governance, large amounts of data can be processed automatically, avoiding errors and inefficiencies that may occur during manual processing.

确保数据的准确性和完整性:通过对填充数据进行准确率的判断和正则治理,可以提高数据的准确性和完整性,确保数据的质量符合要求。Ensure the accuracy and completeness of data: By judging the accuracy of filled data and regularizing it, the accuracy and completeness of data can be improved to ensure that the quality of data meets the requirements.

促进市场信息流动和高效使用:通过对招标公告信息进行高质量的数据治理,能够更好地促进市场信息流动和高效使用,提高市场交易效率。Promote the flow and efficient use of market information: Through high-quality data governance of tender announcement information, it is possible to better promote the flow and efficient use of market information and improve market transaction efficiency.

降低人工成本和工作量:避免了传统的人工识别处理方法,减少了人工成本和工作量,提高了工作效率。Reduce labor costs and workload: Avoid traditional manual identification processing methods, reduce labor costs and workload, and improve work efficiency.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本说明书实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本说明书中记载的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。在附图中:In order to more clearly illustrate the technical solutions in the embodiments of this specification or the prior art, the following briefly introduces the drawings required for use in the embodiments or the prior art description. Obviously, the drawings described below are only some embodiments recorded in this specification. For ordinary technicians in this field, other drawings can be obtained based on these drawings without creative labor. In the drawings:

图1为本说明书一个或多个实施例提供的一种招标数据填充治理方法的流程示意图;FIG1 is a flow chart of a bidding data filling management method provided by one or more embodiments of this specification;

图2为本说明书一个或多个实施例提供的结构化的治理流程示意图;FIG2 is a schematic diagram of a structured governance process provided by one or more embodiments of this specification;

图3为本说明书一个或多个实施例提供的一种招标数据填充治理装置的结构示意图;FIG3 is a schematic diagram of the structure of a bidding data filling and management device provided by one or more embodiments of this specification;

图4为本说明书一个或多个实施例提供的一种招标数据填充治理设备的结构示意图。FIG4 is a schematic diagram of the structure of a bidding data filling management device provided by one or more embodiments of this specification.

具体实施方式DETAILED DESCRIPTION

本说明书实施例提供一种招标数据填充治理方法、装置、设备及介质。The embodiments of this specification provide a bidding data filling management method, device, equipment and medium.

为了使本技术领域的人员更好地理解本说明书中的技术方案,下面将结合本说明书实施例中的附图,对本说明书实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本说明书一部分实施例,而不是全部的实施例。基于本说明书实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都应当属于本说明书保护的范围。In order to enable those skilled in the art to better understand the technical solutions in this specification, the technical solutions in the embodiments of this specification will be clearly and completely described below in conjunction with the drawings in the embodiments of this specification. Obviously, the described embodiments are only part of the embodiments of this specification, not all of the embodiments. Based on the embodiments of this specification, all other embodiments obtained by ordinary technicians in this field without creative work should fall within the scope of protection of this specification.

图1为本说明书一个或多个实施例提供的一种招标数据填充治理方法的流程示意图,该流程可以由招标数据填充治理系统执行。流程中的某些输入参数或者中间结果允许人工干预调节,以帮助提高准确性。Figure 1 is a flow chart of a bidding data filling management method provided by one or more embodiments of this specification, which can be executed by a bidding data filling management system. Certain input parameters or intermediate results in the process allow manual intervention and adjustment to help improve accuracy.

本说明书实施例的方法流程步骤如下:The method steps of the embodiment of this specification are as follows:

S102,根据预设字段对招标源数据进行数据填充,得到待评估招标填充数据。S102, filling the bidding source data according to the preset fields to obtain the bidding filling data to be evaluated.

在本说明书实施例中,关于上述的内容,可以通过下述具体实施方案:In the examples of this specification, the above contents can be implemented by the following specific implementation schemes:

确定预设字段:明确要填充的预设字段。这些字段可能是招标项目中特定的信息,如项目名称、招标编号、预算金额、招标截止日期等,具体的预设字段可以根据实际需求进行设定。Determine the preset fields: Specify the preset fields to be filled in. These fields may be specific information in the bidding project, such as project name, bidding number, budget amount, bidding deadline, etc. The specific preset fields can be set according to actual needs.

收集招标源数据:获取包含相关信息的招标源数据。这可以是从招标文件、电子表格、数据库或其他数据源中提取的数据。Collect tender source data: Obtain tender source data containing relevant information. This can be data extracted from tender documents, spreadsheets, databases or other data sources.

提取源数据中的字段值:根据预设字段,从招标源数据中提取相应字段的值。可以使用数据提取工具、编程语言或数据库查询语句来完成这一步骤。Extract field values from source data: Extract the values of corresponding fields from the bidding source data according to the preset fields. This step can be completed using data extraction tools, programming languages, or database query statements.

数据清洗和预处理:对提取到的值进行清洗和预处理,例如处理空值、无效值等情况。确保提取到的值是准确和可用的。Data cleaning and preprocessing: Clean and preprocess the extracted values, such as handling null values, invalid values, etc. Ensure that the extracted values are accurate and usable.

填充预设字段:将预处理后的值对应地填充到预留的预设字段中。可以使用数据插入语句、数据更新操作或其他适当的方法将值域填充到指定的字段中。Fill preset fields: Fill the preprocessed values into the reserved preset fields accordingly. You can use data insert statements, data update operations, or other appropriate methods to fill the value domain into the specified field.

S104,若所述待评估招标填充数据中所述预设字段的填充率小于第一阈值,确定所述待评估招标填充数据的未填充预设字段。S104: If the filling rate of the preset fields in the tender filling data to be evaluated is less than a first threshold, determine unfilled preset fields of the tender filling data to be evaluated.

在本说明书实施例中,关于上述的内容,可以通过下述具体实施方案:In the examples of this specification, the above contents can be implemented by the following specific implementation schemes:

确定第一阈值:根据实际情况,确定一个合理的第一阈值。这个阈值可以基于业务需求、数据质量要求或其他相关因素来设定,比如,第一阈值可以设定为70%。Determine the first threshold: Determine a reasonable first threshold based on actual conditions. This threshold can be set based on business needs, data quality requirements or other relevant factors. For example, the first threshold can be set to 70%.

字段填充率计算:对于每个预设字段,计算其在待评估招标填充数据中的填充率。填充率的计算可以通过统计该字段中有实际值的记录数量与总记录数量的比例来完成。Field fill rate calculation: For each preset field, calculate its fill rate in the tender fill data to be evaluated. The fill rate calculation can be completed by counting the ratio of the number of records with actual values in the field to the total number of records.

比较填充率与第一阈值:将每个预设字段的填充率与第一阈值进行比较。如果填充率小于第一阈值,则确定该预设字段为未填充预设字段。Comparing the filling rate with the first threshold: comparing the filling rate of each preset field with the first threshold. If the filling rate is less than the first threshold, determining that the preset field is an unfilled preset field.

未填充预设字段确定:未填充预设字段指的是在数据填充过程中,那些没有被实际数据填充的预设字段。当进行数据填充时,如果某些预设字段没有收到相应的数据值,就会形成未填充预设字段。这可能是因为数据来源中缺少相关信息,或者在填充过程中出现了错误或遗漏,后续的处理主要是针对“在填充过程中出现了错误或遗漏”这一情况。Unfilled preset fields determination: Unfilled preset fields refer to those preset fields that are not filled with actual data during the data filling process. When filling data, if some preset fields do not receive corresponding data values, unfilled preset fields will be formed. This may be due to the lack of relevant information in the data source, or errors or omissions in the filling process. Subsequent processing is mainly for the situation of "errors or omissions in the filling process".

S106,将所述未填充预设字段输入预先训练的信息抽取模型,得到第一治理数据。S106, inputting the unfilled preset fields into a pre-trained information extraction model to obtain first governance data.

在本说明书实施例中,可以先通过预先训练的信息抽取模型,对所述未填充预设字段进行数据抽取,得到所述未填充预设字段的待填充数据;再根据所述待填充数据对所述未填充预设字段进行数据填充,得到第一治理数据。In an embodiment of the present specification, data can be first extracted from the unfilled preset fields through a pre-trained information extraction model to obtain the data to be filled in the unfilled preset fields; then, data is filled in the unfilled preset fields according to the data to be filled to obtain the first governance data.

在本说明书实施例中,关于上述的内容,可以通过下述具体实施方案:In the examples of this specification, the above contents can be implemented by the following specific implementation schemes:

模型训练:Model training:

-选择适合的数据抽取任务的信息抽取模型,例如基于神经网络的模型。-Choose an information extraction model suitable for the data extraction task, such as a neural network-based model.

-使用已有的标注数据对模型进行训练,以提高模型的准确性和泛化能力。- Use existing labeled data to train the model to improve the accuracy and generalization ability of the model.

-调整模型的超参数,如学习率、层数、节点数等,以优化模型性能。- Adjust model hyperparameters such as learning rate, number of layers, number of nodes, etc. to optimize model performance.

数据抽取:Data extraction:

-将未填充预设字段输入到预先训练的信息抽取模型中。- Input the unpopulated preset fields into the pre-trained information extraction model.

-模型对待评估招标填充数据进行分析和抽取,得到未填充预设字段的待填充数据。-The model analyzes and extracts the data to be filled in the tender to be evaluated, and obtains the data to be filled in the preset fields that have not been filled.

数据填充:Data filling:

-将待填充数据与未填充预设字段进行匹配和关联。-Match and associate the data to be filled with the unfilled preset fields.

-使用合适的填充方法,将待填充数据填充到未填充预设字段中。-Use the appropriate filling method to fill the data to be filled into the unfilled preset fields.

-可以采用规则、机器学习算法等方式进行填充。-It can be filled using rules, machine learning algorithms, etc.

需要说明的是,本说明书实施例通过上述内容,具有下述有益效果:It should be noted that the embodiments of this specification have the following beneficial effects through the above contents:

提高数据填充的准确性:通过预先训练的信息抽取模型,可以更准确地抽取未填充预设字段的待填充数据,从而提高数据填充的准确性。Improve the accuracy of data filling: Through the pre-trained information extraction model, the data to be filled in the preset fields that are not filled can be extracted more accurately, thereby improving the accuracy of data filling.

减少人工干预:自动进行数据抽取和填充,减少了人工干预的需求,提高了工作效率。Reduce manual intervention: Automatic data extraction and filling reduces the need for manual intervention and improves work efficiency.

快速处理大量数据:可以快速处理大量的未填充预设字段,提高数据处理的效率。Quickly process large amounts of data: A large number of unpopulated preset fields can be quickly processed to improve the efficiency of data processing.

进一步的,所述待评估招标填充数据可以为HTML格式文档,通过预先训练的信息抽取模型,对所述未填充预设字段进行数据抽取,得到所述未填充预设字段的待填充数据时,可以将所述待评估招标填充数据输入所述信息抽取模型,通过指针网络对所述未填充预设字段所对应的信息进行片段抽取,以实现命名实体识别、关系抽取、事件抽取,以及属性情感抽取。Furthermore, the tender filling data to be evaluated can be a document in HTML format. Through a pre-trained information extraction model, data is extracted from the unfilled preset fields. When the data to be filled in the unfilled preset fields is obtained, the tender filling data to be evaluated can be input into the information extraction model, and the information corresponding to the unfilled preset fields can be fragmented through a pointer network to realize named entity recognition, relationship extraction, event extraction, and attribute sentiment extraction.

需要说明的是,关于上述内容,可以通过下述具体实施方案:It should be noted that the above content can be implemented through the following specific implementation plans:

一、目标1. Objectives

通过预先训练的信息抽取模型,对未填充预设字段的待评估招标填充数据进行数据抽取,实现命名实体识别、关系抽取、事件抽取以及属性情感抽取。Through the pre-trained information extraction model, data extraction is performed on the bidding data to be evaluated that has not filled in the preset fields, realizing named entity recognition, relationship extraction, event extraction and attribute sentiment extraction.

二、数据准备2. Data Preparation

1.需要评估招标的文档,格式为HTML。1. Documents required for tender evaluation, in HTML format.

2.提取包含待填充预设字段的文本数据。2. Extract text data containing preset fields to be filled.

三、模型训练3. Model Training

1.使用大量的HTML格式文本数据对信息抽取模型进行训练。1. Use a large amount of HTML formatted text data to train the information extraction model.

2.训练方法可采用监督学习、无监督学习或强化学习等。2. The training method can be supervised learning, unsupervised learning or reinforcement learning.

四、数据抽取4. Data Extraction

1.对待评估招标填充数据进行解析。1. Analyze the data filled in the bidding to be evaluated.

2.将解析后的文本输入信息抽取模型。2. Input the parsed text into the information extraction model.

3.通过指针网络对未填充预设字段所对应的信息进行片段抽取。3. Extract fragments of information corresponding to unfilled preset fields through the pointer network.

五、抽取类型5. Extraction Type

1.命名实体识别:识别出文本中的人名、地名等实体。1. Named entity recognition: Identify entities such as names of people and places in the text.

2.关系抽取:抽取文本中不同实体之间的关系。2. Relationship extraction: extract the relationship between different entities in the text.

3.事件抽取:提取文本中的事件信息。3. Event extraction: extract event information from text.

4.属性情感抽取:获取文本中关于属性的情感信息。4. Attribute sentiment extraction: obtain sentiment information about attributes in the text.

六、输出待填充数据6. Output the data to be filled

将抽取出的待填充数据进行整理和输出,以便后续进行其他操作。The extracted data to be filled will be sorted and output for subsequent other operations.

需要说明的是,本说明书实施例通过上述内容,具有下述有益效果:It should be noted that the embodiments of this specification have the following beneficial effects through the above contents:

提高数据抽取的准确性:使用预先训练的信息抽取模型和指针网络,能够更准确地抽取未填充预设字段所对应的信息。这些模型经过训练,可以学习到HTML格式文档中的语义和结构特征,从而更好地识别和提取所需的数据。Improve the accuracy of data extraction: Using pre-trained information extraction models and pointer networks, you can more accurately extract information corresponding to unfilled preset fields. These models are trained to learn the semantic and structural features of HTML formatted documents to better identify and extract the required data.

实现多任务数据抽取:通过命名实体识别、关系抽取、事件抽取和属性情感抽取等多种任务,可以获取更丰富和全面的信息。这有助于深入了解待评估招标填充数据中的各个方面,为后续的分析和决策提供更有力的支持。Implement multi-task data extraction: Through multiple tasks such as named entity recognition, relationship extraction, event extraction, and attribute sentiment extraction, richer and more comprehensive information can be obtained. This helps to gain a deeper understanding of all aspects of the tender filling data to be evaluated, and provides stronger support for subsequent analysis and decision-making.

适应HTML格式:由于待评估招标填充数据为HTML格式文档,这种方法能够直接处理这种常见的网页格式。无需进行额外的格式转换或预处理,减少了数据处理的复杂性和工作量。Adapt to HTML format: Since the data filled in the tender to be evaluated is in HTML format, this method can directly process this common web page format. No additional format conversion or preprocessing is required, which reduces the complexity and workload of data processing.

提高数据治理效率:自动的数据抽取过程能够显著提高数据治理的效率。相比手动抽取数据,模型可以快速处理大量的HTML文档,节省时间和人力成本。Improve data management efficiency: The automatic data extraction process can significantly improve the efficiency of data management. Compared with manual data extraction, the model can quickly process a large number of HTML documents, saving time and labor costs.

提升数据质量:准确抽取待填充数据有助于提高数据质量。高质量的数据对于准确的分析和决策至关重要,可以减少错误和不确定性。Improve data quality: Accurately extracting the data to be filled helps improve data quality. High-quality data is essential for accurate analysis and decision-making, and can reduce errors and uncertainties.

提供全面的信息视角:多种任务的抽取结果可以提供更全面的信息视角。例如,命名实体识别可以识别出重要的实体,关系抽取可以揭示它们之间的关系,事件抽取可以了解相关事件,属性情感抽取可以获取对事物的评价和情感倾向。Providing a comprehensive information perspective: The extraction results of multiple tasks can provide a more comprehensive information perspective. For example, named entity recognition can identify important entities, relationship extraction can reveal the relationship between them, event extraction can understand related events, and attribute sentiment extraction can obtain the evaluation and emotional tendency of things.

进一步的,所述通过预先训练的信息抽取模型,对所述未填充预设字段进行数据抽取,得到所述未填充预设字段的待填充数据前,可以通过Python的bs4模块中的BeautifulSoup类移除所述待评估招标填充数据中的script标签内容与style标签内容,并移除其他标签的class属性、id属性。Furthermore, before extracting data from the unfilled preset fields through the pre-trained information extraction model and obtaining the data to be filled in the unfilled preset fields, the script tag content and the style tag content in the tender filling data to be evaluated can be removed through the BeautifulSoup class in the bs4 module of Python, and the class attributes and id attributes of other tags can be removed.

需要说明的是,关于上述内容,可以通过下述具体实施方案:It should be noted that the above content can be implemented through the following specific implementation plans:

导入BeautifulSoup类:使用Python的ˋbs4ˋ模块中的BeautifulSoup类来解析和处理待评估招标填充数据。Import the BeautifulSoup class: Use the BeautifulSoup class in Python's bs4 module to parse and process the tender filling data to be evaluated.

读取数据:读取包含待评估招标填充数据的文件或字符串。Read Data: Reads a file or string containing the data populated by the tender to be evaluated.

解析数据:使用BeautifulSoup类的构造函数将数据解析为BeautifulSoup对象。Parse the data: Use the constructor of the BeautifulSoup class to parse the data into a BeautifulSoup object.

移除script标签内容:使用BeautifulSoup对象的方法找到所有的script标签,并使用ˋdecompose()ˋ方法移除它们。Remove script tag content: Use the BeautifulSoup object’s methods to find all script tags and remove them using the decompose() method.

移除style标签内容:使用类似的方法找到所有的style标签,并移除它们。Remove style tag content: Use a similar method to find all style tags and remove them.

移除其他标签的class属性和id属性:遍历所有其他标签,使用ˋdelˋ语句移除它们的class属性和id属性。Remove the class and id attributes of other tags: Traverse all other tags and use the del statement to remove their class and id attributes.

获取清理后的数据:将清理后的BeautifulSoup对象转换回字符串或文件,以便后续使用。Get the cleaned data: Convert the cleaned BeautifulSoup object back to a string or file for subsequent use.

需要说明的是,本说明书实施例通过上述内容,具有下述有益效果:It should be noted that the embodiments of this specification have the following beneficial effects through the above contents:

提高数据清洁度:移除ˋscriptˋ标签内容和ˋstyleˋ标签内容可以减少无关信息的干扰,提高数据的清洁度和可读性。这些标签通常包含与页面布局和样式相关的信息,对于数据抽取和分析来说并非关键。Improve data cleanliness: Removing the content of the script and style tags can reduce the interference of irrelevant information and improve the cleanliness and readability of the data. These tags usually contain information related to page layout and style, which is not critical for data extraction and analysis.

减少数据冗余:移除其他标签的ˋclassˋ属性和ˋidˋ属性可以减少数据的冗余性。这些属性通常用于页面的样式设计和交互控制,对于数据本身的含义影响较小。Reduce data redundancy: Removing the class and id attributes of other tags can reduce data redundancy. These attributes are usually used for page style design and interactive control, and have little impact on the meaning of the data itself.

提升模型效率:清理和简化HTML数据可以提高信息抽取模型的效率和准确性。模型不需要处理不必要的标签属性,从而能够更专注于抽取有意义的内容。Improve model efficiency: Cleaning and simplifying HTML data can improve the efficiency and accuracy of information extraction models. The model does not need to process unnecessary tag attributes, so it can focus more on extracting meaningful content.

增强数据一致性:通过统一移除特定的标签内容和属性,有助于确保数据的一致性和标准化。这对于后续的数据分析和处理非常重要,可以减少因不同标签结构而产生的差异。Enhanced data consistency: By uniformly removing specific tag content and attributes, it helps ensure data consistency and standardization. This is very important for subsequent data analysis and processing, and can reduce differences caused by different tag structures.

便于数据解析:简化后的HTML数据更易于解析和处理。去除无关的标签和属性可以使数据更易于被其他工具和算法理解和操作。Easier data parsing: Simplified HTML data is easier to parse and process. Removing irrelevant tags and attributes can make the data easier to understand and operate by other tools and algorithms.

提高信息抽取准确性:减少干扰因素能够提高信息抽取模型对关键信息的提取能力,从而提高抽取结果的准确性。Improve the accuracy of information extraction: Reducing interference factors can improve the information extraction model's ability to extract key information, thereby improving the accuracy of the extraction results.

降低模型学习难度:清理数据可以降低信息抽取模型的学习难度。减少不必要的噪音和干扰可以使模型更容易学习到有意义的特征和模式。Reduce the difficulty of model learning: Cleaning data can reduce the difficulty of learning information extraction models. Reducing unnecessary noise and interference can make it easier for the model to learn meaningful features and patterns.

S108,若所述第一治理数据中预设字段的填充率不小于第一阈值,判断所述第一治理数据中预设字段的准确率是否小于第二阈值。S108: If the filling rate of the preset field in the first governance data is not less than the first threshold, determine whether the accuracy of the preset field in the first governance data is less than the second threshold.

在本说明书实施例中,若第一治理数据中预设字段的填充率不小于第一阈值,则判断预设字段的准确率是否小于第二阈值,第二阈值可以根据实际情况进行设定,比如,第二阈值设定为80%。关于评估预设字段中数据的准确性的方法,可以通过与可靠的数据源进行比较或使用数据验证规则。确定准确率的计算方式,可以通过正确数据的数量与总数据数量的比例。In an embodiment of the present specification, if the fill rate of the preset field in the first governance data is not less than the first threshold, it is determined whether the accuracy of the preset field is less than the second threshold. The second threshold can be set according to the actual situation. For example, the second threshold is set to 80%. Regarding the method of evaluating the accuracy of the data in the preset field, it can be done by comparing with a reliable data source or using data verification rules. The calculation method of determining the accuracy rate can be determined by the ratio of the number of correct data to the total number of data.

S110,若所述第一治理数据中预设字段的准确率小于所述第二阈值,对所述第一治理数据中的填充数据进行正则治理,得到第二治理数据。S110: If the accuracy of the preset field in the first managed data is less than the second threshold, regularization management is performed on the padding data in the first managed data to obtain second managed data.

在本说明书实施例中,可以通过正则表达式对所述第一治理数据中的填充数据进行地址拆分、金额提取、日期标准化、手机号码提取,以及预设符号的过滤,得到第二治理数据。In the embodiment of the present specification, the fill-in data in the first governance data can be subjected to address splitting, amount extraction, date standardization, mobile phone number extraction, and preset symbol filtering through regular expressions to obtain the second governance data.

需要说明的是,关于上述内容,可以通过下述具体实施方案:It should be noted that the above content can be implemented through the following specific implementation plans:

使用正则表达式进行地址拆分:Use regular expressions to split addresses:

-定义地址的常见模式,例如包含省、市、区、街道等信息的模式。-Define common patterns for addresses, such as patterns that contain information such as province, city, district, street, etc.

-使用正则表达式匹配并提取地址中的各个部分。- Use regular expressions to match and extract individual parts of an address.

使用正则表达式进行金额提取:Use regular expressions to extract amounts:

-定义金额的常见格式,例如带有货币符号、小数点、千位分隔符等的格式。-Define common formats for amounts, such as formats with currency symbols, decimal points, thousands separators, etc.

-使用正则表达式匹配并提取金额字段中的数值部分。-Use regular expressions to match and extract the numeric portion of an amount field.

使用正则表达式进行日期标准化:Use regular expressions to normalize dates:

-定义日期的常见格式,例如年-月-日、月/日/年等。-Define common formats for dates, such as year-month-day, month/day/year, etc.

-使用正则表达式匹配并将日期转换为统一的格式,例如ISO格式。-Use regular expressions to match and convert dates to a uniform format, such as ISO format.

使用正则表达式进行手机号码提取:Use regular expressions to extract mobile phone numbers:

-定义手机号码的常见格式,例如11位数字。- Define common formats for mobile phone numbers, such as 11 digits.

-使用正则表达式匹配并提取手机号码字段。-Use regular expressions to match and extract the mobile number field.

使用正则表达式进行预设符号的过滤:Use regular expressions to filter preset symbols:

-定义需要过滤的符号列表。-Define the list of symbols that need to be filtered.

-使用正则表达式删除数据中的这些符号。- Use regular expressions to remove these symbols from your data.

最后,可以将上述处理后的结果组合成第二治理数据。Finally, the above processed results can be combined into the second governance data.

需要说明的是,本说明书实施例通过上述内容,具有下述有益效果:It should be noted that the embodiments of this specification have the following beneficial effects through the above contents:

提高数据质量:通过正则治理,可以去除数据中的噪声、错误和不一致性,提高数据的准确性和可靠性。Improve data quality: Through regular governance, noise, errors and inconsistencies in data can be removed to improve data accuracy and reliability.

提取关键信息:正则表达式可以用于地址拆分、金额提取、日期标准化、手机号码提取等,帮助从数据中快速准确地提取出关键信息。Extract key information: Regular expressions can be used for address splitting, amount extraction, date standardization, mobile phone number extraction, etc., to help quickly and accurately extract key information from the data.

数据标准化:日期标准化和预设符号的过滤有助于将数据转换为统一的格式,使其更易于分析和比较。Data Standardization: Date standardization and filtering with preset symbols help convert data into a uniform format, making it easier to analyze and compare.

增强数据一致性:正则治理可以确保填充数据遵循一定的规则和模式,从而增强数据的一致性和可重复性。Enhanced data consistency: Regular governance can ensure that the populated data follows certain rules and patterns, thereby enhancing the consistency and repeatability of the data.

简化数据处理:处理后的第二治理数据更加规范和简洁,便于后续的数据分析、存储和传输。Simplified data processing: The processed second governance data is more standardized and concise, which is convenient for subsequent data analysis, storage and transmission.

提高数据可用性:通过提高数据质量和一致性,正则治理使得数据更有价值,能够更好地支持决策制定和业务流程。Improve data availability: By improving data quality and consistency, formal governance makes data more valuable and better supports decision making and business processes.

S112,若所述第二治理数据中预设字段的准确率不小于所述第二阈值,得到符合条件的招标数据。S112: If the accuracy of the preset field in the second governance data is not less than the second threshold, qualified bidding data is obtained.

在本说明书实施例中,如果第二治理数据中预设字段的准确率不小于第二阈值,则认为该数据符合条件,即得到符合条件的招标数据。In the embodiment of the present specification, if the accuracy of the preset field in the second governance data is not less than the second threshold, the data is considered to meet the conditions, that is, qualified bidding data is obtained.

进一步的,若所述第二治理数据中预设字段的准确率小于所述第二阈值,可以通过预设大模型对所述第二治理数据中的填充数据进行信息提取,得到第三治理数据;若所述第三治理数据中预设字段的准确率不小于所述第二阈值,得到符合条件的招标数据。Furthermore, if the accuracy of the preset field in the second governance data is less than the second threshold, the preset large model can be used to extract information from the fill-in data in the second governance data to obtain third governance data; if the accuracy of the preset field in the third governance data is not less than the second threshold, qualified bidding data is obtained.

需要说明的是,关于上述内容,可以通过下述具体实施方案:It should be noted that the above content can be implemented through the following specific implementation plans:

利用预设大模型进行信息提取:如果第二治理数据中预设字段的准确率小于第二阈值,启用预设大模型。预设大模型使用机器学习或自然语言处理技术,对第二治理数据中的填充数据进行信息提取。提取的信息可能包括但不限于关键词、实体、关系等。Use the preset big model to extract information: If the accuracy of the preset field in the second governance data is less than the second threshold, the preset big model is enabled. The preset big model uses machine learning or natural language processing technology to extract information from the fill data in the second governance data. The extracted information may include but is not limited to keywords, entities, relationships, etc.

根据提取结果得到第三治理数据:将从预设大模型中提取的信息整合到第二治理数据中,形成第三治理数据。The third governance data is obtained according to the extraction result: the information extracted from the preset large model is integrated into the second governance data to form the third governance data.

检测第三治理数据中预设字段的准确率:对第三治理数据中预设字段的准确率进行再次检测,并将检测的准确率与第二阈值进行比较。Detecting the accuracy of the preset field in the third governance data: re-detecting the accuracy of the preset field in the third governance data, and comparing the detected accuracy with the second threshold.

确定符合条件的招标数据:若第三治理数据中预设字段的准确率不小于第二阈值,将这些数据标记为符合条件的招标数据。若第三治理数据中预设字段的准确率扔小于第二阈值,可以通过大模型重新进行数据治理,直到治理数据中预设字段的准确率不小于第二阈值为止。Determine qualified bidding data: If the accuracy of the preset fields in the third governance data is not less than the second threshold, mark these data as qualified bidding data. If the accuracy of the preset fields in the third governance data is still less than the second threshold, the data governance can be re-performed through the big model until the accuracy of the preset fields in the governance data is not less than the second threshold.

需要说明的是,本说明书实施例应用的的大模型,需要确保该大模型经过充分的训练和优化,以提高信息提取的准确性。It should be noted that the large model used in the embodiments of this specification needs to ensure that the large model has been fully trained and optimized to improve the accuracy of information extraction.

需要说明的是,本说明书实施例通过上述内容,具有下述有益效果:It should be noted that the embodiments of this specification have the following beneficial effects through the above contents:

提高数据治理的准确性:通过预设大模型进行信息提取,可以进一步提高数据治理的准确性。大模型通常具有更强大的语言理解和信息提取能力,能够更全面、准确地从填充数据中提取所需信息。Improve the accuracy of data governance: By presetting large models for information extraction, the accuracy of data governance can be further improved. Large models usually have more powerful language understanding and information extraction capabilities, and can extract the required information from the populated data more comprehensively and accurately.

减少误判和遗漏:在第二治理数据中预设字段准确率小于阈值的情况下,使用大模型进行进一步处理可以减少误判和遗漏的发生。大模型可以利用其学习到的模式和知识,更好地处理复杂和模糊的信息,从而提高数据治理的质量。Reduce misjudgments and omissions: When the accuracy of the preset field in the second governance data is less than the threshold, using a large model for further processing can reduce the occurrence of misjudgments and omissions. The large model can use its learned patterns and knowledge to better handle complex and ambiguous information, thereby improving the quality of data governance.

提供更符合条件的招标数据:通过对第二治理数据进行信息提取,得到的第三治理数据中预设字段的准确率不小于第二阈值,这意味着得到的数据更符合招标条件和要求。这有助于提高招标数据的质量和可靠性,为招标决策提供更有力的支持。Providing bidding data that is more in line with the conditions: By extracting information from the second governance data, the accuracy of the preset fields in the third governance data obtained is not less than the second threshold, which means that the obtained data is more in line with the bidding conditions and requirements. This helps to improve the quality and reliability of bidding data and provide stronger support for bidding decisions.

增强数据的可用性和利用价值:准确和符合要求的招标数据具有更高的可用性和利用价值。这些数据可以用于进一步的分析、比较和决策,帮助招标方更好地了解潜在供应商的情况,做出明智的选择。Enhance the availability and value of data: Accurate and compliant bidding data has higher availability and value. This data can be used for further analysis, comparison and decision-making, helping the tenderer to better understand the situation of potential suppliers and make wise choices.

提高招标流程的效率和效果:准确的招标数据可以减少在招标过程中因数据错误或不准确而导致的延误和问题。同时,符合条件的数据也能够更好地满足招标需求,提高招标流程的效果和效率。Improve the efficiency and effectiveness of the bidding process: Accurate bidding data can reduce delays and problems caused by wrong or inaccurate data during the bidding process. At the same time, qualified data can better meet bidding needs and improve the effectiveness and efficiency of the bidding process.

降低风险和成本:通过确保招标数据的准确性和符合要求,可以降低因错误数据而导致的风险和成本。例如,避免与不符合条件的供应商进行合作,减少潜在的法律风险和经济损失。Reduce risks and costs: By ensuring the accuracy and compliance of bidding data, the risks and costs caused by incorrect data can be reduced. For example, avoiding cooperation with unqualified suppliers can reduce potential legal risks and economic losses.

改进数据治理流程和方法:对数据治理过程中的不足进行识别和改进,有助于不断优化数据治理流程和方法。通过引入大模型等先进技术,可以提升数据治理的能力和水平,适应不断变化的业务需求和数据挑战。Improve data governance processes and methods: Identifying and improving deficiencies in the data governance process will help to continuously optimize data governance processes and methods. By introducing advanced technologies such as big models, the ability and level of data governance can be improved to adapt to changing business needs and data challenges.

进一步的,所述通过预设大模型对所述第二治理数据中的填充数据进行信息提取,得到第三治理数据时,可以先定义所述大模型的角色,提供所述第二治理数据中信息提取的任务描述;再根据所述大模型依据所述第二治理数据中信息提取的任务描述,对所述第二治理数据中的填充数据进行信息提取,得到第三治理数据。Furthermore, when extracting information from the fill-in data in the second governance data through the preset big model to obtain the third governance data, the role of the big model can be defined first, and a task description for extracting information from the second governance data can be provided; then, according to the big model and the task description for extracting information from the second governance data, information can be extracted from the fill-in data in the second governance data to obtain the third governance data.

需要说明的是,关于上述内容,可以通过下述具体实施方案:It should be noted that the above content can be implemented through the following specific implementation plans:

明确大模型的角色和任务描述:Clarify the roles and tasks of the big model:

角色定义:根据治理数据的特点和信息提取的需求,确定大模型在这个过程中的具体角色。例如,如果需要提取文本中的关键词,可以将大模型定义为关键词提取器。Role definition: Determine the specific role of the big model in this process based on the characteristics of the governance data and the needs of information extraction. For example, if you need to extract keywords from the text, you can define the big model as a keyword extractor.

任务描述:详细描述信息提取的任务,包括要提取的信息类型(如日期、人名、地名等)、提取规则或模式等。确保任务描述清晰明确,以便大模型能够理解并执行。Task description: Describe the information extraction task in detail, including the type of information to be extracted (such as dates, names of people, places, etc.), extraction rules or patterns, etc. Make sure the task description is clear so that the big model can understand and execute it.

使用大模型进行信息提取:Use large models for information extraction:

调用大模型并将第二治理数据作为输入。大模型根据任务描述对填充数据进行分析和处理,使用其内置的算法和模型来提取所需的信息。大模型生成提取结果,即第三治理数据。The big model is called and the second governance data is used as input. The big model analyzes and processes the populated data according to the task description, and uses its built-in algorithms and models to extract the required information. The big model generates the extraction results, which are the third governance data.

验证和评估提取结果:Verify and evaluate the extraction results:

对大模型生成的第三治理数据进行验证和评估。可以通过与参考数据对比或其他验证方法来确保提取的信息准确无误。根据评估结果,对大模型的参数或任务描述进行调整和优化,以提高提取效果。Verify and evaluate the third-party governance data generated by the large model. You can ensure that the extracted information is accurate by comparing it with reference data or other verification methods. Based on the evaluation results, adjust and optimize the parameters or task description of the large model to improve the extraction effect.

需要说明的是,本说明书实施例通过上述内容,具有下述有益效果:It should be noted that the embodiments of this specification have the following beneficial effects through the above contents:

提高信息提取的准确性:通过明确大模型的角色和提供详细的任务描述,大模型可以更好地理解要提取的信息类型和上下文。这有助于提高提取结果的准确性,减少误判和遗漏。Improve the accuracy of information extraction: By clarifying the role of the big model and providing a detailed task description, the big model can better understand the type and context of information to be extracted. This helps improve the accuracy of the extraction results and reduce misjudgments and omissions.

降低人工干预:定义大模型的角色可以使信息提取过程更加自动化,减少对人工干预的需求。这可以提高工作效率,节省时间和人力资源。Reduced manual intervention: Defining the role of a large model can make the information extraction process more automated, reducing the need for manual intervention. This can improve work efficiency and save time and human resources.

提高数据质量:准确的信息提取有助于提高第三治理数据的质量。这可以使后续的数据分析、决策制定和业务流程更加可靠和有效。Improve data quality: Accurate information extraction helps improve the quality of third-party governance data. This can make subsequent data analysis, decision-making, and business processes more reliable and effective.

支持更好的决策制定:通过提供更准确和详细的信息,第三治理数据可以为决策制定提供更有力的支持。帮助决策者做出更明智的选择,并减少基于不准确或不完整信息做出决策的风险。Support better decision making: By providing more accurate and detailed information, third-party governance data can provide stronger support for decision making, helping decision makers make more informed choices and reduce the risk of making decisions based on inaccurate or incomplete information.

增强模型的可解释性:明确大模型的角色和任务描述有助于增强模型的可解释性。这可以使人们更好地理解模型如何做出决策和提取信息,增加对模型结果的信任和理解。Enhance model interpretability: Clarifying the role and task description of large models helps enhance model interpretability. This allows people to better understand how the model makes decisions and extracts information, increasing trust and understanding of the model results.

适应不同的数据需求:通过灵活定义大模型的角色和任务描述,可以适应不同的信息提取需求和数据特点。这使得该方法更具通用性和可扩展性。Adapt to different data requirements: By flexibly defining the roles and task descriptions of the large model, it can adapt to different information extraction requirements and data characteristics. This makes the method more general and scalable.

需要说明的是,现在市面公开的招标采购公告信息存在信息量超大、格式不统一的问题,只有通过高质量的数据治理,提取出有价值的数据,才能更好促进市场信息流动和高效使用。It should be noted that the tender and procurement announcement information currently available on the market has the problem of excessive amount of information and inconsistent format. Only through high-quality data governance and the extraction of valuable data can we better promote the flow of market information and its efficient use.

而使用传统的人工识别处理,根本无法满足高效和高质量的要求。现在随着大模型技术的出现,使用人工智能辅助数据治理是一种非常有效和创新的解决办法。Traditional manual recognition and processing simply cannot meet the requirements of high efficiency and high quality. Now with the emergence of big model technology, using artificial intelligence to assist data governance is a very effective and innovative solution.

针对采集后的大量非标准招标采购公告文本数据,通过应用本说明书实施例的数据治理流程,输出需要的高质量数据集,以供各类场景应用提供准确的数据基础。For the large amount of non-standard bidding and procurement announcement text data collected, the data governance process of the embodiment of this specification is applied to output the required high-quality data set to provide an accurate data foundation for various scenario applications.

名词解释:Glossary:

(1)填充率:每一类公告信息有值的关键字段数量占全部字段数量的百分比,只有达到这个比率才能确保后续治理正常。基于大量数据统计,以70%为阈值。(1) Filling rate: The percentage of key fields with values in each type of announcement information to the total number of fields. Only when this ratio is reached can the subsequent governance be normal. Based on a large amount of data statistics, 70% is used as the threshold.

(2)准确率:每一类公告信息准确的关键字段数量占全部字段数量的百分比,只有达到这个比率才能入库存为数据资源。基于大量数据统计,以80%为最低阈值。(2) Accuracy: The percentage of accurate key fields in each type of announcement information to the total number of fields. Only when this ratio is reached can it be stored in the inventory as a data resource. Based on a large amount of data statistics, 80% is the minimum threshold.

需要说明的是,本说明书实施例的关键字段可以为上述提到的预设字段,关键字段为预先设定的字段,可以根据不同的场景自行设定。It should be noted that the key fields of the embodiments of this specification may be the preset fields mentioned above. The key fields are pre-set fields and can be set according to different scenarios.

准确率的计算可以通过一系列的指标比对得出。比如以下专用要求:The accuracy can be calculated by comparing a series of indicators. For example, the following special requirements:

金额:不能为负值、中标金额要小于等于预算金额、小于1或大于1000万的需要预警审核;公司名称:不能有特殊字符、不能有特殊词语;联系人:不能有特殊字符、是汉字、大于4的需要预警审核;联系方式:数字+特殊字符;时间与日期:标准格式。Amount: cannot be a negative value, the winning bid amount must be less than or equal to the budget amount, and a warning review is required if it is less than 1 or greater than 10 million; Company name: cannot contain special characters or special words; Contact person: cannot contain special characters, must be Chinese characters, and a warning review is required if it is greater than 4; Contact information: numbers + special characters; Time and date: standard format.

整体流程参见图2示出的结构化的治理流程示意图;整体流程分为5个部分,具体为:The overall process is shown in Figure 2, which shows a structured governance process diagram. The overall process is divided into five parts, specifically:

(1)信息原文:接收待治理的招标采购公告信息原文。(1) Original information: Receive the original information of the bidding and procurement announcement to be processed.

(2)任务控制:根据信息格式和业务的不同,调用不同的治理任务。(2) Task control: Different governance tasks are called according to different information formats and businesses.

(3)数据治理:具体数据治理过程和方法。(3) Data governance: specific data governance processes and methods.

(4)质量控制:判断治理后的质量效果,根据质量效果决定是否需要继续提升治理。(4) Quality control: Determine the quality effect after treatment and decide whether to continue to improve treatment based on the quality effect.

(5)数据资源:存储治理完成后的招标采购数据。(5) Data resources: storage of bidding and procurement data after governance is completed.

信息分类:Information classification:

公告类型:根据招标采购阶段目前分6类,包括采购意向、拟在建、招标预告、招标公告、招标结果、其他公告。其中招标预告包括采购预告、资格预审公告、预审结果、论证意见、需求公示;招标公告包括公开招标公告、邀请招标公告、询价公告、竞争性谈判公告、竞争性磋商公告、单一来源公告、竞价公告、更正公告;招标结果包括中标公告、成交公告、废标公告、流标公告、结果变更;其他公告包括合同公告、验收公告、违规公告及其他。Announcement type: It is currently divided into 6 categories according to the bidding and procurement stage, including procurement intention, planned construction, bidding notice, bidding announcement, bidding results, and other announcements. Among them, bidding notices include procurement notices, prequalification announcements, prequalification results, demonstration opinions, and demand announcements; bidding announcements include open bidding announcements, invitation bidding announcements, inquiry announcements, competitive negotiation announcements, competitive consultation announcements, single source announcements, bidding announcements, and correction announcements; bidding results include winning bid announcements, transaction announcements, bid cancellation announcements, bid failure announcements, and result changes; other announcements include contract announcements, acceptance announcements, violation announcements, and others.

关键信息:分项目信息、采购单位信息、供应商信息、代理机构信息、评审专家信息。其中项目信息包括20类字段,采购单位包括9类字段,供应商信息包括8类字段,代理机构包括7类字段,评审专家包括1类字段。Key information: project information, purchasing unit information, supplier information, agency information, and review expert information. Project information includes 20 fields, purchasing unit information includes 9 fields, supplier information includes 8 fields, agency information includes 7 fields, and review expert information includes 1 field.

整体概述:对接收到的源数据,进行数据质量统计,根据数据质量的不同,启动不同的数据治理任务。当源数据关键字段填充率大于70%的时候,执行正则治理-大模型治理-质量控制任务流程,当源数据关键字段填充率小于70%的时候,采用UIE模型治理-正则治理-大模型治理-质量控制任务流程,准确的产出高质量的数据集。Overall overview: Data quality statistics are performed on the received source data, and different data governance tasks are initiated according to the different data quality. When the filling rate of key fields of source data is greater than 70%, the regular governance-large model governance-quality control task process is executed. When the filling rate of key fields of source data is less than 70%, the UIE model governance-regular governance-large model governance-quality control task process is adopted to accurately produce high-quality data sets.

技术、流程点描述:Description of technology and process points:

(1)统计分析源数据关键字段填充率:按公告类型分类,定义公告类型字典:type_info={'公告类型1':'value1','公告类型2':'value2',,,},确定源数据类型下的全部字段,定义字典key_info={'关键字段1':'value1','关键字段2':'value2',,,},确定源数据关键字段,嵌套循环遍历type_info与info,累加计算出各公告类型关键字段的总数据量与全部字段总数据量,得出关键字段填充率。(1) Statistical analysis of the filling rate of key fields of source data: Classify by announcement type, define the announcement type dictionary: type_info = {'Announcement type 1':'value1', 'Announcement type 2':'value2',,,}, determine all fields under the source data type, define the dictionary key_info = {'Key field 1':'value1', 'Key field 2':'value2',,,}, determine the key fields of the source data, nested loops traverse type_info and info, and cumulatively calculate the total data volume of the key fields of each announcement type and the total data volume of all fields to obtain the key field filling rate.

(2)根据关键字段填充率的情况,通过BlockingScheduler().add_job()方法,自动启动预设的不同治理任务(2) Based on the fill rate of key fields, different preset governance tasks are automatically started through the BlockingScheduler().add_job() method.

数据治理:包括正则治理、信息抽取模型治理与通用大模型信息抽取治理。Data governance: including regularization governance, information extraction model governance and general large model information extraction governance.

正则治理:Regular governance:

整体概述:利用正则表达式,对源数据进行拆分,截取,匹配等操作。包括地址拆分、金额提取、日期标准化、手机号码提取、特殊符号等无用信息过滤等。Overall overview: Use regular expressions to split, intercept, match and other operations on source data. This includes address splitting, amount extraction, date standardization, mobile phone number extraction, and filtering of useless information such as special symbols.

关键技术、流程点描述:Description of key technologies and process points:

(1)地址拆分:利用详细地址,通过INSTR函数,找到省、市、区等关键字段的位置,再利用SUBSTR函数拆分详细地址,从而得到省、市、区等独立关键字段。(1) Address splitting: Use the detailed address and the INSTR function to find the location of key fields such as province, city, and district. Then use the SUBSTR function to split the detailed address to obtain independent key fields such as province, city, and district.

(2)金额提取:在一段文字表述中,包含着金额数字,利用LEAST和Locate函数,找到第一个数字的位置,适用GREATEST与CHAR_LENGTH函数找到最后一个数字的位置,最后通过SUBSTR函数拆分出文字中的纯数字,再根据文字中的金额单位信息,最终确定准确的金额信息。(2) Amount extraction: In a text expression that contains amount numbers, use the LEAST and Locate functions to find the position of the first number, use the GREATEST and CHAR_LENGTH functions to find the position of the last number, and finally use the SUBSTR function to separate the pure numbers in the text. Then, based on the amount unit information in the text, the accurate amount information is finally determined.

(3)手机号过滤:参考步骤(2),提取出纯数字信息,再结合LENGTH(phone_number)=11AND SUBSTRING(phone_number,1,2)IN('13','14','15','17','18')与REGEXP'^(13\d|14[57]|15[^4\D]|17[13678]|18\d)\d{8}$'治理出准确的手机号码。(3) Mobile phone number filtering: Refer to step (2) to extract pure digital information, and then combine LENGTH(phone_number)=11AND SUBSTRING(phone_number,1,2)IN('13','14','15','17','18') with REGEXP'^(13\d|14[57]|15[^4\D]|17[13678]|18\d)\d{8}$' to get the correct mobile phone number.

(4)特殊符号过滤:通过replace函数,剔除信息项中的特殊符号。(4) Special symbol filtering: Use the replace function to remove special symbols from information items.

关于信息抽取模型治理:About information extraction model governance:

整体概述:使用信息抽取(Universal Information Extraction,UIE)模型,对HTML文档中的特定信息进行提取,以结构化的形式返回信息提取结果。为了提升信息抽取模型的准确率设计HTML简化方法,对HTML原文进行简化,剔除绝大部分HTML语法结构,同时保留格式符号,在充分利用HTML原文中的格式信息的同时,通过缩短HTML文档长度,剔除无效信息,降低模型推理过程消耗的显存和时间,提升推理准确率。Overall overview: Use the Universal Information Extraction (UIE) model to extract specific information from HTML documents and return the information extraction results in a structured form. In order to improve the accuracy of the information extraction model, an HTML simplification method is designed to simplify the HTML original text, remove most of the HTML syntax structure, and retain the format symbols. While making full use of the format information in the HTML original text, by shortening the length of the HTML document and removing invalid information, the memory and time consumed by the model reasoning process are reduced, and the reasoning accuracy is improved.

关键技术、流程点描述:Description of key technologies and process points:

(1)HTML简化:使用Python的bs4模块中的BeautifulSoup类进行HTML格式简化,移除HTML文档中的script、style等标签内容,再移除其他标签的class、id等属性。具体移除的标签和属性名称可以根据实际需要进行设计。(1) HTML simplification: Use the BeautifulSoup class in Python's bs4 module to simplify the HTML format, remove the script, style and other tags in the HTML document, and then remove the class, id and other attributes of other tags. The specific tags and attribute names to be removed can be designed according to actual needs.

(2)UIE信息抽取模型:将简化后的HTML文档作为输入,传入UIE信息抽取模型中,进行招标采购公告中所包含的信息抽取,如采购人名称、地址、联系方式,代理机构名、地址、联系方式等。UIE信息抽取模型利用指针网络(Pointer Network)实现片段抽取(SpanExtraction),从而实现命名实体识别(NER)、关系抽取(RE)、事件抽取(EE)、属性情感抽取(ABSA)等多类任务的抽取。通过结构化schema指导器,实现不同的信息抽取任务和具体需要抽取的内容。(2) UIE information extraction model: The simplified HTML document is used as input and passed into the UIE information extraction model to extract the information contained in the bidding and procurement announcement, such as the name, address, and contact information of the purchaser, the name, address, and contact information of the agency, etc. The UIE information extraction model uses a pointer network to implement span extraction, thereby achieving the extraction of multiple tasks such as named entity recognition (NER), relation extraction (RE), event extraction (EE), and attribute sentiment extraction (ABSA). Through the structured schema director, different information extraction tasks and specific content to be extracted are implemented.

信息抽取(UIE)模型:这是一种用于从文本中提取信息的模型。它可以识别和提取HTML文档中的关键元素和数据。Information Extraction (UIE) model: This is a model used to extract information from text. It can identify and extract key elements and data from HTML documents.

特定信息:这指的是模型需要提取的具体数据或元素。这些特定的信息可能根据具体的应用场景而定。Specific information: This refers to the specific data or elements that the model needs to extract. This specific information may depend on the specific application scenario.

提取:模型通过分析HTML文档的结构和内容,识别并抽取特定的信息。Extraction: The model identifies and extracts specific information by analyzing the structure and content of HTML documents.

结构化形式:提取的信息将以一种特定的结构进行组织,例如可以是一个包含字段和值的字典或其他形式的数据结构。Structured form: The extracted information will be organized in a specific structure, such as a dictionary or other data structure containing fields and values.

返回信息提取结果:最终模型将把提取到的结构化信息返回给调用者或其他处理程序,以便后续的使用和分析。Return information extraction results: The final model will return the extracted structured information to the caller or other processing program for subsequent use and analysis.

关于通用大模型信息抽取治理:About general large model information extraction management:

整体概述:对于UIE模型解析不准确的字段,可以使用大模型进行信息抽取。借助通用大模型的语言理解能力和代码理解能力,通过提示词工程将招标采购公告信息提取的任务转化为大模型对话任务,使大模型以结构化的形式将HTML文档中的信息返回给用于。再通过少量的标注数据,利用LORA微调进一步增强大模型在招标采购公告信息提取任务上的推理准确率。与4.2相同,在使用大模型进行信息抽取前,先对HTML文档进行简化,提升推理效率和准确率。Overall overview: For fields that the UIE model does not parse accurately, a large model can be used for information extraction. With the help of the language understanding and code understanding capabilities of the general large model, the task of extracting information from the bidding and procurement announcement is converted into a large model dialogue task through the prompt word engineering, so that the large model returns the information in the HTML document to the user in a structured form. Then, through a small amount of annotated data, LORA fine-tuning is used to further enhance the reasoning accuracy of the large model in the task of extracting information from the bidding and procurement announcement. Similar to 4.2, before using the large model for information extraction, the HTML document is simplified to improve reasoning efficiency and accuracy.

关键技术、流程点描述:Description of key technologies and process points:

(1)HTML简化:同上述的HTML简化。(1) HTML simplification: Same as the HTML simplification mentioned above.

(2)通用大模型:使用已有开源通用大模型作为基础模型,如qwen,chatglm等,根据资源分配可以选择1.5b、7b、14b等参数规模。利用开源通用大模型的语言理解能力和代码理解能力,对招标采购公告信息进行抽取。(2) General large model: Use existing open source general large models as the basic model, such as qwen, chatglm, etc., and select parameter scales such as 1.5b, 7b, and 14b according to resource allocation. Use the language understanding and code understanding capabilities of the open source general large model to extract bidding and procurement announcement information.

(3)提示词工程:构造规范的提示词,定义你希望大模型扮演的角色和其能力范围,提供背景信息和上下文以指导大模型,明确指出你希望大模型执行的任务,并给出相关的问答示例。(3) Prompt word engineering: Construct standardized prompt words, define the role you want the big model to play and the scope of its capabilities, provide background information and context to guide the big model, clearly point out the tasks you want the big model to perform, and give relevant question and answer examples.

(4)LORA微调:构造少量已标注信息的样本(100-1000条),对通用大模型进行LORA微调,并将生成的LORA微调合并到原有大模型上,利用已有的标注问答样本进一步提升大模型标注的准确性和格式规范性。(4) LoRa fine-tuning: Construct a small amount of samples with labeled information (100-1000 items), perform LoRa fine-tuning on the general large model, and merge the generated LoRa fine-tuning into the original large model. Use the existing labeled question and answer samples to further improve the accuracy and format standardization of the large model annotation.

质量控制:Quality Control:

整体概述:当一批源数据经过一次完整的数据治理任务流程后,产生待确认的数据资源,对待确认的数据资源进行质量统计分析。当关键字段填充率大于70%且关键字段准确率大于80%的时候,输出数据资源。当关键字段填充率低于70%的时候,对UIE模型进行数据集标注,监督微调,提升UIE模型的信息提取效果,之后再次启动数据治理任务。当关键字段准确率小于80%的时候,对正则治理流程进行修正维护,提高正则解析正确率,且对大模型进行提示词修正,提升大模型在特定问题上的推理效果,之后再次启动数据治理任务。Overall overview: When a batch of source data goes through a complete data governance task process, data resources to be confirmed are generated, and quality statistical analysis is performed on the data resources to be confirmed. When the key field filling rate is greater than 70% and the key field accuracy is greater than 80%, the data resources are output. When the key field filling rate is lower than 70%, the UIE model is labeled with a data set, supervised and fine-tuned to improve the information extraction effect of the UIE model, and then the data governance task is restarted. When the key field accuracy is less than 80%, the regular governance process is corrected and maintained to improve the accuracy of regular parsing, and the prompt words of the large model are corrected to improve the reasoning effect of the large model on specific problems, and then the data governance task is restarted.

关键技术、流程点描述:Description of key technologies and process points:

(1)监督微调:利用标注好的数据集对通用大模型进行微调,使用LORA微调方法,在通用大模型的基础上并联一个网络,保持原始模型不变的情况下对模型推理能力进行提升,针对不同的标注任务可以并联不同的LORA微调,实现模型的轻量级微调转换。(1) Supervised fine-tuning: Use the labeled data set to fine-tune the general large model. Use the LoRa fine-tuning method to connect a network in parallel on the basis of the general large model to improve the model's reasoning ability while keeping the original model unchanged. Different LoRa fine-tuning methods can be connected in parallel for different labeling tasks to achieve lightweight fine-tuning conversion of the model.

(2)提示词修正:对于提示词再次进行优化,通过指令任务的分条说明,着重符等特殊符号的使用,将指令更清晰的传达给大模型,使用zero shot,few shot方法,将任务样例写入提示词中,提升大模型在特定问题上的推理效果。(2) Prompt word correction: The prompt words are optimized again. The instructions are conveyed to the big model more clearly through the segmented description of the instruction task and the use of special symbols such as emphasis marks. The zero shot and few shot methods are used to write task examples into the prompt words to improve the reasoning effect of the big model on specific problems.

通过本流程可以大幅提升大量数据的治理效率。经过使用本流程,面对千万级别数量,仅需要4位技术人员,通过模型微调和算法应用,可以实现在单卡算力机器2条公告信息/s的实时治理,治理质量在84%以上。This process can greatly improve the efficiency of managing large amounts of data. After using this process, only 4 technicians are needed to manage tens of millions of data. Through model fine-tuning and algorithm application, real-time management of 2 announcements/s can be achieved on a single-card computing machine, with a management quality of more than 84%.

图3为本说明书一个或多个实施例提供的一种招标数据填充治理装置的结构示意图,包括:数据填充单元302、未填充字段确定单元304、第一治理单元306、准确率判断单元308、第二治理单元310与招标数据确定单元312。Figure 3 is a structural diagram of a bidding data filling and management device provided by one or more embodiments of this specification, including: a data filling unit 302, an unfilled field determination unit 304, a first management unit 306, an accuracy judgment unit 308, a second management unit 310 and a bidding data determination unit 312.

数据填充单元302,根据预设字段对招标源数据进行数据填充,得到待评估招标填充数据;The data filling unit 302 fills the tender source data with data according to the preset fields to obtain tender filling data to be evaluated;

未填充字段确定单元304,若所述待评估招标填充数据中所述预设字段的填充率小于第一阈值,确定所述待评估招标填充数据的未填充预设字段;An unfilled field determining unit 304 is configured to determine unfilled preset fields of the tender filling data to be evaluated if the filling rate of the preset fields in the tender filling data to be evaluated is less than a first threshold;

第一治理单元306,将所述未填充预设字段输入预先训练的信息抽取模型,得到第一治理数据;The first management unit 306 inputs the unfilled preset field into a pre-trained information extraction model to obtain first management data;

准确率判断单元308,若所述第一治理数据中预设字段的填充率不小于第一阈值,判断所述第一治理数据中预设字段的准确率是否小于第二阈值;The accuracy judgment unit 308 judges whether the accuracy of the preset field in the first governance data is less than a second threshold if the filling rate of the preset field in the first governance data is not less than a first threshold;

第二治理单元310,若所述第一治理数据中预设字段的准确率小于所述第二阈值,对所述第一治理数据中的填充数据进行正则治理,得到第二治理数据;A second governance unit 310, if the accuracy of the preset field in the first governance data is less than the second threshold, performs regularization governance on the padding data in the first governance data to obtain second governance data;

招标数据确定单元312,若所述第二治理数据中预设字段的准确率不小于所述第二阈值,得到符合条件的招标数据。The bidding data determining unit 312 obtains bidding data that meets the conditions if the accuracy of the preset field in the second governance data is not less than the second threshold.

图4为本说明书一个或多个实施例提供的一种招标数据填充治理设备的结构示意图,包括:FIG4 is a schematic diagram of the structure of a bidding data filling management device provided by one or more embodiments of this specification, including:

至少一个处理器;以及,at least one processor; and,

与所述至少一个处理器通信连接的存储器;其中,a memory communicatively connected to the at least one processor; wherein,

所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够:The memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to:

根据预设字段对招标源数据进行数据填充,得到待评估招标填充数据;Fill the tender source data according to the preset fields to obtain the tender filling data to be evaluated;

若所述待评估招标填充数据中所述预设字段的填充率小于第一阈值,确定所述待评估招标填充数据的未填充预设字段;If the filling rate of the preset field in the tender filling data to be evaluated is less than a first threshold, determining the unfilled preset field of the tender filling data to be evaluated;

将所述未填充预设字段输入预先训练的信息抽取模型,得到第一治理数据;Inputting the unfilled preset fields into a pre-trained information extraction model to obtain first governance data;

若所述第一治理数据中预设字段的填充率不小于第一阈值,判断所述第一治理数据中预设字段的准确率是否小于第二阈值;If the filling rate of the preset field in the first governance data is not less than the first threshold, determining whether the accuracy rate of the preset field in the first governance data is less than the second threshold;

若所述第一治理数据中预设字段的准确率小于所述第二阈值,对所述第一治理数据中的填充数据进行正则治理,得到第二治理数据;If the accuracy of the preset field in the first managed data is less than the second threshold, regularization management is performed on the padding data in the first managed data to obtain second managed data;

若所述第二治理数据中预设字段的准确率不小于所述第二阈值,得到符合条件的招标数据。If the accuracy of the preset field in the second governance data is not less than the second threshold, qualified bidding data is obtained.

本说明书一个或多个实施例提供的一种非易失性计算机存储介质,存储有计算机可执行指令,所述计算机可执行指令被计算机执行时能够实现:One or more embodiments of this specification provide a non-volatile computer storage medium storing computer executable instructions, which can achieve the following when executed by a computer:

根据预设字段对招标源数据进行数据填充,得到待评估招标填充数据;Fill the tender source data according to the preset fields to obtain the tender filling data to be evaluated;

若所述待评估招标填充数据中所述预设字段的填充率小于第一阈值,确定所述待评估招标填充数据的未填充预设字段;If the filling rate of the preset field in the tender filling data to be evaluated is less than a first threshold, determining the unfilled preset field of the tender filling data to be evaluated;

将所述未填充预设字段输入预先训练的信息抽取模型,得到第一治理数据;Inputting the unfilled preset fields into a pre-trained information extraction model to obtain first governance data;

若所述第一治理数据中预设字段的填充率不小于第一阈值,判断所述第一治理数据中预设字段的准确率是否小于第二阈值;If the filling rate of the preset field in the first governance data is not less than the first threshold, determining whether the accuracy rate of the preset field in the first governance data is less than the second threshold;

若所述第一治理数据中预设字段的准确率小于所述第二阈值,对所述第一治理数据中的填充数据进行正则治理,得到第二治理数据;If the accuracy of the preset field in the first managed data is less than the second threshold, regularization management is performed on the padding data in the first managed data to obtain second managed data;

若所述第二治理数据中预设字段的准确率不小于所述第二阈值,得到符合条件的招标数据。If the accuracy of the preset field in the second governance data is not less than the second threshold, qualified bidding data is obtained.

本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于装置、设备、非易失性计算机存储介质实施例而言,由于其基本相似于方法实施例,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。Each embodiment in this specification is described in a progressive manner, and the same or similar parts between the embodiments can be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for the device, equipment, and non-volatile computer storage medium embodiments, since they are basically similar to the method embodiments, the description is relatively simple, and the relevant parts can be referred to the partial description of the method embodiment.

上述对本说明书特定实施例进行了描述。其它实施例在所附权利要求书的范围内。在一些情况下,在权利要求书中记载的动作或步骤可以按照不同于实施例中的顺序来执行并且仍然可以实现期望的结果。另外,在附图中描绘的过程不一定要求示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方式中,多任务处理和并行处理也是可以的或者可能是有利的。The above is a description of a specific embodiment of the specification. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps recorded in the claims can be performed in an order different from that in the embodiments and still achieve the desired results. In addition, the processes depicted in the drawings do not necessarily require the specific order or continuous order shown to achieve the desired results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

以上所述仅为本说明书的一个或多个实施例而已,并不用于限制本说明书。对于本领域技术人员来说,本说明书的一个或多个实施例可以有各种更改和变化。凡在本说明书的一个或多个实施例的精神和原理之内所作的任何修改、等同替换、改进等,均应包含在本说明书的权利要求范围之内。The above description is only one or more embodiments of this specification and is not intended to limit this specification. For those skilled in the art, one or more embodiments of this specification may have various changes and variations. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of one or more embodiments of this specification shall be included in the scope of the claims of this specification.

Claims (10)

1.一种招标数据填充治理方法,其特征在于,包括:1. A bidding data filling management method, characterized by comprising: 根据预设字段对招标源数据进行数据填充,得到待评估招标填充数据;Fill the tender source data according to the preset fields to obtain the tender filling data to be evaluated; 若所述待评估招标填充数据中所述预设字段的填充率小于第一阈值,确定所述待评估招标填充数据的未填充预设字段;If the filling rate of the preset field in the tender filling data to be evaluated is less than a first threshold, determining the unfilled preset field of the tender filling data to be evaluated; 将所述未填充预设字段输入预先训练的信息抽取模型,得到第一治理数据;Inputting the unfilled preset fields into a pre-trained information extraction model to obtain first governance data; 若所述第一治理数据中预设字段的填充率不小于第一阈值,判断所述第一治理数据中预设字段的准确率是否小于第二阈值;If the filling rate of the preset field in the first governance data is not less than the first threshold, determining whether the accuracy rate of the preset field in the first governance data is less than the second threshold; 若所述第一治理数据中预设字段的准确率小于所述第二阈值,对所述第一治理数据中的填充数据进行正则治理,得到第二治理数据;If the accuracy of the preset field in the first managed data is less than the second threshold, regularization management is performed on the padding data in the first managed data to obtain second managed data; 若所述第二治理数据中预设字段的准确率不小于所述第二阈值,得到符合条件的招标数据。If the accuracy of the preset field in the second governance data is not less than the second threshold, qualified bidding data is obtained. 2.根据权利要求1所述的方法,其特征在于,所述将所述未填充预设字段输入预先训练的信息抽取模型,得到第一治理数据,包括:2. The method according to claim 1, characterized in that the step of inputting the unfilled preset fields into a pre-trained information extraction model to obtain the first governance data comprises: 通过预先训练的信息抽取模型,对所述未填充预设字段进行数据抽取,得到所述未填充预设字段的待填充数据;Extracting data from the unfilled preset fields through a pre-trained information extraction model to obtain data to be filled in the unfilled preset fields; 根据所述待填充数据对所述未填充预设字段进行数据填充,得到第一治理数据。The unfilled preset fields are filled with data according to the data to be filled to obtain first governance data. 3.根据权利要求2所述的方法,其特征在于,所述待评估招标填充数据为HTML格式文档,通过预先训练的信息抽取模型,对所述未填充预设字段进行数据抽取,得到所述未填充预设字段的待填充数据,包括:3. The method according to claim 2 is characterized in that the tender filling data to be evaluated is an HTML format document, and the unfilled preset fields are extracted by a pre-trained information extraction model to obtain the data to be filled in the unfilled preset fields, including: 将所述待评估招标填充数据输入所述信息抽取模型,通过指针网络对所述未填充预设字段所对应的信息进行片段抽取,以实现命名实体识别、关系抽取、事件抽取,以及属性情感抽取。The tender filling data to be evaluated is input into the information extraction model, and the information corresponding to the unfilled preset fields is extracted through a pointer network to achieve named entity recognition, relationship extraction, event extraction, and attribute sentiment extraction. 4.根据权利要求3所述的方法,其特征在于,所述通过预先训练的信息抽取模型,对所述未填充预设字段进行数据抽取,得到所述未填充预设字段的待填充数据前,所述方法还包括:4. The method according to claim 3, characterized in that before extracting data from the unfilled preset fields through the pre-trained information extraction model to obtain the data to be filled in the unfilled preset fields, the method further comprises: 通过Python的bs4模块中的BeautifulSoup类移除所述待评估招标填充数据中的script标签内容与style标签内容,并移除其他标签的class属性、id属性。The BeautifulSoup class in the bs4 module of Python is used to remove the script tag content and the style tag content in the tender filling data to be evaluated, and remove the class attributes and id attributes of other tags. 5.根据权利要求1所述的方法,其特征在于,所述对所述第一治理数据中的填充数据进行正则治理,得到第二治理数据,包括:5. The method according to claim 1, wherein the performing regularization on the padding data in the first governance data to obtain the second governance data comprises: 通过正则表达式对所述第一治理数据中的填充数据进行地址拆分、金额提取、日期标准化、手机号码提取,以及预设符号的过滤,得到第二治理数据。The fill data in the first governance data is subjected to address splitting, amount extraction, date standardization, mobile phone number extraction, and preset symbol filtering by regular expressions to obtain second governance data. 6.根据权利要求1所述的方法,其特征在于,若所述第二治理数据中预设字段的准确率小于所述第二阈值,所述方法还包括:6. The method according to claim 1, characterized in that if the accuracy of the preset field in the second governance data is less than the second threshold, the method further comprises: 通过预设大模型对所述第二治理数据中的填充数据进行信息提取,得到第三治理数据;Extracting information from the filling data in the second governance data by using a preset large model to obtain third governance data; 若所述第三治理数据中预设字段的准确率不小于所述第二阈值,得到符合条件的招标数据。If the accuracy of the preset field in the third governance data is not less than the second threshold, qualified bidding data is obtained. 7.根据权利要求6所述的方法,其特征在于,所述通过预设大模型对所述第二治理数据中的填充数据进行信息提取,得到第三治理数据,包括:7. The method according to claim 6, characterized in that the step of extracting information from the fill data in the second governance data by using a preset large model to obtain the third governance data comprises: 定义所述大模型的角色,提供所述第二治理数据中信息提取的任务描述;Define the role of the large model and provide a task description for information extraction from the second governance data; 根据所述大模型依据所述第二治理数据中信息提取的任务描述,对所述第二治理数据中的填充数据进行信息提取,得到第三治理数据。According to the task description of extracting information from the second governance data based on the large model, information is extracted from the filling data in the second governance data to obtain third governance data. 8.一种招标数据填充治理装置,其特征在于,包括:8. A bidding data filling management device, characterized by comprising: 数据填充单元,根据预设字段对招标源数据进行数据填充,得到待评估招标填充数据;A data filling unit fills the tender source data according to preset fields to obtain tender filling data to be evaluated; 未填充字段确定单元,若所述待评估招标填充数据中所述预设字段的填充率小于第一阈值,确定所述待评估招标填充数据的未填充预设字段;an unfilled field determining unit, which determines unfilled preset fields of the tender filling data to be evaluated if the filling rate of the preset fields in the tender filling data to be evaluated is less than a first threshold; 第一治理单元,将所述未填充预设字段输入预先训练的信息抽取模型,得到第一治理数据;A first management unit inputs the unfilled preset field into a pre-trained information extraction model to obtain first management data; 准确率判断单元,若所述第一治理数据中预设字段的填充率不小于第一阈值,判断所述第一治理数据中预设字段的准确率是否小于第二阈值;an accuracy judgment unit, which judges whether the accuracy of the preset field in the first governance data is less than a second threshold value if the filling rate of the preset field in the first governance data is not less than a first threshold value; 第二治理单元,若所述第一治理数据中预设字段的准确率小于所述第二阈值,对所述第一治理数据中的填充数据进行正则治理,得到第二治理数据;A second governance unit, if the accuracy of the preset field in the first governance data is less than the second threshold, performs regularization governance on the padding data in the first governance data to obtain second governance data; 招标数据确定单元,若所述第二治理数据中预设字段的准确率不小于所述第二阈值,得到符合条件的招标数据。The bidding data determining unit obtains bidding data that meets the conditions if the accuracy of the preset field in the second governance data is not less than the second threshold. 9.一种招标数据填充治理设备,其特征在于,包括:9. A bidding data filling management device, characterized by comprising: 至少一个处理器;以及,at least one processor; and, 与所述至少一个处理器通信连接的存储器;其中,a memory communicatively connected to the at least one processor; wherein, 所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够:The memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to: 根据预设字段对招标源数据进行数据填充,得到待评估招标填充数据;Fill the tender source data according to the preset fields to obtain the tender filling data to be evaluated; 若所述待评估招标填充数据中所述预设字段的填充率小于第一阈值,确定所述待评估招标填充数据的未填充预设字段;If the filling rate of the preset field in the tender filling data to be evaluated is less than a first threshold, determining the unfilled preset field of the tender filling data to be evaluated; 将所述未填充预设字段输入预先训练的信息抽取模型,得到第一治理数据;Inputting the unfilled preset fields into a pre-trained information extraction model to obtain first governance data; 若所述第一治理数据中预设字段的填充率不小于第一阈值,判断所述第一治理数据中预设字段的准确率是否小于第二阈值;If the filling rate of the preset field in the first governance data is not less than the first threshold, determining whether the accuracy rate of the preset field in the first governance data is less than the second threshold; 若所述第一治理数据中预设字段的准确率小于所述第二阈值,对所述第一治理数据中的填充数据进行正则治理,得到第二治理数据;If the accuracy of the preset field in the first managed data is less than the second threshold, regularization management is performed on the padding data in the first managed data to obtain second managed data; 若所述第二治理数据中预设字段的准确率不小于所述第二阈值,得到符合条件的招标数据。If the accuracy of the preset field in the second governance data is not less than the second threshold, qualified bidding data is obtained. 10.一种非易失性计算机存储介质,其特征在于,存储有计算机可执行指令,所述计算机可执行指令被计算机执行时能够实现:10. A non-volatile computer storage medium, characterized in that it stores computer executable instructions, which when executed by a computer can achieve: 根据预设字段对招标源数据进行数据填充,得到待评估招标填充数据;Fill the tender source data according to the preset fields to obtain the tender filling data to be evaluated; 若所述待评估招标填充数据中所述预设字段的填充率小于第一阈值,确定所述待评估招标填充数据的未填充预设字段;If the filling rate of the preset field in the tender filling data to be evaluated is less than a first threshold, determining the unfilled preset field of the tender filling data to be evaluated; 将所述未填充预设字段输入预先训练的信息抽取模型,得到第一治理数据;Inputting the unfilled preset fields into a pre-trained information extraction model to obtain first governance data; 若所述第一治理数据中预设字段的填充率不小于第一阈值,判断所述第一治理数据中预设字段的准确率是否小于第二阈值;If the filling rate of the preset field in the first governance data is not less than the first threshold, determining whether the accuracy rate of the preset field in the first governance data is less than the second threshold; 若所述第一治理数据中预设字段的准确率小于所述第二阈值,对所述第一治理数据中的填充数据进行正则治理,得到第二治理数据;If the accuracy of the preset field in the first managed data is less than the second threshold, regularization management is performed on the padding data in the first managed data to obtain second managed data; 若所述第二治理数据中预设字段的准确率不小于所述第二阈值,得到符合条件的招标数据。If the accuracy of the preset field in the second governance data is not less than the second threshold, qualified bidding data is obtained.
CN202410872098.3A 2024-07-01 2024-07-01 A bidding data filling management method, device, equipment and medium Pending CN118862889A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410872098.3A CN118862889A (en) 2024-07-01 2024-07-01 A bidding data filling management method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410872098.3A CN118862889A (en) 2024-07-01 2024-07-01 A bidding data filling management method, device, equipment and medium

Publications (1)

Publication Number Publication Date
CN118862889A true CN118862889A (en) 2024-10-29

Family

ID=93178516

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410872098.3A Pending CN118862889A (en) 2024-07-01 2024-07-01 A bidding data filling management method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN118862889A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119579009A (en) * 2025-01-27 2025-03-07 中通服创立信息科技有限责任公司 A large-model-based intelligent bid evaluation method, system and related products
CN119917756A (en) * 2025-04-03 2025-05-02 上海冰鉴信息科技有限公司 A method for extracting bidding data information using a generative large model

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119579009A (en) * 2025-01-27 2025-03-07 中通服创立信息科技有限责任公司 A large-model-based intelligent bid evaluation method, system and related products
CN119917756A (en) * 2025-04-03 2025-05-02 上海冰鉴信息科技有限公司 A method for extracting bidding data information using a generative large model

Similar Documents

Publication Publication Date Title
CN108256074B (en) Verification processing method and device, electronic equipment and storage medium
CN108153729B (en) A Knowledge Extraction Method Oriented to the Financial Field
CN118862889A (en) A bidding data filling management method, device, equipment and medium
US20050182736A1 (en) Method and apparatus for determining contract attributes based on language patterns
CN111324631A (en) Method for automatically generating sql statement by human natural language of query data
CN111274817A (en) An intelligent software cost measurement method based on natural language processing technology
CN114490571A (en) Modeling method, server and storage medium
CN119940322B (en) A method and system for generating rational drug use reports combined with artificial intelligence
CN119149742A (en) Contract evaluation method, system, equipment and medium based on large model
CN120124612A (en) A method for automatic generation of audit reports based on natural language processing
CN119151343B (en) A scientific and technological achievement evaluation and management system
CN120045580A (en) Financial apportionment detail query recommendation method, equipment and medium based on natural language
CN119722337A (en) A data tracing analysis system and method based on big data
CN118585892B (en) Personnel file data classification method and system based on artificial intelligence
CN118627471B (en) Automatic government affair data labeling method and system based on dependency attention diagram convolution
CN119129609A (en) Intelligent consulting method and consulting platform combined with demand semantic analysis
CN119672750A (en) A method and system for extracting key parameter information from PDF drawings
CN119273436A (en) An intelligent bid evaluation method based on big data
CN118747901A (en) Power transmission and transformation engineering technology and technical and economic big data integration method based on AI semantic recognition
CN118569804A (en) Short message text processing method and short message text receiving method
CN117875706A (en) An AI-based digital management method for grading process
CN116450717B (en) Data integration method and information management system for cross-service modules
CN119917464B (en) File processing method and device
CN120523923B (en) News information processing method, device, electronic equipment and storage medium
CN120672203A (en) Patent value assessment method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载