此页面由 Cloud Translation API 翻译。

使用布局解析器处理文档

布局解析器会提取文本、表格和列表等文档内容元素，然后创建内容感知数据块，以便生成式 AI 和发现应用从中检索信息。

布局解析器功能

解析文档布局。您可以向 Layout Parser 输入 HTML 或 PDF 文件，以识别文本块、表格、列表等内容元素，以及标题和标题等结构元素。这些元素有助于定义文档的组织结构和层次结构，其中包含丰富的内容和结构元素，可为信息检索和发现创建更多情境。
将文档分块。布局解析器可以将文档拆分为多个块，这些块会保留有关原始文档布局层次结构的情境信息。生成答案的 LLM 可以使用块来提高相关性并降低计算负荷。

在分块期间考虑文档的布局有助于提高语义连贯性，并减少在检索和 LLM 生成时内容中的噪声。一个块中的所有文本都来自同一布局实体，例如标题、副标题或列表。

限制

存在以下限制：

在线处理：
- 所有文件类型的输入文件大小上限为 20 MB
- 每个 PDF 文件的页数上限为 15 页
批处理：
- PDF 文件的单个文件大小上限为 40 MB
- 每个 PDF 文件的页数上限为 500 页

按文件类型检测到的布局数

下表列出了布局解析器可以检测到的元素（按文档文件类型）。

文件类型	检测到的元素	限制
HTML	段落、表格、列表、标题、标题、页眉、页脚	请注意，解析很大程度上依赖于 HTML 标记，因此可能无法捕获基于 CSS 的格式。
PDF	段落、表格、标题、标题、页眉、页脚	跨多个页面的表格可能会拆分为两个表格。
DOCX（预览版）	段落、跨多个页面的表格、列表、标题、标题元素	不支持嵌套表。
PPTX（预览版）	段落、表格、列表、标题、标题元素	为了准确识别标题，应在 PowerPoint 文件中将标题标记为标题。不支持嵌套表格和隐藏幻灯片。
XLSX/XLSM（预览版）	Excel 电子表格中的表格，支持 `INT`、`FLOAT` 和 `STRING` 值	不支持检测多个表格。隐藏的工作表、行或列也可能会影响检测。

准备工作

如需开启布局解析器，请按以下步骤操作：

按照创建和管理处理器中的说明创建布局解析器。

处理器类型名称为 LAYOUT_PARSER_PROCESSOR。
按照启用处理器中的说明启用布局解析器。

使用 Layout Parser 发送在线处理请求

将文档输入到布局解析器以进行解析和分块。

按照发送处理请求中的批处理请求说明操作。

在 ProcessDocumentRequest 中配置 ProcessOptions.layoutConfig 中的字段。

REST

在使用任何请求数据之前，请先进行以下替换：

LOCATION：处理器的位置，例如：
- us - 美国
- eu - 欧盟
PROJECT_ID：您的 Google Cloud 项目 ID。
PROCESSOR_ID：自定义处理器的 ID。
MIME_TYPE：布局解析器支持 application/pdf 和 text/html。
DOCUMENT：要拆分为块的内容。布局解析器接受原始 PDF 或 HTML 文档，或由布局解析器输出的已解析文档。
CHUNK_SIZE：可选。拆分文档时使用的块大小（以 token 为单位）。
INCLUDE_ANCESTOR_HEADINGS：可选。布尔值。在拆分文档时是否包含祖先标题。

HTTP 方法和网址：

POST https://LOCATION-documentai.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/processors/PROCESSOR_ID:process

请求 JSON 正文：

// Sample for inputting raw documents such as PDF or HTML
{
  "rawDocument": {
    "mimeType": "MIME_TYPE",
    "content": "DOCUMENT"
  },
  "processOptions": {
    "layoutConfig": {
      "chunkingConfig": {
        "chunkSize": "CHUNK_SIZE",
        "includeAncestorHeadings": "INCLUDE_ANCESTOR_HEADINGS",
      }
    }
  }
}

如需发送请求，请选择以下方式之一：

curl

将请求正文保存在名为 request.json 的文件中，然后执行以下命令：

curl -X POST \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://LOCATION-documentai.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/processors/PROCESSOR_ID:process"

PowerShell

将请求正文保存在名为 request.json 的文件中，然后执行以下命令：

$headers = @{  }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://LOCATION-documentai.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/processors/PROCESSOR_ID:process" | Select-Object -Expand Content

响应包含处理后的文档，其中包含布局和分块信息，格式为 Document.documentLayout 和 Document.chunkedDocument。

Python

如需了解详情，请参阅 Document AI Python API 参考文档。

如需向 Document AI 进行身份验证，请设置应用默认凭据。如需了解详情，请参阅为本地开发环境设置身份验证。


from typing import Optional, Sequence

from google.api_core.client_options import ClientOptions
from google.cloud import documentai

# TODO(developer): Uncomment these variables before running the sample.
# project_id = "YOUR_PROJECT_ID"
# location = "YOUR_PROCESSOR_LOCATION" # Format is "us" or "eu"
# processor_id = "YOUR_PROCESSOR_ID" # Create processor before running sample
# processor_version = "rc" # Refer to https://cloud.google.com/document-ai/docs/manage-processor-versions for more information
# file_path = "/path/to/local/pdf"
# mime_type = "application/pdf" # Refer to https://cloud.google.com/document-ai/docs/file-types for supported file types


def process_document_layout_sample(
    project_id: str,
    location: str,
    processor_id: str,
    processor_version: str,
    file_path: str,
    mime_type: str,
) -> documentai.Document:
    process_options = documentai.ProcessOptions(
        layout_config=documentai.ProcessOptions.LayoutConfig(
            chunking_config=documentai.ProcessOptions.LayoutConfig.ChunkingConfig(
                chunk_size=1000,
                include_ancestor_headings=True,
            )
        )
    )

    document = process_document(
        project_id,
        location,
        processor_id,
        processor_version,
        file_path,
        mime_type,
        process_options=process_options,
    )

    print("Document Layout Blocks")
    for block in document.document_layout.blocks:
        print(block)

    print("Document Chunks")
    for chunk in document.chunked_document.chunks:
        print(chunk)



def process_document(
    project_id: str,
    location: str,
    processor_id: str,
    processor_version: str,
    file_path: str,
    mime_type: str,
    process_options: Optional[documentai.ProcessOptions] = None,
) -> documentai.Document:
    # You must set the `api_endpoint` if you use a location other than "us".
    client = documentai.DocumentProcessorServiceClient(
        client_options=ClientOptions(
            api_endpoint=f"{location}-documentai.googleapis.com"
        )
    )

    # The full resource name of the processor version, e.g.:
    # `projects/{project_id}/locations/{location}/processors/{processor_id}/processorVersions/{processor_version_id}`
    # You must create a processor before running this sample.
    name = client.processor_version_path(
        project_id, location, processor_id, processor_version
    )

    # Read the file into memory
    with open(file_path, "rb") as image:
        image_content = image.read()

    # Configure the process request
    request = documentai.ProcessRequest(
        name=name,
        raw_document=documentai.RawDocument(content=image_content, mime_type=mime_type),
        # Only supported for Document OCR processor
        process_options=process_options,
    )

    result = client.process_document(request=request)

    # For a full list of `Document` object attributes, reference this page:
    # https://cloud.google.com/document-ai/docs/reference/rest/v1/Document
    return result.document

使用布局解析器批处理文档

使用以下过程在单个请求中解析多个文档并将其分块。

将文档输入到布局解析器以进行解析和分块。

按照发送处理请求中的批处理请求说明操作。

在发出 batchProcess 请求时，在 ProcessOptions.layoutConfig 中配置字段。
输入
以下 JSON 示例配置了 ProcessOptions.layoutConfig。
```
"processOptions": {
  "layoutConfig": {
    "chunkingConfig": {
      "chunkSize": "CHUNK_SIZE",
      "includeAncestorHeadings": "INCLUDE_ANCESTOR_HEADINGS_BOOLEAN"
    }
  }
}
```
替换以下内容：
- CHUNK_SIZE：拆分文档时要使用的最大块大小（以令牌数量表示）。
- INCLUDE_ANCESTOR_HEADINGS_BOOLEAN：在拆分文档时是否包含祖先标题。祖先标题是指原始文档中子标题的父标题。它们可以提供一个包含有关其在原始文档中的位置的额外上下文的块。一个块最多可包含两个级别的标题。

后续步骤

查看处理器列表。
创建自定义分类器。
使用 Enterprise Document OCR 检测和提取文本。
请参阅发送批量处理文档请求，了解如何处理响应。

结合使用生成式 AI 的自定义提取器

预训练模型概览