+
Skip to content

aborroy/alf-tengine-pii

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Alfresco Transform Engine – PII Redaction

This project provides a custom Transform Engine for Alfresco that automatically detects and redacts Personally Identifiable Information (PII) in PDF documents.
It integrates Microsoft Presidio for entity detection and redaction, producing a sanitized PDF version that can be safely shared, archived or previewed.

Features

  • Detects common PII entities such as names, phone numbers, emails, IP addresses, and credit card numbers.
  • Redacts sensitive content in PDF documents, producing a new PDF as output.
  • Configurable detection and redaction options through pii_engine_config.json.
  • Runs as an Alfresco Transform Engine and can be combined with other T-Engines in a deployment.
  • Packaged with Docker for easy build and deployment.

Requirements

Run

  • Docker and Docker Compose or Kubernetes

Develop

  • Java 17+
  • Maven 3.9+
  • Docker and optionally Docker Compose
  • Python 3

Build

To build the JAR:

mvn clean package -DskipTests

To build the Docker image:

docker build -t alfresco-tengine-pii . --load

Usage

Run the service using Docker Compose:

docker compose up

This will start the PII Redaction Transform Engine, exposing it for use by Alfresco Content Services.

Testing with the HTML Interface

After starting the service, open the test application at http://localhost:8090. Use the following input values:

  • file: Upload a PDF file with sensitive data on it
  • sourceMimetype: application/pdf
  • targetMimetype: application/pdf

You may also use following transformation options:

  • entities: PERSON
  • label: REDACTED
  • score-threshold: 0.6

Meaning of this options is explained below

Click the Transform button to process the PDF file

Configuration

Redaction behavior is controlled by src/main/resources/pii_engine_config.json.

The following options are available when invoking the transformation (below sample values included):

{
  "entities": ["PERSON", "PHONE_NUMBER", "EMAIL_ADDRESS"],
  "scoreThreshold": 0.6,
  "label": "PII"
}
  • entities: List of PII types to detect (see Presidio supported entities).
  • scoreThreshold: Confidence threshold (between 0.0 and 1.0). Higher values mean stricter detection.
  • label: Text label used to replace redacted entities in the PDF.

Language Support

By default, this project is configured for English using Presidio’s built-in models.

To support additional languages:

  1. Install and configure the appropriate spaCy language model. For example, to add Spanish:

    pip install spacy
    python -m spacy download es_core_news_md
  2. Update the Presidio Analyzer configuration to use the desired model.

  3. Modify pii_engine_config.json if you want entity definitions aligned with the target language.

See the Presidio multilingual documentation for more details.

Extracting PII Metadata (PDF > alfresco-metadata-extract)

In addition to producing a redacted PDF, this Transform Engine can also extract PII metadata from PDF documents.
This operation exposes detected entities, counts, confidence scores, and the actual values as structured metadata, which can be mapped to Alfresco properties.

Testing via HTML interface

Use the same test app at http://localhost:8090 with these values:

  • file: Upload a PDF file with sensitive data
  • sourceMimetype: application/pdf
  • targetMimetype: alfresco-metadata-extract

Optional transform options:

  • entities: Comma-separated list of entities (e.g. PERSON,EMAIL_ADDRESS)
  • score-threshold: Minimum confidence (e.g. 0.6)
  • language: NLP language model to use (default en)

The output will be JSON like:

{
  "{http://example.com/model/pii/1.0}countEmail": 1,
  "{http://example.com/model/pii/1.0}countPerson": 1,
  "{http://example.com/model/pii/1.0}entities": "CREDIT_CARD,DATE_TIME,EMAIL_ADDRESS,IBAN_CODE,IP_ADDRESS,PERSON,PHONE_NUMBER",
  "{http://example.com/model/pii/1.0}hasPII": true,
  "{http://example.com/model/pii/1.0}scoreAvg": 0.9286,
  "{http://example.com/model/pii/1.0}scoreMax": 1,
  "{http://example.com/model/pii/1.0}values": "+1 (217) 555-1234,192.168.1.100,2025-09-18,4111 1111 1111 1111,DE89 3704 0044 0532 0130 00,John Doe,john.doe@example.com"
}

Mapping into Alfresco

These flat keys are resolved through the standard PIIMetadataExtractor_metadata_extract.properties mapping file in the T-Engine, e.g.:

namespace.prefix.pii=http://example.com/model/pii/1.0

pii.hasPII=pii:hasPII
pii.entities=pii:entities
pii.counts.PERSON=pii:countPerson
pii.counts.EMAIL_ADDRESS=pii:countEmail
pii.values=pii:values
pii.values.PERSON=pii:valuesPerson
pii.values.EMAIL_ADDRESS=pii:valuesEmail
pii.scoreMax=pii:scoreMax
pii.scoreAvg=pii:scoreAvg

Ensure your Alfresco content model defines the corresponding pii:* properties, or temporarily map them to existing cm: properties for testing.

Deployment

Deploying with Alfresco Community 25.x

Ensure your compose.yaml file includes the following configuration:

services:
  alfresco:
    environment:
      JAVA_OPTS: >-
        -DlocalTransform.core-aio.url=http://transform-core-aio:8090/
        -DlocalTransform.pdf-pii.url=http://transform-pii:8090/

  transform-core-aio:
    image: alfresco/alfresco-transform-core-aio:${TRANSFORM_ENGINE_TAG}

  transform-pii:
    image: alfresco-tengine-pii

Key Configuration Updates:

  • Add localTransform.pdf-pii.url to the Alfresco service (http://transform-pii:8090/ by default).
  • Define the transform-pii service using the custom-built image.
  • Ensure you have built the Docker image (alfresco-tengine-pii) before running Docker Compose.

Deploying with Alfresco Enterprise 25.x

Ensure your compose.yaml file includes the following configuration:

services:
  alfresco:
    environment:
      JAVA_OPTS: >-
        -Dtransform.service.enabled=true
        -Dtransform.service.url=http://transform-router:8095
        -Dsfs.url=http://shared-file-store:8099/

  transform-router:
    image: quay.io/alfresco/alfresco-transform-router:5.2.0
    environment:
      CORE_AIO_URL: "http://transform-core-aio:8090"
      TRANSFORMER_URL_PDF_PII: "http://transform-pii:8090"
      TRANSFORMER_QUEUE_PDF_PII: "pii-engine-queue"

  transform-pii:
    image: alfresco-tengine-pii
    environment:
      ACTIVEMQ_URL: "nio://activemq:61616"
      FILE_STORE_URL: >-
        http://shared-file-store:8099/alfresco/api/-default-/private/sfs/versions/1/file

Key Configuration Updates:

  • Register the PII transformer with transform-router
    • URL: http://transform-pii:8090/ (default).
    • Queue Name: pii-engine-queue (defined in application-default.yaml).
  • Define the transform-pii service and link it to ActiveMQ and Shared File Store services.
  • Ensure you have built the Docker image (alfresco-tengine-pii) before running Docker Compose.

About

Custom Alfresco TEngine to redact metadata in a PDF file

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载