This project provides a custom Transform Engine for Alfresco that automatically detects and redacts Personally Identifiable Information (PII) in PDF documents.
It integrates Microsoft Presidio for entity detection and redaction, producing a sanitized PDF version that can be safely shared, archived or previewed.
- Detects common PII entities such as names, phone numbers, emails, IP addresses, and credit card numbers.
- Redacts sensitive content in PDF documents, producing a new PDF as output.
- Configurable detection and redaction options through
pii_engine_config.json
. - Runs as an Alfresco Transform Engine and can be combined with other T-Engines in a deployment.
- Packaged with Docker for easy build and deployment.
Run
- Docker and Docker Compose or Kubernetes
Develop
- Java 17+
- Maven 3.9+
- Docker and optionally Docker Compose
- Python 3
To build the JAR:
mvn clean package -DskipTests
To build the Docker image:
docker build -t alfresco-tengine-pii . --load
Run the service using Docker Compose:
docker compose up
This will start the PII Redaction Transform Engine, exposing it for use by Alfresco Content Services.
After starting the service, open the test application at http://localhost:8090. Use the following input values:
- file: Upload a PDF file with sensitive data on it
- sourceMimetype: application/pdf
- targetMimetype: application/pdf
You may also use following transformation options:
- entities: PERSON
- label: REDACTED
- score-threshold: 0.6
Meaning of this options is explained below
Click the Transform button to process the PDF file
Redaction behavior is controlled by src/main/resources/pii_engine_config.json
.
The following options are available when invoking the transformation (below sample values included):
{
"entities": ["PERSON", "PHONE_NUMBER", "EMAIL_ADDRESS"],
"scoreThreshold": 0.6,
"label": "PII"
}
entities
: List of PII types to detect (see Presidio supported entities).scoreThreshold
: Confidence threshold (between0.0
and1.0
). Higher values mean stricter detection.label
: Text label used to replace redacted entities in the PDF.
By default, this project is configured for English using Presidio’s built-in models.
To support additional languages:
-
Install and configure the appropriate spaCy language model. For example, to add Spanish:
pip install spacy python -m spacy download es_core_news_md
-
Update the Presidio Analyzer configuration to use the desired model.
-
Modify
pii_engine_config.json
if you want entity definitions aligned with the target language.
See the Presidio multilingual documentation for more details.
In addition to producing a redacted PDF, this Transform Engine can also extract PII metadata from PDF documents.
This operation exposes detected entities, counts, confidence scores, and the actual values as structured metadata, which can be mapped to Alfresco properties.
Use the same test app at http://localhost:8090 with these values:
- file: Upload a PDF file with sensitive data
- sourceMimetype:
application/pdf
- targetMimetype:
alfresco-metadata-extract
Optional transform options:
entities
: Comma-separated list of entities (e.g.PERSON,EMAIL_ADDRESS
)score-threshold
: Minimum confidence (e.g.0.6
)language
: NLP language model to use (defaulten
)
The output will be JSON like:
{
"{http://example.com/model/pii/1.0}countEmail": 1,
"{http://example.com/model/pii/1.0}countPerson": 1,
"{http://example.com/model/pii/1.0}entities": "CREDIT_CARD,DATE_TIME,EMAIL_ADDRESS,IBAN_CODE,IP_ADDRESS,PERSON,PHONE_NUMBER",
"{http://example.com/model/pii/1.0}hasPII": true,
"{http://example.com/model/pii/1.0}scoreAvg": 0.9286,
"{http://example.com/model/pii/1.0}scoreMax": 1,
"{http://example.com/model/pii/1.0}values": "+1 (217) 555-1234,192.168.1.100,2025-09-18,4111 1111 1111 1111,DE89 3704 0044 0532 0130 00,John Doe,john.doe@example.com"
}
These flat keys are resolved through the standard PIIMetadataExtractor_metadata_extract.properties
mapping file in the T-Engine, e.g.:
namespace.prefix.pii=http://example.com/model/pii/1.0
pii.hasPII=pii:hasPII
pii.entities=pii:entities
pii.counts.PERSON=pii:countPerson
pii.counts.EMAIL_ADDRESS=pii:countEmail
pii.values=pii:values
pii.values.PERSON=pii:valuesPerson
pii.values.EMAIL_ADDRESS=pii:valuesEmail
pii.scoreMax=pii:scoreMax
pii.scoreAvg=pii:scoreAvg
Ensure your Alfresco content model defines the corresponding pii:*
properties, or temporarily map them to existing cm:
properties for testing.
Ensure your compose.yaml
file includes the following configuration:
services:
alfresco:
environment:
JAVA_OPTS: >-
-DlocalTransform.core-aio.url=http://transform-core-aio:8090/
-DlocalTransform.pdf-pii.url=http://transform-pii:8090/
transform-core-aio:
image: alfresco/alfresco-transform-core-aio:${TRANSFORM_ENGINE_TAG}
transform-pii:
image: alfresco-tengine-pii
Key Configuration Updates:
- Add
localTransform.pdf-pii.url
to the Alfresco service (http://transform-pii:8090/
by default). - Define the
transform-pii
service using the custom-built image. - Ensure you have built the Docker image (
alfresco-tengine-pii
) before running Docker Compose.
Ensure your compose.yaml
file includes the following configuration:
services:
alfresco:
environment:
JAVA_OPTS: >-
-Dtransform.service.enabled=true
-Dtransform.service.url=http://transform-router:8095
-Dsfs.url=http://shared-file-store:8099/
transform-router:
image: quay.io/alfresco/alfresco-transform-router:5.2.0
environment:
CORE_AIO_URL: "http://transform-core-aio:8090"
TRANSFORMER_URL_PDF_PII: "http://transform-pii:8090"
TRANSFORMER_QUEUE_PDF_PII: "pii-engine-queue"
transform-pii:
image: alfresco-tengine-pii
environment:
ACTIVEMQ_URL: "nio://activemq:61616"
FILE_STORE_URL: >-
http://shared-file-store:8099/alfresco/api/-default-/private/sfs/versions/1/file
Key Configuration Updates:
- Register the PII transformer with
transform-router
- URL:
http://transform-pii:8090/
(default). - Queue Name:
pii-engine-queue
(defined inapplication-default.yaml
).
- URL:
- Define the
transform-pii
service and link it to ActiveMQ and Shared File Store services. - Ensure you have built the Docker image (
alfresco-tengine-pii
) before running Docker Compose.