这是indexloc提供的服务,不要输入任何密码
Skip to content

Conversation

@homanp
Copy link
Collaborator

@homanp homanp commented Oct 16, 2025

Description

This pull request introduces custom entity redaction to the Superagent CLI and SDKs, allowing users to specify which types of PII to redact using natural language descriptions. The CLI now supports a --entities flag, and both the TypeScript and Python SDKs accept an entities parameter for flexible, context-aware redaction. Documentation and tests have been updated to reflect and validate these new features.

Custom Entity Redaction Feature

  • Added support for specifying custom entities to redact using natural language in both TypeScript and Python SDKs via an entities parameter in the redact() method. The CLI now accepts a --entities flag for the same purpose. [1] [2] [3] [4] [5] [6]
  • Updated the request body sent to the redaction API to include the entities list if provided, enabling AI-powered interpretation of custom entity descriptions. [1] [2]

Documentation Updates

  • Expanded CLI, TypeScript, and Python SDK documentation with sections and examples for custom entity redaction, including usage of the new flag/parameter and sample entity descriptions. [1] [2] [3] [4] [5] [6]

API and Type Safety Improvements

  • Introduced a RedactOptions interface to the TypeScript SDK for better type safety and extended the Client interface to support the new options.
  • Added support for a new reasoning field in API responses for both SDKs, with backward compatibility for the previous reasoning_content field. [1] [2] [3] [4]

Testing Enhancements

  • Added new tests in both Python and TypeScript SDKs to verify that custom entity redaction works as expected, including cases for single and multiple entities. [1] [2]

Version Bumps

  • Bumped CLI, TypeScript SDK, and Python SDK versions to reflect the new feature releases. [1] [2] [3]

Change references: [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21]

Related Issue

Fixes #1053

Checklist

  • I tested my changes
  • I reviewed my own code

@homanp homanp self-assigned this Oct 16, 2025
@vercel
Copy link

vercel bot commented Oct 16, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
docs Ready Ready Preview Comment Oct 16, 2025 7:15am

@homanp homanp changed the title Feat/blacklist redaction option feat: blacklist redaction option Oct 16, 2025
@claude
Copy link

claude bot commented Oct 16, 2025

Pull Request Review: Custom Entity Redaction Feature

Overall Assessment

This is a well-structured PR that adds valuable functionality for custom entity redaction using natural language descriptions. The implementation is consistent across CLI, TypeScript SDK, and Python SDK, with good documentation and test coverage.

Recommendation: ✅ Approve with minor suggestions for improvement


Code Quality & Best Practices

✅ Strengths

  1. Consistent API Design: The entities parameter is implemented consistently across all three surfaces (CLI, TypeScript, Python)
  2. Backward Compatibility: Good handling of both reasoning and reasoning_content fields for API compatibility
  3. Type Safety: Added RedactOptions interface in TypeScript for better type safety
  4. Documentation: Comprehensive documentation updates with clear examples
  5. Version Management: Proper version bumps across all packages

⚠️ Suggestions for Improvement

  1. CLI Argument Parsing (cli/src/commands/redact.ts:21-37)

    • The current implementation assumes --entities value is always at entitiesFlagIndex + 1, but doesn't validate it's not another flag
    • Consider edge case: superagent redact --entities --url-whitelist ... would incorrectly treat --url-whitelist as the entities value

    Suggested fix:

    if (entitiesFlagIndex !== -1) {
      const entitiesValue = args[entitiesFlagIndex + 1];
      if (!entitiesValue || entitiesValue.startsWith('--')) {
        console.error('❌ ERROR: --entities requires a comma-separated list of entity types');
        process.exit(1);
      }
      // ... rest of logic
    }
  2. Empty Entities Array Handling

    • TypeScript SDK checks entities.length > 0 before adding to request (line 299)
    • Python SDK doesn't have this check - it will send an empty array if entities=[]

    Suggested alignment in Python (client.py:213):

    if entities and len(entities) > 0:
        request_body["entities"] = entities
  3. Error Handling for Invalid Entities

    • No validation that entities are non-empty strings
    • CLI splits on comma but doesn't validate the resulting entities

    Consider adding:

    // In CLI after split
    entities = entitiesValue.split(',')
      .map(entity => entity.trim())
      .filter(entity => entity.length > 0);
    
    if (entities.length === 0) {
      console.error('❌ ERROR: --entities must contain at least one valid entity type');
      process.exit(1);
    }

Potential Bugs & Issues

🔴 Medium Priority

  1. Test Assertions May Be Fragile (sdk/python/tests/test_guard.py:212, sdk/typescript/tests/guard.test.ts:180)

    • Tests assume SSN 123-45-6789 will NOT be redacted when only requesting email/phone redaction
    • This relies on the API's interpretation and may be brittle if the model's behavior changes
    • Recommendation: Add a comment explaining this assumption or use mock responses for more deterministic tests
  2. No Validation of API Response Format

    • Both SDKs assume the API response has the expected structure
    • If the API changes or returns an error in a different format, could throw unclear errors
    • Suggestion: Add basic validation after result = response.json() to ensure required fields exist

Security Concerns

✅ Good Security Practices

  1. No Direct PII Storage: The feature enables better PII detection, which is security-positive
  2. Defensive Coding: Good use of optional chaining and fallbacks for API response fields
  3. API Key Handling: Proper use of environment variables and headers

⚠️ Minor Considerations

  1. Entity Descriptions as Attack Vector

    • Malicious users could potentially craft entity descriptions to manipulate redaction behavior
    • Example: entities=["nothing", "everything except secrets"]
    • Recommendation: Consider documenting best practices or adding warnings about entity description specificity
  2. URL Whitelist Before Entities

    • In Python SDK (client.py:248-254), URL whitelist is applied AFTER getting API response
    • If entities include "URLs", the API might redact them before local whitelist is applied
    • Recommendation: Document the interaction between entities and url_whitelist, or clarify order of operations

Performance Considerations

✅ Efficient Implementation

  1. Minimal Overhead: Only adds a small array to the request body
  2. No Additional API Calls: Single request handles custom entities
  3. Optional Parameter: No performance impact when not used

💡 Optimization Opportunities

  1. Entity Deduplication

    • If user passes duplicate entities, they're sent to API as-is
    • Minor optimization: deduplicate before sending
    # In Python SDK
    if entities:
        request_body["entities"] = list(set(entities))  # Remove duplicates
  2. Caching Potential

    • Identical text + entities combinations could benefit from caching
    • Consider documenting this as a future enhancement for high-volume use cases

Test Coverage

✅ Good Coverage

  1. Happy Path Tests: Both SDKs test single and multiple entities
  2. Integration Tests: Tests hit actual API endpoints (based on API_BASE_URL usage)
  3. Backward Compatibility: Existing tests still pass (implicitly validates backward compat)

📈 Opportunities for Additional Tests

  1. Edge Cases:

    • Empty string in entities array: entities=["", "email"]
    • Very long entity descriptions
    • Special characters in entity descriptions
    • Interaction between entities and url_whitelist
  2. Error Scenarios:

    • API returns error when entities are invalid
    • Network timeout with entities parameter
    • Malformed API response
  3. CLI Tests:

    • No test file visible for CLI command
    • Recommendation: Add CLI integration tests for --entities flag

Example test cases to add:

// TypeScript
it("redact with empty entities array sends no entities", async () => {
  const result = await client.redact(text, { entities: [] });
  // Verify request body doesn't include entities field
});

it("redact with entities and urlWhitelist works together", async () => {
  const result = await client.redact(
    "Email: john@example.com, URL: https://example.com",
    { entities: ["email"], urlWhitelist: ["https://example.com"] }
  );
  // Verify email redacted but URL preserved
});

Documentation Quality

✅ Excellent Documentation

  1. Clear Examples: All three platforms have usage examples
  2. Natural Language Descriptions: Good examples of entity descriptions
  3. Combined Usage: Shows using entities with url_whitelist together

💡 Minor Enhancements

  1. Add Performance Note: Document that more specific entity descriptions may yield better results
  2. Add Limitations: Mention any known limitations or edge cases
  3. Add Examples of What NOT to Do: Show antipatterns (e.g., too vague descriptions like "data")

Changelog Review

The CHANGELOG.md is well-structured and comprehensive. However:

  1. Missing Breaking Changes Section: If the API response format changed (reasoning vs reasoning_content), should note if this is breaking
  2. Consider Semantic Versioning: From 0.0.11 → 0.0.12 suggests patch, but this is a new feature (minor bump: 0.1.0?)

Summary & Recommendations

Category Rating Notes
Code Quality ⭐⭐⭐⭐ Clean, consistent, well-structured
Security ⭐⭐⭐⭐ No major concerns, minor considerations documented
Performance ⭐⭐⭐⭐⭐ Minimal overhead, efficient implementation
Test Coverage ⭐⭐⭐ Good coverage, room for edge case tests
Documentation ⭐⭐⭐⭐⭐ Comprehensive and clear

Action Items (Priority Order)

  1. High: Fix CLI argument parsing to validate flag values
  2. Medium: Align empty array handling between TypeScript and Python
  3. Medium: Add entity description validation/sanitization
  4. Low: Add edge case tests
  5. Low: Document entity + URL whitelist interaction

Conclusion

This is a solid feature addition that follows the project's conventions and maintains consistency across platforms. The implementation is secure and performant. With the minor improvements suggested above, this PR will be excellent.

Great work! 🎉


Review generated on 2025-10-16

@homanp homanp merged commit 4f9c0d3 into main Oct 16, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Add support for passing blacklist to redaction methods

2 participants