这是indexloc提供的服务,不要输入任何密码
Skip to content

Conversation

@homanp
Copy link
Collaborator

@homanp homanp commented Oct 22, 2025

Description

This pull request introduces a breaking change to the Superagent SDKs and CLI, focusing on a cleaner, more intuitive API for security analysis and redaction, and adding support for PDF file analysis. The most significant updates are to the method signatures for guard() and redact(), which now accept either a string or a file as the first parameter. The CLI and documentation are updated accordingly to reflect these changes and to support analyzing PDF files for security threats.

SDK and API Changes:

  • Breaking change: The guard() and redact() methods in both Python and TypeScript SDKs now accept a single input parameter, which can be either a string or a file (e.g., a PDF). This replaces the previous pattern of passing both a string and a file separately, simplifying the API and making it more intuitive.
  • Added PDF support: The guard() method now supports analyzing PDF files for security threats, extracting and analyzing the text content of the PDF and returning a JSON analysis.

CLI Improvements:

  • New --file flag for guard: The CLI now supports a --file flag for the guard command, enabling users to analyze PDF files for security threats directly from the command line. The implementation reads the file, creates a Blob, and passes it to the SDK. [1] [2] [3] [4]
  • CLI and docs updated: Documentation and help messages are updated to reflect the new usage patterns and PDF support. [1] [2] [3]

Documentation Updates:

  • SDK docs rewritten: The Python SDK documentation is updated to show the new method signatures, union types, and usage examples for both text and PDF inputs. TypeTable components are used for clearer type information. Examples now demonstrate passing file objects directly and handling PDF analysis results. [1] [2] [3] [4] [5]
  • Changelog and package versions bumped: Changelog documents the breaking changes and additions; package versions are incremented and dependencies updated. [1] [2] [3] [4] [5]

Summary of Most Important Changes:

API & SDK Breaking Changes:

  • Refactored guard() and redact() methods in Python and TypeScript SDKs to accept a single input parameter (string or file), simplifying the interface and enabling PDF support.

New Features:

  • Added PDF file analysis support to the guard() method in SDKs and CLI, extracting and analyzing text from PDFs for security threats. [1] [2] [3]

CLI Improvements:

  • Introduced --file flag to the guard command, allowing users to analyze PDF files from the command line. [1] [2] [3]
  • Updated CLI documentation and examples to show new usage patterns for PDF file analysis. [1] [2] [3]

Documentation Enhancements:

  • Rewrote SDK documentation with new method signatures, union types, and usage examples for both text and PDF inputs, using TypeTable components for clarity. [1] [2] [3] [4] [5]

Maintenance:

  • Updated changelog, bumped package versions, and updated dependencies to reflect the new release. [1] [2] [3] [4] [5]

Related Issue

Fixes #1061

Checklist

  • I tested my changes
  • I reviewed my own code

@homanp homanp self-assigned this Oct 22, 2025
@vercel
Copy link

vercel bot commented Oct 22, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
docs Ready Ready Preview Comment Oct 22, 2025 0:31am

@claude
Copy link

claude bot commented Oct 22, 2025

Pull Request Review

This PR introduces PDF support to the Guard SDK and refactors the API to accept a unified input parameter. Overall, this is a well-executed feature addition with good documentation.

Code Quality and Best Practices

Strengths:

  • Consistent API Design: The refactoring creates a cleaner, more intuitive API
  • Good Type Safety: Proper use of TypedDict, Union types, and dataclasses in Python; strong TypeScript interfaces
  • Error Handling: Comprehensive error handling with custom GuardError class
  • Documentation: Excellent documentation updates with clear examples and migration guides
  • Code Reusability: Good use of helper functions

Recommendations:

  1. File Type Detection (Python SDK, client.py:168-169, 243-244): The logic could incorrectly classify objects. Consider using AND instead of OR in the hasattr check.
  2. Empty Text Field: Both SDKs send text="" when a file is provided. Consider if omitting would be clearer.

Potential Bugs and Issues

Critical:

  1. CLI Guard Command (guard.ts:16): Help example shows passing both --file AND text, but implementation only uses one. Fix the help text.
  2. Missing File Validation: Neither SDK validates files are PDFs before sending.

Moderate:
3. URL Whitelist Only Applied to Text (client.py:332): Silently ignored for file inputs.
4. Inconsistent Prompt Variable: When using --file, prompt is created but never used.

Performance Considerations

Good: Async/await used correctly, proper timeout handling
Suggestions: Add file size limits, improve memory management in redact.ts:149-150, consider longer timeout for PDFs

Security Concerns

Good: Proper auth, input validation, no credential leakage, defensive security tool
Recommendations: Validate file paths, sanitize output filenames, validate file magic numbers

Test Coverage - CRITICAL ISSUE

Missing Tests:

  1. No tests for PDF file handling - THE MAIN FEATURE!
  2. No tests for file type detection logic
  3. No tests for error cases
  4. No CLI tests for --file flag

Summary

Overall: Solid feature addition with good code quality.

MUST FIX:

  • Inconsistent CLI help text
  • Add test coverage for PDF file handling

SHOULD FIX:

  • Improve file type detection logic
  • Add file size/type validation

Recommendation: Request changes for test coverage and CLI documentation before merging.

Action Items

  • Fix CLI help text for --file flag usage
  • Add integration tests for PDF file handling in both SDKs
  • Improve file type detection logic in Python SDK
  • Add file validation (size, type checking)
  • Consider adding CLI tests
  • Fix PR title typo: suppor to support

Great work on this feature! The API design improvement makes the SDK much more intuitive.

@claude
Copy link

claude bot commented Oct 22, 2025

Pull Request Review: PDF Support for Guard SDK + Refactoring

Summary

This PR introduces a breaking change that simplifies the API by accepting a unified input parameter (string or file) for both guard() and redact() methods, while adding PDF file analysis support. Overall, this is a well-executed refactoring with comprehensive documentation updates.

✅ Code Quality & Best Practices

Strengths

  1. API Design Improvements - The new unified input parameter is a significant improvement
  2. Type safety well-maintained with union types
  3. Clean separation between file and text handling logic
  4. Excellent changelog with clear migration examples

Areas for Improvement

  1. Code Duplication in CLI Commands - File reading logic is duplicated between guard.ts and redact.ts
  2. Python File Type Detection could produce false positives
  3. Inconsistent CLI help text showing unused parameters

🐛 Critical Issues

1. URL Whitelist Only Applied Client-Side (sdk/python/src/superagent_ai/client.py:332)
The URL whitelist is applied client-side using regex. The API receives unfiltered text and processes all URLs. Token usage is not optimized, and the feature doesn't work for file inputs.

2. Missing Prompt Parameter in Guard File Mode (cli/src/commands/guard.ts:76-98)
When --file is provided, the prompt parameter is ignored, creating user confusion.

3. Path Traversal Risk in CLI (cli/src/commands/redact.ts:141-144)
Output paths not validated, could write to unexpected locations.

Medium Priority Issues

  1. Inconsistent Error Exit Codes
  2. Missing File Validation - assumes all files are PDFs
  3. Typo in PR Title: suppor should be support

⚡ Performance Considerations

Good: HTTP client reuse, streaming support, timeout configuration

Recommendations:

  1. Add file size limits before uploading
  2. Default 10s timeout may be too short for large PDF processing

🔒 Security Concerns

Good: Proper authorization headers, no credential leakage, input validation

Issues:

  1. Path traversal risk in CLI output paths
  2. Missing content-type validation
  3. Regex DoS risk in URL pattern
  4. No rate limiting / retry logic for HTTP 429

🧪 Test Coverage - CRITICAL ⚠️

Critical Missing Tests

NO FILE INPUT TESTS - Neither Python nor TypeScript tests include PDF file upload scenarios. THIS IS THE MAIN FEATURE BEING ADDED!

Also missing:

  • Redact PDF format tests
  • Error handling tests (invalid keys, network failures, timeouts)
  • CLI tests (zero coverage)
  • Edge case tests (empty files, large files, corrupted PDFs)

🎯 Verdict: ✅ Approve with Recommendations

Well-designed refactoring with excellent documentation. However, critical test coverage gaps must be addressed.

Must Fix Before Merge:

  1. ❗ Add file input tests - main feature has NO test coverage
  2. ❗ Fix URL whitelist implementation or document limitation
  3. ❗ Fix CLI file path security
  4. ❗ Clarify prompt parameter behavior with --file

Should Fix:

  1. Add file size validation
  2. Standardize exit codes
  3. Add file type validation
  4. Add error handling tests

Nice to Have:

  1. Extract duplicated file reading logic
  2. Improve file type detection
  3. Add ReDoS protection
  4. Add rate limiting / retry logic

Documentation:

  • Document PDF file size limits
  • Document URL whitelist is client-side only
  • Add migration guide
  • Document CLI exit codes

Test Plan

Before merging, verify:

  • Upload PDF to guard() - returns analysis
  • Upload PDF to redact() with format=pdf - returns PDF bytes
  • Pass empty string - error handling works
  • Test CLI --file flag
  • Test URL whitelist with file inputs

Great work on the refactoring! Please address test coverage and security issues before merging.

@claude
Copy link

claude bot commented Oct 22, 2025

Code Review: PDF Support for Guard SDK + Minor Refactoring

Summary

This PR introduces a significant breaking change to simplify the API by accepting a single input parameter (string OR file) instead of separate text and file parameters, and adds PDF support to the guard() method. Overall, the changes are well-implemented with good documentation updates.


✅ Code Quality & Best Practices

Strengths

  • Clean API design: The refactored union type approach (input: str | File) is more intuitive than the previous pattern
  • Comprehensive documentation: Excellent updates to SDK docs with TypeTable components showing type information
  • Consistent implementation: Both Python and TypeScript SDKs follow the same patterns
  • Good error handling: Proper validation and error messages throughout
  • Changelog maintained: Clear documentation of breaking changes with migration examples

Areas for Improvement

  1. Type Detection Logic Inconsistency (Python SDK, client.py:168)

    is_file = hasattr(input, 'read') or not isinstance(input, str)

    Issue: This logic is too permissive. not isinstance(input, str) will match ANY non-string type (integers, lists, etc.), not just files.

    Recommendation:

    is_file = hasattr(input, 'read') or (not isinstance(input, str) and hasattr(input, 'name'))
  2. TypeScript File Detection Complexity (index.ts:220-225)

    const isFile = typeof input !== 'string' && (
      (typeof File !== 'undefined' && input instanceof File) ||
      (typeof Blob !== 'undefined' && input instanceof Blob) ||
      (input && typeof input === 'object' && 'stream' in input)
    );

    Issue: Checking for 'stream' in input is fragile and could match unintended objects.

    Recommendation: Add more specific checks or use a type guard function for clarity.


🐛 Potential Bugs

Critical

  1. CLI Guard Command - Prompt Ignored When File Provided (cli/src/commands/guard.ts:98)

    const input = file || prompt;

    Issue: When --file is provided, the prompt argument is completely ignored. However, line 16 shows:

    console.log('  superagent guard --file document.pdf "Analyze this document"');

    This suggests the prompt should still be used for context, but it's discarded.

    Impact: User confusion - the example shows passing a prompt with --file, but it will be silently ignored.

    Recommendation: Either:

    • Remove the prompt from the help example if it's not used
    • Pass the prompt as context to the API if supported
    • Show a warning if both are provided
  2. URL Whitelist Applied to File Input (client.py:332-333, index.ts:444-446)
    Both SDKs skip URL whitelist for file inputs, but the CLI doesn't prevent users from trying:

    superagent redact --file doc.pdf --url-whitelist https://example.com

    This will silently be ignored.

    Recommendation: Add validation in CLI to warn users that --url-whitelist is incompatible with --file.

Minor

  1. File Path Validation Missing (cli/src/commands/guard.ts:34)

    const fileBuffer = readFileSync(filePath);

    Issue: No check if filePath is actually a file (not a directory) before attempting to read.

    Recommendation: Add existsSync() and statSync().isFile() checks with clear error messages.

  2. Hardcoded PDF Content Type (guard.ts:35, redact.ts:50)

    file = new Blob([fileBuffer], { type: 'application/pdf' });

    Issue: Always sets content type to PDF, but no validation that the file is actually a PDF.

    Recommendation: Either detect MIME type from file extension/content, or validate it's actually a PDF.


⚡ Performance Considerations

Good Practices

  • Using readFileSync in CLI is appropriate for single-file operations
  • Proper cleanup with timeout handling in TypeScript SDK

Potential Issues

  1. Memory Usage for Large PDFs
    Both CLIs load entire files into memory:

    const fileBuffer = readFileSync(filePath); // Entire file in memory

    Recommendation: Consider streaming for large files or add file size limit validation with helpful error messages.

  2. No Timeout Configuration in CLI
    The SDK clients have configurable timeouts, but the CLI uses defaults. Large PDF files might timeout.

    Recommendation: Add optional --timeout flag for CLI commands.


🔒 Security Concerns

Good Security Practices

  • ✅ API key validation before making requests
  • ✅ Proper authorization headers
  • ✅ Input validation for empty strings
  • ✅ Error messages don't leak sensitive information

Concerns

  1. Path Traversal Risk (guard.ts:31-34, redact.ts:46-49)

    const filePath = args[fileFlagIndex + 1];
    const fileBuffer = readFileSync(filePath);

    Issue: User-provided file paths are read without validation. No protection against path traversal attacks like:

    superagent guard --file ../../../etc/passwd

    Severity: Medium (CLI is typically user-owned, but still a concern for automation)

    Recommendation:

    • Resolve and validate paths to ensure they're within expected directories
    • Consider adding a --allow-any-path flag with clear warnings for automation use cases
    import { resolve, normalize } from 'path';
    const resolvedPath = resolve(normalize(filePath));
    // Add validation logic here
  2. No File Extension Validation
    The code accepts any file but always treats it as a PDF. Malicious users could provide executable files or other formats.

    Recommendation: Validate file extensions against an allowlist (e.g., .pdf) before processing.

  3. API Key in Environment Variable
    While standard practice, consider documenting secure key management best practices in the security section of the docs.


🧪 Test Coverage

Excellent Coverage ✅

  • Both Python and TypeScript have comprehensive tests for:
    • Basic guard pass/block scenarios
    • Callback invocations (onPass/onBlock)
    • Redact with URL whitelist
    • Redact with entities
    • Multiple whitelist entries
    • Edge cases with JSON input

Missing Test Coverage ❌

  1. No PDF File Tests
    Critical: This PR adds PDF support, but there are NO tests for:

    • guard() with PDF files
    • redact() with PDF files
    • Format parameter with PDFs
    • CLI --file flag functionality

    Recommendation: Add integration tests with sample PDFs:

    async def test_guard_with_pdf_file():
        with open("test_data/sample.pdf", "rb") as pdf_file:
            result = await client.guard(pdf_file)
            assert result.rejected is not None
            assert result.reasoning is not None
  2. No CLI Tests
    The CLI commands are completely untested.

    Recommendation: Add CLI integration tests using subprocess or similar.

  3. No Error Handling Tests
    Missing tests for:

    • Invalid file paths
    • Malformed API responses
    • Network errors
    • Timeout scenarios
    • Invalid API keys
  4. No Type Validation Tests
    What happens if you pass an integer, null, or other invalid type as input?


📋 Additional Recommendations

Breaking Change Communication

  1. Consider adding a deprecation notice in v0.0.13 before making the breaking change, if possible
  2. Add migration guide link to the changelog
  3. Consider semantic versioning bump to 1.0.0 since this is a breaking change

Documentation

  1. Python SDK Doc (python-sdk.mdx:208): Consider adding a troubleshooting section for common file handling errors
  2. Add examples of error handling in both SDK docs
  3. Document maximum file size limits (if any)

Code Organization

  1. Consider extracting file validation logic into a shared utility function
  2. The applyUrlWhitelist function could be shared between Python and TypeScript implementations

Type Safety

  1. TypeScript: Consider using type guards for file detection
    function isFileInput(input: unknown): input is File | Blob {
      return (typeof File !== 'undefined' && input instanceof File) ||
             (typeof Blob !== 'undefined' && input instanceof Blob);
    }

🎯 Priority Action Items

Before Merge (High Priority)

  1. ✅ Add PDF file tests for both SDKs
  2. ✅ Fix prompt being ignored in CLI guard command (or update docs)
  3. ✅ Add path traversal protection in CLI
  4. ✅ Improve file type detection logic in Python SDK

Should Address (Medium Priority)

  1. Add file size validation/limits
  2. Add file extension validation
  3. Add CLI integration tests
  4. Add error handling tests

Nice to Have (Low Priority)

  1. Add timeout configuration to CLI
  2. Extract shared validation utilities
  3. Add TypeScript type guards
  4. Document security best practices

🏆 Overall Assessment

Code Quality: 7/10

  • Well-structured, clean API design
  • Good documentation
  • Some edge cases and validation gaps

Test Coverage: 5/10

  • Excellent coverage for existing text features
  • Critical gap: No tests for the main feature (PDF support)

Security: 6/10

  • Good authentication handling
  • Path traversal risk in CLI needs addressing
  • File validation missing

Recommendation: Request changes - Address the missing PDF tests and path traversal concerns before merging. The breaking change is well-documented and the API design is solid, but the lack of tests for the primary new feature is concerning.


Great work on the API redesign! The union type approach is much cleaner than the previous implementation. Once the tests are added and security concerns are addressed, this will be a solid improvement to the SDK. 🚀

@homanp homanp merged commit 5ed46fd into main Oct 22, 2025
5 checks passed
@claude
Copy link

claude bot commented Oct 22, 2025

Pull Request Review - PR 1062

Summary

This PR introduces a breaking change to simplify the API for guard() and redact() methods by accepting a single input parameter (either string or file), and adds PDF support to the guard() method. Overall, this is a well-structured refactoring with comprehensive documentation updates.


Strengths

API Design

  • Improved ergonomics: The new unified input parameter simplifies the API significantly
  • Consistent pattern: Both guard() and redact() now follow the same pattern
  • Clear documentation: Excellent changelog with migration examples

Code Quality

  • Type safety: Good use of union types with proper runtime type detection
  • Error handling: Proper validation and error messages
  • Cross-platform compatibility: Handles both browser and Node.js environments well

Issues and Concerns

1. Critical: Missing Test Coverage for PDF File Guard Analysis

Severity: High

The main feature addition (PDF support for guard()) has NO test coverage. Neither sdk/typescript/tests/guard.test.ts nor sdk/python/tests/test_guard.py include tests for the new PDF file guard functionality.

Recommendation: Add tests for PDF file guard analysis before merging.

2. Security: Type Detection Relies on Duck Typing

Severity: Medium

Python SDK (client.py:168): is_file = hasattr(input, read) or not isinstance(input, str)
This is too permissive. Any object with a read attribute will be treated as a file.

TypeScript SDK (index.ts:220-225): The check for stream in input is also too broad.

Recommendation: Use more specific type checking to prevent runtime errors.

3. Bug: URL Whitelist Applied to Wrong Input

Severity: Medium

In both SDKs, the applyUrlWhitelist function is called AFTER the text has been redacted by the API. This logic is confusing and potentially buggy.

4. Performance: No File Size Validation

Severity: Low

Neither SDK validates the size of uploaded files. Large PDFs could cause timeouts or memory issues.

5. CLI: Hardcoded PDF MIME Type

Severity: Low

In cli/src/commands/guard.ts:35, MIME type is hardcoded with no validation.

6. Documentation: Typo in PR Title

Severity: Trivial

PR title has typos: suppor should be support, skd should be sdk


Recommendations Summary

Must Fix (Before Merge)

  1. Add test coverage for PDF file guard functionality
  2. Fix URL whitelist logic
  3. Improve type detection

Should Fix

  1. Add file size validation
  2. Document or remove empty text field requirement
  3. Add CLI file type validation
  4. Fix typo in PR title

Overall Assessment

  • Code Quality: 7/10 - Good structure but missing critical tests
  • Security: 7/10 - No major vulnerabilities but type detection could be more robust
  • Documentation: 9/10 - Excellent and comprehensive
  • Test Coverage: 4/10 - Missing tests for the main feature

Recommendation: Request changes. The PR is well-structured and the API improvements are excellent, but the missing test coverage for the main feature (PDF guard analysis) is a blocker.

Great work on the API simplification and comprehensive documentation! Once the test coverage is added and the issues above are addressed, this will be a solid addition to the codebase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Add support for file upload to guard sdks and cli

2 participants