+
Skip to content

Conversation

chrisbraddock
Copy link

@chrisbraddock chrisbraddock commented Jun 28, 2025

Note: this was vibed. I put enough effort in to it that it's working locally for me and that's as much as I can do at the moment.

I'm submitting it in case it's useful to you, but don't treat it as merge ready.

On the plus side, Claude Code collaborating with Gemini on actual design (via screenshots of running code from Playwright MCP) is working out pretty nice at the moment.

This commit introduces image analysis capabilities to consult7, enabling it to process and send image files to compatible multimodal models, initially targeting Google Gemini.

Key Features & Changes:

  • Multimodal Content Handling:

    • Added an --include-images command-line flag to enable the processing of image files.
    • file_processor.py now differentiates between text and image files based on common extensions (PNG, JPG, GIF, WebP, BMP, SVG).
    • Image files are read as bytes and base64-encoded.
    • format_content now structures image parts as {"inline_data": {"mime_type": ..., "data": ...}} to comply with Google Gemini API expectations.
    • File size calculations account for base64 encoding overhead.
  • Provider-Specific Logic:

    • Introduced a supports_images flag in the model_info dictionary (set in consultation.py's get_model_context_info) to determine if a model/provider combination can handle multimodal input.
    • consultation_impl uses this flag along with the --include-images CLI flag to decide whether to send structured multimodal content_parts or a concatenated text string to the provider.
    • GoogleProvider (providers/google.py) was updated to:
      • Accept List[Dict[str, Any]] (multimodal parts) as input.
      • Correctly assemble the contents list for the Gemini API, including properly formatted inline_data parts for images.
      • Use config= instead of generation_config= in the generate_content API call.
      • Estimate image token costs (currently a fixed 258 tokens per image based on Gemini documentation).
    • Text-only providers (OpenAI, OpenRouter) continue to receive concatenated text strings. Warnings are logged if image processing is attempted with them.
  • Bug Fixes & Robustness:

    • Resolved issues with MCP tool parameter passing (consultation_impl argument mismatches) by consistently using keyword arguments for optional and server-provided parameters in server.py.
    • Addressed an MCP tool registration issue by initially simplifying and then planning the incremental restoration of list_tools in server.py (though the final step of restoring list_tools was deferred after confirming the core vision functionality).
  • Token Handling & Utilities:

    • Added estimate_image_tokens to token_utils.py.
  • Documentation:

    • README.md updated to include the --include-images flag, image analysis capabilities for Gemini, supported formats, token usage, and example use cases.

This series of changes allows consult7 to effectively leverage Gemini's vision capabilities for tasks involving image analysis alongside text or code, while maintaining compatibility with existing text-only providers.

This commit introduces image analysis capabilities to `consult7`, enabling it to process and send image files to compatible multimodal models, initially targeting Google Gemini.

Key Features & Changes:

*   **Multimodal Content Handling:**
    *   Added an `--include-images` command-line flag to enable the processing of image files.
    *   `file_processor.py` now differentiates between text and image files based on common extensions (PNG, JPG, GIF, WebP, BMP, SVG).
    *   Image files are read as bytes and base64-encoded.
    *   `format_content` now structures image parts as `{"inline_data": {"mime_type": ..., "data": ...}}` to comply with Google Gemini API expectations.
    *   File size calculations account for base64 encoding overhead.

*   **Provider-Specific Logic:**
    *   Introduced a `supports_images` flag in the `model_info` dictionary (set in `consultation.py`'s `get_model_context_info`) to determine if a model/provider combination can handle multimodal input.
    *   `consultation_impl` uses this flag along with the `--include-images` CLI flag to decide whether to send structured multimodal `content_parts` or a concatenated text string to the provider.
    *   `GoogleProvider` (`providers/google.py`) was updated to:
        *   Accept `List[Dict[str, Any]]` (multimodal parts) as input.
        *   Correctly assemble the `contents` list for the Gemini API, including properly formatted `inline_data` parts for images.
        *   Use `config=` instead of `generation_config=` in the `generate_content` API call.
        *   Estimate image token costs (currently a fixed 258 tokens per image based on Gemini documentation).
    *   Text-only providers (OpenAI, OpenRouter) continue to receive concatenated text strings. Warnings are logged if image processing is attempted with them.

*   **Bug Fixes & Robustness:**
    *   Resolved issues with MCP tool parameter passing (`consultation_impl` argument mismatches) by consistently using keyword arguments for optional and server-provided parameters in `server.py`.
    *   Addressed an MCP tool registration issue by initially simplifying and then planning the incremental restoration of `list_tools` in `server.py` (though the final step of restoring `list_tools` was deferred after confirming the core vision functionality).

*   **Token Handling & Utilities:**
    *   Added `estimate_image_tokens` to `token_utils.py`.

*   **Documentation:**
    *   `README.md` updated to include the `--include-images` flag, image analysis capabilities for Gemini, supported formats, token usage, and example use cases.

This series of changes allows `consult7` to effectively leverage Gemini's vision capabilities for tasks involving image analysis alongside text or code, while maintaining compatibility with existing text-only providers.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载