Phase 17: Multimodal image understanding via analyze_image tool

Dual-model approach (C): Qwen3-8B handles conversation, Qwen2.5-VL-7B analyzes images on demand via analyze_image LangChain tool. - services/model/mlx_vision_model.py: MlxVisionModel (mlx-vlm wrapper, lazy load) - services/agent/tools.py: make_vision_tool(vision_model, image_path) - agent_service.py: stream_response(image_path=None), dynamic tool binding via config["image_path"] — thread-safe per-request rebinding - container.py: vision_model Singleton provider - config.py: vision_enabled, vision_model_id, vision_max_tokens - api.py: image_base64 in ChatRequest, decode to temp file, cleanup after stream Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-02 13:52:10 +09:00
parent bdb6fd83c4
commit 68f741af72
8 changed files with 355 additions and 18 deletions
@@ -19,6 +19,7 @@ from services.rag.ingestion_service import IngestionService
 from services.rag.rerank_service import RerankService
 from services.rag.retriever_service import RetrieverService
 from services.agent.agent_service import AgentService
+from services.model.mlx_vision_model import MlxVisionModel


 class Container(containers.DeclarativeContainer):
@@ -130,6 +131,13 @@ class Container(containers.DeclarativeContainer):
        sparse_embeddings=sparse_embeddings,
    )

+    # Phase 17 — Vision Model (lazy load)
+    vision_model = providers.Singleton(
+        MlxVisionModel,
+        model_id=providers.Callable(lambda c: c.vision_model_id, config),
+        max_tokens=providers.Callable(lambda c: c.vision_max_tokens, config),
+    )
+
    # Phase 3 — LangGraph Agent
    agent_service = providers.Singleton(
        AgentService,