Phase 17: Multimodal image understanding via analyze_image tool
Dual-model approach (C): Qwen3-8B handles conversation, Qwen2.5-VL-7B analyzes images on demand via analyze_image LangChain tool. - services/model/mlx_vision_model.py: MlxVisionModel (mlx-vlm wrapper, lazy load) - services/agent/tools.py: make_vision_tool(vision_model, image_path) - agent_service.py: stream_response(image_path=None), dynamic tool binding via config["image_path"] — thread-safe per-request rebinding - container.py: vision_model Singleton provider - config.py: vision_enabled, vision_model_id, vision_max_tokens - api.py: image_base64 in ChatRequest, decode to temp file, cleanup after stream Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -59,6 +59,11 @@ class Config(BaseSettings):
|
||||
whisper_model_size: str = "small"
|
||||
tts_voice: str = "Yuna" # macOS say 명령어 한국어 음성
|
||||
|
||||
# Vision (Phase 17)
|
||||
vision_enabled: bool = False
|
||||
vision_model_id: str = "mlx-community/Qwen2.5-VL-7B-Instruct-4bit"
|
||||
vision_max_tokens: int = 512
|
||||
|
||||
system_prompt: str = """모든 사고 과정(thinking)과 답변은 반드시 한국어로만 작성하세요. 영어 사용 절대 금지.
|
||||
|
||||
당신의 이름은 '율봇'입니다. 친절하고 따뜻한 한국어 상담 도우미입니다.
|
||||
|
||||
Reference in New Issue
Block a user