Agentic vision-language repair guidance — an R&D capability build

What this is

An internal R&D capability build that demonstrates how we approach applied vision-language work end to end. The user uploads a photo of a broken household or automotive item; the system identifies the make and model, retrieves real repair information from the open web, diagnoses the issue conversationally, and presents the fix as a visualization adapted to the user’s actual unit.

This is a concept-validation R&D system — not a shipped product, and not a delivered client engagement.

Why we built it

Applied vision-language reasoning is one of the most over-promised areas in AI right now. We wanted to validate, on a problem we cared about, that the agent could handle the long tail — items the system has never seen before — by orchestrating retrieval and reasoning per query, rather than depending on a hand-authored library of canned guides.

What it does

Identifies the make and model of an item from a single user photo, including small badges and worn product features.
Retrieves real repair information from the open web — preferring manufacturer-authored documentation, established repair communities, and high-karma forums; down-weighting AI-generated content as unreliable.
Diagnoses the issue through a short conversational loop with the user.
Produces visual guidance overlaid on the user’s actual photo, escalating to generative 3D only for problems where motion or internal disassembly require it.
Carries citations back to source for any safety-relevant guidance — no invented safety claims.

Built with

Layer	Stack
Vision identification	Leading commercial vision-language model
Orchestration & dialogue	Leading frontier LLM
Web retrieval	Commercial web-retrieval APIs
Document parsing	Layout-aware document parsers
Depth & segmentation	Open-source depth and segmentation models
Generative 3D (escalation)	Leading generative-3D models
3D rendering	Browser-native 3D rendering
Backend	Python + FastAPI
Frontend	Next.js + React

What this means for you

If you have a workflow where a user provides a photo (or an image, scan, or video frame) and you need an AI agent to identify, diagnose, and produce visual guidance — field service, claims adjusting, equipment maintenance, retail returns triage, anything in that family — we can build it. Vision-language reasoning is a capability we have validated end to end.

Want to discuss applied vision-language work in your domain? Contact us.