What this is
An internal R&D capability build that demonstrates how we approach applied vision-language work end to end. The user uploads a photo of a broken household or automotive item; the system identifies the make and model, retrieves real repair information from the open web, diagnoses the issue conversationally, and presents the fix as a visualization adapted to the user’s actual unit.
This is a concept-validation R&D system, not a shipped product, and not a delivered client engagement.
Why we built it
Applied vision-language reasoning is one of the most over-promised areas in AI right now. We wanted to validate, on a problem we cared about, that the agent could handle the long tail, items the system has never seen before, by orchestrating retrieval and reasoning per query, rather than depending on a hand-authored library of canned guides.
What it does
- Identifies the make and model of an item from a single user photo, including small badges and worn product features.
- Retrieves real repair information from the open web, preferring manufacturer-authored documentation, established repair communities, and high-karma forums; down-weighting AI-generated content as unreliable.
- Diagnoses the issue through a short conversational loop with the user.
- Produces visual guidance overlaid on the user’s actual photo, escalating to generative 3D only for problems where motion or internal disassembly require it.
- Carries citations back to source for any safety-relevant guidance, no invented safety claims.
Built with
| Layer | Stack |
|---|---|
| Vision identification | Leading commercial vision-language model |
| Orchestration & dialogue | Leading frontier LLM |
| Web retrieval | Commercial web-retrieval APIs |
| Document parsing | Layout-aware document parsers |
| Depth & segmentation | Open-source depth and segmentation models |
| Generative 3D (escalation) | Leading generative-3D models |
| 3D rendering | Browser-native 3D rendering |
| Backend | Python + FastAPI |
| Frontend | Next.js + React |
What this means for you
If you have a workflow where a user provides a photo (or an image, scan, or video frame) and you need an AI agent to identify, diagnose, and produce visual guidance (field service, claims adjusting, equipment maintenance, retail returns triage, anything in that family), we can build it. Vision-language reasoning is a capability we have validated end to end.
Want to discuss applied vision-language work in your domain? Contact us.