What this is
An internal R&D capability build that demonstrates how we approach applied vision-language work end to end. The user uploads a photo of a broken household or automotive item; the system identifies the make and model, retrieves real repair information from the open web, diagnoses the issue conversationally, and presents the fix as a visualization adapted to the user’s actual unit.
This is a concept-validation R&D system — not a shipped product, and not a delivered client engagement.
Why we built it
Applied vision-language reasoning is one of the most over-promised areas in AI right now. We wanted to validate, on a problem we cared about, that the agent could handle the long tail — items the system has never seen before — by orchestrating retrieval and reasoning per query, rather than depending on a hand-authored library of canned guides.
What it does
- Identifies the make and model of an item from a single user photo, including small badges and worn product features.
- Retrieves real repair information from the open web — preferring manufacturer-authored documentation, established repair communities, and high-karma forums; down-weighting AI-generated content as unreliable.
- Diagnoses the issue through a short conversational loop with the user.
- Produces visual guidance overlaid on the user’s actual photo, escalating to generative 3D only for problems where motion or internal disassembly require it.
- Carries citations back to source for any safety-relevant guidance — no invented safety claims.
Built with
| Layer | Stack |
|---|---|
| Vision identification | Leading commercial vision-language model |
| Orchestration & dialogue | Leading frontier LLM |
| Web retrieval | Commercial web-retrieval APIs |
| Document parsing | Layout-aware document parsers |
| Depth & segmentation | Open-source depth and segmentation models |
| Generative 3D (escalation) | Leading generative-3D models |
| 3D rendering | Browser-native 3D rendering |
| Backend | Python + FastAPI |
| Frontend | Next.js + React |
What this means for you
If you have a workflow where a user provides a photo (or an image, scan, or video frame) and you need an AI agent to identify, diagnose, and produce visual guidance — field service, claims adjusting, equipment maintenance, retail returns triage, anything in that family — we can build it. Vision-language reasoning is a capability we have validated end to end.
Want to discuss applied vision-language work in your domain? Contact us.