What this is
A complete document-digitization platform that we designed, built, and own outright as part of our internal capability portfolio. It is not a delivered client engagement — it is a reference implementation of how we approach end-to-end document AI work, and it is available to be tailored, deployed, and handed to organizations whose workflows need it.
We share the architecture and the working system in qualifying conversations under NDA. You can also see a live demo on request.
Why we built it
Records institutions, archives, healthcare-records operations, and any organization sitting on irreplaceable paper or scanned-image collections face the same problems: source documents vary in age, quality, binding, and format; line-of-business systems often live behind firewalls; budgets are real; and the long tail of materials includes both printed and handwritten content that defeats most off-the-shelf OCR.
We wanted a platform that could:
- Be operated by domain staff without specialized AV or imaging expertise.
- Handle the breadth of materials a real archive holds — not just the easy cases.
- Produce searchable, exportable digital records suitable for both long-term preservation and modern access.
- Run on commodity hardware affordable at multiple sites — including under air-gapped or restricted-connectivity conditions.
What it does
- A modern web capture frontend with camera capture, auto-capture mode, batch scanning, and per-document review — runnable on standard tablets.
- Image processing tuned for archival source material — page detection, deskewing, dewarping, despeckling, lighting correction, color handling — applied before OCR.
- Operator-selectable OCR for both printed and handwritten content, plus an enhanced layout-aware mode for tables and multi-column material.
- Records are indexed for both full-text and semantic search, so users find content by exact phrase and by meaning.
- Multi-format export — TIFF, JPEG2000, PDF/A-2b for long-term preservation; searchable PDF, markdown, and DOCX for modern access.
- CPU-first by default; GPU only required for the enhanced layout-aware mode. Deployable on premises and air-gapped where the environment requires it.
Built with
| Layer | Stack |
|---|---|
| Backend | Python, FastAPI |
| Image processing | Open-source computer-vision models (page detection, deskew, dewarp, despeckle) |
| OCR | Open-source OCR for printed text + transformer-based model for handwriting |
| Search | PostgreSQL with vector + full-text extensions, open-source embedding model |
| Object storage | S3-compatible |
| Frontend | Vite + React + TypeScript + Tailwind CSS (PWA support) |
| Reproducibility | Docker Compose (clean clone → running stack) |
| Hardware target | Commodity CPU server + tablet + stand + LED lighting |
What this means for you
If your organization needs to digitize a corpus of historical, archival, or operational documents at quality and cost that scale — and you do not want to hand the work, the data, or the long-term ownership to a closed platform — we can:
- Tailor and deploy this platform to your sources, workflow, and security posture.
- Combine printed and handwriting OCR so the long tail of your collection isn’t left behind.
- Deliver hybrid full-text + semantic search so retrieval works the way users actually look for things.
- Specify a commodity-hardware reference kit that scales from one site to many.
- Deploy on-premises or air-gapped when your environment requires it. Modern AI does not need a hyperscaler.
Want to see a live demo or walk through the architecture under NDA? Contact us.