How Machines Learn from Documents

Xtracta’s research programme focuses on how systems can learn to extract data from documents they’ve never seen before, without templates or manual rules. The engine builds its understanding from real-world production data, not training sets in a lab.

Understanding Language in Context

The research team works on how the engine interprets language within the specific context of a document. A “total” on a receipt means something different from a “total” on a contract. The engine needs to know the difference, and it does.

Reading Structure, Not Just Text

Documents communicate through layout, not just words. Tables, columns, headers, indentation, spacing. The research programme focuses on how the engine reads these structural signals to understand what data means and where it belongs.

Continuous Improvement at Scale

The research team works on how to make the engine improve continuously from production data across the entire global network. Not just for one customer, but for all of them. Patterns learned from one document type improve accuracy across others.

How the Engine Learns

Xtracta operates as a cloud service. Every document that has ever been processed across the entire global network contributes to a continuously evolving knowledge base. The engine mines this data to find patterns, correlations, and structures that improve how it extracts information.

This works in two parts. First, core data mining looks across the entire network, finds relevant links and structures, and generates an organically changing pool of extraction approaches. Second, the processing servers that capture data in real time for each customer are continuously updated with these learnings.

The result: an engine that gets smarter every day, without anyone having to tell it what to do. A document processed in one country improves accuracy for a different customer in another. The more the network grows, the better it gets for everyone.

The Research Team

Our research team is a diverse group of specialists from around the world. Several hold PhDs. Many have published in international journals. They bring together the open thinking of academic research with the discipline of building products that work in production, every day, at global scale.

How the Technology Has Evolved

Research begins. The founding question: can a system learn to read documents the way a human does, without being told what to look for?


First production engine launches. The core approach works: learn from documents, not from templates.


Advanced extraction capabilities go live, delivering near-perfect results on complex documents from day one.


New models for complex line item extraction. System re-architecture for scalable, container-based infrastructure.


Major investment in straight-through processing. New data transformation and validation capabilities reduce the need for human verification.


Research begins on deep learning transformer approaches. The goal: dramatically improved out-of-the-box extraction across all document types.


First production-ready deep learning transformer models rolled out for the most common document types.


The engine keeps evolving. The research team is working on the next generation of document understanding. The principle stays the same: set it up, let it learn, forget about it.