AWS

Building a Production Vision RAG System with ColPali and Light-ColPali

We took ColPali (vision-language embeddings for documents) and Light-ColPali (token merging via hierarchical clustering) and built the production infrastructure around them. The system uses PostgreSQL + pgvector as a unified store, a lease-based job queue for resilient ingestion, and a two-stage retrieval pipeline that retrieves at patch granularity but ranks at page level.

The key insight: text extraction is lossy. For documents with complex layouts, charts, and tables, embedding the rendered page as an image solves problems that text-based RAG can’t touch.

  • March 8, 2026