AI Data Infrastructure
This page tracks databases, search, OCR/document conversion, and data workflows around AI systems.
Sources in this batch
- Hugging Face describes OCR’ing 30,000 papers using Codex, open OCR models, and Jobs.
- MotherDuck discusses Claude and the future of data.
- ClickHouse argues AI is redrawing the database market.
- SQLite FTS5 provides a lightweight full-text search substrate.
- Microsoft’s MarkItDown converts files and office documents to Markdown.
Research interest
The surprising angle is that AI changes what “data infrastructure” is for: not just querying structured data, but turning messy documents into agent-readable corpora, maintaining local searchable memory, and letting models operate directly over databases and text indexes. For this wiki, SQLite FTS5 and MarkItDown are practical primitives for building agent-native research infrastructure.
Open questions:
- Which database interfaces are most natural for agents: SQL, semantic search, tools, or generated code?
- How should document conversion preserve provenance and layout-critical meaning?
- Can local FTS plus curated wiki pages outperform heavier RAG for research workflows?