AI Data Infrastructure

This page tracks databases, search, OCR/document conversion, and data workflows around AI systems.

Sources in this batch

  • Hugging Face describes OCR’ing 30,000 papers using Codex, open OCR models, and Jobs.
  • MotherDuck discusses Claude and the future of data.
  • ClickHouse argues AI is redrawing the database market.
  • SQLite FTS5 provides a lightweight full-text search substrate.
  • Microsoft’s MarkItDown converts files and office documents to Markdown.

Research interest

The surprising angle is that AI changes what “data infrastructure” is for: not just querying structured data, but turning messy documents into agent-readable corpora, maintaining local searchable memory, and letting models operate directly over databases and text indexes. For this wiki, SQLite FTS5 and MarkItDown are practical primitives for building agent-native research infrastructure.

Open questions:

  • Which database interfaces are most natural for agents: SQL, semantic search, tools, or generated code?
  • How should document conversion preserve provenance and layout-critical meaning?
  • Can local FTS plus curated wiki pages outperform heavier RAG for research workflows?