Document Ingestion

From Upload to Answers in Minutes

Our ingestion pipeline automatically parses, chunks, and indexes your documents so your team can search by meaning from day one.

The ingestion pipeline

Five stages from raw file to searchable knowledge. Fully automated, fully transparent.

Upload

Drag and drop files, bulk upload via API, or point our web crawler at any URL. We handle the rest.

Parse

Our parser extracts text, tables, headers, and metadata from any format. OCR for scanned documents included.

Chunk

Documents are intelligently split into semantic chunks that preserve context. No arbitrary page breaks or lost meaning.

Embed

Each chunk is converted to a high-dimensional vector embedding that captures its meaning, not just keywords.

Search

Your documents are now searchable by meaning. Ask any question and get cited answers in milliseconds.

Supported formats

We handle the formats your team actually uses. No conversion required.

PDF

Native and scanned

DOCX

Word documents

XLSX

Spreadsheets

PPTX

Presentations

CSV

Structured data

HTML

Web pages

TXT

Plain text

MD

Markdown

Web Crawler

Index your website too

Point our web crawler at any URL and we’ll automatically index the content. Perfect for knowledge bases, documentation sites, and public-facing web content.

  • Crawl any public or authenticated website
  • Follow links to a configurable depth
  • Respect robots.txt and rate limits
  • Schedule recurring crawls for fresh content
  • Extract clean text from complex HTML layouts
  • Handle JavaScript-rendered pages
  • Automatic deduplication of content
  • Sitemap-aware for complete coverage
# Crawl a website via API curl -X POST \ https://api.corpusfabric.com/v1/crawl \ -H "Authorization: Bearer $API_KEY" \ -d '{ "url": "https://docs.example.com", "depth": 3, "workspace": "docs" }'

Or use the dashboard UI — no code required.

Upload your first document in 60 seconds

No configuration, no training, no waiting. Drop your files and start asking questions immediately.