Document Ingestion

From Upload to Answers in Minutes

Our ingestion pipeline automatically parses, chunks, and indexes your documents so your team can search by meaning from day one.

Start free trial See the pipeline

The ingestion pipeline

Five stages from raw file to searchable knowledge. Fully automated, fully transparent.

Upload

Drag and drop files, bulk upload via API, or point our web crawler at any URL. We handle the rest.

Parse

Our parser extracts text, tables, headers, and metadata from any format. OCR for scanned documents included.

Chunk

Documents are intelligently split into semantic chunks that preserve context. No arbitrary page breaks or lost meaning.

Embed

Each chunk is converted to a high-dimensional vector embedding that captures its meaning, not just keywords.

Search

Your documents are now searchable by meaning. Ask any question and get cited answers in milliseconds.

Upload

Drag and drop files, bulk upload via API, or point our web crawler at any URL. We handle the rest.

Parse

Our parser extracts text, tables, headers, and metadata from any format. OCR for scanned documents included.

Chunk

Documents are intelligently split into semantic chunks that preserve context. No arbitrary page breaks or lost meaning.

Embed

Each chunk is converted to a high-dimensional vector embedding that captures its meaning, not just keywords.

Search

Your documents are now searchable by meaning. Ask any question and get cited answers in milliseconds.

Supported formats

We handle the formats your team actually uses. No conversion required.

PDF

Native and scanned

DOCX

Word documents

XLSX

Spreadsheets

PPTX

Presentations

CSV

Structured data

HTML

Web pages

TXT

Plain text

MD

Markdown

Web Crawler

Index your website too

Point our web crawler at any URL and we’ll automatically index the content. Perfect for knowledge bases, documentation sites, and public-facing web content.

Crawl any public or authenticated website
Follow links to a configurable depth
Respect robots.txt and rate limits
Schedule recurring crawls for fresh content
Extract clean text from complex HTML layouts
Handle JavaScript-rendered pages
Automatic deduplication of content
Sitemap-aware for complete coverage

# Crawl a website via API
curl -X POST \
  https://api.corpusfabric.com/v1/crawl \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
    "url": "https://docs.example.com",
    "depth": 3,
    "workspace": "docs"
  }'

Or use the dashboard UI — no code required.

Upload your first document in 60 seconds

No configuration, no training, no waiting. Drop your files and start asking questions immediately.

Start free trial Book a demo