Skip to content

Stress Test Seeding Performance Report

Summary

This report documents the performance metrics from the incremental database seeding for stress testing the LangChain migration (Feature 014).

System Specifications

  • RAM: 120 GB
  • Storage: 916 GB NVMe SSD (588 GB available before test)
  • Database: PostgreSQL 15 (Docker container)
  • Partition: /dev/nvme1n1p2

Test Stages

Stage 1: Small Test (1K docs, ~50K chunks)

Metric Value
Documents 1,000
Chunks 50,326
Mega doc chunks 10,000
DB Size 429 MB
Time 20.0 seconds
Rate ~2,511 chunks/s
Size per chunk ~8.5 KB

Stage 2: Medium Test (10K docs, ~460K chunks)

Metric Value
Documents 10,000
Chunks 459,342
Mega doc chunks 50,000
DB Size 3.75 GB
Time 177.4 seconds (~3 min)
Rate ~2,589 chunks/s
Size per chunk ~8.35 KB

Stage 3: Large Test (100K docs, ~4.2M chunks) - COMPLETED

Metric Value
Documents 100,000
Chunks 4,202,829
Mega doc chunks 100,000
Vector stores 10
Specialists 20
DB Size 34 GB
Time 1,747.3 seconds (~29 min)
Rate ~2,405 chunks/s
Size per chunk ~8.1 KB

Stage 4: Full Test (1M docs, ~25.5M chunks) - PLANNED

Metric Estimated Value
Documents 1,000,000
Expected chunks ~25,500,000
Mega doc chunks 500,000
Vector stores 10
Specialists 20
Estimated DB size ~210-250 GB
Estimated time ~3 hours

Performance Metrics

Chunk Insert Rate

  • Average: ~2,400-2,500 chunks/second
  • Consistent: Rate remains stable regardless of table size
  • Batch size: 2,000 chunks per INSERT

Storage Calculations

  • Per chunk (avg): ~8.35 KB
  • Embedding (1536 × 4 bytes): 6.1 KB
  • Text content: ~500 bytes
  • Metadata + indexes: ~1.7 KB

Time Estimates

Chunks Estimated Time
50K 20 seconds
500K 3.5 minutes
4M 28 minutes
25M ~3 hours

Chunk Distribution (Variable Chunks Mode)

The --variable-chunks flag creates realistic document size distribution:

Category Chunks per doc % of docs Example
Small 1-5 50% Notes, single pages
Medium 6-30 25% Reports, articles
Large 31-100 15% Manuals, guides
Very Large 101-300 8% Comprehensive PDFs
Huge 301-1000 1.9% Technical specs
Extreme 1001+ 0.1% Large datasets
Mega doc Configurable 1 doc 500K chunks for stress testing

Commands Reference

# Clean existing stress test data
pnpm ts-node src/scripts/seed-stress-test-data.ts --clean

# Stage 1: Small test
pnpm ts-node src/scripts/seed-stress-test-data.ts \
  --documents=1000 \
  --chunk-sampling=100 \
  --variable-chunks \
  --mega-doc-chunks=10000 \
  --vector-stores=5 \
  --specialists=10

# Stage 2: Medium test
pnpm ts-node src/scripts/seed-stress-test-data.ts \
  --documents=10000 \
  --chunk-sampling=100 \
  --variable-chunks \
  --mega-doc-chunks=50000 \
  --vector-stores=5 \
  --specialists=10

# Stage 3: Large test
pnpm ts-node src/scripts/seed-stress-test-data.ts \
  --documents=100000 \
  --chunk-sampling=100 \
  --variable-chunks \
  --mega-doc-chunks=100000 \
  --vector-stores=10 \
  --specialists=20 \
  --batch-size=2000

# Stage 4: Full test (500K mega doc)
pnpm ts-node src/scripts/seed-stress-test-data.ts \
  --documents=1000000 \
  --chunk-sampling=100 \
  --variable-chunks \
  --mega-doc-chunks=500000 \
  --max-chunks=1000 \
  --vector-stores=10 \
  --specialists=20 \
  --batch-size=2000

Disk Space Requirements

Stage Chunks DB Size Cumulative Time
Base 0 23 MB 23 MB -
Stage 1 50K 429 MB 429 MB 20s
Stage 2 460K 3.8 GB 3.8 GB 3 min
Stage 3 4.2M 34 GB 34 GB 29 min
Stage 4 25.5M ~210 GB ~210 GB ~3 hrs (est.)

Notes

  1. Storage is efficient: Each chunk with embedding is ~8.35 KB
  2. Rate is consistent: INSERT performance doesn't degrade with table size
  3. Variable chunks: Creates realistic edge cases for testing
  4. Mega document: Single document with 500K chunks tests extreme scenarios

Report generated during stress test seeding for LangChain Migration (Feature 014)