Ir para o conteúdo

Manual Testing Guide: Feature 014 - LangChain Migration

This document provides comprehensive manual test cases for validating the LangChain migration. Run these tests using the Mobile App and Admin Specialist Test screen.

Prerequisites

1. Database Setup

Run the stress test seeding script to create test data:

cd packages/backend

# Option 1: Light test (10k documents, 100k chunks) - Recommended for initial testing
pnpm ts-node src/scripts/seed-stress-test-data.ts --documents=10000 --chunk-sampling=100

# Option 2: Medium test (100k documents, 100k chunks) - Good for performance testing
pnpm ts-node src/scripts/seed-stress-test-data.ts --documents=100000 --chunk-sampling=10

# Option 3: Full scale test (1M documents, 3M chunks) - Production simulation
# Requires ~20GB storage for embeddings, takes 30-60 minutes to seed
pnpm ts-node src/scripts/seed-stress-test-data.ts --documents=1000000 --chunk-sampling=100 --chunks-per-doc=3 --vector-stores=10 --specialists=20

# Clean up when done
pnpm ts-node src/scripts/seed-stress-test-data.ts --clean

2. Verify Seed Data

After seeding, verify the data was created:

-- Check specialists
SELECT name, "organizationId" FROM specialists WHERE "organizationId" = 'stress-test-org';

-- Check vector stores and chunk counts
SELECT vs.name, COUNT(fc.id) as chunk_count
FROM vector_store vs
LEFT JOIN file_metadata fm ON fm.vector_store_id = vs.id
LEFT JOIN file_chunks fc ON fc.file_id = fm.id
WHERE vs.organization_id = 'stress-test-org'
GROUP BY vs.id, vs.name;

-- Check MCP plugins
SELECT name, "displayName", "pluginType" FROM mcp_plugins WHERE name LIKE '%-stress';

3. Start Services

# Terminal 1: Start backend
cd packages/backend && pnpm run dev

# Terminal 2: Start admin
cd packages/admin && pnpm run dev

# Terminal 3 (optional): Start mobile
cd packages/mobile-app && pnpm run start

Test Categories

  1. Telemetry Tests
  2. Specialist Handover Tests
  3. Vector Store Query Tests
  4. Tool Call Tests

1. Telemetry Tests

Test 1.1: Basic Telemetry Capture

Objective: Verify telemetry is captured for simple queries

Steps:

  1. Open Admin Dashboard
  2. Navigate to a specialist test page
  3. Send a simple query: "What is the refund policy?"
  4. Check the Telemetry tab/section

Expected Results:

  • [ ] Trace ID is generated and displayed
  • [ ] TRIGGER node shows the original query
  • [ ] SPECIALIST node shows which specialist was selected
  • [ ] Query timestamp is recorded
  • [ ] Response time is captured

Verification Query:

SELECT * FROM routing_telemetry
WHERE "organizationId" = 'stress-test-org'
ORDER BY "createdAt" DESC LIMIT 5;

Test 1.2: Telemetry with Tool Execution

Objective: Verify telemetry captures tool calls

Steps:

  1. Navigate to Technical Support specialist (has tools)
  2. Send query: "What is 25 * 47?"
  3. Check telemetry trace

Expected Results:

  • [ ] ACTION node appears for tool call
  • [ ] Tool name (calculate) is recorded
  • [ ] Tool input parameters are logged
  • [ ] Tool output is captured
  • [ ] Execution duration is measured

Test 1.3: Telemetry Under Load

Objective: Verify telemetry performance with many concurrent requests

Steps:

  1. Open 3-5 browser tabs with specialist test pages
  2. Send queries simultaneously from all tabs
  3. Check each trace is independent

Expected Results:

  • [ ] Each request gets unique trace ID
  • [ ] No cross-contamination between traces
  • [ ] Latency remains under 2 seconds
  • [ ] All traces are queryable

Test 1.4: Telemetry Data Sanitization

Objective: Verify sensitive data is not stored in telemetry

Steps:

  1. Send query containing email: "My email is test@example.com, what's my balance?"
  2. Check telemetry data

Expected Results:

  • [ ] Email addresses are redacted or masked
  • [ ] No PII stored in plain text
  • [ ] Query is truncated if too long (>500 chars)

2. Specialist Handover Tests

Test 2.1: Simple Handover

Objective: Verify handover works when query matches different specialist

Steps:

  1. Start conversation with Billing Specialist
  2. Send: "I need help with billing" (should stay with billing)
  3. Then send: "Actually, I have a technical error with the API"
  4. Observe handover to Technical Support

Expected Results:

  • [ ] Handover message displayed in user's language
  • [ ] Technical Support specialist now active
  • [ ] Original conversation context preserved
  • [ ] New specialist responds appropriately

Test 2.2: Explicit Handover Request

Objective: Verify LLM correctly uses handover tool

Steps:

  1. Chat with Technical Support
  2. Send: "I don't need technical help anymore, I want to talk about my invoice"
  3. Observe handover

Expected Results:

  • [ ] LLM calls handover_to_specialist tool
  • [ ] Handover reason is correctly extracted
  • [ ] Billing Specialist selected
  • [ ] Handover message localized correctly (EN/PT-BR)

Test 2.3: Handover Chain Prevention

Objective: Verify circular handovers are prevented

Steps:

  1. Start with Billing Specialist
  2. Ask billing questions
  3. Request handover to Technical Support
  4. Immediately ask: "I want to go back to billing"
  5. Observe behavior

Expected Results:

  • [ ] System prevents immediate bounce-back
  • [ ] Previous specialists are tracked
  • [ ] User can still access billing after some interaction
  • [ ] No infinite loop occurs

Test 2.4: Handover with No Suitable Specialist

Objective: Verify graceful handling when no specialist matches

Steps:

  1. Start conversation
  2. Ask about something completely unrelated: "Tell me about quantum physics"
  3. Observe response

Expected Results:

  • [ ] Handover fails gracefully
  • [ ] General assistant provides response
  • [ ] Error message displayed (localized)
  • [ ] Conversation continues normally

Test 2.5: Handover Telemetry

Objective: Verify handover events are captured in telemetry

Steps:

  1. Perform a successful handover (Test 2.1)
  2. Check telemetry data

Expected Results:

  • [ ] Handover action recorded in trace
  • [ ] Source specialist ID logged
  • [ ] Target specialist ID logged
  • [ ] Handover reason captured
  • [ ] handoverOccurred: true in routing_telemetry

3. Vector Store Query Tests

Objective: Verify vector store queries return relevant results

Steps:

  1. Navigate to Billing Specialist (linked to Billing Documentation)
  2. Ask: "How do I get a refund?"
  3. Check response includes knowledge base content

Expected Results:

  • [ ] Response mentions 30-day refund policy
  • [ ] Content from Billing Documentation used
  • [ ] Semantic similarity score > 0.7
  • [ ] Response is coherent and relevant

Test 3.2: Multi-Vector Store Query

Objective: Verify queries across multiple vector stores

Steps:

  1. Use a specialist linked to multiple knowledge bases
  2. Ask a question spanning multiple topics
  3. Check responses include content from both

Expected Results:

  • [ ] Results from multiple vector stores
  • [ ] Re-ranking applied correctly
  • [ ] Most relevant chunks prioritized
  • [ ] No duplicate content

Test 3.3: Query Performance Under Load

Objective: Verify vector queries perform well with large datasets

Prerequisite: Run with --documents=100000 or higher

Steps:

  1. Measure query time for simple query
  2. Send 10 sequential queries
  3. Record average response time

Expected Results:

  • [ ] Query latency < 500ms (embedding generation)
  • [ ] Vector search < 100ms (PostgreSQL pgvector)
  • [ ] Total response time < 3 seconds
  • [ ] No timeout errors

Test 3.4: Min Score Filtering

Objective: Verify low-relevance results are filtered

Steps:

  1. Ask a question unrelated to any knowledge base
  2. Example: "What's the weather like today?"
  3. Check that irrelevant chunks are not included

Expected Results:

  • [ ] No chunks returned with score < 0.7
  • [ ] Response generated without hallucinating content
  • [ ] System handles no-match gracefully

Test 3.5: Re-Ranking Algorithm

Objective: Verify multi-factor re-ranking works correctly

Steps:

  1. Ask question with multiple relevant chunks
  2. Check ordering of results
  3. Verify recent/early chunks ranked appropriately

Expected Results:

  • [ ] Semantic score weighted 65%
  • [ ] Vector store order weighted 15%
  • [ ] Chunk position weighted 10%
  • [ ] Recency weighted 10%
  • [ ] Final ranking is sensible

4. Tool Call Tests

Test 4.1: REST API Tool - GET Request

Objective: Verify REST GET tool execution

Steps:

  1. Navigate to Technical Support (has Weather API)
  2. Ask: "What's the weather in London?"
  3. Observe tool execution

Expected Results:

  • [ ] Tool get_current_weather called
  • [ ] GET request sent to API
  • [ ] Response parsed correctly
  • [ ] Weather info displayed to user

Test 4.2: REST API Tool - POST Request

Objective: Verify REST POST tool execution

Steps:

  1. Navigate to specialist with Task Manager plugin
  2. Ask: "Create a task called 'Test task from LangChain'"
  3. Observe tool execution

Expected Results:

  • [ ] Tool create_task called
  • [ ] POST request sent with correct body
  • [ ] Response confirms creation
  • [ ] Task info returned to user

Test 4.3: Builtin Tool Execution

Objective: Verify builtin (stdio) tools work

Steps:

  1. Navigate to specialist with Calculator plugin
  2. Ask: "Calculate 123 * 456 + 789"
  3. Observe tool execution

Expected Results:

  • [ ] Tool calculate called
  • [ ] Expression parsed correctly
  • [ ] Result: 56877 returned
  • [ ] User sees the calculation result

Test 4.4: Multiple Tool Calls in Sequence

Objective: Verify multiple tools can be called in one conversation

Steps:

  1. Ask: "What's 25 * 4?"
  2. Wait for response
  3. Ask: "Now multiply that result by 2"
  4. Observe both tool calls

Expected Results:

  • [ ] First tool returns 100
  • [ ] Second tool returns 200
  • [ ] Context maintained between calls
  • [ ] Both results displayed correctly

Test 4.5: Tool Error Handling

Objective: Verify tool errors are handled gracefully

Steps:

  1. Ask: "Get weather for 'invalid_city_xyz_123'"
  2. Observe error handling

Expected Results:

  • [ ] API error captured
  • [ ] Error message in ToolMessage
  • [ ] LLM provides helpful response
  • [ ] Conversation continues normally

Test 4.6: Tool Iteration Limit

Objective: Verify tool loop doesn't run infinitely

Steps:

  1. Craft a query that might cause repeated tool calls
  2. Example: "Keep calculating until you reach infinity"
  3. Observe iteration limit behavior

Expected Results:

  • [ ] Maximum 10 iterations (default)
  • [ ] iterationLimitReached: true in response
  • [ ] User receives partial response
  • [ ] No infinite loop

Test 4.7: OAuth Required Tool

Objective: Verify OAuth flow is triggered when needed

Prerequisite: Have a plugin requiring OAuth configured

Steps:

  1. Try to use a tool requiring OAuth
  2. Observe OAuth prompt

Expected Results:

  • [ ] oauthRequired event triggered
  • [ ] Provider info returned
  • [ ] Auth URL provided
  • [ ] Required scopes listed

Test 4.8: Tool Telemetry

Objective: Verify tool executions appear in telemetry

Steps:

  1. Execute any tool (Test 4.1)
  2. Check telemetry trace

Expected Results:

  • [ ] ACTION node for tool execution
  • [ ] Tool name recorded
  • [ ] Input parameters logged
  • [ ] Output captured
  • [ ] Duration measured

Performance Benchmarks

Baseline Metrics (with 100k chunks)

Operation Expected Acceptable Critical
Simple query (no tools) < 1s < 2s > 5s
Query with vector search < 2s < 3s > 6s
Single tool call < 3s < 5s > 10s
Multiple tool calls (3) < 8s < 12s > 20s
Handover < 2s < 4s > 8s
Telemetry write < 50ms < 100ms > 500ms

Load Testing Checklist

  • [ ] 10 concurrent users: Response times < 3s
  • [ ] 50 concurrent users: Response times < 5s
  • [ ] 100 concurrent users: Response times < 10s
  • [ ] Memory usage stable over 30 minutes
  • [ ] No connection pool exhaustion
  • [ ] Telemetry data complete for all requests

Test Execution Log

Test ID Date Tester Result Notes
1.1 [ ] Pass / [ ] Fail
1.2 [ ] Pass / [ ] Fail
1.3 [ ] Pass / [ ] Fail
1.4 [ ] Pass / [ ] Fail
2.1 [ ] Pass / [ ] Fail
2.2 [ ] Pass / [ ] Fail
2.3 [ ] Pass / [ ] Fail
2.4 [ ] Pass / [ ] Fail
2.5 [ ] Pass / [ ] Fail
3.1 [ ] Pass / [ ] Fail
3.2 [ ] Pass / [ ] Fail
3.3 [ ] Pass / [ ] Fail
3.4 [ ] Pass / [ ] Fail
3.5 [ ] Pass / [ ] Fail
4.1 [ ] Pass / [ ] Fail
4.2 [ ] Pass / [ ] Fail
4.3 [ ] Pass / [ ] Fail
4.4 [ ] Pass / [ ] Fail
4.5 [ ] Pass / [ ] Fail
4.6 [ ] Pass / [ ] Fail
4.7 [ ] Pass / [ ] Fail
4.8 [ ] Pass / [ ] Fail

Troubleshooting

Common Issues

  1. "No enabled LLM integration found"
  2. Run: pnpm ts-node src/scripts/seed-llm.ts
  3. Enable an LLM integration in admin

  4. "Vector store not found"

  5. Check organization ID matches
  6. Verify seed script completed successfully

  7. "Tool not found"

  8. Check plugin is installed for organization
  9. Verify specialist has plugin associated

  10. Slow query times

  11. Check PostgreSQL pgvector indexes
  12. Monitor connection pool usage
  13. Check embedding service availability

  14. Handover not triggering

  15. Verify specialists have distinct routing metadata
  16. Check routing embeddings are generated
  17. Review router service logs

Useful Debug Commands

# Check backend logs
tail -f packages/backend/logs/*.log

# Monitor PostgreSQL queries
SELECT query, calls, mean_time FROM pg_stat_statements ORDER BY mean_time DESC LIMIT 10;

# Check vector store indexes
SELECT indexname, pg_size_pretty(pg_relation_size(indexname::regclass))
FROM pg_indexes WHERE tablename = 'file_chunks';

Sign-off

Role Name Date Signature
Developer
QA
Product Owner