RunFullBenchmark
Comprehensive [[performance]] [[benchmark]] suite that validates all Maenifold performance claims including [[graph]] traversal ([[GRPH-009]] CTE vs N+1), [[search]] performance, [[sync]] timing, and complex traversal bottlenecks. Provides empirical validation of system performance characteristics under real workloads.
- Performance Validation: Verify Maenifold meets documented performance targets
- System Health Checks: Assess overall system performance before/after changes
- Regression Detection: Compare performance across versions or configurations
- Capacity Planning: Understand performance at different dataset scales
- Optimization Verification: Confirm performance improvements from code changes
- Production Readiness: Validate performance before deployment
- Debugging Performance Issues: Identify which subsystems are slow
Key Features
- Comprehensive Coverage: Tests [[graph-traversal]], [[search]], [[sync]], and complex operations
- Real Workloads: Uses actual memory files and knowledge graph data
- Statistical Rigor: Multiple iterations with averaged results
- Claim Validation: Compares results against documented performance targets
- System Health Report: Database size, concept counts, memory statistics
- Configurable: Control iterations, test file limits, and expensive tests
- Production-Safe: Read-only operations, doesn’t modify system state
Parameters
| Parameter | Type | Required | Description | Example |
|---|
| iterations | int | No | Number of test iterations per benchmark (default: 5, more = more accurate) | 10 |
| maxTestFiles | int | No | Maximum test files to use for benchmarks (default: 1000, limits scope) | 500 |
| includeDeepTraversal | bool | No | Include expensive deep traversal tests (default: true, may timeout) | false |
Usage Examples
Standard Benchmark Run
{
"iterations": 5
}
Runs complete benchmark suite with 5 iterations per test - balances accuracy and time (~2-5 minutes).
Quick Health Check
{
"iterations": 3,
"includeDeepTraversal": false
}
Faster benchmark skipping expensive deep traversal tests - useful for quick validation (~1-2 minutes).
High-Accuracy Benchmark
{
"iterations": 10,
"maxTestFiles": 2000
}
More iterations and larger dataset for statistically robust results (~5-10 minutes).
Small Dataset Testing
{
"iterations": 5,
"maxTestFiles": 500
}
Limited test file scope for smaller knowledge bases or faster execution.
Benchmark Suite Components
1. Graph Traversal Benchmark (GRPH-009)
Tests: CTE (Common Table Expression) vs N+1 query patterns
Claims: CTE recursive queries should outperform naive N+1 approaches
Measures: Query execution time for concept relationship traversal
Validates: [[graph-database]] optimization effectiveness
Tests: Three search modes - Hybrid, Semantic, Full-text
Claims:
- Hybrid: ~33ms average
- Semantic: ~116ms average
- Full-text: ~47ms average
Test Queries:
- “machine learning algorithms”
- “performance optimization”
- “graph database”
- “vector embeddings”
- “concept relationships”
Validates: [[vector-search]] and [[full-text-search]] performance
Tests: Knowledge graph synchronization speed
Claims: ~27 seconds for 2,500 files
Measures: Time to scan files, extract [[WikiLinks]], generate [[embeddings]], update database
Validates: [[sync]] operation scalability
4. Complex Traversal Benchmark (Optional)
Tests: Deep [[graph]] traversal with multiple hops
Claims: Handles deep concept relationships efficiently
Measures: Multi-level BuildContext operations
Validates: Performance under complex query patterns
Note: ⚠️ Can timeout with large graphs - use includeDeepTraversal=false to skip
Output Structure
MAENIFOLD PERFORMANCE BENCHMARK SUITE
=====================================
Iterations: 5, Max Test Files: 1000
Started: 2024-10-24 18:45:00
Search Performance Benchmark
Claims: Hybrid 33ms, Semantic 116ms, Full-text 47ms
Test dataset: 2,458 files
Results (25 iterations each):
Hybrid Average: 34.2ms (claim: 33ms) ✓
Semantic Average: 118.7ms (claim: 116ms) ✓
Full-text Average: 45.3ms (claim: 47ms) ✓
Sync Performance Benchmark
Claim: 27s for 2,500 files
Test dataset: 2,458 files
Results (3 iterations):
Sync Average: 28.3s (claim: 27s) ✓
Files per second: 86.8
Graph Traversal Results
Graph Traversal Benchmark (GRPH-009)
CTE vs N+1 Query Performance
CTE Average: 12.4ms
N+1 Average: 247.8ms
Performance Improvement: 19.9x faster ✓
System Health Report
System Health Report
Database Size: 145.2 MB
Concept Count: 12,847
Memory Files: 2,458
Total WikiLinks: 47,392
Average Concepts per File: 19.3
Completed: 2024-10-24 18:52:15
Total Duration: 7m 15s
✓ Passing Criteria
Results within ±20% of claimed performance indicate system health:
- Hybrid search: 26-40ms (claim: 33ms)
- Semantic search: 93-140ms (claim: 116ms)
- Full-text search: 38-56ms (claim: 47ms)
- Sync: 22-32s for 2,500 files (claim: 27s)
⚠️ Warning Signs
Results 20-50% slower than claims suggest investigation:
- Database fragmentation (run VACUUM)
- Insufficient memory/CPU resources
- Disk I/O bottlenecks
- Large file size bloat
🚨 Failure Indicators
Results >50% slower indicate serious issues:
- Corrupted indices (rebuild database)
- System resource exhaustion
- Incorrect configuration
- Database locking contention
Common Patterns
Pre-Release Validation
# Before releasing new version
RunFullBenchmark iterations=10 includeDeepTraversal=true
# Verify all benchmarks pass
# Document any performance changes in CHANGELOG
Post-Optimization Verification
# Before optimization
RunFullBenchmark iterations=5 > before.txt
# Apply optimization changes
# After optimization
RunFullBenchmark iterations=5 > after.txt
# Compare results
diff before.txt after.txt
Continuous Integration
# In CI pipeline
RunFullBenchmark iterations=3 maxTestFiles=500 includeDeepTraversal=false
# Fast validation that catches major regressions
# Full benchmark in nightly builds
# Identify slow subsystem
RunFullBenchmark iterations=10
# Review which benchmark fails or performs poorly
# Focus optimization efforts on that subsystem
- Sync: Tested by sync performance benchmark - rebuild [[graph]] if needed
- SearchMemories: Tested by search performance benchmark - three search modes validated
- BuildContext: Tested by graph traversal benchmark - CTE optimization validated
- MemoryStatus: Compare with system health report for consistency checks
- GetConfig: Review configuration settings that affect performance
Troubleshooting
Error: “Insufficient test data (X files). Need at least 100 files for meaningful benchmarks”
Cause: Not enough memory files to produce statistically valid results
Solution:
- Add more test data (use WriteMemory to create sample memories)
- Or accept that benchmarks on small datasets aren’t representative
Benchmark Timeout (Deep Traversal)
Cause: Complex traversal tests can take >5 minutes with large graphs
Solution: Run with includeDeepTraversal=false to skip expensive tests
Results Much Slower Than Claims
Cause: Database fragmentation, resource constraints, or system issues
Solution:
- Run Sync to rebuild indices
- Run SQLite VACUUM to defragment database
- Check system resources (CPU, memory, disk I/O)
- Review Config settings for performance tuning
Results Highly Variable Between Runs
Cause: System resource contention or insufficient iterations
Solution:
- Increase iterations parameter (10-20 for stable results)
- Close other applications during benchmark
- Run on dedicated system without background load
”BENCHMARK FAILED” Error
Cause: Exception during benchmark execution
Solution: Check error message details - may indicate database corruption, missing files, or configuration issues
System Requirements for Valid Benchmarks
Minimum Dataset Size
- 100+ files: Required for search benchmarks
- 1,000+ files: Recommended for realistic results
- 2,500+ files: Matches performance claims benchmark conditions
Hardware Considerations
- RAM: 4GB+ available (vector operations memory-intensive)
- Storage: SSD recommended (HDD will be much slower)
- CPU: Multi-core benefits sync and search operations
Database Health
- Run Sync before benchmarking to ensure indices current
- Consider VACUUM if database heavily modified
- Verify no database locks or contention
Interpreting Results
When Results Match Claims
System is performing as designed - all optimizations working correctly.
When Search Slower Than Expected
- Check vector extension loaded correctly
- Verify [[ONNX]] model loaded (first semantic search has ~400ms overhead)
- Review database indices with EXPLAIN QUERY PLAN
When Sync Slower Than Expected
- Large files slow sync (Ma Protocol recommends <250 lines)
- Many [[WikiLinks]] per file increases processing
- Vector embedding generation is CPU-intensive
When Graph Traversal Slow
- Deep traversal (depth >3) becomes expensive
- Highly connected concepts create large result sets
- CTE vs N+1 improvement validates optimization working
Benchmark Best Practices
DO
- Run multiple iterations (5-10) for stable results
- Use consistent dataset between runs for comparison
- Document system state (CPU load, other processes)
- Run Sync before benchmarking to ensure fresh indices
DON’T
- Run benchmarks with active writes (skews results)
- Compare results across different dataset sizes
- Rely on single iteration (too much variance)
- Ignore system health report warnings
Example Benchmark Session
Step 1: Prepare System
{
"tool": "Sync"
}
Ensure database indices are current.
Step 2: Run Benchmark
{
"iterations": 10,
"maxTestFiles": 1000,
"includeDeepTraversal": true
}
Step 3: Review Results
Check each benchmark against claims:
- ✓ Search: All modes within expected range
- ✓ Sync: 28.3s for 2,458 files (expected ~27s for 2,500)
- ✓ Graph: CTE 19.9x faster than N+1
- ✓ System Health: No warnings
Step 4: Document
Save benchmark output for historical comparison:
RunFullBenchmark > benchmark-v1.2.3.txt
Ma Protocol Compliance
RunFullBenchmark follows Maenifold’s Ma Protocol principles:
- Real Testing: Uses actual memory files and live database, no mocks
- Transparent Metrics: Reports exact timings and compares to documented claims
- No Magic: Standard Stopwatch timing with clear measurement points
- Single Responsibility: Measures performance, doesn’t modify system
- Statistical Rigor: Multiple iterations with averages, not single-shot results
- Production-Safe: Read-only operations, safe to run on live systems
This tool embodies Ma Protocol’s commitment to measurement over assumption - every performance claim is validated empirically with real data, ensuring the system performs as documented.