1 unstable release

Uses new Rust 2024

0.1.0 Feb 11, 2026

#734 in Testing

MIT license

2MB
44K SLoC

APR Model QA Playbook

APR Model QA Playbook

Property-Based Model Qualification Testing for HuggingFace Models

PhilosophyFeaturesQuick StartArchitectureTest MatrixMQS Scoring


Philosophy

This framework synthesizes two complementary quality paradigms:

Toyota Production System (TPS)

"Stop the line. Fix it now. Never pass a defect to the next process." — Taiichi Ohno

Principle Application
Jidoka Execution halts on first P0 failure
Poka-Yoke Schema validation prevents malformed playbooks
Genchi Genbutsu All metrics from actual inference
Heijunka Load-balanced parallel execution
Kaizen Continuous refinement via mutation testing

Popperian Falsificationism

"The criterion of the scientific status of a theory is its falsifiability." — Karl Popper

We don't test to pass—we test to fail. No amount of passing tests proves correctness, but a single failure proves a defect.

Outcome Meaning
Corroborated Hypothesis survived refutation attempt
Falsified Hypothesis refuted by evidence
Timeout Execution exceeded time limit
Crashed Process terminated abnormally

Features

  • Property-based testing via proptest for comprehensive scenario generation
  • Parallel execution with Rayon worker pools
  • Gateway checks (G0-G4) that zero the score on critical failures
  • Model Qualification Score (MQS) 0-1000 with grade mapping
  • JUnit XML and HTML reports for CI/CD integration
  • Playbook YAML format with JSON Schema validation
  • 1.8M+ test assertions across all model/format/backend combinations
  • 217 falsification gates across conversion, inference, patterns, and security domains

New in v2.0.0

Feature Description
Two-Tier Certification MVP (≤10min, Grade B) and Full (≤1hr, Grade A+) tiers
Tier-Aware Scoring score_from_tier(), status_from_tier(), grade_from_tier()
Certify CLI Command apr-qa certify --family qwen-coder --tier mvp
Rosetta Differential Testing Tensor layout mismatch, token comparison, fingerprint, stats validation
Profile CI Mode Performance assertions for CI/CD (--assert-throughput, --assert-p99)
Trace Payload Mode Real forward pass with NaN/Inf and garbage output detection
Bug Pattern Detection 12 cross-project patterns from aprender/realizar analysis

Model Certifications

Certification Summary (updated: 2026-03-02 10:15 UTC)

Status Count
Certified 95/95
Provisional 0/95
Blocked 0/95
Pending 0/95

Priority Family: Qwen Coder (see Certified Testing Spec)

Model Family Size Status MQS Grade G1-4 Prov GGUF CPU GGUF GPU APR CPU APR GPU ST CPU ST GPU
bloom-560m bloom 560M certified 1000 A+ - - - - - -
bloomz-560m bloom 560M certified 1000 A+ - - - - - -
deepseek-coder-1.3b-instruct deepseek-coder 1.3B certified 1000 A+ - - - - - -
deepseek-coder-6.7b-instruct deepseek-coder 6.7B certified 1000 A+ - - - - - -
deepseek-coder-7b-instruct deepseek-coder 7B certified 1000 A+ - - - - - -
deepseek-coder-33b-instruct deepseek-coder 33B certified 1000 A+ - - - - - -
DeepSeek-Coder-V2-Lite-Instruct deepseek-coder-v2 16B certified 1000 A+ - - - - - -
DeepSeek-R1-Distill-Qwen-1.5B deepseek-r1 1.5B certified 1000 A - - - - - -
DeepSeek-R1-Distill-Qwen-7B deepseek-r1 7B certified 1000 A+ - - - - - -
DeepSeek-R1-Distill-Llama-8B deepseek-r1 8B certified 1000 A+ - - - - - -
DeepSeek-R1-Distill-Qwen-14B deepseek-r1 14B certified 1000 A+ - - - - - -
DeepSeek-R1-Distill-Qwen-32B deepseek-r1 32B certified 1000 A+ - - - - - -
DeepSeek-R1-Distill-Llama-70B deepseek-r1 70B certified 1000 A+ - - - - - -
dolphin-2.6-mistral-7b dolphin 7B certified 1000 A+ - - - - - -
Dolphin3.0-Llama3.1-8B dolphin 8B certified 1000 A+ - - - - - -
falcon-7b-instruct falcon 7B certified 1000 A+ - - - - - -
falcon-40b falcon 40B certified 1000 A+ - - - - - -
Falcon-H1-Tiny-90M-Instruct falcon-h1 90M certified 1000 A+ - - - - - -
Falcon-H1-0.5B-Instruct falcon-h1 0.5B certified 1000 A+ - - - - - -
tiny_starcoder_py gpt-bigcode 164M certified 1000 A+ - - - - - -
gpt-neo-125m gpt-neo 125M certified 1000 A+ - - - - - -
pythia-410m-deduped gpt-neox 410M certified 1000 A+ - - - - - -
pythia-160m gpt-neox 160M certified 1000 A+ - - - - - -
pythia-70m gpt-neox 70M certified 1000 A+ - - - - - -
distilgpt2 gpt2 82M certified 1000 A+ - - - - - -
gpt2 gpt2 124M certified 1000 A+ - - - - - -
gpt2-large gpt2 774M certified 1000 A+ - - - - - -
gpt2-medium gpt2 355M certified 1000 A+ - - - - - -
granite-3.1-2b-instruct granite 2B certified 1000 A+ - - - - - -
granite-3.1-8b-instruct granite 8B certified 1000 A+ - - - - - -
granite-3b-code-instruct-128k granite-code 3B certified 1000 A+ - - - - - -
Hermes-3-Llama-3.1-8B hermes 8B certified 1000 A+ - - - - - -
internlm2_5-7b-chat internlm 7B certified 1000 A+ - - - - - -
internlm2_5-20b-chat internlm 20B certified 1000 A+ - - - - - -
mamba-130m-hf mamba 130M certified 1000 A+ - - - - - -
mamba2-130m-hf mamba2 130M certified 1000 A+ - - - - - -
Mistral-7B-Instruct-v0.3 mistral 7B certified 1000 A+ - - - - - -
Mistral-Nemo-Instruct-2407 mistral 12B certified 1000 A+ - - - - - -
Mistral-Small-24B-Instruct-2501 mistral 24B certified 1000 A+ - - - - - -
Codestral-22B-v0.1 mistral-code 22B certified 1000 A+ - - - - - -
Llama-3.1-Nemotron-Nano-4B-v1.1 nemotron 4B certified 1000 A+ - - - - - -
Llama-3.1-Nemotron-70B-Instruct-HF nemotron 70B certified 1000 A+ - - - - - -
OLMo-2-1124-7B-Instruct olmo 7B certified 1000 A+ - - - - - -
OLMo-2-1124-13B-Instruct olmo 13B certified 1000 A+ - - - - - -
openchat-3.5-0106 openchat 7B certified 1000 A+ - - - - - -
OpenHermes-2.5-Mistral-7B openhermes 7B certified 1000 A+ - - - - - -
galactica-125m opt 125M certified 1000 A+ - - - - - -
phi-1_5 phi 1.5B certified 1000 A+ - - - - - -
Phi-3-mini-4k-instruct phi 3.8B certified 1000 A+ - - - - - -
Phi-3.5-mini-instruct phi 3.8B certified 1000 A+ - - - - - -
Phi-3-small-8k-instruct phi 7B certified 1000 A+ - - - - - -
Phi-3-medium-4k-instruct phi 14B certified 1000 A+ - - - - - -
Phi-4-mini-instruct phi4 3.8B certified 1000 A+ - - - - - -
Qwen2.5-0.5B-Instruct qwen 0.5B certified 1000 A+ - - - - - -
Qwen2.5-1.5B-Instruct qwen 1.5B certified 1000 A - - - - - -
Qwen2.5-3B-Instruct qwen 3B certified 964 A - - - - - -
Qwen2.5-7B-Instruct qwen 7B certified 900 B - - - - - -
Qwen2.5-14B-Instruct qwen 14B certified 1000 A+ - - - - - -
Qwen2.5-32B-Instruct qwen 32B certified 1000 A+ - - - - - -
QwQ-32B qwen 32B certified 1000 A+ - - - - - -
Qwen2.5-72B-Instruct qwen 72B certified 1000 A+ - - - - - -
Qwen2.5-Coder-0.5B-Instruct qwen-coder 0.5B certified 1000 A+ - - - - - -
Qwen2.5-Coder-1.5B-Instruct qwen-coder 1.5B certified 1000 A+ - - - - - -
Qwen2.5-Coder-3B-Instruct qwen-coder 3B certified 1000 A+ - - - - - -
Qwen2.5-Coder-7B-Instruct qwen-coder 7B certified 1000 A+ - - - - - -
Qwen2.5-Coder-14B-Instruct qwen-coder 14B certified 1000 A+ - - - - - -
Qwen2.5-Coder-32B-Instruct qwen-coder 32B certified 1000 A+ - - - - - -
Qwen2-0.5B-Instruct qwen2 0.5B certified 1000 A+ - - - - - -
Qwen3-0.6B qwen3 0.6B certified 1000 A+ - - - - - -
Qwen3-1.7B qwen3 1.7B certified 964 A - - - - - -
Qwen3-4B qwen3 4B certified 1000 A+ - - - - - -
Qwen3-8B qwen3 8B certified 1000 A+ - - - - - -
Qwen3-14B qwen3 14B certified 1000 A+ - - - - - -
Qwen3-32B qwen3 32B certified 1000 A+ - - - - - -
Qwen3-Coder-30B-A3B-Instruct qwen3-coder-moe 30B certified 1000 A+ - - - - - -
Qwen3-30B-A3B qwen3-moe 30B certified 1000 A+ - - - - - -
Qwen3-Coder-Next qwen3-next 3B certified 1000 A+ - - - - - -
SmolLM2-135M-Instruct smollm 135M certified 925 B - - - - - -
SmolLM2-360M-Instruct smollm 360M certified 925 B - - - - - -
SmolLM-135M smollm 135M certified 1000 A+ - - - - - -
SmolLM-360M smollm 360M certified 1000 A+ - - - - - -
SmolLM2-1.7B-Instruct smollm 1.7B certified 925 B - - - - - -
SmolLM2-135M smollm2 135M certified 1000 A+ - - - - - -
SmolLM2-360M smollm2 360M certified 1000 A+ - - - - - -
stablelm-2-zephyr-1_6b stablelm 1.6B certified 1000 A+ - - - - - -
stablelm-zephyr-3b stablelm 3B certified 1000 A+ - - - - - -
starcoder2-3b starcoder2 3B certified 1000 A+ - - - - - -
starcoder2-7b starcoder2 7B certified 1000 A+ - - - - - -
starcoder2-15b starcoder2 15B certified 1000 A+ - - - - - -
TinyLlama-1.1B-Chat-v1.0 tinyllama 1.1B certified 1000 A - - - - - -
WizardCoder-33B-V1.1 wizardcoder 33B certified 1000 A+ - - - - - -
Yi-1.5-6B-Chat yi 6B certified 1000 A+ - - - - - -
Yi-1.5-9B-Chat yi 9B certified 1000 A+ - - - - - -
Yi-1.5-34B-Chat yi 34B certified 1000 A+ - - - - - -
zephyr-7b-beta zephyr 7B certified 1000 A+ - - - - - -

Quick Start

# Build all crates
make build

# Run all tests
make test

# Generate coverage report
make coverage

# Certify models (recommended)
cargo run --bin apr-qa -- certify --family qwen-coder --tier mvp

# Run a specific playbook
cargo run --bin apr-qa -- run playbooks/models/qwen2.5-coder-1.5b-mvp.playbook.yaml

Certification Tiers

Tier Time Description Pass → Grade / Status
Dim-Smoke <30s Dimension-only via kernel equivalence (SafeTensors, CPU) Kernel-proven dev check
Smoke ~1-2 min Sanity check (minimal matrix) Dev feedback only
MVP ~5-10 min All formats × backends × modalities (18 combos) ≥90% → B / PROVISIONAL
Quick ~10-30 min Dev iteration with broader coverage Dev feedback
Standard ~1-2 hr CI/CD gate CI gate
Deep ~8-24 hr Production qualification (full matrix) ≥95% → A+ / CERTIFIED
# Dimensional smoke (fastest — requires kernel proof via MVP on representative model)
cargo run --bin apr-qa -- certify --kernel-class A --tier dim-smoke

# Smoke check
cargo run --bin apr-qa -- certify --family qwen-coder --tier smoke

# MVP certification (quick surface coverage)
cargo run --bin apr-qa -- certify --family qwen-coder --tier mvp

# Deep certification (production qualification)
cargo run --bin apr-qa -- certify --family qwen-coder --tier deep

Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                          APR-MODEL-QA-PLAYBOOK                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐                  │
│  │ apr-qa-gen   │    │ apr-qa-runner│    │apr-qa-report │                  │
│  │              │───▶│              │───▶│              │                  │
│  │ • proptest   │    │ • parallel   │    │ • MQS score  │                  │
│  │ • scenarios  │    │ • execution  │    │ • JUnit XML  │                  │
│  │ • oracles    │    │ • evidence   │    │ • HTML/MD    │                  │
│  │ • kernels    │    │              │    │              │                  │
│  │ • bootstrap  │    │              │    │              │                  │
│  └──────────────┘    └──────────────┘    └──────────────┘                  │
│         │                    │                    │                          │
│         └────────────────────┼────────────────────┘                          │
│                              ▼                                               │
│  ┌──────────────┐    ┌──────────────┐                                       │
│  │apr-qa-certify│    │ apr-qa-cli   │                                       │
│  │              │◀───│              │                                       │
│  │ • tier score │    │ • certify    │                                       │
│  │ • README sync│    │ • run/report │                                       │
│  │ • CSV export │    │ • Jidoka sigs│                                       │
│  └──────────────┘    └──────────────┘                                       │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Crate Structure

Crate Purpose
apr-qa-gen Scenario generation with proptest, oracle definitions, kernel profiles, playbook bootstrapping
apr-qa-runner Playbook execution, differential testing, bug patterns
apr-qa-report MQS scoring, JUnit/HTML report generation
apr-qa-certify Two-tier certification, README sync, tier-aware scoring
apr-qa-cli Command-line interface

Key Modules (apr-qa-runner)

Module Purpose
executor.rs Scenario execution engine
parallel.rs Rayon-based parallel execution with Jidoka enforcement
playbook.rs YAML playbook parsing and validation
conversion.rs Format conversion testing with bug classification
differential.rs Rosetta diff-tensors, compare-inference, profile CI
patterns.rs Cross-project bug pattern detection (12 patterns)
contract.rs Generic contract validation
family_contract.rs Family YAML alignment checks
layout_contract.rs LAYOUT-002 row-major tensor validation
integrity.rs config.json and model integrity (G0 gateway)
provenance.rs Git/file provenance tracking
evidence.rs Evidence collection and serialization
oracle.rs Oracle execution layer
command.rs Process execution wrapper
diagnostics.rs Debugging and diagnostic output
process.rs Jidoka process lifecycle management

Test Matrix

The framework tests models across multiple dimensions:

Dimension Options
Modality run, chat, serve
Backend cpu, gpu
Format safetensors (ground truth), apr, gguf
Quantization q4_k_m, q5_k_m, q8_0, f16, f32

Ground Truth: SafeTensors is the source of truth for model weights (native HuggingFace format). APR is our optimized native format. GGUF is a supported third-party format.

With 100 scenarios per combination across 100 HuggingFace models:

  • 3 modalities × 2 backends × 3 formats × 100 models × 100 scenarios = 1,800,000 tests

MQS Scoring

The Model Qualification Score (MQS) ranges from 0-1000:

Gateway Checks (G0-G4)

Any gateway failure zeros the entire score:

Gateway Check Failure Impact
G0 config.json matches tensor metadata MQS = 0
G1 Model loads successfully MQS = 0
G2 Basic inference works MQS = 0
G3 No crashes or panics MQS = 0
G4 Output is not garbage MQS = 0

Tier-Aware Scoring

The scoring system uses tier-aware functions:

Tier Pass Threshold Score on Pass Grade Status
MVP ≥90% 800 B PROVISIONAL
Full ≥95% 950+ A+ CERTIFIED

Grade Mapping

Score Grade Status
950-1000 A+ CERTIFIED
900-949 A CERTIFIED
850-899 B+ CERTIFIED
800-849 B PROVISIONAL
700-799 C PROVISIONAL
0-699 F BLOCKED

Playbook Format

version: "1.0"
model:
  id: "Qwen/Qwen2.5-Coder-1.5B"
  revision: "main"

test_matrix:
  modalities: [run, chat]
  backends: [cpu, gpu]
  formats: [safetensors, apr, gguf]  # safetensors is ground truth

scenarios:
  - name: "arithmetic_basic"
    prompt: "What is 2 + 2?"
    oracle: arithmetic
    expected: 4

  - name: "code_generation"
    prompt: "Write a Python function to reverse a string"
    oracle: code_syntax
    language: python

# Differential Testing (v1.3.0)
differential_tests:
  tensor_diff:
    enabled: true
    filter: "embed,lm_head"
    gates: ["F-ROSETTA-DIFF-001"]
  inference_compare:
    enabled: true
    prompt: "What is 2+2?"
    tolerance: 1e-5

# Profile CI Assertions (v1.3.0)
profile_ci:
  enabled: true
  assertions:
    min_throughput: 10.0  # tok/s
    max_p99_ms: 500       # ms

# Trace Payload (v1.3.0)
trace_payload:
  enabled: true
  gates: ["F-TRACE-PAYLOAD-001", "F-TRACE-PAYLOAD-002"]

Project Structure

apr-model-qa-playbook/
├── crates/
│   ├── apr-qa-gen/        # Scenario generation + oracles + kernel profiles + bootstrapper
│   ├── apr-qa-runner/     # Playbook execution (Rayon parallel, 16 modules)
│   ├── apr-qa-report/     # MQS scoring + JUnit/HTML/Markdown reports
│   ├── apr-qa-certify/    # Tier-aware scoring, README sync, CSV export
│   └── apr-qa-cli/        # CLI binary (14 subcommands)
├── certifications/        # Model certification evidence (39 models)
│   └── <model>/evidence.json
├── playbooks/
│   ├── models/            # Per-model playbooks (117 YAML files)
│   ├── templates/         # Reusable templates (smoke, mvp, quick, standard, deep)
│   ├── verify/            # Ticket verification
│   └── spec/              # Executable specifications
├── book/                  # mdBook documentation
├── scripts/               # Validation and golden output generation
└── docs/
    ├── certifications/    # models.csv certification database (95 models)
    ├── specifications/    # Full specification (10 docs)
    ├── tickets/           # Ticket analysis (GH-190, GH-191)
    ├── five-whys/         # Root cause analysis
    ├── workflows/         # Certification workflow guides
    └── troubleshooting/   # Debugging guides

Installation

Install the CLI from source:

cargo install --path crates/apr-qa-runner

Or build the entire workspace:

cargo build --release --workspace

Usage

Run model qualification against a playbook:

# Run a single model playbook
apr-qa run playbooks/models/qwen-coder-0.5b.yaml

# Certify a model family (MVP tier, ≤10 min)
apr-qa certify --family qwen-coder --tier mvp

# Generate HTML report
apr-qa report --format html --output report.html

See apr-qa --help for the full list of commands and options.

Contributing

Contributions are welcome. Please follow these steps:

  1. Fork the repository
  2. Make changes on your fork
  3. Run make check (fmt + lint + test) before submitting
  4. Open a pull request with a clear description of the change

All pull requests must pass CI quality gates (clippy, tests, coverage ≥ 95%).

Development

# Run tests with coverage
make coverage

# Verify PMAT compliance (>= 95%)
make coverage-check

# Lint with clippy
make lint

# Full check (fmt + lint + test)
make check

License

MIT License - see LICENSE for details.


Built with Rust • Powered by proptest • Inspired by Toyota & Popper

Dependencies

~18–34MB
~401K SLoC