#detect #anonymization #pii #entity #onnx #anonymize

app redact-cli

CLI tool for PII detection and anonymization

7 releases

new 0.8.1 Apr 11, 2026
0.8.0 Apr 9, 2026
0.7.1 Feb 7, 2026
0.6.8 Feb 3, 2026

#593 in Command line utilities

Apache-2.0

185KB
4K SLoC

Censgate Redact

Rust License Tests Crates.io

High-performance PII detection and anonymization engine

A production-ready, Rust-based solution designed as a drop-in replacement for Microsoft Presidio.

Quick Start · Documentation · Examples · Contributing


Features

  • High Performance — 10-100x faster than Python-based solutions with sub-millisecond inference
  • Memory Safe — Rust's borrow checker eliminates entire classes of security vulnerabilities
  • Production Ready — 36 pattern-based entity types with validation, plus transformer-based NER
  • Multi-Platform — Native server, WebAssembly (WASM), and CLI support
  • ML-Powered — Full ONNX Runtime integration for transformer models (BERT, RoBERTa, DistilBERT)
  • Lightweight — ~20-50MB memory footprint vs ~300MB for Presidio
  • Extensible — Plugin architecture for custom recognizers and anonymization strategies

Quick Start

Install the CLI

cargo install redact-cli
redact --version

Analyze Text for PII

redact analyze "Contact John Doe at john@example.com or call (555) 123-4567"

Output:

Detected 2 PII entities:

  EmailAddress at 21..37 (score: 0.80): john@example.com
  PhoneNumber at 46..60 (score: 0.70): (555) 123-4567

Processing time: 2ms

Anonymize PII

# Replace with placeholders (default)
redact anonymize "My SSN is 123-45-6789"
# Output: My SSN is [US_SSN]

# Mask sensitive data
redact anonymize --strategy mask "Email: john@example.com"
# Output: Email: jo**@****le.com

# Hash for consistent pseudonymization
redact anonymize --strategy hash "Card: 4532-1234-5678-9010"
# Output: Card: [CREDIT_CARD_a1b2c3d4]

Process Files

# Analyze a file
redact analyze -i sensitive_data.txt

# Pipe from stdin
cat document.txt | redact anonymize --strategy mask

# Output as JSON
redact analyze --format json "test@example.com" > results.json

Filter by Entity Type

redact analyze --entities EmailAddress --entities UsSsn \
  "Email: test@example.com, SSN: 123-45-6789, Phone: (555) 123-4567"
# Only detects EmailAddress and UsSsn, ignores PhoneNumber

Installation

cargo install redact-cli

From Source

git clone https://github.com/censgate/redact.git
cd redact
cargo build --release
cargo test --workspace

Using Docker

Multi-architecture images available for linux/amd64 and linux/arm64:

docker pull ghcr.io/censgate/redact:latest
docker run -p 8080:8080 ghcr.io/censgate/redact:latest

The image uses a minimal distroless base (~37MB) optimized for ARM64 (AWS Graviton, Apple Silicon) and AMD64.

Full image (pattern + ONNX NER)

To enable all entities including ONNX NER (PERSON, ORGANIZATION, LOCATION, DATE_TIME), use the full image. It is published on every release to GHCR with tags full, X.Y.Z-full, etc.:

docker pull ghcr.io/censgate/redact:full
docker run -p 8080:8080 ghcr.io/censgate/redact:full

To build locally instead:

docker build -f Dockerfile.ner -t ghcr.io/censgate/redact:full .
docker run -p 8080:8080 ghcr.io/censgate/redact:full

The full image bakes in a pre-exported NER model (dslim/bert-base-NER) and sets NER_MODEL_PATH=/app/model/model.onnx, so NER is enabled at startup. To enable NER with the default image, mount a directory containing model.onnx and tokenizer.json and set:

docker run -p 8080:8080 -v /path/to/model:/app/model -e NER_MODEL_PATH=/app/model/model.onnx ghcr.io/censgate/redact:latest

Rust Version

This project requires Rust 1.93.0. Use Mise or ASDF for version management:

# Using Mise (recommended)
mise install rust@1.93.0

# Using ASDF
asdf install rust 1.93.0

# Using rustup
rustup install 1.93.0
rustup default 1.93.0

Library Usage

Add to your Cargo.toml:

[dependencies]
redact-core = "0.1"
redact-ner = "0.1"  # Optional: for ML-based NER

Basic Pattern Detection

use redact_core::{AnalyzerEngine, AnonymizerConfig, AnonymizationStrategy};

fn main() -> anyhow::Result<()> {
    let engine = AnalyzerEngine::new();

    // Analyze text
    let text = "Contact John Doe at john@example.com or call (555) 123-4567";
    let result = engine.analyze(text, None)?;

    println!("Found {} PII entities", result.detected_entities.len());
    for entity in &result.detected_entities {
        println!(
            "  {:?}: {} (score: {:.2})",
            entity.entity_type,
            entity.text.as_deref().unwrap_or_default(),
            entity.score
        );
    }

    // Anonymize
    let config = AnonymizerConfig {
        strategy: AnonymizationStrategy::Replace,
        ..Default::default()
    };
    let anonymized = engine.anonymize(text, None, &config)?;
    println!("\nAnonymized: {}", anonymized.text);

    Ok(())
}

ML-Powered NER

For detecting contextual entities like person names, organizations, and locations:

use redact_core::AnalyzerEngine;
use redact_ner::{NerRecognizer, NerConfig};
use std::sync::Arc;

fn main() -> anyhow::Result<()> {
    // Configure NER with ONNX model
    let ner_config = NerConfig {
        model_path: "models/bert-base-ner/model.onnx".to_string(),
        tokenizer_path: Some("models/bert-base-ner/tokenizer.json".to_string()),
        min_confidence: 0.7,
        ..Default::default()
    };

    let ner = NerRecognizer::from_config(ner_config)?;

    // Add NER to analyzer
    let mut engine = AnalyzerEngine::new();
    engine.recognizer_registry_mut().add_recognizer(Arc::new(ner));

    // Detect both pattern-based and contextual entities
    let text = "John Doe works at Acme Corp. Email: john@acme.com";
    let result = engine.analyze(text, None)?;

    for entity in &result.detected_entities {
        println!("{:?}: {}", entity.entity_type, entity.text.as_deref().unwrap_or_default());
    }
    // Output: PERSON: John Doe, ORGANIZATION: Acme Corp, EMAIL: john@acme.com

    Ok(())
}

REST API

Start the Server

cargo run --release --bin redact-api
# Server listening on http://0.0.0.0:8080

Analyze Endpoint

curl -X POST http://localhost:8080/api/v1/analyze \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Email john@example.com, SSN 123-45-6789",
    "language": "en"
  }'

Response:

{
  "results": [
    {
      "entity_type": "EMAIL_ADDRESS",
      "start": 6,
      "end": 22,
      "score": 0.8,
      "text": "john@example.com",
      "recognizer_name": "PatternRecognizer"
    },
    {
      "entity_type": "US_SSN",
      "start": 28,
      "end": 39,
      "score": 0.9,
      "text": "123-45-6789",
      "recognizer_name": "PatternRecognizer"
    }
  ],
  "metadata": {
    "recognizers_used": 1,
    "processing_time_ms": 2,
    "language": "en"
  }
}

Anonymize Endpoint

curl -X POST http://localhost:8080/api/v1/anonymize \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Contact John at john@example.com",
    "config": {
      "strategy": "mask",
      "mask_char": "*",
      "mask_start_chars": 2,
      "mask_end_chars": 4
    }
  }'

Supported Entity Types

Pattern-Based (36 types)

Category Entity Types
Contact EMAIL_ADDRESS, PHONE_NUMBER, IP_ADDRESS, URL, DOMAIN_NAME
Financial CREDIT_CARD, IBAN_CODE, US_BANK_NUMBER
US US_SSN, US_DRIVER_LICENSE, US_PASSPORT, US_ZIP_CODE
UK UK_NHS, UK_NINO, UK_POSTCODE, UK_PHONE_NUMBER, UK_MOBILE_NUMBER, UK_SORT_CODE, UK_DRIVER_LICENSE, UK_PASSPORT_NUMBER, UK_COMPANY_NUMBER
Healthcare MEDICAL_LICENSE, MEDICAL_RECORD_NUMBER
Crypto CRYPTO_WALLET, BTC_ADDRESS, ETH_ADDRESS
Technical GUID, MAC_ADDRESS, MD5_HASH, SHA1_HASH, SHA256_HASH
Generic PASSPORT_NUMBER, AGE, ISBN, PO_BOX, DATE_TIME

Pattern-based detection includes validation (Luhn for credit cards, mod-11 for NHS, IBAN checksums) to reduce false positives.

NER-Based (ML-Powered)

Entity Type Description
PERSON Person names (e.g., "John Doe", "Marie Curie")
ORGANIZATION Organization names (e.g., "Acme Corp", "Microsoft")
LOCATION Location names (e.g., "New York", "London")
DATE_TIME Date/time expressions in context

Requires ONNX model. See ML-Powered NER section.

Anonymization Strategies

Strategy Description Example
Replace Simple placeholder [EMAIL_ADDRESS]
Mask Partial masking jo**@****le.com
Hash Irreversible hashing [EMAIL_ADDRESS_a1b2c3d4]
Encrypt Reversible encryption <TOKEN_uuid>
use redact_core::anonymizers::{AnonymizerConfig, AnonymizationStrategy};

let config = AnonymizerConfig {
    strategy: AnonymizationStrategy::Mask,
    mask_char: '*',
    mask_start_chars: 2,
    mask_end_chars: 4,
    ..Default::default()
};
// "john@example.com" → "jo**@****le.com"

ML-Powered NER

Redact includes full ONNX Runtime integration for transformer-based Named Entity Recognition.

Setup

1. Export a HuggingFace model to ONNX:

pip install transformers optimum[exporters]
python scripts/export_ner_model.py \
    --model dslim/bert-base-NER \
    --output models/bert-base-ner

2. Use in your code:

use redact_ner::{NerRecognizer, NerConfig};
use redact_core::AnalyzerEngine;
use std::sync::Arc;

let config = NerConfig {
    model_path: "models/bert-base-ner/model.onnx".to_string(),
    tokenizer_path: Some("models/bert-base-ner/tokenizer.json".to_string()),
    min_confidence: 0.7,
    ..Default::default()
};

let ner = NerRecognizer::from_config(config)?;
let mut engine = AnalyzerEngine::new();
engine.recognizer_registry_mut().add_recognizer(Arc::new(ner));

Model Directory Structure

The export script creates a directory with the following files:

models/bert-base-ner/
├── model.onnx          # ONNX model file (REQUIRED)
├── tokenizer.json      # HuggingFace tokenizer (REQUIRED)
├── config.json         # Model config with label mappings
├── special_tokens_map.json
└── tokenizer_config.json

Required files for inference:

  • model.onnx - The ONNX-exported transformer model
  • tokenizer.json - HuggingFace fast tokenizer (must be in same directory as model, or specify via tokenizer_path)
Model Size Use Case
dslim/bert-base-NER ~420MB Best accuracy/size balance (default)
dbmdz/bert-large-cased-finetuned-conll03-english ~1.2GB Highest accuracy
Davlan/distilbert-base-multilingual-cased-ner-hrl ~500MB Multilingual support
elastic/distilbert-base-cased-finetuned-conll03-english ~250MB Smaller/faster

All models must be trained on CoNLL-2003 or similar NER datasets with BIO tagging scheme (B-PER, I-PER, B-ORG, I-ORG, B-LOC, I-LOC labels).

Performance

  • Inference: ~2-10ms per text (depending on model and text length)
  • Memory: ~50-200MB (depending on model)
  • Startup: ~100-500ms model load time
  • Concurrency: Thread-safe via mutex-wrapped sessions

Performance

Benchmark Results (2026-01-31)

Measured using oha with both services running in Docker containers.

Metric Redact (Rust) Presidio (Python) Speedup
p50 Latency 0.20 ms 6.96 ms 34x
p99 Latency 0.96 ms 22.50 ms 23x
Throughput 16,223 req/s 171 req/s 95x

Test payload: Contact john.doe@example.com or call (555) 123-4567. SSN: 123-45-6789.

Run Benchmarks

# REST API comparison vs Presidio (requires Docker + oha)
./scripts/benchmark-comparison.sh

# Criterion micro-benchmarks (Redact internals)
cargo bench --package redact-core

See docs/benchmarks/ for methodology and detailed results.

Project Structure

redact/
├── crates/
│   ├── redact-core/     # Core detection & anonymization engine
│   ├── redact-ner/      # ONNX NER integration
│   ├── redact-api/      # REST API service (Axum)
│   ├── redact-cli/      # Command-line tool
│   └── redact-wasm/     # WebAssembly bindings
├── patterns/            # PII detection patterns (GDPR, HIPAA, CCPA)
├── scripts/             # Utility scripts (model export)
├── examples/            # Usage examples
└── docs/                # Documentation

Testing

# Run all tests
cargo test --workspace

# Run with output
cargo test --workspace -- --nocapture

# Run benchmarks
cargo bench --package redact-core

# Run NER E2E tests (requires ONNX model)
cargo test --package redact-ner --test ner_e2e -- --ignored

# Run specific test suites
cargo test --package redact-core --test pattern_coverage
cargo test --package redact-core --test error_scenarios
cargo test --package redact-core --test concurrent_operations

See TEST_COVERAGE.md for detailed coverage report.

Documentation

Roadmap

v0.6.0 (Current)

  • Complete Rust rewrite (replacing Go v0.1.0-v0.4.1)
  • 36 pattern-based entity types with checksum validation
  • Full ONNX NER integration (PERSON, ORGANIZATION, LOCATION)
  • 4 anonymization strategies (replace, mask, hash, encrypt)
  • REST API service
  • CLI tool
  • Multi-arch Docker images (AMD64/ARM64)
  • Full Docker image with embedded NER model (ghcr.io/censgate/redact:full)
  • Comprehensive test suite (~75% coverage)
  • Entity overlap resolution with specificity scoring

v0.7.0 (Planned)

  • WebAssembly (WASM) browser support
  • Publish crates to crates.io
  • Enhanced documentation
  • Streaming API for large texts

v0.8.0 (Future)

  • Mobile FFI bindings (Swift/Kotlin)
  • GPU acceleration for NER
  • Multi-language support expansion

Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

# Fork and clone
git clone https://github.com/YOUR_USERNAME/redact.git
cd redact

# Create a feature branch
git checkout -b feature/my-new-feature

# Make changes and test
cargo test --workspace
cargo clippy --all-targets --all-features
cargo fmt --all

# Commit and push
git commit -m "feat: add amazing feature"
git push origin feature/my-new-feature

License

Censgate Redact is licensed under the Business Source License 1.1 (BUSL-1.1).

Additional Use Grant: You may use Censgate Redact in production to process up to 100,000 redacted records per month per legal entity, free of charge. Beyond this threshold, a commercial license is required. Contact support@censgate.com for commercial licensing.

Change Date: On 1 March 2030 (or four years after each version's release, whichever comes first), each version of Censgate Redact automatically converts to the GNU General Public License v3.0 or later.

See the LICENSE file for the complete license terms.

Mixed Licensing

Unless explicitly stated otherwise in a subdirectory's own LICENSE file, all code in this repository is licensed under BUSL-1.1. Specific subdirectories (e.g., sdk/ or examples/) may contain their own LICENSE files with different open source licenses (such as MIT or Apache-2.0) to facilitate integration.

Copyright (c) 2026 Censgate LLC

Acknowledgments

Support


Star us on GitHub if you find this project useful!

Dependencies

~36MB
~610K SLoC