#duplicates #tree-sitter #detect

app polydup-cli

Cross-language duplicate code detector CLI tool

25 releases (6 breaking)

0.7.0 Dec 31, 2025
0.6.2 Dec 30, 2025
0.5.5 Dec 25, 2025
0.4.1 Dec 25, 2025
0.1.3 Dec 22, 2025

#991 in Command line utilities

MIT/Apache

370KB
7.5K SLoC

PolyDup CLI

Command-line interface for PolyDup, the cross-language duplicate code detector.

Installation

From Source

cd crates/polydup-cli
cargo build --release

# Binary will be at: target/release/polydup

System-wide Installation

cargo install --path crates/polydup-cli

# Or from the workspace root:
cargo install --path .

Usage

Basic Scan

polydup ./src

Scan Multiple Paths

polydup ./src ./lib ./tests

Adjust Detection Parameters

# Set minimum block size (default: 50 tokens)
polydup ./src --threshold 30

# Set similarity threshold (default: 0.85 = 85%)
polydup ./src --similarity 0.9

# Combine both
polydup ./src --threshold 30 --similarity 0.9

Exclude Files (e.g., Tests)

By default, PolyDup excludes common test file patterns:

  • **/*.test.{ts,js,tsx,jsx}
  • **/*.spec.{ts,js,tsx,jsx}
  • **/__tests__/**
  • **/*.test.py

To use custom exclusions (replaces defaults):

# Exclude specific patterns
polydup ./src --exclude "**/*.generated.ts" --exclude "**/*.mock.js"

# Exclude multiple patterns
polydup ./src -e "**/*.test.ts" -e "**/*.spec.js" -e "**/fixtures/**"

# No exclusions (scan everything including tests)
polydup ./src --exclude ""

Output Formats

Text output (default):

polydup ./src

Output:

Scan Results
═══════════════════════════════════════════════════════════
Files scanned:      4
Functions analyzed: 45
Duplicates found:   0

No duplicates found!

JSON output (for scripting):

polydup ./src --format json

Output:

{
  "files_scanned": 4,
  "functions_analyzed": 45,
  "duplicates": [],
  "stats": {
    "total_lines": 0,
    "total_tokens": 3665,
    "unique_hashes": 2666,
    "duration_ms": 8
  }
}

Verbose Mode

Show additional performance metrics:

polydup ./src --verbose

Output includes:

  • Total tokens processed
  • Number of unique hashes
  • Scan duration

Command-Line Options

polydup [OPTIONS] <PATHS>...

Arguments:
  <PATHS>...  Paths to scan (files or directories)

Options:
  -f, --format <FORMAT>
          Output format [default: text] [possible values: text, json]

  -t, --threshold <MIN_BLOCK_SIZE>
          Minimum code block size in tokens [default: 50]

  -s, --similarity <SIMILARITY>
          Similarity threshold (0.0-1.0) [default: 0.85]

  -v, --verbose
          Show verbose output

  -h, --help
          Print help

  -V, --version
          Print version

Managing False Positives

PolyDup provides an ignore system to suppress false positives while keeping them documented.

Adding Ignore Entries

Add a duplicate to the ignore list:

# Add by ID (from scan output)
polydup ignore add abc123def --files "src/utils.rs:10-30,src/helpers.rs:45-65" --reason "Intentional code reuse"

# Interactive mode
polydup ignore add
# You'll be prompted for files and reason

Listing Ignored Duplicates

# List all ignored duplicates
polydup ignore list

# Verbose output (shows file paths)
polydup ignore list --verbose

# JSON output for scripting
polydup ignore list --format json

Example output:

Ignored Duplicates (2)

1. abc123def456
   Reason: Boilerplate initialization code
   Added by: alice
   Added at: 2025-12-26 10:30:15 UTC
   Files: 2 file(s)

2. xyz789abc123
   Reason: Required by framework convention
   Added by: bob
   Added at: 2025-12-26 11:45:30 UTC
   Files: 3 file(s)

Removing Ignore Entries

# Remove by ID
polydup ignore remove abc123def456

Ignore File Format

Ignored duplicates are stored in .polydup-ignore (TOML format):

version = 1

[[ignores]]
id = "abc123def456"
reason = "Intentional code reuse"
added_by = "alice"
added_at = "2025-12-26T10:30:15Z"

[[ignores.files]]
file = "src/utils.rs"
start_line = 10
end_line = 30

[[ignores.files]]
file = "src/helpers.rs"
start_line = 45
end_line = 65

Tip: Commit .polydup-ignore to version control to share ignore decisions with your team!

Git-Diff Mode (PR Review)

Scan only files changed in a git diff range - perfect for PR checks:

# Scan files changed in current branch vs main
polydup scan . --git-diff origin/main..HEAD

# Scan files changed in last commit
polydup scan . --git-diff HEAD~1..HEAD

# Scan with custom similarity threshold
polydup scan . --git-diff main..feature-branch --similarity 0.9

How It Works

  1. Fast: Only scans files in your diff (10-100x faster for large repos)
  2. Smart: Scans entire codebase but reports only duplicates involving changed files
  3. Accurate: Detects when changed code duplicates with unchanged code

Example Output:

 Git-Diff Mode: Only scanning files changed in origin/main..HEAD
  Git diff filter: Added 3 file(s) -> Modified/Renamed 2 file(s)
  Changed files (2):
     src/handler.rs
     src/utils.rs

  Git-diff filter: 4 duplicate(s) involve changed files

Combined with Ignore Rules and Directives

Git-diff mode works seamlessly with ignore management:

# PR check with directives
polydup scan . --git-diff origin/main..HEAD --enable-directives

# PR check with ignore rules loaded from .polydup-ignore
polydup scan . --git-diff HEAD~1..HEAD --verbose

CI/CD Example:

# .github/workflows/pr-check.yml
- name: Check for duplicates in PR
  run: polydup scan . --git-diff origin/${{ github.base_ref }}..HEAD
  # Only fails if new duplicates introduced in this PR

Benefits:

  • ✅ Focuses review on relevant changes
  • ✅ Respects existing ignore rules
  • ✅ Works with inline directives
  • ✅ No baseline files to manage

Exit Codes

  • 0: No duplicates found
  • 1: Duplicates found (or error occurred)

This allows usage in CI/CD pipelines:

#!/bin/bash
if ! polydup ./src --threshold 100; then
    echo "❌ Duplicates detected!"
    exit 1
fi
echo "No duplicates!"

Examples

CI/CD Integration

GitHub Actions:

name: Check Duplicates

on: [push, pull_request]

jobs:
  check-dupes:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Install Rust
        uses: actions-rs/toolchain@v1
        with:
          toolchain: stable

      - name: Install PolyDup
        run: cargo install --path crates/polydup-cli

      - name: Check for duplicates
        run: |
          polydup ./src --threshold 50 --similarity 0.85 --format json > duplicates.json

      - name: Upload results
        uses: actions/upload-artifact@v3
        if: failure()
        with:
          name: duplicate-report
          path: duplicates.json

Pre-commit Hook

#!/bin/bash
# .git/hooks/pre-commit

echo "Checking for duplicate code..."
if ! polydup ./src --threshold 100 --similarity 0.9; then
    echo "❌ Large code duplicates detected!"
    echo "Review the duplicates above and consider refactoring."
    exit 1
fi

Makefile Integration

.PHONY: check-dupes
check-dupes:
	@echo "Scanning for duplicates..."
	@polydup ./src ./lib --threshold 50 --similarity 0.85

.PHONY: dupes-json
dupes-json:
	@polydup ./src --format json > duplicates.json
	@echo "Report saved to duplicates.json"

Shell Script for Multiple Projects

#!/bin/bash
# scan-all-projects.sh

projects=(
    "project1/src"
    "project2/lib"
    "project3/backend"
)

for project in "${projects[@]}"; do
    echo "Scanning $project..."
    polydup "$project" --format json > "${project//\//-}-report.json"
done

echo "All scans complete!"

Performance Tuning

Fast Scan (Lower Accuracy)

# Large block size = fewer comparisons = faster
polydup ./src --threshold 100 --similarity 0.7

Thorough Scan (Higher Accuracy)

# Small block size = more comparisons = slower but catches smaller duplicates
polydup ./src --threshold 20 --similarity 0.95
Use Case Threshold Similarity
Quick check 100 0.85
Standard scan 50 0.85
Thorough analysis 30 0.90
Refactoring prep 20 0.95

Troubleshooting

No Duplicates Found (But You Expected Some)

  • Lower the threshold: Try --threshold 20 to catch smaller duplicates
  • Lower similarity: Try --similarity 0.7 for looser matching
  • Check file types: Only Rust, Python, and JavaScript/TypeScript are supported

Too Many False Positives

  • Raise the threshold: Try --threshold 100 to only catch large duplicates
  • Raise similarity: Try --similarity 0.95 for stricter matching

Slow Performance

  • Increase threshold: Larger blocks = fewer comparisons
  • Scan fewer files: Be more specific with paths
  • Use release build: cargo build --release (already done if installed)

Supported Languages

  • Rust: .rs files
  • Python: .py files
  • JavaScript/TypeScript: .js, .jsx, .ts, .tsx files

More languages coming soon!

Algorithm

PolyDup uses:

  1. Tree-sitter for AST-based parsing
  2. Token normalization (identifiers → $$ID, strings → $$STR, numbers → $$NUM)
  3. Rabin-Karp rolling hash with window size 50
  4. Parallel processing via Rayon for multi-core performance

See architecture-research.md for details.

License

MIT OR Apache-2.0

Dependencies

~27–42MB
~812K SLoC