#web-archive #parser #archive #web

bin+lib rust_warc

A high performance and easy to use Web Archive (WARC) file reader

3 stable releases

1.2.0 Feb 6, 2026
1.1.0 Feb 17, 2020
1.0.0 May 13, 2019

#1475 in Parser implementations

Download history 75/week @ 2026-01-21 169/week @ 2026-01-28 253/week @ 2026-02-04 395/week @ 2026-02-11 87/week @ 2026-02-18 62/week @ 2026-02-25 37/week @ 2026-03-04

763 downloads per month

MIT license

13KB
236 lines

A high performance Web Archive (WARC) file parser

The WarcReader iterates over WarcRecords from a [BufRead] input.

Perfomance should be quite good, about ~500MiB/s on a single CPU core.

Usage

use rust_warc::WarcReader;

// we're taking input from stdin here, but any BufRead will do
let stdin = std::io::stdin();
let handle = stdin.lock();

let mut warc = WarcReader::new(handle);

let mut response_counter = 0;
for item in warc {
    let record = item.expect("IO/malformed error");

    // header names are case insensitive
    if record.header.get(&"WARC-Type".into()) == Some(&"response".into()) {
        response_counter += 1;
    }
}

println!("# response records: {}", response_counter);

Rust-Warc

crates.io

A high performance and easy to use Web Archive (WARC) file reader

use rust_warc::WarcReader;

use std::io;

fn main() {
    // we're taking input from stdin here, but any BufRead will do
    let stdin = io::stdin();
    let handle = stdin.lock();

    let warc = WarcReader::new(handle);

    let mut response_counter = 0;
    let mut response_size = 0;

    for item in warc {
        let record = item.unwrap(); // could be IO/malformed error

        // header names are case insensitive
        if record.header.get(&"WARC-Type".into()) == Some(&"response".into()) {
            response_counter += 1;
            response_size += record.content.len();
        }
    }

    println!("response records: {}", response_counter);
    println!("response size: {} MiB", response_size >> 20);
}

No runtime deps