7 unstable releases (3 breaking)
0.4.0 | Jun 24, 2022 |
---|---|
0.3.0 | Jun 23, 2022 |
0.2.0 | Jun 23, 2022 |
0.1.3 | Jun 21, 2022 |
#1823 in Command line utilities
57KB
270 lines
warc-parquet
🗄️ A utility for converting WARC to Parquet.
📦 Install
The binary may be installed via cargo
:
$ cargo install warc-parquet
To use the crate in your project, add the following to your Cargo.toml
file:
[dependencies]
warc-parquet = "0.4"
🤸 Usage
The Binary
Once installed, the warc-parquet
utility can be used to transform WARC into Parquet:
$ wget --warc-file example 'https://example.com'
$ cat example.warc.gz | warc-parquet --gzipped > example.snappy.parquet
The Crate
Refer to the docs for more details about how to use the Reader
within your own programs.
DuckDB
There are any number of ways to consume Parquet once you have it. However a natural fit might be DuckDB:
$ duckdb
v0.3.3 fe9ba8003
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D select type, id from 'example.snappy.parquet';
┌──────────┬─────────────────────────────────────────────────┐
│ type │ id │
├──────────┼─────────────────────────────────────────────────┤
│ warcinfo │ <urn:uuid:A8063499-7675-4D8D-A736-A1D7DAE84C84> │
│ request │ <urn:uuid:3EB20966-D74F-4949-AACB-23DB3A0733A7> │
│ response │ <urn:uuid:8B92CADC-F770-45BE-8B72-E13A61CD6D1C> │
│ metadata │ <urn:uuid:4C0E9E17-E21B-49E0-859A-D1016FBDE636> │
│ resource │ <urn:uuid:14F502A5-3BDE-4D0B-8A43-95F4BB8398C6> │
│ resource │ <urn:uuid:6B6D6ADD-52FF-4760-AA00-FB9E739CABBE> │
└──────────┴─────────────────────────────────────────────────┘
D describe select * from 'example.snappy.parquet';
┌─────────────────────────┬─────────────┬──────┬─────┬─────────┬───────┐
│ column_name │ column_type │ null │ key │ default │ extra │
├─────────────────────────┼─────────────┼──────┼─────┼─────────┼───────┤
│ id │ VARCHAR │ YES │ │ │ │
│ content_length │ UINTEGER │ YES │ │ │ │
│ date │ TIMESTAMP │ YES │ │ │ │
│ type │ VARCHAR │ YES │ │ │ │
│ content_type │ VARCHAR │ YES │ │ │ │
│ concurrent_to │ VARCHAR │ YES │ │ │ │
│ block_digest │ VARCHAR │ YES │ │ │ │
│ payload_digest │ VARCHAR │ YES │ │ │ │
│ ip_address │ VARCHAR │ YES │ │ │ │
│ refers_to │ VARCHAR │ YES │ │ │ │
│ target_uri │ VARCHAR │ YES │ │ │ │
│ truncated │ VARCHAR │ YES │ │ │ │
│ warc_info_id │ VARCHAR │ YES │ │ │ │
│ filename │ VARCHAR │ YES │ │ │ │
│ profile │ VARCHAR │ YES │ │ │ │
│ identified_payload_type │ VARCHAR │ YES │ │ │ │
│ segment_number │ UINTEGER │ YES │ │ │ │
│ segment_origin_id │ VARCHAR │ YES │ │ │ │
│ segment_total_length │ UINTEGER │ YES │ │ │ │
│ body │ BLOB │ YES │ │ │ │
└─────────────────────────┴─────────────┴──────┴─────┴─────────┴───────┘
Dependencies
~22–50MB
~1M SLoC