1 unstable release

0.1.0 Jan 15, 2023

#369 in Caching

MIT/Apache

28KB
423 lines

data_downloader

data_downloader crate data_downloader documentation MIT/Apache-2 licensed

This crate provides a simple way to download files. In particular this crate aims to make it easy to download and cache files that do not change over time, for example reference image files, ML models, example audio files or common password lists.

Roadmap

  • Test concurrency
  • Add an expected_size: Optional size to DownloadRequests
    • If the download is bigger than that fail
    • If it is None no upper limit

lib.rs:

This crate provides a simple way to download files. In particular this crate aims to make it easy to download and cache files that do not change over time, for example reference image files, ML models, example audio files or common password lists.

Downloading a file

As an example: To download the plaintext version of RFC 2068 you construct a DownloadRequest with the URL and SHA-256 checksum and then use the [get] function.

If you know that the file was already downloaded you can use get_cached.

use data_downloader::{get, get_cached, DownloadRequest};

// Define where to get the file from
let rfc_link = &DownloadRequest {
    url: "https://www.rfc-editor.org/rfc/rfc2068.txt",
    name: "rfc2068.txt",
    sha256_hash: &hex_literal::hex!(
        "D6C4E471389F2D309AB1F90881576542C742F95B115336A346447D052E0477CF"
    ),
};

// Get the binary contents of the file
let rfc: Vec<u8> = get(rfc_link)?;

// Convert the file to a String
let as_text = String::from_utf8(rfc)?;
assert!(as_text.contains("The Hypertext Transfer Protocol (HTTP) is an application-level"));
assert!(as_text.contains("protocol for distributed, collaborative, hypermedia information"));
assert!(as_text.contains("systems."));

// Get the binary contents of the file directly from disk
let rfc: Vec<u8> = get_cached(rfc_link)?;

get_path can be used to get a PathBuf to the file. Note that get_path does not download the file so you have to call [get] first.

One of the design goals of this crate is to verify the integrity of the downloaded files, as such the SHA-256 checksum of the downloads are checked. If a file is loaded from the cache on disk the SHA-256 checksum is also verified. However for get_path the checksum is not verified because even if it was you would still be vulnerable to a TOC/TOU vulnerability.

The [get], get_cached and get_path function use a default directory to cache the downloads, this allows multiple application to share their cached downloads. If you need more configuarbility you can use Downloader and set the storage directory manually using Downloader::new_with_dir. The default storage directory is a platform specific cache directory or a platform specific temporary directory if the cache directory is not available.

Included DownloadRequests

The files module contains some predefined DownloadRequest for your convenience.

Pitfalls

When manually changing a DownloadRequest, inherently the SHA-256 sum needs to be changed too. If this is not done this can result in a DownloadRequest that looks as if it is downloading a specific file but instead downloads something else. For example here the above DownloadRequest was changed but only he url was addapted. Since neither the name nor sha256_hash are set to the correct value this will return rfc2068.txt from the cache. This is a user error, as the developer has to ensure that they specify the correct SHA-256 checksum for a DownloadRequest.

&DownloadRequest {
    url: "https://www.rfc-editor.org/rfc/rfc7168.txt",
    name: "rfc2068.txt",
    sha256_hash: &hex_literal::hex!(
        "D6C4E471389F2D309AB1F90881576542C742F95B115336A346447D052E0477CF"
    ),
};

Status of this crate

This is an early release. As such breaking changes are expected at some point. There are also some implementation limitations including but not limited to:

  • The downloading is rather primitive. Failed downloads are simply retried once and no continuation of interupted downloads is implemented.
  • The default timeouts of reqwest are used. As such large downloads on slow connections can fail.
  • Only one URL is used per DownloadRequest, it's not currently possible to specify multiple possible locations for a file.
  • Only single files are supported, no unpacking of zips is supported.
  • The crate uses blocking IO.

Contributions to improve this are welcome.

Nevertheless this crate should be suitable for simple use cases.

Dependencies

This crate uses the following dependencies:

  • dirs to find platform specific temporary and cache directories
    • Implementing this manually would only cause incompatibilities
  • reqwest to issue HTTP requests
    • A HTTP library is definitely required to allow this crate to download files. reqwest is widely used in the Rust comunity, it is however a rather big dependency as it is very fully featured. It might be worth investigating smaller HTTP client libraries in the future.
  • sha2 to hash files
    • To ensure the integrity of the files a collision resistant cryptographic hash function is required. SHA-256 is generally considered as the standard for such a use case. The sha2 crate by the RustCrypto organisation is the defacto standard implementation of SHA-2 for Rust.
  • hex-literal to conveniently specify the SHA-256 sums
    • Technially this dependency could be removed if we specified the SHA-256 in the predefined DownloadRequest directley as &[u8] slice litterals. However the library is maintained by the RustCrypto organisation and as such can be regarded as trustworthy
  • thiserror to conveniently derive Error
    • This library is also very wiedly used and maintained by David Tolnay , a highly regarded member of the Rust comunity. Once data_downloader has sufficently matured it might be a good idea to stop using thiserror and instead directly use the generated implementations in the code. This would potentially reduce build times. This has however low priority, especially while the [enum@crate::Error] type is still changing frequently.

Dependencies

~4–18MB
~260K SLoC