#data-file #data #web-request #local-file #testing #file-content #request

fetch-data

Fetch data files from a URL, but only if needed. Verify contents via SHA256. Some Python Pooch compatibility.

5 releases

0.2.0 Jul 29, 2024
0.1.6 Oct 20, 2022
0.1.5 Oct 11, 2022
0.1.4 Jun 29, 2022
0.1.3 Jun 29, 2022

#49 in Science

Download history 31/week @ 2024-08-16 33/week @ 2024-08-23 56/week @ 2024-08-30 37/week @ 2024-09-06 37/week @ 2024-09-13 118/week @ 2024-09-20 106/week @ 2024-09-27 64/week @ 2024-10-04 384/week @ 2024-10-11 98/week @ 2024-10-18 125/week @ 2024-10-25 415/week @ 2024-11-01 45/week @ 2024-11-08 86/week @ 2024-11-15 105/week @ 2024-11-22 116/week @ 2024-11-29

370 downloads per month
Used in 4 crates (3 directly)

MIT/Apache

33KB
296 lines

fetch-data

github crates.io docs.rs CI

Fetch data files from a URL, but only if needed. Verify contents via SHA256. Some Python Pooch compatibility.

Fetch-Data checks a local data directory and then downloads needed files. It always verifies the local files and downloaded files via a hash.

Fetch-Data makes it easy to download large and small sample files. For example, here we download a genomics file from GitHub (if it has not already been downloaded). We then print the size of the now local file.

use fetch_data::sample_file;

let path = sample_file("small.fam")?;
println!("{}", std::fs::metadata(path)?.len()); // Prints 85

# Ok::<(), anyhow::Error>(())

Features

  • Thread-safe -- allowing it to be used with Rust's multithreaded testing framework.
  • Inspired by Python's popular Pooch and our PySnpTools filecache module.
  • Avoids run-times such as Tokio (by using ureq to download files via blocking I/O).

Suggested Usage

You can set up FetchData many ways. Here are the steps -- followed by sample code -- for one set up.

  • Create a registry.txt file containing a whitespace-delimited list of files and their hashes. (This is the same format as Pooch. See section Registry Creation for tips on creating this file.)

  • As shown below, create a global static FetchData instance that reads your registry.txt file. Give it:

    • the URL root from which to download the files
    • an environment variable telling the local data directory in which to store the files
    • a qualifier, organization, and application -- Used to create a local data directory when the environment variable is not set. See crate ProjectsDir for details.
  • As shown below, define a public sample_file function that takes a file name and returns a Result containing the path to the downloaded file.

use fetch_data::{ctor, FetchData, FetchDataError};
use std::path::{Path, PathBuf};

#[ctor]
static STATIC_FETCH_DATA: FetchData = FetchData::new(
    include_str!("../registry.txt"),
    "https://raw.githubusercontent.com/CarlKCarlK/fetch-data/main/tests/data/",
    "BAR_APP_DATA_DIR", // env_key
    "com",              // qualifier
    "Foo Corp",         // organization
    "Bar App",          // application
);

/// Download a data file.
pub fn sample_file<P: AsRef<Path>>(path: P) -> Result<PathBuf, Box<FetchDataError>> {
    STATIC_FETCH_DATA.fetch_file(path)
}

You can now use your sample_file function to download your files as needed.

Registry Creation

You can create your registry.txt file many ways. Here are the steps -- followed by sample code -- for one way to create it.

  • Upload your data files to the Internet.
    • For example, Fetch-Data puts its sample data files in tests/data, so they upload to this GitHub folder. In GitHub, by looking at the raw view of a data file, we see the root URL for these files. In cargo.toml, we keep these data files out of our crate via exclude = ["tests/data/*"]
  • As shown below, write code that
    • Creates a FetchData instance without registry contents.
    • Lists the files in your data directory.
    • Calls the gen_registry_contents method on your list of files. This method will download the files, compute their hashes, and create a string of file names and hashes.
  • Print this string, then manually paste it into a file called registry.txt.
use fetch_data::{FetchData, dir_to_file_list};

let fetch_data = FetchData::new(
    "", // registry_contents ignored
    "https://raw.githubusercontent.com/CarlKCarlK/fetch-data/main/tests/data/",
    "BAR_APP_DATA_DIR", // env_key
    "com",              // qualifier
    "Foo Corp",         // organization
    "Bar App",          // application
);
let file_list = dir_to_file_list("tests/data")?;
let registry_contents = fetch_data.gen_registry_contents(file_list)?;
println!("{registry_contents}");

# use fetch_data::FetchDataError; // '#' needed for doctest
# Ok::<(), Box<FetchDataError>>(())

Notes

  • Feature requests and contributions are welcome.

  • Don't use our sample sample_file. Define your own sample_file that knows where to find your data files.

  • The FetchData instance need not be global and static. See FetchData::new for an example of a non-global instance.

  • Additional methods on the FetchData instance can fetch multiples files and can give the path to the local data directory.

  • You need not use a registry.txt file and FetchData instance. You can instead use the stand-alone function fetch to retrieve a single file with known URL, hash, and local path.

  • Additional stand-alone functions can download files and hash files.

  • Fetch-Data always does binary downloads to maintain consistent line endings across OSs.

  • The Bed-Reader genomics crate uses Fetch-Data.

  • To make FetchData work well as a static global, FetchData::new never fails. Instead, FetchData stores any error and returns it when the first call to fetch_file, etc., is made.

  • Debugging this crate under Windows can cause a "Oops! The debug adapter has terminated abnormally" exception. This is some kind of LLVM, Windows, NVIDIA(?) problem via ureq.

  • This crate follows Nine Rules for Elegant Rust Library APIs from Towards Data Science.

Dependencies

~3–15MB
~130K SLoC