5 releases
0.2.0 | Jul 29, 2024 |
---|---|
0.1.6 | Oct 20, 2022 |
0.1.5 | Oct 11, 2022 |
0.1.4 | Jun 29, 2022 |
0.1.3 | Jun 29, 2022 |
#49 in Science
370 downloads per month
Used in 4 crates
(3 directly)
33KB
296 lines
fetch-data
Fetch data files from a URL, but only if needed. Verify contents via SHA256. Some Python Pooch compatibility.
Fetch-Data
checks a local data directory and then downloads needed files. It always verifies the local files and downloaded files via a hash.
Fetch-Data
makes it easy to download large and small sample files. For example, here we download a genomics file from GitHub (if it has not already been downloaded). We then print the size of the now local file.
use fetch_data::sample_file;
let path = sample_file("small.fam")?;
println!("{}", std::fs::metadata(path)?.len()); // Prints 85
# Ok::<(), anyhow::Error>(())
Features
- Thread-safe -- allowing it to be used with Rust's multithreaded testing framework.
- Inspired by Python's popular Pooch and our PySnpTools filecache module.
- Avoids run-times such as Tokio (by using
ureq
to download files via blocking I/O).
Suggested Usage
You can set up FetchData
many ways. Here are the steps -- followed by sample code -- for one set up.
-
Create a
registry.txt
file containing a whitespace-delimited list of files and their hashes. (This is the same format as Pooch. See section Registry Creation for tips on creating this file.) -
As shown below, create a global static
FetchData
instance that reads yourregistry.txt
file. Give it:- the URL root from which to download the files
- an environment variable telling the local data directory in which to store the files
- a
qualifier
,organization
, andapplication
-- Used to create a local data directory when the environment variable is not set. See crate ProjectsDir for details.
-
As shown below, define a public
sample_file
function that takes a file name and returns aResult
containing the path to the downloaded file.
use fetch_data::{ctor, FetchData, FetchDataError};
use std::path::{Path, PathBuf};
#[ctor]
static STATIC_FETCH_DATA: FetchData = FetchData::new(
include_str!("../registry.txt"),
"https://raw.githubusercontent.com/CarlKCarlK/fetch-data/main/tests/data/",
"BAR_APP_DATA_DIR", // env_key
"com", // qualifier
"Foo Corp", // organization
"Bar App", // application
);
/// Download a data file.
pub fn sample_file<P: AsRef<Path>>(path: P) -> Result<PathBuf, Box<FetchDataError>> {
STATIC_FETCH_DATA.fetch_file(path)
}
You can now use your sample_file
function to download your files as needed.
Registry Creation
You can create your registry.txt
file many ways. Here are the steps -- followed by sample code -- for one way to create it.
- Upload your data files to the Internet.
- For example,
Fetch-Data
puts its sample data files intests/data
, so they upload to this GitHub folder. In GitHub, by looking at the raw view of a data file, we see the root URL for these files. Incargo.toml
, we keep these data files out of our crate viaexclude = ["tests/data/*"]
- For example,
- As shown below, write code that
- Creates a
FetchData
instance without registry contents. - Lists the files in your data directory.
- Calls the
gen_registry_contents
method on your list of files. This method will download the files, compute their hashes, and create a string of file names and hashes.
- Creates a
- Print this string, then manually paste it into a file called
registry.txt
.
use fetch_data::{FetchData, dir_to_file_list};
let fetch_data = FetchData::new(
"", // registry_contents ignored
"https://raw.githubusercontent.com/CarlKCarlK/fetch-data/main/tests/data/",
"BAR_APP_DATA_DIR", // env_key
"com", // qualifier
"Foo Corp", // organization
"Bar App", // application
);
let file_list = dir_to_file_list("tests/data")?;
let registry_contents = fetch_data.gen_registry_contents(file_list)?;
println!("{registry_contents}");
# use fetch_data::FetchDataError; // '#' needed for doctest
# Ok::<(), Box<FetchDataError>>(())
Notes
-
Feature requests and contributions are welcome.
-
Don't use our sample
sample_file
. Define your ownsample_file
that knows where to find your data files. -
The
FetchData
instance need not be global and static. SeeFetchData::new
for an example of a non-global instance. -
Additional
methods on the FetchData
instance can fetch multiples files and can give the path to the local data directory. -
You need not use a
registry.txt
file andFetchData
instance. You can instead use the stand-alone functionfetch
to retrieve a single file with known URL, hash, and local path. -
Additional stand-alone functions can download files and hash files.
-
Fetch-Data
always does binary downloads to maintain consistent line endings across OSs. -
The Bed-Reader genomics crate uses
Fetch-Data
. -
To make
FetchData
work well as a static global,FetchData::new
never fails. Instead,FetchData
stores any error and returns it when the first call tofetch_file
, etc., is made. -
Debugging this crate under Windows can cause a "Oops! The debug adapter has terminated abnormally" exception. This is some kind of LLVM, Windows, NVIDIA(?) problem via ureq.
-
This crate follows Nine Rules for Elegant Rust Library APIs from Towards Data Science.
Project Links
Dependencies
~3–15MB
~130K SLoC