1 unstable release
0.1.0 | Jan 15, 2023 |
---|
#369 in Caching
28KB
423 lines
data_downloader
This crate provides a simple way to download files. In particular this crate aims to make it easy to download and cache files that do not change over time, for example reference image files, ML models, example audio files or common password lists.
Roadmap
- Test concurrency
- Add an expected_size: Optional size to DownloadRequests
- If the download is bigger than that fail
- If it is None no upper limit
lib.rs
:
This crate provides a simple way to download files. In particular this crate aims to make it easy to download and cache files that do not change over time, for example reference image files, ML models, example audio files or common password lists.
Downloading a file
As an example: To download the plaintext version of RFC 2068 you construct a
DownloadRequest
with the URL and SHA-256 checksum and then use the
[get
] function.
If you know that the file was already downloaded you can use get_cached
.
use data_downloader::{get, get_cached, DownloadRequest};
// Define where to get the file from
let rfc_link = &DownloadRequest {
url: "https://www.rfc-editor.org/rfc/rfc2068.txt",
name: "rfc2068.txt",
sha256_hash: &hex_literal::hex!(
"D6C4E471389F2D309AB1F90881576542C742F95B115336A346447D052E0477CF"
),
};
// Get the binary contents of the file
let rfc: Vec<u8> = get(rfc_link)?;
// Convert the file to a String
let as_text = String::from_utf8(rfc)?;
assert!(as_text.contains("The Hypertext Transfer Protocol (HTTP) is an application-level"));
assert!(as_text.contains("protocol for distributed, collaborative, hypermedia information"));
assert!(as_text.contains("systems."));
// Get the binary contents of the file directly from disk
let rfc: Vec<u8> = get_cached(rfc_link)?;
get_path
can be used to get a PathBuf
to the file. Note that
get_path
does not download the file so you have to call [get
] first.
One of the design goals of this crate is to verify the integrity of the
downloaded files, as such the SHA-256 checksum of the downloads are checked.
If a file is loaded from the cache on disk the SHA-256 checksum is also
verified. However for get_path
the checksum is not verified because even
if it was you would still be vulnerable to a TOC/TOU vulnerability.
The [get
], get_cached
and get_path
function use a default
directory to cache the downloads, this allows multiple application to share
their cached downloads. If you need more configuarbility you can use
Downloader
and set the storage directory manually using
Downloader::new_with_dir
. The default storage directory is a platform
specific cache directory or a platform specific temporary directory if the
cache directory is not available.
Included DownloadRequest
s
The files
module contains some predefined DownloadRequest
for your
convenience.
Pitfalls
When manually changing a DownloadRequest
, inherently the SHA-256 sum
needs to be changed too. If this is not done this can result in a
DownloadRequest
that looks as if it is downloading a specific file but
instead downloads something else. For example here the above
DownloadRequest
was changed but only he url
was addapted. Since
neither the name
nor sha256_hash
are set to the correct value this will
return rfc2068.txt
from the cache. This is a user error, as the developer
has to ensure that they specify the correct SHA-256 checksum for a
DownloadRequest
.
&DownloadRequest {
url: "https://www.rfc-editor.org/rfc/rfc7168.txt",
name: "rfc2068.txt",
sha256_hash: &hex_literal::hex!(
"D6C4E471389F2D309AB1F90881576542C742F95B115336A346447D052E0477CF"
),
};
Status of this crate
This is an early release. As such breaking changes are expected at some point. There are also some implementation limitations including but not limited to:
- The downloading is rather primitive. Failed downloads are simply retried once and no continuation of interupted downloads is implemented.
- The default timeouts of
reqwest
are used. As such large downloads on slow connections can fail. - Only one URL is used per
DownloadRequest
, it's not currently possible to specify multiple possible locations for a file. - Only single files are supported, no unpacking of zips is supported.
- The crate uses blocking IO.
Contributions to improve this are welcome.
Nevertheless this crate should be suitable for simple use cases.
Dependencies
This crate uses the following dependencies:
dirs
to find platform specific temporary and cache directories- Implementing this manually would only cause incompatibilities
reqwest
to issue HTTP requests- A HTTP library is definitely required to allow this crate to download
files.
reqwest
is widely used in the Rust comunity, it is however a rather big dependency as it is very fully featured. It might be worth investigating smaller HTTP client libraries in the future.
- A HTTP library is definitely required to allow this crate to download
files.
sha2
to hash files- To ensure the integrity of the files a collision resistant
cryptographic hash function is required. SHA-256 is generally
considered as the standard for such a use case. The
sha2
crate by theRustCrypto
organisation is the defacto standard implementation of SHA-2 for Rust.
- To ensure the integrity of the files a collision resistant
cryptographic hash function is required. SHA-256 is generally
considered as the standard for such a use case. The
hex-literal
to conveniently specify the SHA-256 sums- Technially this dependency could be removed if we specified the
SHA-256 in the predefined
DownloadRequest
directley as&[u8]
slice litterals. However the library is maintained by theRustCrypto
organisation and as such can be regarded as trustworthy
- Technially this dependency could be removed if we specified the
SHA-256 in the predefined
thiserror
to conveniently deriveError
- This library is also very wiedly used and maintained by David Tolnay ,
a highly regarded member of the Rust comunity. Once
data_downloader
has sufficently matured it might be a good idea to stop usingthiserror
and instead directly use the generated implementations in the code. This would potentially reduce build times. This has however low priority, especially while the [enum@crate::Error
] type is still changing frequently.
- This library is also very wiedly used and maintained by David Tolnay ,
a highly regarded member of the Rust comunity. Once
Dependencies
~4–18MB
~260K SLoC