Data processing // Lib.rs

Data processing

lib.rs goes well beyond displaying crates.io data as-is. Many crates have incomplete metadata, e.g. lack categories or keywords that would help find the crate. Sometimes the metadata specified by crate authors is incorrect (e.g. the purpose of the parsing category is often misunderstood, or repository links of forked crates still point to the upstream repo instead of the fork, etc.). Download numbers counted by crates.io don't have any throttling or anti-spam measures, so they're biased by automated downloads from web crawlers and uncached CI builds.

To make search work better, and crate pages show more useful information, lib.rs combines data from crates.io with data from github.com, docs.rs, rustsec.org, rustaceans.org, cargo-crev repositories, cargo-vet registry, and its own datasets and analysis. This means that the combined data is not just from crate authors, and should be understood as lib.rs's interpretation, and not necessarily what the crate authors intended.

lib.rs often uses heuristics to complete and fix data. Most of the data quality issues are reported in the maintainer dashboard.

If a crate is missing keywords, they are scraped from the crate's description, readme, source code, and github metadata. This may sometimes pick words that aren't most relevant. Crate authors can prevent this by filling the keywords field in Cargo.toml. Keywords are normalized to kebab-case, and some spelling variations and close synonyms are canonicalized.
lib.rs has slightly different categories than crates.io. Some categories have been merged, because individually they had too few crates, or were often confused with each other. Some categories have been broadened to make a place for crates that did not have a dedicated category.
- Removed api-bindings and external-ffi-bindings. The crates are in topic-specific categories instead (classified by what they are for, not what they are).
- Merged localization into internationalization.
- Merged drones and UAVs into robotics.
- Merged game-engines into game-development.
- Merged graphics into images.
- Merged freebsd and linux into unix-apis.
- Merged no-alloc into no-std.
- Merged multimedia::encoding into other multimedia subcategories.
- Expanded compilers to be programming languages in general.
- Expanded computer-vision into machine learning.
- Expanded macos-apis to include iOS and other Apple platforms.
- Expanded neuroscience into biology.
If a crate lacks categories that lib.rs uses, or has categories that are commonly confused (e.g. parsing vs parser-implementations), then lib.rs will try to deduce categorization using fuzzy logic based on crate's keywords, features, dependencies, and categories of similar crates. Sometimes categories are overriden manually if the heuristics don't pick the right ones. In most cases authors can prevent this by filling the categories and keywords fields in Cargo.toml. Issues with categories are flagged in the maintainer dashboard.
lib.rs users GitHub's contributor insights to show top users who contributed to the project. Because crates.io deprecated the authors metadata field, these names of authors are not shown, unless they appear to be names of teams or organizations. Individual authors are shown if they can be matched to a GitHub account, and then lib.rs uses the name from GitHub. The data is cached to stay within GitHub API quota, so it might take a while for it to update.
If a crate is missing README on crates.io, lib.rs will search its repository for a README and/or show doc comments from src/lib.rs.
Links in the readme that point to crates.io crate pages are rewritten to equivalent lib.rs pages. Images are proxied through an image resizing service. Relative links in the README depend on location of the file in its repository in a specific commit. This context is sometimes lost in published crates, so relative links may be broken. There is a partial support Rustdoc-specific markdown syntax for documentation links.
If a crate does not specify a repository URL, lib.rs will check if the crate owner has a repository for the crate on github.com.
If a crate is flagged as unmaintained by rustsec.org or cargo-crev reviwers, the crate page will show an unmaintained badge.
cargo-vet diff reviews are displayed as full reviews if the safe-to-deploy diffs from the same source can be added together to cover all versions of the crate.
If a crate's repository is archived on GitHub, the crate page will reflect that.
Relationships between multiple crates are deduced based on directory structure of their repository (if they share a monorepo), naming schemes (e.g. a _derive suffix), and shared owners on crates-io.
Some data is based on text in the crate's description and readme, e.g. whether the crate is reserved, deprecated, or internal.
Whether a crate is labelled as a "dev" or "build" dependency is based on how often it's used in [dev-dependencies] and [build-dependencies] compared to the regular dependencies section.
lib.rs has a manually curated list of deprecated/obsolete crates (e.g. tokio v0.1 or futures-preview). Libraries may also be marked as deprecated if they have lost majority of their users (were removed from crates, or most crates that used them became unmaintained).
If a crate is missing rust-version field, the information about the minimum supported Rust version is estimated based on successes and failures of cargo check of the crate and its dependencies.
Download numbers are filtered to remove noise floor, suspicious outliers, and known incidents of spam/manipulation.
no-std support is inferred based on presence of std/no-std Cargo features or attributes in src/lib.rs.
Users are banned for name squatting manually.

The list of sources and algorithms is likely to be expanded in the future. See also logic for ranking and outdated dependencies.