18 releases (6 major breaking)

6.0.0 Jun 17, 2023
5.1.0 Jun 17, 2023
4.0.2 May 30, 2023
3.0.1 May 23, 2023
0.2.0 May 16, 2023

#312 in Text processing

23 downloads per month

Apache-2.0

33KB
410 lines

WhichLicense detection

This is a library to facilitate the detection of licenses in source code.

Usage

License Detection

Gaoya detection

let mut gaoya = GaoyaDetection {
    index: MinHashIndex::new(num_bands, band_width, 0.5),
    min_hasher: MinHasher32::new(num_bands * band_width),
    shingle_text_size,
    normalization_fn: DEFAULT_NORMALIZATION_FN,
};
gaoya.load_from_file("licenses");
// OR: 
// for l in load_licenses_from_folder("./licenses/RAW"){
//     gaoya.add_plain(&l.name, &strip_spdx_heading(&l.text));
// }

Fuzzyhash-rs Detection

let mut fuzzy = FuzzyDetection {
        licenses: vec![],
        min_confidence: 50,
        exit_on_exact_match: false,
        normalization_fn: DEFAULT_NORMALIZATION_FN,
};
fuzzy.load_from_file("licenses");
// OR: 
// for l in load_licenses_from_folder("./licenses/RAW"){
//     fuzzy.add_plain(&l.name, &strip_spdx_heading(&l.text));
// }

Normalization function

The normalization function is used to normalize the license text before it is processed by the algorithm. This is used so that the algorithm can focus on the license text itself and not the formatting of the license text, which ultimately improves the accuracy of the algorithm (higher confidence).

Pipeline System

The pipeline system was developed to automatically improve the results of license detection outputs by allowing further processing when a confidence is, for example, too low. A pipeline works by executing each segment on the running license whilst also checking against the algorithm every time a segment is executed. The pipeline will stop running if the confidence of the top (highest confidence) license is above the desired confidence.

The steps are as follows:

  1. The pipeline is created with the given segments.
  2. An initial sample is fetched from the algorithm directly without executing any pipeline segment.
  3. The system checks if the confidence of the top (highest confidence) license is above the desired confidence.
    • If it is, the pipeline stops running and returns the results.
    • If it is not, the pipeline continues to step 4.
  4. The next segment is executed on the running license (starts at the first segment).
  5. The system checks if the confidence of the top (highest confidence) license is above the desired confidence.
    • If it is, the pipeline stops running and returns the results.
    • If it is not, the pipeline moves back to step 4 and runs the next segment.

Batched segments allow you to run multiple segments one after the other without checking against (i.e., testing) the algorithm after each segment. The algorithm will be tested after all batched segments have executed.

Example

let pipeline = Pipeline::new(vec![
    Segment::Remove(Using::Regex(Regex::new(r"...").unwrap())),
    Segment::Remove(Using::Text("...".to_string())),
    Segment::Replace(Using::Text("...".to_string()), "***".to_string()),
    Segment::Batch(vec![
        Segment::Remove(Using::Regex(Regex::new(r"...").unwrap())),
        Segment::Remove(Using::Regex(Regex::new(r"...").unwrap())),
    ]),
]);

let results = pipeline.run(&algorithm, "<your_incoming_license>", 100.0);

Attributions

ScanCode License data

The initial database was generated by making use of the license data from the ScanCode toolkit. You do not need to make use of this copyright notice in your project if you choose not to use the ScanCode license database. However, if you do make use of the ScanCode license database, you must include this copyright notice in your project.

Copyright (c) nexB Inc. and others. All rights reserved. ScanCode is a trademark of nexB Inc. SPDX-License-Identifier: CC-BY-4.0 See https://creativecommons.org/licenses/by/4.0/legalcode for the license text. See https://github.com/nexB/scancode-toolkit for support or download. See https://aboutcode.org for more information about nexB OSS projects.

Dependencies

~5.5–8MB
~148K SLoC