#web-scraping #css-selectors #web-page #json #extract #data #structured

nightly bin+lib scrapelect

Interpreter for scrapelect, a CSS-inspired web scraping DSL

5 releases

0.3.2 Aug 7, 2024
0.3.1 Aug 7, 2024
0.3.0 Aug 6, 2024
0.2.1 Aug 4, 2024
0.2.0 Aug 4, 2024

#1457 in Web programming

Apache-2.0 OR MIT

190KB
4.5K SLoC

JavaScript 2.5K SLoC // 0.2% comments Rust 2.5K SLoC // 0.0% comments

scrapelect

scrapelect is a web scraping language inspired by CSS that turns a web page into structured JSON data. Select elements with CSS selectors, apply filters to extract and modify the data you want from a web page, and get the output in a structured, machine-readable, interoperable format.

installation

Install the Rust toolchain. Using cargo, run:

$ cargo install scrapelect

to install the scrapelect interpreter.

usage

Write a scrapelect program into a .scrp file. Documentation for the language can be found in the scrapelect book.

A quick example, title.scrp, retrieves the title of a Wikipedia article:

title: .mw-page-title-main {
  content: $element | text();
};

Run the scrp with the URL of the web page to scrape:

$ scrapelect title.scrp "https://en.wikipedia.org/wiki/Cat"

It will output:

{
  "title": {
    "content": "Cat"
  }
}

documentation

  • The scrapelect book contains documentation on language concepts and how to write a scrapelect program.
  • Additionally, documentation for scrapelect's built-in filters is located at docs.rs
  • Developer-level documentation is also at docs.rs, but it is currently incomplete.

community

  • GitHub issues and discussions are great places to report bugs, request features, and get help using scrapelect
  • Also, consider submitting a pull request to contribute to the code or documentation.
  • See the contributing chapter of the scrapelect book for more information on contributing to scrapelect.

license

scrapelect is available under the MIT or Apache 2 licenses, at your option. Copies of these licenses are included at LICENSE-MIT and LICENSE-APACHE at the root directory.

scrapelect: scrape + select, also -lect

Dependencies

~11–23MB
~354K SLoC