5 releases

0.3.2	Aug 7, 2024
0.3.1	Aug 7, 2024
0.3.0	Aug 6, 2024
0.2.1	Aug 4, 2024
0.2.0	Aug 4, 2024

#1882 in Web programming

166 downloads per month

Apache-2.0 OR MIT

190KB
4.5K SLoC

scrapelect

scrapelect is a web scraping language inspired by CSS that turns a web page into structured JSON data. Select elements with CSS selectors, apply filters to extract and modify the data you want from a web page, and get the output in a structured, machine-readable, interoperable format.

installation

Install the Rust toolchain. Using cargo, run:

$ cargo install scrapelect

to install the scrapelect interpreter.

usage

Write a scrapelect program into a .scrp file. Documentation for the language can be found in the scrapelect book.

A quick example, title.scrp, retrieves the title of a Wikipedia article:

title: .mw-page-title-main {
  content: $element | text();
};

Run the scrp with the URL of the web page to scrape:

$ scrapelect title.scrp "https://en.wikipedia.org/wiki/Cat"

It will output:

{
  "title": {
    "content": "Cat"
  }
}

documentation

The scrapelect book contains documentation on language concepts and how to write a scrapelect program.
Additionally, documentation for scrapelect's built-in filters is located at docs.rs
Developer-level documentation is also at docs.rs, but it is currently incomplete.

community

GitHub issues and discussions are great places to report bugs, request features, and get help using scrapelect
Also, consider submitting a pull request to contribute to the code or documentation.
See the contributing chapter of the scrapelect book for more information on contributing to scrapelect.

license

scrapelect is available under the MIT or Apache 2 licenses, at your option. Copies of these licenses are included at LICENSE-MIT and LICENSE-APACHE at the root directory.

scrapelect: scrape + select, also -lect

Dependencies

~12–24MB
~348K SLoC