#scrape #cli #search

app scwape

A tool to scrape the web via CSS selectors

3 releases

0.1.5 Jul 30, 2021
0.1.4 Jul 30, 2021

#35 in #scrape

MIT and GPL-3.0-only

17KB
270 lines

Scwape

A command line tool in Rust to scrape data from websites via CSS selectors.

Install

cargo install scwape

Usage

# syntax: scwape <url> -s "#css-selector"
# get all elements with an href from wikipedia's home page
scwape "https://www.wikipedia.org/" -s '[href]'
# get mdn's list of css selectors
scwape "https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors" -s '[href*="/en-US/docs/Web/CSS/"]'
# syntax: scwape <file> -s "#css-selector" -f "format specifier\n"
# get the headers in mdn's list of css selectors. When specifying a format, appending a newline is typically desirable. 
curl "https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors" > selector.html
scwape selector.html -s "h2" -f "\id: \text\n"

The default format is \text\n, and extra format specifiers are ignored.

scwape <url_or_file> -s "#selector1" -s ".selector2" -f -f "format1\n" -f "format2\n" -f "format3\n"

is equivalent to

scwape <url_or_file> -s "#selector1" -s ".selector2" -f "format1\n" -f "format2\n"

The blank -f and the extra "format3\n are ignored.

The possible format specifiers are

  • Id (\id, the element id)
  • Name (\name, the element name)
  • Classes (\classes, the classes for the element)
  • Text (\text, the combined text of child nodes for the element)
  • Html (\html, the html of the element)
  • Attrs (\attrs, the attributes of the element)

\id will be replaced, but \\id, \\\id, and so on will not. Likewise for the other format specifiers.

The disparate -d option exists to allow for printing out each selector independently. The default behavior is to print the matching elements for all selectors in the order they appear. The disparate option would instead print the elements for the first selector, then the second, then the third and so on.

Completions

Fish and Bash shell completions are available on github and are generated upon cargo build. To generate your own, select the appropiate shell in build.rs, then run cargo build. The shell completion will be available in the completions directory. The list of available shells are in clap's documentation.

Dependencies

~11–24MB
~384K SLoC