#html-xml #grep #json #cat #tool

app web-grep

A Grep Tool for HTML or XML

5 releases

0.1.4 Feb 21, 2021
0.1.3 Jan 22, 2021
0.1.2 Jan 18, 2021
0.1.1 Jan 18, 2021
0.1.0 Jan 18, 2021

#1628 in Text processing

MIT license

9KB
124 lines

web-grep

What this?

Grep for HTML or XML.

$ echo '<a>Hello</a>' | web-grep '<a>{}</a>'
Hello
$ echo '<a>Hello</a>' | web-grep '<a>{html}</a>' --json
{"html":"Hello"}
# List up all <p>-innerHTML
$ cat << EOM | web-grep '<p>{}</p>'
<body>
  <p>hello</p>
  <div>
    <p>world</p>
  </div>
</body>
EOM
hello
world
# filtering with attributes
$ cat << EOM | web-grep '<p class=here>{}</p>'
<body>
  <p class="not-here">hello</p>
  <div>
    <p class="here">world</p>
  </div>
</body>
EOM
world
# Place-holder {} can be attribute
$ cat << EOM | web-grep '<p class={}>world</p>'
<body>
  <p class="not-here">hello</p>
  <div>
    <p class="here">world</p>
  </div>
</body>
EOM
here

How this?

This is just a CLI for an awesome library, tanakh/easy-scraper.

Installation

  1. Install cargo
    • Recommended Way: Install rustup
  2. Then,
    • cargo install web-grep

Usage

$ web-grep <QUERY> [INPUT]

The QUERY is a HTML (XML) Pattern.

Patterns are valid HTML structures which has placeholders for innerHTMLs or attributes. web-grep has various placeholders for cases.

Placeholders

Anonymous Palceholder {}

If you need exact one placeholer in the pattern, use {}.

<p>{}</p>
<p class="here">
    <q>{}</q>
</p>

web-grep outputs all texts matching for {}.

$ echo "<p>1</p><p>2</p><p>3</p>" | web-grep "<p>{}</p>"
1
2
3

Numbered Placeholders {n}

<a href="{1}">{2}</a>

web-grep outputs matched texts for {1}, {2}... in order, separated by \t.

$ echo '<a href=hoge>fuga</a>' | web-grep "<a href={2}>{1}</a>"
fuga	hoge

The delimiter can be specified with -F.

$ echo '<a href=hoge>fuga</a>' | web-grep "<a href={2}>{1}</a>" -F ' '
fuga hoge

Named Placeholders {xxx}

<a href="{href}">{innerHTML}</a>

The output can be formatted as JSON with --json.

$ echo '<a href=hoge>fuga</a>' | web-grep "<a href={href}>{html}</a>" --json
{"href":"hoge","html":"fuga"}

Dependencies

~7–14MB
~166K SLoC