5 releases
0.1.4 | Feb 21, 2021 |
---|---|
0.1.3 | Jan 22, 2021 |
0.1.2 | Jan 18, 2021 |
0.1.1 | Jan 18, 2021 |
0.1.0 | Jan 18, 2021 |
#1628 in Text processing
9KB
124 lines
web-grep
What this?
Grep for HTML or XML.
$ echo '<a>Hello</a>' | web-grep '<a>{}</a>'
Hello
$ echo '<a>Hello</a>' | web-grep '<a>{html}</a>' --json
{"html":"Hello"}
# List up all <p>-innerHTML
$ cat << EOM | web-grep '<p>{}</p>'
<body>
<p>hello</p>
<div>
<p>world</p>
</div>
</body>
EOM
hello
world
# filtering with attributes
$ cat << EOM | web-grep '<p class=here>{}</p>'
<body>
<p class="not-here">hello</p>
<div>
<p class="here">world</p>
</div>
</body>
EOM
world
# Place-holder {} can be attribute
$ cat << EOM | web-grep '<p class={}>world</p>'
<body>
<p class="not-here">hello</p>
<div>
<p class="here">world</p>
</div>
</body>
EOM
here
How this?
This is just a CLI for an awesome library, tanakh/easy-scraper.
Installation
- Install cargo
- Recommended Way: Install rustup
- Then,
cargo install web-grep
Usage
$ web-grep <QUERY> [INPUT]
The QUERY
is a HTML (XML) Pattern.
Patterns are valid HTML structures which has placeholders for innerHTMLs or attributes.
web-grep
has various placeholders for cases.
Placeholders
Anonymous Palceholder {}
If you need exact one placeholer in the pattern, use {}
.
<p>{}</p>
<p class="here">
<q>{}</q>
</p>
web-grep
outputs all texts matching for {}
.
$ echo "<p>1</p><p>2</p><p>3</p>" | web-grep "<p>{}</p>"
1
2
3
Numbered Placeholders {n}
<a href="{1}">{2}</a>
web-grep
outputs matched texts for {1}
, {2}
... in order, separated by \t
.
$ echo '<a href=hoge>fuga</a>' | web-grep "<a href={2}>{1}</a>"
fuga hoge
The delimiter can be specified with -F
.
$ echo '<a href=hoge>fuga</a>' | web-grep "<a href={2}>{1}</a>" -F ' '
fuga hoge
Named Placeholders {xxx}
<a href="{href}">{innerHTML}</a>
The output can be formatted as JSON with --json
.
$ echo '<a href=hoge>fuga</a>' | web-grep "<a href={href}>{html}</a>" --json
{"href":"hoge","html":"fuga"}
Dependencies
~7–14MB
~166K SLoC