#dom #jquery #scrape #html-parser #crawl

visdom

A html document syntax and operation library, use APIs similar to jquery, easy to use for web scraping and confused html

29 releases

new 0.3.1 Mar 7, 2021
0.3.0 Mar 7, 2021
0.2.8 Mar 4, 2021
0.2.3 Feb 24, 2021
0.0.8 Jan 29, 2021

#45 in Web programming

Download history 25/week @ 2021-01-24 88/week @ 2021-01-31 111/week @ 2021-02-07 183/week @ 2021-02-14 67/week @ 2021-02-21 104/week @ 2021-02-28

194 downloads per month

MIT license

235KB
6K SLoC

Rust 5.5K SLoC // 0.1% comments Go 441 SLoC // 0.1% comments JavaScript 404 SLoC // 0.0% comments

Visdom

Build Status crates.io tag GitHub license

A server-side html document syntax and operation library written in Rust, it uses apis similar to jQuery, left off the parts thoes only worked in browser(e.g. render and event related methods), and use names with snake-case instead of camel-case in javasript.

It's not only helpful for the working with web scraping, but also supported useful apis to operate text nodes, so you can use it to mix your html with dirty html segement to keep away from web scrapers.

Usage

中文 API 文档    CHANGELOG    Live Demo

main.rs

use visdom::Vis;
use std::error::Error;

fn main()-> Result<(), Box<dyn Error>>{
  let html = r##"
    <Doctype html>
    <html>
      <head>
        <meta charset="utf-8" />
      </head>
      <body>
        <nav id="header">
          <ul>
            <li>Hello,</li>
            <li>Vis</li>
            <li>Dom</li>
          </ul>
        </nav>
      </body>
    </html>
  "##;
  // load html
  let nodes = Vis::load(html)?;
  let lis_text = nodes.find("#header li").text();
  println!("{}", lis_text);
  // will output "Hello,VisDom"
  Ok(())
}

Vis

Static method:load(html: &str) -> Result<Elements, Box<dyn Error>>

Load the `html` string into an `Elements` collection.

Static method:load_catch(html: &str, handle: Box<dyn Fn(Box<dyn Error>)>) -> Elements

Load the `html` string into an `Elements` collection, and use the handle to do with the errors such as html parse error, wrong selectors, this is useful if you don't want the process is paniced by the errors.

Static method:load_options(html: &str, options: html::ParseOptions) -> Result<Elements, Box<dyn Error>>

This method allowed you to define the parse options when parsing the `html` string into a document tree, the `load` method is just an alias method of this,  with the most compatible parse options parameter.
// the `load` and `load_catch` use the parse options as below
// more about the `ParseOptions`, you can see the document of `rphtml` library.
ParseOptions{
  auto_fix_unclosed_tag: true,
  auto_fix_unexpected_endtag: true,
  auto_fix_unescaped_lt: true,
  allow_self_closing: true,
}

Static method:load_options_catch(html: &str, options: html::ParseOptions, handle: Box<dyn Fn(Box<dyn Error>)>) -> Elements

It's same as `load` and `load_options` methods, just exposed a parse options parameter so that you can define how to resolve errors when parsing html.

Static method:dom(ele: &BoxDynElement) -> Elements

Change the `ele` node to single node `Elements`, this will copy the `ele`, you don't need it if you just need do something with methods of the `BoxDynElement` its'own.

e.g.:

// go on the code before
let texts = lis.map(|_index, ele|{
  let ele = Vis::dom(ele);
  return String::from(ele.text());
});
// now `texts` will be a `Vec<String>`: ["Hello,", "Vis", "Dom"]

API

The following API are inherited from the library mesdoc

Trait methods

Instance Trait Inherit Document
BoxDynNode INodeTrait None INodeTrait Document
BoxDynElement IElementTrait INodeTrait IElementTrait Document
BoxDynText ITextTrait INodeTrait ITextTrait Document
Box<dyn IDocumentTrait> IDocumentTrait None IDocumentTrait Document

Collections APIs

Collections Document
Elements Elements Document
Texts Texts Document

Selector Operation

Selector API Description Remarks
The caller Self is a Elements, Return Elements Tha all APIs are same with the jQuery library
find(selector: &str) Get the descendants of each element in the Self, filtered by the selector.
filter(selector: &str) Reduce Self to those that match the selector.
filter_by(handle: |index: usize, ele: &BoxDynElement| -> bool) Reduce Self to those that pass the handle function test.
filter_in(elements: &Elements) Reduce Self to those that also in the elements
not(selector: &str) Remove elements those that match the selector from Self.
not_by(handle: |index: usize, ele: &BoxDynElement| -> bool) Remove elements those that pass the handle function test from Self.
not_in(elements: &Elements) Remove elements those that also in the elements from Self.
is(selector: &str) Check at least one element in Self is match the selector.
is_by(handle: |index: usize, ele: &BoxDynElement| -> bool) Check at least one element call the handle function return true.
is_in(elements: &Elements) Check at least one element in Self is also in elements.
is_all(selector: &str) Check if each element in Self are all matched the selector.
is_all_by(handle: |index: usize, ele: &BoxDynElement| -> bool) Check if each element in Self call the handle function are all returned true.
is_all_in(elements: &Elements) Check if each element in Self are all also in elements.
has(selector: &str) Reduce Self to those that have a descendant that matches the selector.
has_in(elements: &Elements) Reduce Self to those that have a descendant that in the elements.
children(selector: &str) Get the children of each element in Self, when the selector is not empty, will filtered by the selector.
parent(selector: &str) Get the parent of each element in Self, when the selector is not empty, will filtered by the selector.
parents(selector: &str) Get the ancestors of each element in Self, when the selector is not empty, will filtered by the selector.
parents_until(selector: &str, filter: &str, contains: bool) Get the ancestors of each element in Self, until the ancestor matched the selector, when contains is true, the matched ancestor will be included, otherwise it will exclude; when the filter is not empty, will filtered by the selector;
closest(selector: &str) Get the first matched element of each element in Self, traversing from self to it's ancestors.
siblings(selector: &str) Get the siblings of each element in Self, when the selector is not empty, will filtered by the selector.
next(selector: &str) Get the next sibling of each element in Self, when the selector is not empty, will filtered by the selector.
next_all(selector: &str) Get all following siblings of each element in Self, when the selector is not empty, will filtered by the selector.
next_until(selector: &str, filter: &str, contains: bool) Get all following siblings of each element in Self, until the sibling element matched the selector, when contains is true, the matched sibling will be included, otherwise it will exclude; when the filter is not empty, will filtered by the selector;
prev(selector: &str) Get the previous sibling of each element in Self, when the selector is not empty, will filtered by the selector.
prev_all(selector: &str) Get all preceding siblings of each element in Self, when the selector is not empty, will filtered by the selector.
prev_until(selector: &str, filter: &str, contains: bool) Get all preceding siblings of each element in Self, until the previous sibling element matched the selector, when contains is true, the matched previous sibling will be included, otherwise it will exclude; when the filter is not empty, will filtered by the selector;
eq(index: usize) Get one element at the specified index.
first() Get the first element of the set,equal to eq(0).
last() Get the last element of the set, equal to eq(len - 1).
slice<T: RangeBounds>(range: T) Get a subset specified by a range of indices. e.g.:slice(..3), will match the first three element.
add(eles: Elements) Get a concated element set from Self and eles, it will generate a new element set, take the ownership of the parameter eles, but have no sence with Self

Helpers

Helper API Description Remarks
length() Get the number of Self's element.
is_empty() Check if Self has no element, length() == 0.
for_each(handle: |index: usize, ele: &mut BoxDynElement| -> bool) Iterate over the elements in Self, when the handle return false, stop the iterator. You can also use each if you like less code.
map<T>(|index: usize, ele: &BoxDynElement| -> T) -> Vec<T> Get a collection of values by iterate the each element in Self and call the handle function.

Supported Selectors

Selectors Description Remarks
* MDN Universal Selectors
#id MDN Id Selector
.class MDN Class Selector
p MDN Type Selectors
[attr] MDN Attribute Selectors
[attr=value] See the above.
[attr*=value] See the above.
[attr|=value] See the above.
[attr~=value] See the above.
[attr^=value] See the above.
[attr$=value] See the above.
[attr!=value] jQuery supported, match the element that has an attribute of attr,but it's value is not equal to value.
span > a MDN Child Combinator match the element of a that who's parent is a span
span a MDN Descendant Combinator
span + a MDN Adjacent Sibling Combinator
span ~ a MDN Generic Sibling Combinator
span,a MDN Selector list
span.a Adjoining Selectors match an element that who's tag type is span and also has a class of .a
:empty MDN :empty Pseudo Selectors
:first-child MDN :first-child
:last-child MDN :last-child
:only-child MDN :only-child
:nth-child(nth) MDN :nth-child() nth support keyword odd and even
:nth-last-child(nth) MDN :nth-last-child()
:first-of-type MDN :first-of-type
:last-of-type MDN :last-of-type
:only-of-type MDN :only-of-type
:nth-of-type(nth) MDN :nth-of-type()
:nth-last-of-type(nth) MDN :nth-last-of-type()
:not(selector) MDN :not()
:contains(content) Match the element who's text() contains the content.
:header All title tags,alias of: h1,h2,h3,h4,h5,h6.
:input All form input tags, alias of: input,select,textarea,button.
:submit Form submit buttons, alias of: input\[type="submit"\],button\[type="submit"\].

Attribute Operation

Attribute API Description Remarks
attr(attr_name: &str) -> Option<IAttrValue> Get an atrribute of key attr_name The return value is an Option Enum IAttrValue, IAttrValue has is_true(), is_str(&str), to_list() methods.
set_attr(attr_name: &str, value: Option<&str>) Set an attribute of key attr_name,the value is an Option<&str>, when the value is None,that means the attribute does'n have a string value, it's a bool value of true.
remove_attr(attr_name: &str) Remove an attribute of key attr_name.
has_class(class_name: &str) -> bool Check if Self's ClassList contains class_name, multiple classes can be splitted by whitespaces.
add_class(class_name: &str) Add class to Self's ClassList, multiple classes can be splitted by whitespaces.
remove_class(class_name: &str) Remove class from Self's ClassList, multiple classes can be splitted by whitespaces.
toggle_class(class_name: &str) Toggle class from Self's ClassList, multiple classes can be splitted by whitespaces.

Content Operation

Content API Description Remarks
text() -> &str Get the text of each element in Self,the html entity will auto decoded.
set_text(content: &str) Set the Self's text, the html entity in content will auto encoded.
html() Get the first element in Self's html.
set_html(content: &str) Set the html to content of each element in Self.
outer_html() Get the first element in Self's outer html.
texts(limit_depth: u32) -> Texts Get the text node of each element in Self, if limit_depth is 0, will get all the descendant text nodes; if 1, will just get the children text nodes.Texts not like Elements, it doesn't have methods by implemented the IElementTrait trait, but it has append_text and prepend_text methods by implemented the ITextTrait.

DOM Operation

DOM Insertion and Remove API Description Remarks
append(elements: &Elements) Append all elements into Self, after the last child
append_to(elements: &mut Elements) The same as the above,but exchange the caller and the parameter target.
prepend(elements: &mut Elements) Append all elements into Self, befpre the first child
prepend_to(elements: &mut Elements) The same as the above,but exchange the caller and the parameter target.
insert_after(elements: &mut Elements) Insert all elements after Self
after(elements: &mut Elements) The same as the above,but exchange the caller and the parameter target.
insert_before(elements: &mut Elements) Insert all elements before Self
before(elements: &mut Elements) The same as the above,but exchange the caller and the parameter target.
remove() Remove the Self, it will take the ownership of Self, so you can't use it again.
empty() Clear the all childs of each element in Self.

Example

let html = r##"
  <div class="second-child"></div>
  <div id="container">
    <div class="first-child"></div>
  </div>
"##;
let root = Vis::load(html)?;
let mut container = root.find("#container");
let mut second_child = root.find(".second-child");
// append the `second-child` element to the `container`
container.append(&mut second_child);
// then the code become to below
/*
<div id="container">
  <div class="first-child"></div>
  <div class="second-child"></div>
</div>
*/
// create new element by `Vis::load`
let mut third_child = Vis::load(r##"<div class="third-child"></div>"##)?;
container.append(&mut third_child);
// then the code become to below
/*
<div id="container">
  <div class="first-child"></div>
  <div class="second-child"></div>
  <div class="third-child"></div>
</div>
*/

Depedencies

Questions & Advices & Bugs?

Welcome to report Issue to us if you have any question or bug or good advice.

License

MIT License.

Dependencies

~1.8–2.6MB
~71K SLoC