37 releases
0.8.0 | Apr 3, 2023 |
---|---|
0.8.0-alpha.5 | Feb 23, 2023 |
0.8.0-alpha.1 | Jan 31, 2023 |
0.7.4 | Dec 4, 2022 |
0.3.0-alpha.3 | Nov 12, 2020 |
#957 in Web programming
695 downloads per month
Used in 2 crates
165KB
3.5K
SLoC
parsoid-rs
The parsoid
crate is a wrapper around Parsoid HTML
that provides convenient accessors for processing and extraction.
See the full documentation (docs for main).
Testing
Use the build_corpus
example to download the first 500 featured articles
on the English Wikipedia to create a test corpus.
The featured_articles
example will iterate through those downloaded examples
to test the parsing code, clean roundtripping, etc.
License
parsoid-rs is (C) 2020-2021 Kunal Mehta, released under the GPL v3 or any later version, see COPYING for details.
lib.rs
:
parsoid-rs
The parsoid
crate is a wrapper around Parsoid HTML
that provides convenient accessors for processing and extraction.
Inspired by mwparserfromhell, parsoid-jsapi and built on top of Kuchiki (朽木).
Quick starts
Fetch HTML and extract the value of a template parameter:
# use parsoid::Result;
use parsoid::prelude::*;
# #[tokio::main]
# async fn main() -> Result<()> {
let client = ParsoidClient::new("https://en.wikipedia.org/api/rest_v1", "parsoid-rs demo")?;
let code = client.get("Taylor_Swift").await?.into_mutable();
for template in code.filter_templates()? {
if template.name() == "Template:Infobox person" {
let birth_name = template.param("birth_name").unwrap();
assert_eq!(birth_name, "Taylor Alison Swift");
}
}
# Ok(())
# }
Add a link to a page and convert it to wikitext:
# use parsoid::Result;
use parsoid::prelude::*;
# #[tokio::main]
# async fn main() -> Result<()> {
let client = ParsoidClient::new("https://en.wikipedia.org/api/rest_v1", "parsoid-rs demo")?;
let code = client.get("Wikipedia:Sandbox").await?.into_mutable();
let link = WikiLink::new(
"./Special:Random",
&Wikicode::new_text("Visit a random page")
);
code.append(&link);
let wikitext = client.transform_to_wikitext(&code).await?;
assert!(wikitext.ends_with("[[Special:Random|Visit a random page]]"));
# Ok(())
# }
This crate provides no functionality for actually saving a page, you'll
need to use something like mwbot
.
Architecture
Conceptually this crate provides wiki-related types on top of an HTML processing
library. There are three primary constructs to be aware of: Wikicode
,
Wikinode
, and Template
.
Wikicode
represents a container of an entire wiki page, equivalent to a
<html>
or <body>
node. It provides some convenience functions like
filter_links()
to easily operate on and mutate a specific Wikinode.
(For convenience, Wikicode
is also a Wikinode
.)
# use parsoid::Result;
use parsoid::prelude::*;
# #[tokio::main]
# async fn main() -> Result<()> {
let client = ParsoidClient::new("https://en.wikipedia.org/api/rest_v1", "parsoid-rs demo")?;
let code = client.get("Taylor Swift").await?.into_mutable();
for link in code.filter_links() {
if link.target() == "You Belong with Me" {
// ...do something
}
}
# Ok(())
# }
Filter functions are only provided for common types as an optimization, but it's straightforward to implement for other types:
# use parsoid::Result;
use parsoid::prelude::*;
# #[tokio::main]
# async fn main() -> Result<()> {
let client = ParsoidClient::new("https://en.wikipedia.org/api/rest_v1", "parsoid-rs demo")?;
let code = client.get("Taylor Swift").await?.into_mutable();
let entities: Vec<HtmlEntity> = code
.descendants()
.filter_map(|node| node.as_html_entity())
.collect();
# Ok(())
# }
Wikinode
is an enum representing all of the different types of Wikinodes,
mostly to enable functions that accept/return various types of nodes.
A Wikinode provides convenience functions for working with specific
types of MediaWiki constructs. For example, the WikiLink
type wraps around
a node of <a rel="mw:WikiLink" href="...">...</a>
. It provides functions
for accessing or mutating the href
attribute. To access the link text
you would need to use .children()
and modify or append to those nodes.
Standard mutators like .append()
and .insert_after()
are part of the
WikinodeIterator
trait, which is automatically imported in the prelude.
The following nodes have been implemented so far:
BehaviorSwitch
:__TOC__
,{{DISPLAYTITLE:}}
Category
:[[Category:Foo]]
Comment
:<!-- ... -->
ExtLink
:[https://example.org Text]
Heading
:== Some text ==
HtmlEntity
:
IncludeOnly
:<includeonly>foo</includeonly>
InterwikiLink
:[[:en:Foo]]
LanguageLink
:[[en:Foo]]
Nowiki
:<nowiki>[[foo]]</nowiki>
Redirect
:#REDIRECT [[Foo]]
Section
: Contains aHeading
and its contentsWikiLink
:[[Foo|bar]]
Generic
- any node that we don't have a more specific type for.
Each Wikinode is effectively a wrapper around Rc<Node>
, making it cheap to
clone around.
Templates
Unlike Wikinodes, Templates do not have a 1:1 mapping with a HTML node, it's
possible to have multiple templates in one node. The main way to get
Template
instances is to call Wikicode::filter_templates()
.
See the Template
documentation for more details
and examples.
noinclude and onlyinclude
Similar to Templates, <noinclude>
and <onlyinclude>
do not have a
1:1 mapping with a single HTML node, as they may span multiple. The main
way to get NoInclude
or OnlyInclude
instances is to call
filter_noinclude()
and filter_onlyinclude()
respectively.
See the module-level documentation for more details and examples.
Safety
This library is implemented using only safe Rust and should not panic.
However, the HTML is expected to meet some level of well-formedness. For
example, if a node has rel="mw:WikiLink"
, it is assumed it is an <a>
element. This is not designed to be fully defensive for arbitrary HTML
and should only be used with HTML from Parsoid itself or mutated by
this or another similar library (contributions to improve this will gladly
be welcomed!).
Additionally Wikicode
does not implement Send
,
which means it cannot be safely shared across threads. This is a
limitation of the underlying kuchiki library being used.
A ImmutableWikicode
is provided as a workaround - it is Send
and
contains all the same information Wikicode
does, but is immutable.
Switching between the two is straightforward by using into_immutable()
and
into_mutable()
or by using the standard From
and Into
traits.
Contributing
parsoid
is a part of the mwbot-rs
project.
We're always looking for new contributors, please reach out
if you're interested!
Dependencies
~5–14MB
~285K SLoC