#crawler #user-agent #article #board #online #community #ptt

bin+lib ptt-crawler

A crawler for the web version of PTT, the largest online community in Taiwan

1 unstable release

0.1.0 Aug 16, 2020

#19 in #community

MIT license

145KB
1.5K SLoC

ptt-crawler (ptc)  Crates.io latest version badge Docs.rs badge Crates.io download latest badge Crates.io license badge

A crawler for the web version of PTT, the largest online community in Taiwan.

Yet another PTT crawler but written in Rust. Can be used as binary directly or as crate.

Table of Contents

Created by gh-md-toc

Features

  • Single executable without any dependence
  • Cross platforms supported
  • Crawl single article or multiple articles in one board
  • Anti-Anti-Crawler with random user agent and proxy server

Getting started

Installation

The binary name for ptt-crawler is ptc . Currently, no precompiled binary is available. You need Rust 1.40 or higher and use cargo to build ptt-crawler from the sources.

From crates.io

> cargo install ptt-crawler

From the sources

> git clone https://github.com/cwouyang/ptt-crawler.git
> cd ptt-crawler
> cargo build --release

How to use

  • Crawls specific article
> ptc url https://www.ptt.cc/bbs/Gossiping/M.1597463395.A.478.html

Specify flags user agent -u and proxy -p used during crawling

> ptc -u "user/agent/string" -p "https://some.proxy" url https://www.ptt.cc/bbs/Gossiping/M.1597463395.A.478.html

# pass "random" to use randomly generated user agent
> ptc -u "random" https://www.ptt.cc/bbs/Gossiping/M.1597463395.A.478.html
  • Crawls articles of board within page range
# From page 100 (https://www.ptt.cc/bbs/Gossiping/index100.html) to 200 (https://www.ptt.cc/bbs/Gossiping/index200.html)
> ptc board Gossiping -r 100 200

# From page 1 to latest page
> ptc board Gossiping

Use -l flag to list supported boards

> ptc board Gossiping --list

Used as crate

Add ptt-crawler as dependence in Cargo.toml file

[dependencies]
ptt-crawler = "0.1"

See document for usages.

Run unit tests

> cargo test --all

Contributing

If you'd like to contribute, please fork the repository and use a feature branch. Pull requests are warmly welcome.

Before submit pull request, make sure

Versioning

We use SemVer for versioning. For the versions available, see the tags on this repository.

License

Copyright (c) 2020 cwouyang.

This project is licensed under the terms of MIT License. See the LICENSE file for details.

Dependencies

~26–39MB
~667K SLoC