Jay Taylor's notes
back to listing indexGitHub - pirate/readability-extractor: Wrapper around mozilla/readability to keep archivebox free from nodejs
[web search]
Original source (github.com)
Clipped on: 2021-01-19
Skip to content
-
Why GitHub?
- Team
- Enterprise
-
Explore
- Marketplace
-
Pricing
Latest commit
Files
Type
Name
Latest commit message
Commit time
README.md
Readability-Extractor
This is a tiny JS wrapper library around Mozilla's article-text extraction tool https://github.com/mozilla/readability.
It's designed to be used as an ArchiveBox archive method.
Install
npm install -g 'git+https://github.com/pirate/readability-extractor' # which is equivalent to this: curl https://raw.githubusercontent.com/pirate/readability-extractor/master/readability-extractor > /usr/local/bin/readability-extractor chmod +x /usr/local/bin/readability-extractor
Usage
readability-extractor some_article.html > some_article.json
{ "title":"Title autodetected from article html", "byline": "Autodetected author...", "excerpt": "Autodetected short description", "dir": "ltr", "length": 1337, "content": "<div id=\"readability-page-1\" class=\"page\">abc...</div>", "textContent": "abc...", }
ArchiveBox Integration
# You don't have to run these commands usually.
# Readability is on by default and ArchiveBox will find any
# installed version in your $PATH automatically
# However, if you explicitly want to turn readability on
# and/or specify a manual path to the binary, you can do this:
archivebox config --set SAVE_READABILITY=True
archivebox config --set READABILITY_BINARY="$(which readability-extractor)"