Jay Taylor's notes

back to listing index

GitHub - pirate/readability-extractor: Wrapper around mozilla/readability to keep archivebox free from nodejs

[web search]
Original source (github.com)
Tags: javascript html tools firefox readability-extractor github.com
Clipped on: 2021-01-19

Skip to content

Latest commit

Image (Asset 2/4) alt= 7 commits

Files

README.md

Readability-Extractor

This is a tiny JS wrapper library around Mozilla's article-text extraction tool https://github.com/mozilla/readability.

It's designed to be used as an ArchiveBox archive method.

Install

npm install -g 'git+https://github.com/pirate/readability-extractor'

# which is equivalent to this:
curl https://raw.githubusercontent.com/pirate/readability-extractor/master/readability-extractor > /usr/local/bin/readability-extractor
chmod +x /usr/local/bin/readability-extractor

Usage

readability-extractor some_article.html > some_article.json
{
    "title":"Title autodetected from article html",
    "byline": "Autodetected author...",
    "excerpt": "Autodetected short description",
    "dir": "ltr",
    "length": 1337,
    "content": "<div id=\"readability-page-1\" class=\"page\">abc...</div>",
    "textContent": "abc...",
}

ArchiveBox Integration

# You don't have to run these commands usually.
# Readability is on by default and ArchiveBox will find any 
# installed version in your $PATH automatically

# However, if you explicitly want to turn readability on
# and/or specify a manual path to the binary, you can do this:
archivebox config --set SAVE_READABILITY=True
archivebox config --set READABILITY_BINARY="$(which readability-extractor)"

About

Wrapper around mozilla/readability to keep archivebox free from nodejs

Resources

Releases

No releases published

Packages

No packages published

Contributors 2

  • Image (Asset 3/4) alt= JavaScript 100.0%