GitHub - ArchiveBox/ArchiveBox: 🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more... - Jay Taylor's notes

Jay Taylor's notes

back to listing index

GitHub - ArchiveBox/ArchiveBox: 🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

Original source (github.com)

Tags: web-archiving archive self-hosted github.com

Clipped on: 2023-07-26

Skip to content

Image (Asset 1/205) alt=

_{Open-source self-hosted web archiving.}

▶️ Quickstart | Demo | GitHub | Documentation | Info & Motivation | Community | Roadmap

"Your own personal internet archive" (网站存档 / 爬虫)
curl -sSL 'https://get.archivebox.io' | sh

ArchiveBox is a powerful, self-hosted internet archiving solution to collect, save, and view sites you want to preserve offline.

You can set it up as a command-line tool, web app, and desktop app (alpha), on Linux, macOS, and Windows (WSL/Docker).

You can feed it URLs one at a time, or schedule regular imports from browser bookmarks or history, feeds like RSS, bookmark services like Pocket/Pinboard, and more. See input formats for a full list.

It saves snapshots of the URLs you feed it in several formats: HTML, PDF, PNG screenshots, WARC, and more out-of-the-box, with a wide variety of content extracted and preserved automatically (article text, audio/video, git repos, etc.). See output formats for a full list.

The goal is to sleep soundly knowing the part of the internet you care about will be automatically preserved in durable, easily accessible formats for decades after it goes down.

Image (Asset 3/205) alt=

Demo | Screenshots | Usage
_{. . . . . . . . . . . . . . . . . . . . . . . . . . . .}

Get ArchiveBox with docker / apt / brew / pip3 / nix / etc. (see Quickstart below).

# Get ArchiveBox with Docker or Docker Compose (recommended)
docker run -v $PWD/data:/data -it archivebox/archivebox:dev init --setup

# Or install with your preferred package manager (see Quickstart below for apt, brew, and more)
pip3 install archivebox

# Or use the optional auto setup script to install it
curl -sSL 'https://get.archivebox.io' | sh

Example usage: adding links to archive.

archivebox add 'https://example.com'                                   # add URLs one at a time
archivebox add < ~/Downloads/bookmarks.json                            # or pipe in URLs in any text-based format
archivebox schedule --every=day --depth=1 https://example.com/rss.xml  # or auto-import URLs regularly on a schedule

Example usage: viewing the archived content.

archivebox server 0.0.0.0:8000            # use the interactive web UI
archivebox list 'https://example.com'     # use the CLI commands (--help for more)
ls ./archive/*/index.json                 # or browse directly via the filesystem

Image (Asset 4/205) alt=

_{. . . . . . . . . . . . . . . . . . . . . . . . . . . .}

DEMO: `https://demo.archivebox.io`
Usage | Configuration | Caveats

^{Image from WTF is Link Rot?...}

The balance between the permanence and ephemeral nature of content on the internet is part of what makes it beautiful. I don't think everything should be preserved in an automated fashion--making all content permanent and never removable, but I do think people should be able to decide for themselves and effectively archive specific content that they care about.

Because modern websites are complicated and often rely on dynamic content, ArchiveBox archives the sites in several different formats beyond what public archiving services like Archive.org/Archive.is save. Using multiple methods and the market-dominant browser to execute JS ensures we can save even the most complex, finicky websites in at least a few high-quality, long-term data formats.

Community-maintained indexes of archiving tools and institutions.
Web Archiving Software
Open source tools and projects in the internet archiving space.

Reading List
Articles, posts, and blogs relevant to ArchiveBox and web archiving in general.

Communities
A collection of the most active internet archiving communities and initiatives.

Check out the ArchiveBox Roadmap and Changelog

Learn why archiving the internet is important by reading the "On the Importance of Web Archiving" blog post.

Reach out to me for questions and comments via @ArchiveBoxApp or @theSquashSH on Twitter

Need help building a custom archiving solution?

✨ Hire the team that helps build Archivebox to work on your project. (@MonadicalSAS)

^{(They also do general software consulting across many industries)}

_{This project is maintained mostly in my spare time with the help from generous contributors and Monadical (✨ hire them for dev work!).}

Sponsor this project on GitHub

✨ Have spare CPU/disk/bandwidth and want to help the world?
Check out our Good Karma Kit...

About

Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

archivebox.io

Topics

python rss backups firefox pinboard youtube-dl chromium self-hosted wget pocket browser-bookmarks warc web-archiving wayback-machine digipres singlefile headless-browser bookmark-archiver internet-archiving archivebox

Resources

Readme

License

MIT license

Code of conduct

Code of conduct

Activity

Stars

16.5k stars

Watchers

174 watching

Forks

986 forks

Report repository

Releases 24

v0.6.2: >10x performance gain, new Admin UI & CLI features, and more Latest

Apr 10, 2021

+ 23 releases

Sponsor this project

pirate Nick Sweeting

patreon.com/theSquashSH

https://twitter.com/ArchiveBoxApp

https://paypal.me/NicholasSweeting

https://www.blockchain.com/eth/address/0x5D4c34D4a121Fe08d1dDB7969F07550f2dB9f471

https://www.blockchain.com/btc/address/1HuxXriPE2Bbnag3jJrqa3bkNHrs297dYH

Learn more about GitHub Sponsors

Packages 1

ArchiveBox/archivebox

Used by 12

+ 4

Contributors 100

+ 89 contributors

Languages

Python 50.3%

HTML 44.8%

Shell 3.3%

Other 1.6%

Footer

© 2023 GitHub, Inc.

Footer navigation