Jay Taylor's notes

back to listing index

GitHub - ArchiveBox/ArchiveBox: 🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

[web search]
Original source (github.com)
Tags: web-archiving archive self-hosted github.com
Clipped on: 2023-07-26

Skip to content
Image (Asset 1/205) alt=Open-source self-hosted web archiving.

▶️ Quickstart | Demo | GitHub | Documentation | Info & Motivation | Community | Roadmap

"Your own personal internet archive" (网站存档 / 爬虫)
curl -sSL 'https://get.archivebox.io' | sh

Image (Asset 2/205) alt=

ArchiveBox is a powerful, self-hosted internet archiving solution to collect, save, and view sites you want to preserve offline.

You can set it up as a command-line tool, web app, and desktop app (alpha), on Linux, macOS, and Windows (WSL/Docker).

You can feed it URLs one at a time, or schedule regular imports from browser bookmarks or history, feeds like RSS, bookmark services like Pocket/Pinboard, and more. See input formats for a full list.

It saves snapshots of the URLs you feed it in several formats: HTML, PDF, PNG screenshots, WARC, and more out-of-the-box, with a wide variety of content extracted and preserved automatically (article text, audio/video, git repos, etc.). See output formats for a full list.

The goal is to sleep soundly knowing the part of the internet you care about will be automatically preserved in durable, easily accessible formats for decades after it goes down.



Image (Asset 3/205) alt=
Demo | Screenshots | Usage
. . . . . . . . . . . . . . . . . . . . . . . . . . . .


  Get ArchiveBox with docker / apt / brew / pip3 / nix / etc. (see Quickstart below).

# Get ArchiveBox with Docker or Docker Compose (recommended)
docker run -v $PWD/data:/data -it archivebox/archivebox:dev init --setup

# Or install with your preferred package manager (see Quickstart below for apt, brew, and more)
pip3 install archivebox

# Or use the optional auto setup script to install it
curl -sSL 'https://get.archivebox.io' | sh

Example usage: adding links to archive.

archivebox add 'https://example.com'                                   # add URLs one at a time
archivebox add < ~/Downloads/bookmarks.json                            # or pipe in URLs in any text-based format
archivebox schedule --every=day --depth=1 https://example.com/rss.xml  # or auto-import URLs regularly on a schedule

Example usage: viewing the archived content.

archivebox server 0.0.0.0:8000            # use the interactive web UI
archivebox list 'https://example.com'     # use the CLI commands (--help for more)
ls ./archive/*/index.json                 # or browse directly via the filesystem


Image (Asset 4/205) alt=

Image (Asset 5/205) alt=


Image (Asset 6/205) alt=

Image (Asset 7/205) alt=

Image (Asset 8/205) alt=

Image (Asset 9/205) alt=

Image (Asset 10/205) alt=

Image (Asset 11/205) alt=
Image (Asset 12/205) alt=
. . . . . . . . . . . . . . . . . . . . . . . . . . . .

DEMO: https://demo.archivebox.io
Usage | Configuration | Caveats


Image (Asset 13/205) alt=

Image (Asset 14/205) alt=

Image (Asset 15/205) alt=

Image (Asset 16/205) alt=

Image (Asset 17/205) alt=

Image (Asset 18/205) alt=

Image (Asset 19/205) alt=
Image (Asset 20/205) alt=

Image (Asset 21/205) alt= Image (Asset 22/205) alt= Image (Asset 23/205) alt=

Image (Asset 24/205) alt= Image from WTF is Link Rot?...

The balance between the permanence and ephemeral nature of content on the internet is part of what makes it beautiful. I don't think everything should be preserved in an automated fashion--making all content permanent and never removable, but I do think people should be able to decide for themselves and effectively archive specific content that they care about.

Because modern websites are complicated and often rely on dynamic content, ArchiveBox archives the sites in several different formats beyond what public archiving services like Archive.org/Archive.is save. Using multiple methods and the market-dominant browser to execute JS ensures we can save even the most complex, finicky websites in at least a few high-quality, long-term data formats.

Image (Asset 25/205) alt= Image (Asset 26/205) alt= Community-maintained indexes of archiving tools and institutions.
  • Web Archiving Software
    Open source tools and projects in the internet archiving space.
  • Reading List
    Articles, posts, and blogs relevant to ArchiveBox and web archiving in general.
  • Communities
    A collection of the most active internet archiving communities and initiatives.
  • Check out the ArchiveBox Roadmap and Changelog
  • Learn why archiving the internet is important by reading the "On the Importance of Web Archiving" blog post.
  • Reach out to me for questions and comments via @ArchiveBoxApp or @theSquashSH on Twitter

  • Need help building a custom archiving solution?

    Hire the team that helps build Archivebox to work on your project. (@MonadicalSAS)

    (They also do general software consulting across many industries)



    Image (Asset 27/205) alt=
    Image (Asset 28/205) alt=

    Image (Asset 29/205) alt=


    Image (Asset 30/205) alt= This project is maintained mostly in my spare time with the help from generous contributors and Monadical ( hire them for dev work!).


    Sponsor this project on GitHub

    Image (Asset 31/205) alt=

    Image (Asset 32/205) alt=
    Have spare CPU/disk/bandwidth and want to help the world?
    Check out our Good Karma Kit...

    Footer

    © 2023 GitHub, Inc.