WARC - Just Solve the File Format Problem[web search]
|MIME Type(s)||application/warc, application/warc-fields|
WARC is the successor to the ARC (Internet Archive) format. Standardized as ISO 28500:2009, Information and documentation -- WARC file format. Developed under the auspices of the International Internet Preservation Consortium. WARC was developed as an extension to ARC in part to provide better capabilities for managing Web archives for the long term, allowing for capture of more metadata about the circumstances of archiving.
WARC files are often compressed using gzip, resulting in a .warc.gz extension.
- Draft of ISO-DIS 28500 As circulated for ISO ballot and approval.
- WARC, Web ARChive file format, from Library of Congress resource on Sustainability of Digital Formats
- Working drafts for WARC specification
- The WARC Format v. 1.0
- WARC Specifications
- Test WARC Files warc.gz file from Internet Archive.
- WARC Tools (in Python)
- Some history on the Python tools is available on here on the COPTR wiki.
- warcat: Tool and library for handling Web ARChive (WARC) files.
- Warcreate (for Google Chrome)
- warcbase platform
- WARC Input and Output Formats for Hadoop
Other links and references
- The WARC File Format (ISO 28500) - Information, Maintenance, Drafts
- Slide show on WARC
- The WARC Ecosystem (Archive Team)
- Web Archive Analysis Workshop
- Warcbase Wiki
- Discussion of WARC format 1.1, under development
- Harvesting the Twitter Streaming API to WARC files
- The great WARC adventure: Using SIPS, AIPS, and DIPS to document SLAAPs
- WARC Work
- WARC MIME Media Type (as of now unregistered, but a suggested value exists)