Jay Taylor's notes

back to listing index

GitHub - gtoubassi/femtozip: FemtoZip is a "shared dictionary" compression library optimized for small documents that may not compress well with traditional tools such as gzip

[web search]

Original source (github.com)

Tags: compression tools library femtozip github.com

Clipped on: 2020-05-20

Why GitHub?
Team
Enterprise
Explore
Marketplace
Pricing

gtoubassi / femtozip

Join GitHub today

GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.

FemtoZip is a "shared dictionary" compression library optimized for small documents that may not compress well with traditional tools such as gzip

Shell Java C++ Makefile C M4

Branch: master

Find file

Clone or download

Latest commit

cpp Add fz_compress_writer and fz_decompress_writer to C interface 4 years ago

java maxdictionary size ignored in compressionmodel 4 years ago

scripts Perf work. Significantly improve compression times. If you are willin… 9 years ago

.gitignore gitignored build results 4 years ago

LICENSE License file. 9 years ago

README.md Update readme referring to femtozip as a 'shared dictionary' library 9 years ago

README.md

FemtoZip

FemtoZip is a "shared dictionary" compression library optimized for small documents that may not compress well with traditional tools such as gzip. In particular, situations where a very large number of small documents (10's to 1000's of bytes) share similar characteristics, but do not compress effectively standalone.

How can I tell if my data will work with femtozip?

If gzipping 1000 of your documents concatenated together in a single file achieves much better compression rates then individual documents, then your data is likely tailor made for FemtoZip.
Get your documents onto the file system as discrete files, and run a test using the fzip command line tool as shown in the Tutorial.
If you have a Lucene search index and you want to see how much FemtoZip can compress your stored fields, try the IndexAnalyzer

Examples where FemtoZip is likely to outperform gzip:

Small objects serialized and stored in a database or in memory DHT such as memcached using php, json, or xml serialization format. Keys and tags are repeated across documents, but may not be repeated within a document. For example in one large scale consumer website, memcached user objects (via php serialization) were compressed to 29% of their gzipped size (8.3% of their original size).
Urls, for example stored in a Lucene search index. Urls often start with "http://www.", and have common substrings like ".com/", ".html", "?page=". Again this structure is repeated across documents, but not within a document. For example in a large scale search engine urls in Lucene were compressed to 60% of their gzipped size (20% of their original size).

Other Uses

FemtoZip can also be used for building SDCH dictionaries.

Learn More

To learn more and get your hands dirty check out the FemtoZip wiki at https://github.com/gtoubassi/femtozip/wiki

Terms
Privacy
Security
Status
Help