Skip to main content
Glossary Term

bzip2

Introduction and Overview - bzip2 is a free and open-source file compression program - It uses the Burrows-Wheeler algorithm - It compresses single files, not a file archiver - Relies on external utilities for tasks like handling multiple files, encryption, and archive-splitting - Initial release by Julian Seward in 1996 Compression Techniques - Uses several layers of compression techniques including run-length encoding (RLE), Burrows-Wheeler transform (BWT), move-to-front transform (MTF), and Huffman coding - Compresses data in blocks between 100 and 900 kB - Converts frequently recurring character sequences into strings of identical letters - Compression performance is asymmetric, with decompression being faster than compression Maintainers and Modifications - Multiple maintainers since the initial release - Micah Snyder is the current maintainer since June 2021 - Modifications like pbzip2 for multi-threading to improve compression speed - Suitable for big data applications with cluster computing frameworks like Hadoop and Apache Spark - Compressed blocks can be independently decompressed History and Implementation - First public release by Julian Seward in July 1996 - Version 1.0 released in late 2000 - Federico Mena accepted maintainership in June 2019 after a nine-year hiatus - Micah Snyder became the maintainer in June 2021 - Ongoing expansion and development of the project - Uses a specific order of compression techniques during compression and reverse order during decompression - Techniques include RLE, BWT, MTF, and Huffman coding - Replaces sequences of consecutive duplicate symbols with a repeat length - Burrows-Wheeler transform is at the core of bzip2 - Move-to-front transform and RLE steps optimize compression for natural data patterns File Format, Efficiency, and Limitations - No formal specification for bzip2 exists - A .bz2 stream consists of a 4-byte header, compressed blocks, and an end-of-stream marker with a 32-bit CRC - Compressed blocks are bit-aligned and no padding occurs - bzip2 compresses most files more effectively than LZW and Deflate compression algorithms - LZMA is generally more space-efficient than bzip2, but with slower compression speed - Huffman coding is used with carefully selected codes - Bitmap usage to show which symbols are used inside the block - Limitations include a maximum length of plaintext in a single 900kB bzip2 block and the inability to store multiple files in a single compressed file