Inflate To Deflate

Data is large, and hard drive space is a concern. Real-time effective compression is necessary. Highly evolved compression archivers for text, image, and video trade compression ratio for space and time complexity. Sophisticated context mixing algorithms like PAQ and variants[1] can compress a 1 GB text file into a 116 MB archive. Compression is an AI problem. It is about learning and representing an underlying process that generates the data. The representation is an encoding issue, while the process learning involves techniques ranging from shape coding to pattern segmentation, morphology to neural networks, decision trees to optimization, and matrix factorization to wavelet analysis.

This weekend, I explored a topological approach for data compression. It involves finding an algorithm (decompressor) that yields a tiling topologically equivalent to a boolean matrix. For any file, we get its byte stream. Considering the byte stream as an alias of the original file, we reshape it as a boolean matrix. The dimensions are irrelevant. Depending on the original file size, this matrix will be large (~ billions of rows and columns for a 10 GB file) and dense. To compress such a large binary matrix, we find a [combinatorial] rectangle packing algorithm. Such an algorithm can be found using the chain embedding technique.

The intuition is that the byte stream reshaped as a boolean matrix of dimensions M × N can tell us about its sponginess. An algorithm that generates tiling rules for a boolean matrix satisfying the topological properties is the decompressor, and its embedding into a chain complex becomes the compressed file.

I opted for a new topological approach to compress large boolean matrices because existing methods don't scale well. Transforming it into a packing algorithm discovery problem allows efficient encoding by the decompressor, offering speed without losing information density. Initial experiments show a 25-38 compression ratio on some synthetic images. My focus is on finding the algorithm before fine-tuning for performance, so I've been adjusting inputs for easier discovery.