The Center for Education and Research in Information Assurance and Security (CERIAS)

The Center for Education and Research in
Information Assurance and Security (CERIAS)

Differentially Private Compression Algorithms

Principal Investigator: Jeremiah Blocki

Data compression is a fundamental tool used throughout modern computing systems to reduce storage requirements and communication bandwidth. Widely deployed systems such as ZIP archives, web content compression, and network protocols rely on compression algorithms to exploit redundancy in data before storing or transmitting files. In many settings, compression is combined with encryption in a Compress-Then-Encrypt pipeline: the data is first compressed to reduce its size and then encrypted to protect its contents. While encryption protects the underlying plaintext from direct inspection, it does not hide the length of the encrypted message, which can itself leak sensitive information about the original data.

This leakage arises because compression ratios depend on the structure and redundancy of the input data. If two candidate plaintexts compress to significantly different lengths, an attacker observing the ciphertext length may be able to infer information about the plaintext even though the ciphertext contents are encrypted. Several real-world attacks exploit this phenomenon. For example, the CRIME and BREACH attacks leveraged compression length leakage in web protocols to recover sensitive information from encrypted network traffic. Despite these risks, compress-then-encrypt pipelines remain widely used because compression substantially reduces bandwidth and storage costs.

This project investigates how to mitigate compression-length leakage by developing differentially private compression schemes. Differential privacy has emerged as a rigorous mathematical framework for limiting the amount of information that can be inferred about sensitive data from an algorithm’s output. Informally, a randomized algorithm is differentially private if its output distribution does not change significantly when a single element of the input dataset is modified. Applying this idea to compression, the goal is to ensure that the length of the compressed output does not reveal much information about any individual character or symbol in the input file.

Our work introduces a general Differentially Private Compress-Then-Encrypt framework that adds a carefully calibrated amount of random padding to the compressed output before encryption. The amount of padding depends on the global sensitivity of the compression algorithm—defined as the maximum change in the compressed length that can occur when a single character of the input string is modified. If the compression algorithm has low sensitivity, then only a small amount of padding is required to guarantee strong differential privacy guarantees while preserving most of the benefits of compression.

A central challenge is therefore understanding the sensitivity of real-world compression algorithms. Some compression schemes are extremely sensitive to small input changes. Prior work showed that certain dictionary-based compression algorithms can produce drastically different compressed outputs even when the input strings differ by only a single symbol. If such sensitivity is unavoidable, achieving differential privacy would require large amounts of padding, eliminating the benefits of compression.

Initial Results.
Our initial results, published at TCC 2025, establish a theoretical foundation for differentially private compression and provide the first detailed analysis of the sensitivity of a widely deployed compression algorithm. We introduce a general Differentially Private Compress-Then-Encrypt framework, which achieves differential privacy by adding carefully calibrated random padding to the compressed output before encryption. The amount of padding required depends on the global sensitivity of the compression algorithm—defined as the maximum change in compressed output length that can occur when a single symbol in the input string is modified.

A key technical component of this work is a new analysis of the LZ77 compression algorithm, one of the most widely used compression techniques in practice and a core component of widely deployed formats such as DEFLATE. We establish nearly tight bounds on the global sensitivity of LZ77 compression length. In particular, for an input string of length n and sliding window size W, we show that the sensitivity is at most O(W^{2/3}\log n). When the window size equals the input length (W=n), we prove a matching lower bound of Ω(n^{2/3}\log^{1/3} n), which is tight up to a sublogarithmic factor. These bounds demonstrate that while LZ77 is sensitive to local input changes, its sensitivity grows sublinearly with the input length, enabling differential privacy guarantees with a moderate amount of additional padding.

In addition to these bounds, we investigate whether high sensitivity is inherent to effective compression. We show that idealized compression schemes, such as those based on Kolmogorov complexity, can simultaneously achieve optimal compression rates and low sensitivity. This result suggests that strong compression performance and low sensitivity are not fundamentally incompatible, motivating future work on designing compression algorithms that are both efficient and privacy-preserving. Together, these results provide the first rigorous analysis connecting data compression, differential privacy, and cryptographic security, and they establish a foundation for building practical compression pipelines that limit information leakage through compressed output length.

 

Personnel

Students: Seunghoon Lee Brayan Sebastian Yepes-Garcia

Representative Publications

  • Differentially Private Compression and the Sensitivity of LZ77. Blocki, J., Lee, S. and Garcia, B. TCC 2025