Testing and detecting software upgrade failures in data-intensive distributed systems
Principal Investigator: Yongle Zhang
In the current big data era, Internet services are often built on top of data-intensive distributed systems, such as distributed storage systems and distributed computation framework. Distributed systems have to go through software upgrade as vendors need to add new features, improve performance, and deploy patches. With the rise of continuous deployment in the industry, the frequency of distributed system software upgrade could reach thousands of deployments in a single day in a major Internet company.
Unfortunately, distributed systems could experience upgrade failures - failures happen during software upgrade. These failures often have large-scale impact as upgrade is performed on the entire sys- tem. They are typically mitigated in the production environment with canary deployment, which slowly rollout updates from a small scale to the entire cluster and downgrade if a failure is encountered. How- ever, canary deployment easily takes hours and creates a dilemma between safe and fast upgrade. In addition, many upgrade failures have persistent impact and cannot be easily resolved by downgrading.
Despite the severe consequences of upgrade failures and challenges faced by production mitigation techniques, there are no existing testing and program analysis techniques that focus on testing and analyzing the distributed system upgrade procedure systematically. This work proposes to develop such techniques optimized to detect upgrade failures in early stages through exploring the effectiveness of unique properties of the distributed system software upgrade procedure.
Representative Publications
- UpFuzz: Detecting Data Format Incompatibility Bugs during Distributed Storage System Upgrade. Ke Han, Sruthi P C, Yayu Wang, Yaoxu Song, Bishal Basak Papan, Junwen Yang,
Pedro Fonseca, Yongle Zhang. To appear in The 23rd USENIX Symposium on Networked Systems Design and Implementation (NSDI ‘26).
- UpFuzz: Detecting Data Format Incompatibility Bugs during Distributed Storage System Upgrade. Ke Han, Sruthi P C, Yayu Wang, Yaoxu Song, Bishal Basak Papan, Junwen Yang,

