Statistical Tools for Analyzing the Security of User Chosen Passwords
Principal Investigator: Jeremiah Blocki
Passwords remain one of the most widely deployed authentication mechanisms, yet they are fundamentally vulnerable because users frequently choose low-entropy secrets that can be guessed by an attacker. When authentication databases are breached, attackers can attempt to recover passwords using large-scale offline guessing attacks. In such attacks, the adversary repeatedly tests candidate passwords in decreasing order of likelihood until the correct password is found. The effectiveness of these attacks depends heavily on the distribution of user-selected passwords, which is highly skewed: a relatively small set of common passwords accounts for a significant fraction of real-world user choices.
Understanding the security of password authentication systems therefore requires accurate models of how many passwords an attacker can expect to crack after making a given number of guesses. This relationship is typically captured by the guessing curve, which describes the fraction of accounts compromised as a function of the number of guesses an attacker performs. Reliable estimates of this curve are essential for evaluating password policies, selecting parameters for password hashing algorithms, and estimating the real-world security of deployed authentication systems.
However, accurately estimating password security presents a fundamental statistical challenge. In practice, we typically observe a breached password dataset consisting of independent samples from the underlying password distribution, but the distribution itself is unknown. Many prior approaches estimate attacker success by fitting a password-cracking model to the observed data and measuring the performance of that model. This approach can provide useful insights, but it implicitly assumes that the defender’s model of password guessing is as good as the attacker’s. Following Kerckhoffs’ principle, we should instead assume that the attacker may know the true password distribution or possess a better guessing strategy than the defender. This raises a key question: given only IID samples from an unknown password distribution, what can we rigorously infer about the guessing curve of an (optimal) attacker? Addressing this question requires new statistical tools that allow us to bound attacker success even when the underlying password distribution is unknown.
Our research addresses this challenge by developing statistical techniques for rigorously analyzing password security from sampled password datasets. In two papers published at IEEE Symposium on Security and Privacy (S&P) 2023, we introduced new tools for estimating and bounding password guessing success. In particular, our work develops methods to derive provable bounds on the guessing curve of an ideal attacker based only on observed password samples. These results provide stronger guarantees than traditional empirical analyses because they account for uncertainty about the underlying password distribution rather than relying solely on fitted password-cracking models.
In a complementary line of work with Wenjie Bai and Peiyuan Liu, we introduced a technique called Confident Monte Carlo, which enables efficient and statistically rigorous evaluation of password-cracking models. This method allows us to compute high-confidence upper bounds on the guessing curve of an attacker using a specific password-cracking strategy, without requiring exhaustive evaluation of the model over extremely large candidate sets. In addition, Confident Monte Carlo can be used to compute confidence intervals for the guessing number of individual passwords under a given model. These estimates are particularly useful for applications such as password strength meters, where it is important to provide users with meaningful security feedback while accounting for statistical uncertainty. Together, these tools provide a more principled framework for evaluating password security and understanding the effectiveness of both attackers and defenses in password-based authentication systems.
Ultimately, these statistical tools complement our research on password hashing and memory-hard functions by enabling more accurate evaluation of password security in practice. By combining improved defenses against offline attacks with rigorous methods for estimating attacker success, this research aims to provide a more complete framework for understanding and improving the security of password-based authentication systems.
Artifacts: To encourage reproducibility and facilitate follow-up research, we have also developed open-source artifacts implementing the statistical techniques introduced in our work. Recently, an undergraduate student in our group helped refactor and clean up the codebases associated with both projects, improving documentation and usability so that other researchers can easily apply these tools in their own studies. The implementation of the Confident Monte Carlo framework is publicly available at
https://github.com/ConfidentMonteCarlo/ConfidentMonteCarlo, and the implementation supporting our work on bounding password guessing curves is available at
https://github.com/ArgonAttack/PasswordGuessingCurves.
Personnel
Students: Wenjie Bai Peiyuan Liu Leo Yo-Gui
Representative Publications
Blocki, J. and Peiyuan, L. Towards a Rigorous Statistical Analysis of Empirical Password Datasets. IEEE Symposium on Security and Privacy (S&P 2023).
Peiyuan, L., Blocki, J. and Bai, W. Confident Monte Carlo: Rigorous Analysis of Guessing Curves for Probabilistic Password Models. IEEE Symposium on Security and Privacy (S&P 2023)

