The Center for Education and Research in Information Assurance and Security (CERIAS)

The Center for Education and Research in
Information Assurance and Security (CERIAS)

Phishing Email Detection Through Machine Learning and Word Error Correction

Principal Investigator: Quamar Niyaz

Phishing is one of the most prevalent and effective fraudulent activities on the Internet. Numerous machine learning (ML)-based models have been developed to detect phishing emails using publicly available datasets (e.g., Nazario, Millersmile). These email datasets have poor grammar structure or incorrect word usage, which ML models often learn as key distinguishing features. With the advent of large language models (LLMs), the grammatical quality and structure of phishing emails have significantly improved, making them appear more legitimate. As a result, traditional ML models that rely on grammatical cues may become less effective in identifying phishing emails. To address this challenge, we explore the following research question: Can an ML-based phishing detection model, enhanced with word correction and splitting techniques, effectively identify phishing emails? To investigate this, we develop a phishing detection system that integrates misspelled word correction and combined-word splitting during the data preprocessing stage. The system leverages state-of-the-art natural language processing (NLP) techniques to enhance detection accuracy. Additionally, to improve model robustness, we utilize datasets from diverse sources and time periods for training and deployment.

Representative Publications

  • Deeksha Kulal, Leul Shiferaw and Quamar Niyaz, "Phishing Email Detection Through Machine Learning and Word Error Correction," 2025 17th International Conference on COMmunication Systems and NETworks (COMSNETS), Bengaluru, India, 2025, pp. 1299-1304, doi: 10.1109/COMSNETS63942.2025.10885558.

Keywords: Cybersecurity, Natural Language processing, Phishing detection