Phishing has been an efficient and easy tool for trickery and deception on the Internet costing billions of dollars to users every year. While solutions such as blacklisting in Internet browsers have been effective to some degree, the reliance on exact match of a URL with the blacklist entries makes it easy for attackers to evade and reduce the ability to detect and stop such attacks. In this paper, we observe that several simple modifications such as changing TLDs in existing blacklist entries can result in finding new sources of maliciousness. In particular, we propose five heuristics based on our observations with real blacklists and extensively evaluate them using real-time feeds blacklist feeds. Our approach led to the discovery of more than 18,000 new malign URLs validated through content matching from a set of 6,000 original blacklist entries. Based on the success of these heuristics, we essentially translate the mechanisms used in the heuristics into an approximate matching algorithm that increases the resiliency of blacklists significantly. We propose a system, \system, that breaks up a URL into multiple components that are then matched against individual regular expressions or hash maps formed out of the original set of blacklist entries to assign a normalized score, used to determine whether the entry is malign. We evaluate the efficacy of system using real blacklist URLs from PhishTank and SpamScatter and benign URLs from Yahoo and DMOZ. We find that \system has a false positive rate of 3% rate with a false negative rate of 5%. The false positive rate can be arbitrarily lowered by adjusting the threshold.
Keywords: phishing attack, blacklist, security