TopCat: Data Mining for Topic Identification in a Text Corpus
Download
Author
Christopher Clifton
Tech report number
CERIAS TR 2001-91
Entry type
conference
Abstract
TopCat (Topic Categories) is a technique for identifying topics that recur in articles in a text corpus. Natural language processing techniques are used to identify key entities in individual articles, allowing us to represent an article as a set of items. This allows us to view the problem in a database/data mining context: Identifying related groups of items. This paper presents a novel method for identifying related items based on “traditional†data mining techniques. Frequent itemsets are generated from the groups of items, followed by clusters formed with a hypergraph partitioning scheme. We present an evaluation against an anually-categorized “ground truth†news corpus showing this technique is effective in identifying “topics†in collections of news articles.
Download
Date
1999 – 09
Key alpha
Clifton
Note
3rd European Conference
on Principles and Practice of Knowledge Discovery in Databases
September 15-18,1999 in Prague,Czech Republic
Lecture Notes in Artificial Intelligence
1704, Springer-Verlag(Draft Available)
Publication Date
2001-09-01

