TopCat: data mining for topic identification in a text corpus
Download
Author
Christopher Clifton
Tech report number
CERIAS TR 2004-90
Entry type
article
Abstract
TopCat (topic categories) is a technique for identifying topics that recur in articles in a text corpus. Natural language processing techniques are used to identify key entities in individual articles, allowing us to represent an article as a set of items. This allows us to view the problem in a database/data mining context: Identifying related groups of items. We present a novel method for identifying related items based on traditional data mining techniques. Frequent itemsets are generated from the groups of items, followed by clusters formed with a hypergraph partitioning scheme. We present an evaluation against a manually categorized ground truth news corpus; it shows this technique is effective in identifying topics in collections of news articles.
Download
Date
2004 – 08
Address
Los Alamitos, CA
Journal
Transactions on Knowledge and Data Engineering
Key alpha
Clifton
Number
8
Pages
949-964
Publisher
IEEE Computer Society Press
Volume
16
Publication Date
2004-08-01
Language
English

