Automatic identification of classified documents
Judy Hochberg - Computer Research and Applications Group (CIC-3) at Los Alamos National Laboratory
Feb 25, 2000Size: 223.1MB
Download: MP4 Video
Watch in your Browser Watch on YouTube
AbstractHow can one automatically identify classified documents? This is a vital question for the Department of Energy (DOE), which is reviewing millions of classified documents for possible declassification, and for Los Alamos National Laboratory (LANL), which is checking its unclassified computing storage systems for the presence of classified documents.
The DOE, having already developed an expert rule system for automatic document classification, provided LANL with a small set of documents with which to explore a statistical classifier as an alternative. We represented documents as vectors of character trigram frequencies, used a chi-square statistic to select the optimal trigrams, and trained a linear classifier to distinguish classified and unclassified documents. Results ranged from 60% to 87% accuracy, depending on the training set size and other variables.
In contrast, the LANL effort started "from scratch" and needed to be moved rapidly into large-scale production. We implemented an expert system tailored to the classified documents of most concern to LANL. The talk will discuss the practical issues that arose in canvassing large amounts of files in a variety of formats, and the security issues involved in the sampling, analysis, and notification processes.
About the SpeakerJudy Hochberg is a staff scientist at Los Alamos National Laboratory. She received a B.A. in linguistics from Harvard and a Ph.D. in linguistics from Stanford. Before joining the Laboratory in 1989, she was a post-doctoral researcher at the University of Chicago, then a visiting Assistant Professor at Northwestern University. She has published in journals including Computers and Security, IEEE Transactions in Pattern Analysis and Machine Intelligence, and Language. She has been an R&D 100 award winner and a national finalist in the Johns Hopkins National Search for Computing to Assist Persons with Disabilities. Judy is interested in all manifestations of human language, including document analysis -- text and images -- and speech.
The views, opinions and assumptions expressed in these videos are those of the presenter and do not necessarily reflect the official policy or position of CERIAS or Purdue University. All content included in these videos, are the property of Purdue University, the presenter and/or the presenter’s organization, and protected by U.S. and international copyright laws. The collection, arrangement and assembly of all content in these videos and on the hosting website exclusive property of Purdue University. You may not copy, reproduce, distribute, publish, display, perform, modify, create derivative works, transmit, or in any other way exploit any part of copyrighted material without permission from CERIAS, Purdue University.