The Center for Education and Research in Information Assurance and Security (CERIAS)

The Center for Education and Research in
Information Assurance and Security (CERIAS)

Reliable Identification of Significant Sets of Episodes in Event Sequences


Download PDF Document


Robert Gwadera

Tech report number

CERIAS TR 2005-82

Entry type



In this thesis we present a solution to the problem of identification of significant sets of episodes in event sequences. In order to determine the significance of an episode in a monitored event sequence, we compare its observed frequency to its frequency in a reference sequence. The reference sequence in our work is represented by a variable-length Markov model of generating symbols in the reference sequence. An episode is significant if the probability that it would have a given frequency by chance, in the reference sequence, is very small. In order to identify significant episodes we first show how to select the sliding window size to ensure that a discovered episode is meaningful and then we show how to compute a lower threshold for under-represented and an upper threshold for overrepresented significant episodes. The frequency of occurrence alone is not enough to determine significance, i.e., an infrequent episode can be more significant than a frequent one, and the significance depends on the structure of the episode and on probabilistic characteristics of the reference and monitored event streams. As an extension, we propose a novel method for providing approximate answers, with probabilistic guarantees, to a class of ad hoc sliding window queries referencing past data in data streams. The queries in that class compute the frequency of past windows that satisfy given join conditions among tuples in a window comprising multiple streams. To represent the join conditions consisting of intra-stream and inter-stream constraints between tuples in the window we introduce a concept of a 2D-episode.




2005 – 12

Key alpha



Purdue University

Publication Date



1. Introduction 2. Mining stream data 3. Notation 4. Mining frequent episodes 5. Finding occurrences of episodes 6. Mining significant episodes 7. Identification of significant sets of episodes 8. Analysis of the significance thresholds 9. Variable-length Markov model 10. Analysis of the probability of existence of an episode 11. Experimental results 12. 2D-episodes 13. Summary



BibTex-formatted data

To refer to this entry, you may select and copy the text below and paste it into your BibTex document. Note that the text may not contain all macros that BibTex supports.