SCII: Sequence Classification based on Interesting Itemsets

1. INTRODUCTION

Sequence classification is an important task in data mining. We address the problem of sequence classification using rules composed of interesting itemsets found in a dataset of labelled sequences and accompanying class labels. We measure the interestingness of an itemset in a given class of sequences by combining the cohesion and the support of the itemset. We use the discovered itemsets to generate confident classification rules, and present two different ways of building a classifier. The first classifier (SCII_CBA) is based on the CBA (Classification based on associations) method, but we use a new ranking strategy for the generated rules, achieving better results. The second classifier (SCII_MATCH) ranks the rules by first measuring their value specific to the new data object.

Itemset Based Sequence Classification by Cheng Zhou, Boris Cule and Bart Goethals. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Data (ECML PKDD 2013), 2013 Springer

2. DOWNLOADING THE SOFTWARE

SCII runnable JAR file

2.1. Compiling

SCII has been implemented in Java using JDK (Standard Edition Development Kit) Version 1.7. Note that a different version of JDK may influence the run time.

2.2. Documentation

The documentation for SCII can be generated using Java Doc. All important methods have been provided with the necessary comments.

3. DOWNLOADING THE DATASETS

datasets

3.1. Protein dataset

The original protein data was obtained from PhosphoELM. This dataset consists of different combinations of amino acids for each kind of protein. We chose two of the biggest protein groups to form the Protein dataset, i.e. PKA_group and SRC. After downloading the data from PhosphoELM, you need to run a jar file (included in datasets.zip) to get Protein dataset used in our experiments. We treat each combination of amino acids as a sequence and consider each amino acid as an item. Each sequence is labelled by the protein group it belongs to. This dataset consists of 666 sequences containing 20 different items.

3.2. Reuters dataset

The Reuters-21578 dataset, consisting of news stories, was assembled and indexed with categories by Reuters Ltd. personnel. We consider the words appearing in the texts as items and treat each paragraph as a sequence. We formed the Reuters1 dataset using the two biggest classes in the Reuters-21578 dataset, acq (1596 paragraphs) and earn (2840 paragraphs). Reuters2 consists of the four biggest classes in Reuters-21578, acq, earn, crude (253 paragraphs) and trade (251 paragraphs), and is therefore an imbalanced dataset. Reuters3 is a balanced dataset obtained from Reuters2 by keeping only the first 253 paragraphs in the top two classes. Reuters1 consists of 4436 sequences composed of 11947 different items, Reuters2 of 4940 sequences containing 13532 distinct items, and Reuters3 of 1010 sequences composed of 6380 different items.

3.3. Format of a dataset

Given the dataset: a b a b c b a

The data file will be: a 1 b 2 a 1 b 2 c 3 b 1 a 2

For a sequence, each line is an event consisting of an item and a time stamp. There is a blank line after each sequence. All sequences in a data file belong to the same class. So there will be n data files if there are n classes of data.

4. RUNNING THE SOFTWARE

4.1 Running the .jar file

You can run the .jar file with the command (parameters explained below):

java -jar SCII.jar NumClasses DataFiles RulesFile MinInts MinSize MaxSize MinSups TransFile MinConf ChooseMethod

Optionally, the user can also choose to allocate more memory. This might be necessary for processing larger datasets. This can be done by adding the parameter -Xmx2G:

java -jar -Xmx2G SCII.jar NumClasses DataFiles RulesFile MinInts MinSize MaxSize MinSups TransFile MinConf ChooseMethod

4.2 Parameter description

Parameter	Explanation
Numclasses	number of classes, or the number of input data files
DataFiles	the full names of the data files, separated by a comma (",")
RulesFile	the file storing the output rules which are sorted
MinInts	minimum interestingness thresholds for different classes, seperated by a comma (",")
MinSize	used to limit the output only to interesting itemsets with a size bigger than or equal to MinSize
MaxSize	used to limit the output only to interesting itemsets with a size smaller than or equal to MaxSize
MinSups	minimum support thresholds for different classes, seperated by a comma (",")
TransFile	the file storing the transactions which are transferred from sequences
MinConf	minimum confidence threshold
ChooseMethod	choose the classifier you want to use, available values are "M" (SCII_MATCH), "C" (SCII_CBA) and "MC" (both classifiers).

4.3 Example run command

As an example, you can run the .jar file for two classes of data acq and earn as following (-Xmx2G is optional):

java -jar -Xmx2G SCII.jar 2 acq.txt,earn.txt classifier 0.05,0.05 1 3 0.1,0.1 T.txt 0.6 MC

Note that each file listed in the command (.jar and other files) has to either be in the same directory or be fully specified. For example, for the Windows operating system,

java -jar G:\SCII.jar 2 G:\acq.txt,G:\earn.txt G:\classifier 0.05,0.05 1 3 0.1,0.1 G:\T.txt 0.6 MC.

Both data files need to be fully specified. If a path for either one is omitted (or for both), the algorithm will search for the file in the current scope of the terminal/console. For example G:\acq.txt,earn.txt will look for acq.txt on the G:\ drive, while it will look for earn.txt in the local folder.

The output file contains the output as shown on the console, the files storing the output rules for each fold of the 10-fold cross validation and the file storing the transactions which are transferred from sequences. An actual example of the output can be found at the output.zip.

4.4 Running the source code on Eclipse

After building the project in Eclipse by importing the source code. You can run it by adding parameters (e.g. "2 G:\acq.txt,G:\earn.txt G:\classifier 0.05,0.05 1 3 0.1,0.1 G:\T.txt 0.6 MC" in Windows operating system) to program arguments in "Run Configurations". Here, you can also add "-Xmx2G" to VM arguments if you want to allocate some extra memory. You will get the same output as described in section 4.3.

5. EXPERIMENTS REPLICATION

5.1 Downloading all the needed files

To replicate our experiments, you will need to download AllFiles.zip, which contains all the files necessary for replicating the experiments in the paper. There are three runnable .jar files (SCII.jar, SCII_loopReuters1.jar and SCII_loopProtein.jar), one batch file (runner.bat) for the Windows operating system, one shell script file (runner.sh) for Unix based operating systems (e.g. Linux and Mac) and six data files needed to run the experiments. Then you also need to add the two data files got from the protein dataset as discribed in 3.1.

5.2 Running the Experiments in Windows operation system

You can double click the "runner.bat" file to run all the experiments if you are using Windows operating system. Note that, by default, all files in AllFiles.zip will be assumed to be in C:\AllFiles. To change this and other options, please edit the "runner.bat" file. You may also need to adapt the batch file to increase memory allocation or change which experiments will be run.

5.3 Running the Experiments in Linux, Unix or Mac OS operation system

If you are using Linux, Unix or Mac OS operation system, you can use the command

sh runner.sh

in the terminal/console to run the experiments. As noted in section 5.2, it might be necessary to tweak the file according to your preferences.

5.4 Additional remarks

1. Implementations for the CBA and CMAR classifier can be found in the LUCS-KDD Software Library. These algorithms can be run on the corresponding output files.

2. Implementations for the BayesFM and CBS classifier can be found in BayesFM_CBS.zip. We use the SPADE algorithm (implemented by our colleague Antonio Gomariz Peñalver) to mine subsequential patterns in the implementations. You can run both of them with similar methods as introduced before. There are two batch files for the Windows operating system and two script files for Unix-based operating systems in the zip file. These can be used to replicate experiments.

Note that BayesFM and CBS don't use an interestingness threshold and confidence threshold any more, so their values are irrelevant. This is demonstrated by their values in the two batch or shell script files.

3. We have fixed some bugs in the implementation of our methods and more efficient softwares (CBS, BayesFM and SCII) are upadated. Therefore, there is a small difference between the results in the paper and the results we get now. If any new problems were to arise, please don't hesitate to contact us.

4. We now are going to make an extension of the paper and the new version of implementations can be found in SCIP.zip.

Thanks.