Carti Rank

Finding Subspace Clusters using Ranked Neighborhoods

This web page is for the article "Finding Subspace Clusters using Ranked Neighborhoods" by Emin Aksehirli, Siegfried Nijssen, Matthijs van Leeuwen, and Bart Goethals which is published in Proceedings of Data Mining Workshop (ICDMW), 2015 IEEE International Conference on. The article can be found here.

Abstract

Clustering high dimensional datasets is challenging due to the curse of dimensionality. One approach to address this challenge is to search for subspace clusters, i.e., clusters present in subsets of attributes. Recently the cartification algorithm was proposed to find such subspace clusters. The distinguishing feature of this algorithm is that it operates on a neighborhood database, in which for every object only the identities of the k closest objects are stored. Cartification was shown to produce better results than other state-of-the-art subspace clustering algorithms; however, which clusters it detects was also found to depend heavily on the setting of the parameters. In other words, it is not robust to input parameters.

In this paper, we propose a new approach called ranked cartification that produces more robust results than ordinary cartification. We develop a transformation that creates ranked matrices instead of neighborhood databases; we identify clusters in these ranked matrices.
We demonstrate that this method is more robust than cartification in terms of cluster detection.

Application

Please find the pre-compiled and packaged implementation here and its source code here.

Carti-Rank is packaged as a runnable .jar file (carti-rank.jar). You can run the application
on command line with the commands,

java -jar carti-rank.jar data-file theta minlen

Input

Carti-Rank accepts parameters as command line arguments in the specified order.

  • data-file: Path to the multi dimensional datafile. Please find the
    properties of the data file below.
  • theta: Parameter for the tile mining.
  • minsup: Minimum length for the cluster.

About the data file:

  • it includes space separated real values,
  • each row represents an instance,
  • each column represents a feature/dimension,
  • does not include a(ny) header row(s),
  • all instances have the same number of features,
  • there are no missing values,
  • the real values formatted in the USA locale (use . as decimal separator)

Output

Carti-Rank outputs the found clusters to the standard output. Each line of output
represents a subspace cluster. Output format:

Size of the cluster - Dimensions of the cluster - Objects of the cluster

For example, 10 - 1 2 6 - 0 1 2 3 4 5 6 7 8 9 means a cluster is
detected at 1st, 2nd and 6th subspaces and it has '10' objects, i.e.,
0 1 2 3 4 5 6 7 8 9

Example

java -jar cart-rank.jar ns25.mime 25 20

Datasets

All the datasets we use in the paper are here.

AttachmentSize
Implementation2.11 MB
Datasets746.07 KB