Efficient Cluster Detection by Ordered Neighborhoods

Abstract

Detecting cluster structures seems to be a simple task, i.e. separating similar from dissimilar objects. However, given today's complex data, (dis-)similarity measures and traditional clustering algorithms are not reliable in separating clusters from each other. For example, when too many dimensions are considered simultaneously, objects become unique and (dis-)similarity does not provide meaningful information to detect clusters anymore. While the (dis-)similarity measures might be meaningful for individual dimensions, algorithms fail to combine this information for cluster detection. In particular, it is the severe issue of a combinatorial search space that results in inefficient algorithms.

In this paper we propose a cluster detection method based on the ordered neighborhoods. We study the properties of such local neighborhood information for the separability of clusters. By considering ordered neighborhoods in each dimension individually, we derive a property that allows us to detect clustered objects that show high similarity in linear time. Our algorithm exploits the ordered neighborhoods in order to find both the similar objects and the dimensions in which these objects show high similarity. Evaluation results show that our method is scalable with both database size and dimensionality and enhances cluster detection w.r.t. state-of-the-art clustering techniques.

Paper Supplement & Presentation

This work is presented at DEXA DaWaK 2015 and published by Springer. You can find an early draft of the paper and its supplementary here.
Slides that are presented at the conference are
here (The presentation is in Open Document Format, you can use LibreOffice to open.)

Application

Please find the pre-compiled and packaged implementation here and source code here.



CLON is packaged as a runnable .jar file (clon.jar). You can run the application
on command line with the commands,

java -jar clon.jar data-file k minlen

A neighborhood visualizer is also provided in the package which can be run with
the following command:

java -jar gui.jar

Neighborhood visualizer is developed further and published as a separate article. It can be found here.

GUI of VINeM

Input

CLON accepts parameters as command line arguments in the specified order.

  • data-file: Path to the multi dimensional datafile. Please find the
    properties of the data file below.
  • k: Parameter for the k nearest neighbors.
  • minsup: Minimum length for the cluster.

Neighborhood visualizer does not need any runtime parameters. The data-file
to visualize can be selected in the GUI. Properties of the data file are given below.

About the data file:

  • it includes space separated real values,
  • each row represents an instance,
  • each column represents a feature/dimension,
  • does not include a(ny) header row(s),
  • all instances have the same number of features,
  • there are no missing values,
  • the real values formatted in the USA locale (use . as decimal separator)

Output

CLON outputs the found clusters to the standard output. Each line of output
represents a subspace cluster. Output format:

Size of the cluster - Dimensions of the cluster - Objects of the cluster

For example, 10 - 1 2 6 - 0 1 2 3 4 5 6 7 8 9 means a cluster is
detected at 1st, 2nd and 6th subspaces and it has '10' objects, i.e.,
0 1 2 3 4 5 6 7 8 9

Example

java -jar clon.jar example.mime 170 90

Datasets

All the datasets we use in the paper are here.

References:

AttachmentSize
Efficient Cluster Detection by Ordered Neighborhoods and Supplementary Materials344.97 KB
Presentation at DaWaK 20151.81 MB