SparseKmeansFeatureRanking.jl

Sparse K-means via Feature Ranking

This software package contains an efficient multi-threaded implementation of sparse K-means via feature ranking proposed by Zhang, Lange, and Xu (2020). The code is based on the original github repository. The authors of the original code have kindly agreed to redistribute the derivative of their code on this repository under the MIT License.

Installation

This package requires Julia v1.6 or later, which can be obtained from https://julialang.org/downloads/ or by building Julia from the sources in the https://github.com/JuliaLang/julia repository.

The package can be installed by running the following code:

using Pkg
pkg"add https://github.com/kose-y/SparseKmeansFeatureRanking.jl"

For running the examples below, the following are also necessary.

pkg"add Random Clustering SnpArrays"
versioninfo()
Julia Version 1.7.1
Commit ac5cc99908 (2021-12-22 19:35 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin19.5.0)
  CPU: Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)

Basic Usage

First, let us initialize the random number generator for reproducibility. We use MersenneTwister to obtain the same result with different number of threads.

!!! Since Julia 1.7, the default random number generator depends on thread launches.

using Random, SparseKmeansFeatureRanking
rng = MersenneTwister(7542)
MersenneTwister(7542)

Then, we generate a random data with 300 samples and 100 features. For the first 33 features, we add 1.0 to samples 101:200 and 2.0 to samples 201:300 to give a cluster structure.

(features, cases) = (100, 300);
(classes, sparsity)  = (3, 33);
X = randn(features, cases);
(m, n) = (div(features, 3), 2 * div(features, 3));
(r, s) = (div(cases, 3) + 1, 2 * div(cases, 3));
X[1:m, r:s] = X[1:m, r:s] .+ 1.0;
X[1:m, s + 1:end] = X[1:m, s + 1:end] .+ 2.0;

ImputedMatrix is the basic data structure for the SKFR algorithm with k-POD imputation. This can be generated using the function get_imputed_matrix(). The second argument 3 is the number of clusters, and the optional keyword argument rng determines the status of the random number generator to be used.

IM = SparseKmeansFeatureRanking.get_imputed_matrix(collect(transpose(X)), 3; rng=rng);

The function sparsekmeans1() selects sparsity most informative features globally.

(classout1, center1, selectedvec1, WSSval1, TSSval1) = SparseKmeansFeatureRanking.sparsekmeans1(IM, sparsity);

classout1 is the class labels, center1 contains cluster centers, selectedvec1 contains selected informative features. WSSval1 shows within-cluster sum of squares value, and TSSval1 contains total sum of squares.

using Clustering
randindex(classout1,[repeat([1], 100); repeat([2], 100); repeat([3], 100)])[1]

Checking the rand index gives the value 1.0, meaning perfect clustering. As expected,

all(sort(selectedvec1).== 1:33)

Also, the first 33 features are selected, as expected.

The function sparsekmeans2() selects sparsity most informative feature for each cluster.

IM = SparseKmeansFeatureRanking.get_imputed_matrix(collect(transpose(X)), 3; rng=rng)
(classout2, center2, selectedvec2, WSSval2, TSSval2) = SparseKmeansFeatureRanking.sparsekmeans2(IM, sparsity);
randindex(classout2,[repeat([1], 100); repeat([2], 100); repeat([3], 100)])[1]

Selected feature does not necessarily match 1:33 for all the clusters, as seen below.

selectedvec2

Matrices with missing entries

We can apply the SKFR algorithm on the dataset with missing values, denoted by NaNs. Below, we put 10% of the values in the data matrix as NaN.

using StatsBase
missingix=sample(1:features*cases,Int(features*cases*0.1),replace=false)
X_missing = deepcopy(X)
X_missing[CartesianIndices(X_missing)[missingix]] .= NaN;

Then, we run the SKFR functions just as above. For each iteration, missing values are imputed by the center of current cluster centers, as suggested by the k-POD imputation method.

IM = SparseKmeansFeatureRanking.get_imputed_matrix(collect(transpose(X_missing)), 3; rng=rng);
(classout1, center1, selectedvec1, WSSval1, TSSval1) = SparseKmeansFeatureRanking.sparsekmeans1(IM, sparsity);

If we check the rand index, we see that the clustering result is a little bit noisy, as one may expect.

randindex(classout1,[repeat([1], 100); repeat([2], 100); repeat([3], 100)])[1]

The set of selected features is still the same as before.

length(union(selectedvec1, 1:33))

SnpArray Usage

The SKFR algorithm can also be applied to the PLINK 1 BED-formatted data through SnpArrays.jl. This can be considered a case of unsupervised ancestry informative marker (AIM) selection.

using SnpArrays
X = SnpArray(SnpArrays.datadir("EUR_subset.bed"))
nclusters = 3;
IM = SparseKmeansFeatureRanking.get_imputed_matrix(X, nclusters; rng=rng)
(classout, center, selectedvec, WSSval, TSSval) = SparseKmeansFeatureRanking.sparsekmeans1(IM, 1000);