Application of Deep Clustering Algorithms

Tutorial at CIKM, Birmingham 2023

Abstract

Deep clustering algorithms have gained popularity for clustering complex, large-scale data sets, but getting started is difficult because of numerous decisions regarding architecture, optimizer, and other hyperparameters. Theoretical foundations must be known to obtain meaningful results. At the same time, ease of use is necessary to get used by a broader audience. Therefore, we require a unified framework that allows for easy execution in diverse settings. While this applies to established clustering methods like k-Means and DBSCAN, deep clustering algorithms lack a standard structure, resulting in significant programming overhead. This complicates empirical evaluations, which are essential in both scientific and practical applications. We present a solution to this problem by providing a theoretical background on deep clustering as well as practical implementation techniques, and a unified structure with predefined neural networks. For the latter we use the Python package ClustPy. The aim is to share best practices and facilitate community participation in deep clustering research.

Outline

This tutorial serves as a good introduction to the topic of deep clustering. This involves theoretical concepts and their application using our open-source Python package ClustPy. The following topics will be covered:

  • Introduction to Clustering (20 Min.)
  • Introduction to Deep Clustering (30 Min.)
  • Application of Deep Clustering Algorithms (90 Min. - excl. the break)
    • Deep Clustering Network (DCN)
    • Deep k-Means (DKM)
    • Coffee Break (30 Min.)
    • Deep Embedded Clustering (DEC)
    • Improved Deep Embedded Clustering (IDEC)
    • Domain Knowledge and Augmentation Invariances
  • Recent Approaches (20 Min.)
  • Outlook (20 Min.)

Downloads

The materials used in the tutorial can be downloaded here:

ClustPy

The open-source package ClustPy provides a simple way to perform deep clustering in Python. It includes multiple autoencoder architectures, deep clustering algorithm and evaluation methods. In addition, methods for loading commonly used datasets are provided.

Tutors

The following tutors will present the tutorial at the CIKM:

  • Collin Leiber
    Collin Leiber is a PhD student at Ludwig-Maximilians-Universität München, Germany, and a member of the Munich Center for Machine Learning (MCML). He received his MSc degree in Computer Science with a specialization in data analytics from Ludwig-Maximilians-Universität München in 2019. His current research interests are mainly focused on data mining. In particular, he is working on the question of how to determine an appropriate number of clusters in complex environments, such as deep clustering and alternative clustering. For this purpose, he researches statistics-based methods, density-based methods, and those based on information theory. Furthermore, he is the main developer of the open-source ClustPy package.
  • Lukas Miklautz
    Lukas Miklautz is a PhD student at the Data Mining and Machine Learning research group at the Faculty of Computer Science University of Vienna, Austria, and a member of the UniVie Doctoral School Computer Science. His research focuses on combining representation learning and clustering (Deep Clustering). Deep clustering methods use unsupervised or self-supervised learning algorithms with clustering objectives to improve clustering performance. Deep clustering is applicable to many clustering paradigms like non-redundant, consensus, or subspace clustering. For this purpose, he researches both clustering and deep learning methods. Furthermore, he is a contributor of the open-source ClustPy package.
  • Claudia Plant
    Claudia Plan is a full professor and leader of the research group Data Mining and Machine Learning at the Faculty of Computer Science University of Vienna, Austria. Her research focuses on new methods for exploratory data mining, mostly on clustering and representation learning. Many approaches relate unsupervised learning to data compression, i.e. the better the found patterns compress the data the more information we have learned. Other methods rely on finding statistically independent patterns or multiple non-redundant solutions, on ensemble learning or on nature-inspired concepts such as synchronization. Together with her group, she contributed a lot of clustering methods based on deep learning, spectral methods and matrix factorization. She also develops application-oriented approaches to data mining in the context of interdisciplinary projects with experts from neuroscience, bio-medicine, particle physics, social sciences and archaeology.
  • Christian Böhm
    Christian Böhm is a professor of Computer Science at the University of Vienna, Austria. He received his PhD degree in 1998 from Ludwig-Maximilians-Universität München, Germany. His research interests cover clustering and representation learning with a particular focus on high-performance aspects such as index structures, parallelization, vectorization, and efficient memory accesses through cache. He has more than 150 publications including KDD, ICDM, and SIGMOD.