Clustering Algorithm Fundamentals and an Implementation in Python

    The unsupervised process of creating groups of data containing similar elements

    Photo by ian dooley on Unsplash

    Clustering is a method that can help machine learning engineers understand unlabeled data by creating meaningful groups or clusters. This often reveals patterns in data, which can be a useful first step in machine learning. Since the data you are working with is unlabeled, clustering is an unsupervised machine learning task.

    Data is categorized into groups based on their similarity to each other through a metric known as the similarity measure, which is used to find out how similar the objects in the dataset are. To calculate this similarity measure, the feature data of the object in the dataset is used. A cluster ID is provided for each cluster, which is a powerful application of clustering. This allows large datasets to be simplified and also allows you to condense the entire feature set for an object into its cluster ID.

    A simple real-life example of this principle is collecting data about household size and household income to create clusters of users such as small family high spenders, small family low spenders, large family high spenders, and large family low spenders.

    Today, clustering is used for a wide variety of use cases in the industry. Some of these include the grouping of search results, analysis of social networks, and market segmentation. Clustering is also used in image segmentation, anomaly detection, and in medical imaging.

    Expanding on the advantage of cluster IDs mentioned above, clustering can be used to group objects by different features. For example, stars can be grouped by their brightness or music by their genres.

    In organizations like Google, clustering is used for:

    • Generalization: when objects in clusters have missing feature data, they can be inferred from other objects in their cluster.
    • Data compression: feature data can be entirely replaced by the cluster ID. This saves storage and simplifies the feature space. This can help make ML model training simpler and faster.
    • Preserve privacy: clustering groups of users and associating their data with cluster IDs prevent associating user data with specific users, ensuring user privacy.
    The result of cluster analysis. Source: hellisp, Public domain, via Wikimedia Commons

    Now that we understand the concepts of clustering, let us look at some common clustering algorithms. For an exhaustive list, you can refer to this paper.

    Hierarchical Clustering

    This approach is suited for hierarchical data and it creates a tree of clusters. The standard algorithm is too slow for most datasets, as it has a time complexity of O(n³) with a memory requirement of Ω(n²). However, the runtime can be decreased at the cost of memory requirements, although memory overhead makes it difficult to use practically in most cases.

    Hierarchical clustering and interactive dendrogram visualization in Orange data mining suite. Blaž Zupan (Orange Data Mining), CC BY-SA 4.0, via Wikimedia Commons

    Distribution-based Clustering

    In these algorithms, it is assumed that data belongs to different distributions. Clusters are then defined as those that contain objects of similar distribution. One disadvantage is that distribution-based clustering is prone to overfitting. Therefore, constraints must be put on model complexity.

    An example is shown in the image below where data is clustered into three Gaussian distributions. The darker colors are closer to the center of the distributions and the bands show the intensity of probability that data belongs to a cluster. As the distance to the center increases, the likelihood that the data belongs to that cluster will decrease.

    If you do not have information on the type of distribution of the dataset, this algorithm might not be the best.

    Gaussian distribution-based clustering. By Chire — Own work, CC BY-SA 3.0, via Wikimedia Commons

    Density-based Clustering

    These algorithms create clusters by connecting areas that contain high densities of objects. It requires dense areas to be connectable, and by design, outliers are not assigned to clusters. A disadvantage is the difficulties density-based clustering algorithms face with higher dimensions as well as with data that have varying densities.

    Cluster analysis with DBSCAN algorithm on a density-based data set. Chire, CC BY-SA 3.0, via Wikimedia Commons

    Centroid-based Clustering

    This form of clustering groups data into non-hierarchical partitions. While these types of algorithms are efficient, they are sensitive to initial conditions and to outliers. The most commonly used centroid-based algorithm is known as k-means, where k is a hyperparameter defining the number of clusters.

    K-means offer advantages such as the ability to scale to large data sets, ease of implementation, and adapt to new data. On the other hand, the k value must be found manually with some effort and the centroids can be dragged by outliers. It is beneficial to consider the removal of outliers prior to clustering.

    Given a set of n data points, the objective of the k-means algorithm is to partition them into k groups, where each group contains similar data points. To do this, we first need to choose a number k. We then start by randomly assigning each point to its closest cluster center. Next, the distance between each data point and its assigned center is calculated. Then, we repeat the above steps until no further changes occur. Once we have finished calculating distances and centers, we go back to step 1 and recalculate the clusters. This continues until there are no changes to the clusters. At this point, we know that our clusters are stable.

    Now, let us implement one of the algorithms discussed above and visualize the resulting clusters. For this, we will use the k-means algorithm and scikit-learn. The code was inspired by and contains code found in the demo of K-Means clustering on the handwritten digits data provided by scikit-learn examples.

    Implementing k-means. Contains code from scikit-learn examples (BSD License)

    The output is shown in the image below.

    The resulting plot is produced from the above code.

    In addition to the k-means algorithm, scikit-learn library offers several other algorithms that can be used based on the data you have. Some of these algorithms include:

    • Affinity Propagation
    • Agglomerative Clustering
    • BIRCH
    • DBSCAN
    • K-Means
    • Mini-Batch K-Means
    • OPTICS
    • Spectral Clustering
    • Mixture of Gaussians

    Keep in mind that there is no fixed algorithm that will offer the best results. You will have to run controlled experiments to identify the most suitable algorithm for the dataset you are working with.

    If you want to dive deeper into the algorithms provided, the scikit-learn clustering API is a good place to start.

    In this article, we looked at clustering, its uses, and some commonly used types of clustering algorithms. We also had a look at their advantages and disadvantages, and where some algorithms shine compared to others. We finished by looking at a coded example of how k-means clustering can be done. I hope you found this information useful. Feel free to let me know your thoughts and questions in the comment section below.

    Clustering Algorithm Fundamentals and an Implementation in Python Republished from Source via

    Recent Articles


    Related Stories

    Stay on op - Ge the daily news in your inbox