Definition: K-Means Clustering is an unsupervised machine learning algorithm that divides the observations into “K” number of clusters by grouping the data points together. K is the number of centroids that we need in a dataset. Every cluster has a centroid representing the center of a cluster. The algorithm allocates each data point to the nearest cluster and keeps the cluster as small as possible.
Idea: The core idea is to minimize the sum of distances between the points and their respective cluster centroid.
Example: Classifying the unlabeled data of income of people as high, average, or low.
Context: The Iris ower data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems. It is sometimes called Anderson’s Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris owers of three related species. The data set consists of 50 samples from each of the three species of Iris (Iris Setosa, Iris virginica, and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. The dataset contains a set of 150 records under 5 attributes – Petal Length, Petal Width, Sepal Length, Sepal width, and Class(Species).
Software Used: Python 3
Goal: In this notebook, we will use K-means Clustering for cluster analysis. We will use the Iris dataset to predict the optimum number of clusters and classify the ower as Versicolor, setosa, or virginica.
Application: The given analysis can be used to identify the pattern and for prediction using new data.
Author:
Deepika Saini
BSc Hons. Statistics
Shaheed Rajguru College of Applied Sciences for Women
University of Delhi
K – mean clustering is excellent and widely used .