Big Data Analytics: A Tutorial of Some Clustering Techniques
Main Article Content
Abstract
Data Clustering or unsupervised classification is one of the main research areas in Data Mining. Partitioning Clustering involves the partitioning of n objects into k clusters. Many clustering algorithms use hard (crisp) partitioning techniques where each object is assigned to one cluster. The most widely used in hard partitioning algorithm is the K-means and its variations and extensions such as the K-Medoid. Other algorithms use overlapping techniques where an object may belong to one or more clusters. Partitioning algorithms that overlap include the commonly used Fuzzy K-means and its variations. Other more recent algorithms reviewed in this paper are the Overlapping K-Means (OKM), Weighted OKM (WOKM) the Overlapping Partitioning Cluster (OPC) and the Multi-Cluster Overlapping K-means Extension (MCOKE). This tutorial focuses on the above-mentioned partitioning algorithms. We hope this paper can be beneficial to students, educational institutions, and any other curious mind trying to learn and understand the k-means clustering algorithm.
Article Details

This work is licensed under a Creative Commons Attribution 4.0 International License.
How to Cite
References
Abbas, O.A. (2008). Comparisons between Data Clustering Algorithms. The International Arab Journal of Information Technology, Vol 5. No. 3.
Aggarwal, C.C, Reddy, C.K. (2014). Data Clustering: Algorithms and Applications. CRC Press, pages 15-19.
Arthur, D. and Vassilvitskii, S. (2007). K-means++: The advantages of careful seeding. In proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1027-1035.
Baadel, S., Thabtah, F., and Lu, J. (2015). Multi-Cluster Overlapping K-means Extension Algorithm. International Journal of Computer, Control, Quantum and Information Engineering Vol:9, No:2, pp 374-377.
Baadel, S., Thabtah, F., and Lu, J. (2016). Overlapping Clustering: A Review. SAI Computing Conference, London, UK. IEEE. DOI: 10.1109/SAI.2016.7555988
Ball, G.H. (1965). Isodata, a novel method of data analysis and pattern classification. Technical report, DTIC Document.
Bandyopadhyay, S., Maulik, U. (2002). “Genetic Clustering for Automatic Evolution of Clusters and Application to Image Classification”, Pattern Recognition, Vol. 35, pp. 1197-1208.
BenN’Cir, C., Essoussi, N., Bertrand, P. (2010). Kernel Overlapping K-Means for Clustering in Feature Space. In International Conference on Knowledge Discovery and Information Retrieval (KDIR), pages 250-256
Berkhin, P. (2006). A survey of Clustering Data Mining Techniques. In Grouping Multidimentional Data, Springer, Berlin Heidelberg, pages 25-71.
Bezdek, J.C. (1981). Pattern Recognition with Fuzzy Objective Function Algorithms. Kluwer Academic Publishers.
Bradley, P.S., Fayyad, U. (1998). Refining Initial Points for K-means Clustering. 15th International Conference on Machine Learning, 1998, pp. 91-99
Bottou, L, Bengio, Y., (1995). Convergence Properties of the K-Means Algorithms. Advances in Neural Information Processing Systems 7, MIT Press, pp. 585–592.
Calinski, T. and Harabasz, J., (1974). A dendrite method for cluster analysis. Communications in Statistics, vol 3 pp. 1–27.
Chakraborty, S., Nagwani, N. K. (2011). Analysis and Study of Incremental K-Means Clustering Algorithm. High Performance Architecture and Grid Computing. Springer Berlin Heidelberg, pages 338-341
Chen, Y. Hu, H. (2006). An overlapping Cluster algorithm to provide non-exhaustive clustering. European Journal of Operational Research. Elsevier, pages 762-780.
Cleuzious, G. (2008). An extended version of the k-means method for overlapping clustering. IEEE International Conference on Pattern Recognition.
Cleuzious, G. (2009). Two Variants of the OKM for Overlapping Clustering. Advances in Knowledge Discovery and Management, pages 149-166.
Danganan, A. E., Sison, A. M. and Medina, R. P. (2018). An Improved Overlapping Clustering Algorithm to Detect Outlier. Indones. J. Electr. Eng. Informatics Vol. 6 (4) pp. 401-409. DOI: 10.11591/ijeei.v6i4.499.
Danganan, A. E., De Los Rayes, E. (2021). eHMCOKE: an Enhanced Overlapping Clustering Algorithm for Data Analysis. Bulletin of Electrical Engineering and Informatics. Vol. 10 (4). Pp. 2212-2222. DOI: DOI: 10.11591/eei.v10i4.2547
Everitt, B.S., Landau, S., Leese, M. (1981). Cluster Analysis, Arnold Publishers.
Fellows, M. R., Guo, J., Komusiewicz, C., Niedermeier, R., and Uhlmann, J. (2011). Graph-based data clustering with overlaps. Discrete Optimization, 8(1):2–17.
Fisher, D. (1995). Optimization and simplification of hierarchical clustering. In Proceedings of the 1st International Conference on Knowledge Discovery and Data Mining (KDD), pages 118-123.
Forgy, E.W. (1965). Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics, 21: pages 768-769.
Hruschka, E.R., de Castro, L.N., Campello, R.J.G.B. (2004). Evolutionary Algorithms for Clustering Gene-Expression Data, In Proc. 4th IEEE Int. Conference on Data Mining, pp. 403-406.
Huang, J. Z., Ng, M. K., Rong, H., Li, Z. (2005). Automated Variable Weighting in K-means type Clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(5), pages 657-688.
Jaini, A. (2010). Data Clustering: 50 years beyond k-means. Pattern Recognition Letters, 31(8): pages 651-666.
Kaufman, L. and Rousseeuw, P. (1990). Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons.
Krishnapuram, R. and Keller, J.M. (1996). The possibilistic C-means algorithm: Insights and recommendations. IEEE Transactions on Fuzzy Systems, 4(3): pages 385-393.
Lance, G.N. and Williams, W.T. (1967). A general theory of classificatory sorting strategies II. Clustering Systems. The computer Journal, pages 271-277.
Lloyd, S. (1982). Least squares quantization in pcm. IEEE Transactions on Information Theory, 28(2): pages 129-137.
MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pages 281-297, Berkeley, USA.
Manning, C.D., Raghavan, P. and Schutze, H. (2008). Introduction to Information Retrieval, volume 1. Cambridge University Press, Cambridge.
Pal, K., Keller, J.M, and Bezdek, J.C. (2005). A possibilistic fuzzy C-means Clustering Algorithm. IEEE transactions of Fuzzy Systems, 13(4): pages 517-530.
Pérez-Suárez, A. et. al. (2013) OClustR: A new graph-based algorithm for overlapping clustering. Journal on Advances in Artificial Neural Networks and Machine Learning, vol. 121: pages 234-247.
Raymond, T., and Han, Jiawei. (2002). CLARANS: A method for clustering objects for spatial data mining. IEEE Transactions on Knowledge and Data Engineering, 14(5): pages 1003-1016.
Selim, S. Z. and Ismail, M.A. (1984). K-means-type algorithms: A generalized convergence theorem and characterization of local optimality. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(1): pages 81-87.
Scholkopf, B., Smola, A., Muller, K. R. (1998). Nonlinear Component Analysis as a Kernel Eigen Value Problem. Neural Computation, 10(5), pages 1299-1319.
Sneath, P.H. and Sokal, R. (1962). Numerical Taxonomy. Nature, pages 855-860.
Steinbach, M., Karypis, G., and Kumar, V.(2000). A Comparison of Document Clustering Techniques. In KDD Workshop on Text Mining, volume 400, pages 515-526. Boston, USA.
Tibshirani, R., Walther, G., and Hastie, T. (2001). Estimating the number of data clusters via the gap statistic. Journal of the Royal Statistical Society B, volume 63, pages 411–423
Ward, J. (1963). Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58(301), pages 236-244.
Xu, R. and Wunsch, D. (2005). Survey of clustering algorithms. IEEE Transactions on Neural Networks, 16(3): pages 645-678