Ravinder AhujaDepartment of computer science, Jaypee institute of information technology,Noida,IndiaEmail:[email protected] ChopraDepartment of computer science,Jaypee institute of information technology,NOida.IndiaEmail:[email protected] of computer science,Jaypee institute of information technology,Noida,IndiaEmail:[email protected] SharmaDepartment of computer science,Jaypee institute of information technology,Noida,IndiaEmail:[email protected] Machine Learning is a scientific discipline concerned with the design and development of Algorithms that allow computers to involve behaviours based on empirical data, in the form of database. Its major focus is to automatically learn to recognize complex patterns and make intelligent decisions based on data. In this paper we will group the data into multiple clusters on the basis of their similarities and dissimilarities.Clustering is the assignment of a set of observation into the subset (called clusters) so that observations in the same clusters are similar in some sense. It is a method of unsupervised learning, and a common technique for statistical data analysis used in many fields. Our research paper is about evaluating the performance of the staff member regarding their level of teaching by considering 15 factors. It computes the performance level by collecting feedback from every student. It gives the appraisal result in the form of 30 points earned to every staff member. It helps in assessing the annual increment and other promotion.In this research paper we divide the staff member into three Group1, Group2 , Group3 . Group1 has score between 25 to 30, Group2 has score between 20 to 25 and Group3 has score between 15 to 20. These groups are divide on the bases of the Points which is the average of all the 15 characteristics.Index Terms- Clustering, Fuzzy Grouping, Unsupervised Algorithms, Similarities.RELATED WORKThere have been many approaches, but in this paper we use 5 Algorithms. All 5 Algorithms are a part of unsupervised learning. All the 5 algorithms work upon the Points which is average of 15 characteristics that are Regularity, Presentation, Syllabus Coverage, Discussion, Availability, Curriculum , Punctuality, Create Interest, Coverage, Critical Thinking, Testing Student, Evaluation, Time Utilization, Subject Knowledge, Subject Depth. In this we use 5 clustering Algorithms that are a) K-means clustering b) Fuzzy c-mean clustering c) Self Organizing Mapping d) Agglomerative clustering e) Hierarchical K-means. Here, the resource allocation network has been trained with the sample data set. At the end of training resource allocation, network learns complex data set for the respective function. The network has been tested with patterns from test set. It has been found that network produces perfect classification results. Complexity of K-means clustering is O (n^2). It works upon the Euclidean Distance. The main findings are that fuzzy c-means clustering is better than Self Organizing Mapping and Hierarchical K-means clustering. Hierarchical K-mean clustering is better than Agglomerative and K-means clustering. With using the fuzzy C-means algorithm, training samples will be clustered and the inappropriate data will be detected and moved to another dataset and used differently in the classification phase. The college is computing the staff appraisal points by considering following features.A.1 Regularity in engaging classes a) Very regular b) Regular c) Not regular A.2 Presentation of lecture a) Highly effective b) Effective c) Not effective A.3 Coverage of syllabus a) 95% and above b) 85 – 95 % c) Less than 85 % A.4 Opportunity for questions and discussion in the class a) Highly encouraging b) Encouraging c) Discouraging A.5 Availability of teacher for consultation beyond class hours a) Mostly b) Occasionally c) Never A.6 Organization of course activities a) Excellent b) Good c) Poor A.7 Punctuality a) Punctual b) Fairly punctual c) Not punctual A.8 Teacher attempts to create interest in the subject a) Always b) Occasionally c) Never A.9 Pace of coverage of syllabus a) Normal b) Fast / Slow c) Too fast / Too slow A.10 Encourages critical thinking a) Always b) Occasionally c) NeverA.11. Tests and other evaluations reflect the course content: a) Always b) Occasionally c) Never A.12. Quality of evaluation a) Good b) Fair c) Poor A.13. Utilization of class time a) Very effective b) Effective c) Not effective A.14. Subject knowledge of the teacher a) Excellent b) Good c) Not satisfactory A.15. Depth of subject taught a) More than adequate b) Adequate c) InadequateINTRODUCTIONCluster analysis divides data into meaning full groups (clusters) which share common characteristics i.e. same cluster are similar to each other than those in other clusters. It is the study of automatically finding classes. A web page especially news articles which are flooded in the internet have to be grouped. The clustering of these different groups is a step forward towards the automation process, which requires many fields, including web search engines, web robots and data analysis.Any new web page goes through numerous phases including data acquisition, pre-processing, feature extraction, classification and post processing into the database. Cluster analysis can be regarded as a form of the classification which creates a labelling of objects with class labels. However it derives these labels only from the data. Data mining functionalities are the characterization and discrimination, mining frequent patterns, association, correlation, classification and prediction, cluster analysis, outlier analysis and evolution analysis. Clustering is a vivid method. The solution is not exclusive and it firmly depends upon the analysts’ choices. Clustering always provides groups or clusters, even if there is no predefined structure. While applying cluster analysis we are contemplating that the groups exist. But this speculation may be false. The outcome of clustering should never be generalized.R TOOLR is a free software environment for statistical computing and graphics. It provides a huge class of statistical computing eg. classical statistical tests, linear and nonlinear modelling , classification, time series analysis, clustering and various graphical functions.It has become an important tool for fields such as data sciences. Domains ranging from biology to marketing make use of this tool to make decisions.CLUSTERING ALGORITHMSK-means clustering: K-Means is a simple unsupervised learning method. It belongs to hard clustering technique. K-means algorithm is used to divide the standard data into two or more clusters. In K-Means, the Euclidean distance is used to calculate the distance between the data item and cluster centres. Each cluster has its own centre and data items allotted to that cluster which are at minimum distance to it. It is an iterative procedure to find out minimized sum of distance from each data item to its cluster centre, over all clusters. The main idea behind this is to find the K i.e. the optimal number of clusters and randomly assign the centroid value to each cluster. Every centroid value is different from one another. Each data item is assigned to a cluster based on the minimum distance between the data item and the centroid value of each cluster. Average value of each cluster is calculated and assigned as a new centroid.The K-mean algorithm operates as follows:Initialize cluster centroids C.For each iterationRecalculate distance from data item to centroids(C1,C2,….Ck),and find the closets centroid Cmin.Move current cluster Ck into new cluster Cmin and recalculate the centroid for Ck and Cmin.Repeat step 2 until either the maximum iteration limit is reached or an interation passes in which no changes in cluster assignments are made. We need to consider all the 15 points and the final score and using the R tool plot the graph:The confusion matrix for K-means Clustering:ONETWOTHREE1.5126822.26653.653633b) Fuzzy C-mean Clustering: Fuzzy clustering is an unsupervised method for the analysis of data. In many situations, fuzzy clustering is more natural than hard clustering. Objects on the boundaries between several classes are not forced to fully belong to one of the classes, but rather are assigned membership degrees between 0 and 1 indicating their partial membership. Fuzzy c-means algorithm is most widely used.The FCM model is the optimization problem J m:Jm (U,V;X) = i=1Nj=1Kuijm||xi-vj||A2 —(1)where X = {Xi, i = 1 ? N} ?Rq is the data set, n is the number of data, K is the number of clusters, m is the degree of fuzzy, uij is the degree of membership, vjis the center of cluster j, and ||xi-vj||A is a distance between vjand the objectxi. Consider:?KN={U?RNK:0?uij?1, ?i,?j; ?i?j uij>0}, —(2) Mfc={U??KN : j=1Kuij=1, ?i ?{1,…..,N}; i=1Kuij>0, ?j?{1,…..,N} } ——- (3) Theorem –If D ijA = ||x i?v j||A > 0, for all i, j, m > 1, and X data set contains at least K different patterns, then (U, V) ? M fc × ? K×q and J m can be minimized only if:uij=(s=1K( ||xi-vj||A2||xs-vj||A2 )1(m-1))-1,i?{1,……,N},j?{1,……,K}, —–(4) vj= i=1Nuijmxii=1Nuijm , ?j ?{1,……,K}. —-(5)We consider all the 15 characteristic to plot the multidimensional graph on C-mean clustering: The confusion matrix for K-means clustering:ONETWOTHREE11390023680300120c) Self-Organizing Map:A self-organizing map(SOM) or also known as self- organizing feature map(SOFM) is a artificial neural network (ANN) ,which is trained using unsupervised learning to produce a low-dimensional (especially two dimensional), discrete representation of input space of training samples call map,and therefore is a method to do dimensionality reduction. because of the competitive learning,self-organizing maps differ from other artificial neural network.Algorithm:1. Initially,randomize the node weight in a map.2. Randomly choose an input vector. 3. Traverse each and every node in the map.Use the Euclidean Distance formula to find similiarity between input vector and map’s node’s weight vector. Track the node that whose distance is smallest among all(this node is best matching unit,BMU ) 4. Update weight vector of the nodes in the neighbourhood of BMU (including the BMU itself) by keeping them closer to input vector.5.Increase s and repeat step 2 while().Graph between mean distance at various iterations can be plotted as. The Confusion matrix for Self Organizing Map:ONETWOTHREE1.140302.26503.00120d) Agglomerative Clustering: Agglomerative clustering is a bottom-up clustering method, where all the clusters have sub-clusters, which in turn have the sub-clusters, etc. Species taxonomy is a one of the example. Gene expression data might also exhibit this type of hierarchical quality (example neurotransmitter gene families). It generally starts every single object in a single cluster, and then in successive iteration ,it agglomerates the closest pair of clusters by satisfying similar criteria(not necessarily all) until all of the dat is in one cluster.The hierarchy within the final cluster will have these following characteristic:Clusters generates in early stages are nested in those clusters which generates in later stages.Cluster in the tree with different sizes can be valuable for discovery. Algorithm of Agglomerative hierarchical clustering is:Preparing the data.Computing (dis)similarity between every pair of object in data set.Using linkage function to group objects into small hierarchical cluster tree , based on distance information which was generated in step 1. Clusters that are in close proximity are linked using linkage function.Find where to cut hierarchical tree into clusters. Create partition the data.Dendrogram of Agglomerative clustering is: Graph:The confusion matrix of Agglomerative clustering:ONETWOTHREE1.140402.26403.00120 e) Hierarchical K-Means Clustering:We use Hierarchical K-means Clustering because it accelerate the clustering and feature vector construction and lookup.Algorithm: The algorithm is summarized as follow:Compute the hierarchical clustering, cut the tree(node) into k-clusters.Find the center (i.e mean) of each cluster.Compute K-means by using these set of cluster centers (above defined) as initial cluster centers.Dendrogram for Hierarchical K-Means is: Cluster plot for Hierarchical K-Means is: The Confusion Matrix for Hierarchical clustering:ONETWOTHREE1.140302.26503.00120Comparison:Comparison of accuracy and error of all the algorithms:SNo.ClusteringAccuracyError1.K-means Clustering27.272772.72732.Fuzzy C-meansClustering99.0909.90913.Self- OrganizingMap98.48481.51524.Agglomerative Clustering98.18181.81825.Hierarchical Clustering98.48481.5152Points are: Performance curve of accuracy:Histogram of error: CONCLUSIONWe have perceived a comparative study of K Means, Fuzzy C-Means, Hierarchical K Means, Self Organized Mapping and Agglomerative clustering using Siddaganga Institute of Information Technology teacher’s performance database. The performance of K-Means is faster but getting the appropriate k value is challenging. In fuzzy C-Means each point is assigned some membership hence it does not belong to only one group whereas in fuzzy k-Means each point belongs to a particular group. It was proved experimentally that fuzzy c means is the best algorithm in terms of accuracy (i.e have less error). This analysis can be further improved by testing the algorithms for large data and keeping in mind timeand accuracy.