Effectiveness measurement of spectral clustering algorithm

After the Kuwil method was found for applying the spectral clustering algorithm, we need a way to make sure of the results, because in many cases the nature of data is not compatible with the algorithm; also, when the data contain more than three dimensions (3D) the results cannot be displayed on the monitor. So I found two techniques, first, for measuring the strength and effectiveness of S.C.A, such as some comparative relationships that measure the following: Effectiveness of algorithm applying, the strength of every cluster and the effectiveness of data correlation inside every cluster. Secondly, analysis of variance (ANOVA) for S.C.A; this depends on distance variance instead of values variance. I applied the methods above to calculate the strength and effectiveness of S.C.A, and they showed good results, so they can offer more reliability for the outputs of the algorithm. Using these relations and ANOVA for S.C.A help us to measure the data receptivity for applying the algorithm by ‘Kuwil method’, so the outputs will be more reliable and that will help to spread the use of this algorithm among researchers, analysts and other users.


Introduction
Applying S.C.A may face some difficulties, based on the nature of the data under study. When data are three dimensional and more, they are impossible to be represented graphically on monitors using the current technology, while some data forms do not fit to the S.C.A application according to the algorithm definition. Therefore, I suggest testing the clustering outputs to be sure that they are reliable for decision making, by making limited use of analysis of variance (ANOVA) for S.C.A instead of graphical representation of data, using some comparative relations to test the strength of every cluster, and the effectiveness of algorithm application and data correlation inside every cluster. A lot of research and studies have been conducted on spectral clustering with regard to examining the number of clustering. Indeed, they show that the multiplicity of Eigenvalue 1 equals the number of clusters (this was followed to some extent by Polito and Perona in [1]). In [2], it is shown that if some conditions apply, then spectral clustering minimizes the multi-method normalized cut. A generalization of the two-way normalized cut criterion [3], random walks [4], graph cuts and normalized cuts [3], and matrix disorder theory [5], simplifying the difficulties to make them easier to understand concurrently with the improvement of algorithms [3][4][5][6][7], shows that significant theoretical progress has also been done. Yu and Shi [8] proposed to swap normalized eigenvectors to get optimal segmentation. Parallel spectral clustering [9] and spectral clustering using more details can be found in [10]. All the previous studies were about the algorithm and improvements in its application give more accurate results. The latest studies are interested in performance improvement in the speed of implementation through use of parallelism technology, but to the best of my knowledge, there is no published empirical study that attempts to test the significance of the effectiveness measurement through comparative relations or ANOVA.

Overview of Potential Fields
Two types of techniques were used to measure performance and to determine the possibility of applying the algorithm to multiple types of data.

Comparative Relations
According to the S.C.A concepts, the main issues we must look for about data are the distance and coherence among them to determine which data are similar and which are not. So the proposed method depends basically on the relations among the dataset points, through which we found five laws (relations), the first two of them for general measurement for implementation while the next three are for every cluster:

ANOVA
There are several uses of the ANOVA test, such as it is a way to find out if survey or experimental results are significant or help to figure out if there is a need to reject the null hypothesis or accept the alternate hypothesis or simply said, is it comparing groups to see if there is a difference between them [11].

Comparative Relations
To clarify the proposal for comparative relations, we may look at Figure 1, which is a graph from experiment 1. The idea of measuring effectiveness is illustrated by examining the relationship among all data points inside every cluster with cluster 5; it is also used to find the relationship between this cluster and the others, that is, repeated with each cluster separately, considering that the horizontal axis scale is bigger than the vertical, which therefore causes an inaccurate perspective.

ANOVA
According to my search in statistical science, I found that ANOVA is the most important topic used to find a relation among many groups of data; therefore, I tried to adopt this law to take advantage of it for comparison among the results of the algorithm implementation, and so we use one-way ANOVA. Our use is limited to displaying the graph of each cluster, clarifying the information contained in each cluster and then comparing them in terms of distances, as is the basic idea of S.C.A. The values to be analysed for variance will be the distance among the points for each cluster. Figure 2 shows the general representation of the graphic resulting from the use of ANOVA in MATLAB, where the following can be clarified: Figure 2. Graph of ANOVA in MATLAB  Max-the largest distance between two points within the cluster.  Min-less distance between two points within the cluster indicating at least two smallest coordinates in the case that it equals to 0.  Range-the difference between the max and min distances.  Median-it is a value which is in the middle of data.  25%-ile (Q1)-interquartile range (IQR) of the 25th percentile.  75%-ile (Q3)-IQR of the 75th percentile.  Upper Confidence Limit (UCL)-UCL of median.  Lower Confidence Limit (LCL)-LCL of median.
We are not interested in UCL and LCL, because they are related to the hypothesis test.  IQR-is a kind of dispersion measurement used to overcome the defects in the range, because it excludes the outlier values of both the sides where it depends on the calculation of the first and third quartiles: IQR = (Q 3 -Q 1 ).

Experiments
In order for us to monitor the results and make comparisons, we have fixed the colours in descending order of the clusters in all the experiments, as given in Table 1. We used this method to test some data that were clustered by the Kuwil algorithm. We took five clustered dataset cases, one of them is 3D and the other four are in 2D, while one of them is real data.  Table 2 shows a matrix of distances between clusters. Table 3 shows the weight of every cluster relative to the whole dataset, ESF (c.k-oth) , ESF (c.k-n) and ISF (c.k) for every cluster, respectively. The apply factor A.F here is 0.158, which means that the dataset fits well the concepts of S.C.A definition. M.F = 0.555, which illustrates that the distance between the closest two clusters is not close enough to be significant for merging. Table 3 shows that c (2) has the strongest ESF (c.2-oth) because of its location among the other clusters. On the other hand, c (4) has the weakest ESF (c.2-oth) . In the graph of Figure 3, c (2) has the strongest ESF (c.2-n) , while we can see clearly in Figure 3 that both c (1) and c (4) have the weakest ESF (c.4-n) , also shown in Table 3.   The strongest cluster internally is c (2) (the lowest ISF (c.2) = 0.169), and the weakest cluster is c (1) (ISF (c.1) = 0.526 ). Figure 4 shows the result of the MATLAB program, where seven clusters are represented in the graph for easy comparison and understanding.  Table 4 shows that c (1) shows the highest value of IQR and thus there is more evidence of a weak relation among the points in it, unlike c (2) , where the IQR is smaller. Therefore the points are more cohesive and convergence of all clusters. Table 4 shows that IQR in all the clusters is convergent also, and the outlier values have been ignored in c (2) & c (3) which are coloured in red.

Experiment 2: Unreal Data, 2D
This case of a clustered dataset has good A.F (0.110), but its (M.F = 0.949) indicates a significant possible merging. Table 6 and Figure 5 shows that c (1) and c (5) are close enough to each other to be merged; they both have ESF (c.kn) = 0.049 and ISF (c.k) = 0.949.

Experiment 4: Real Data, 2D
The dataset contains two variables: Air pollution and Renewable energies in 30 European countries in 9 years from 2006 to 2014. The data were collected from the European Economic Association (http://ec.europa.eu/eurostat/data/database). We note that A.F in this case is good, but MF shows a significant possibility for merge. Specifically in Figure 7 and Table 10, ESF (c.k-n) for c (2) and c (5) show how close the two clusters are to each other, and (ISF (c.5) = 0.961) indicates an internal weakness relative to the distance to the nearest cluster.   Table 11 clearly shows that the second cluster is coherent and interrelated where (IQR=609), while the third one is less (IQR=2022).

Experiment 5: Unreal Data, 2D
This case is quite deferent. A.F is relatively high (0.779) and close to one, which means the dataset does not fit well to the concepts of S.C.A, while M.F (0.943) shows a significant possibility for merging. The three other factors also illustrate the weakness of all of the clusters. So we can say that the dataset in this case cannot be clustered strongly.

Discussion
Before turning to our comparative empirical of two techniques of relative relations or ANOVA for distance, we first present a theoretical analysis evaluation for them to measure the effectiveness of applying S.C.A by Kuwil method. In this paper we have five practical experiments which are conducted on a few types of data, and dozens of experiments that cannot be mentioned in this paper.
First, we start with theoretical analysis evaluation: Adoption of the laws used to find the effectiveness of definition principles of the algorithm, with the use of the law of distance between two points and the relative relations and the use of statistical laws such as ANOVA and quartile deviation, whose results are not completely accurate and are not affected by extreme values. All the results were acceptable and logical for all the experiments with all types of data like real, unreal, 2D, 3D and different numbers of clusters. The data generated by the user are controlled in accordance with the nature of the algorithm, but the real data can accept application of the algorithm completely or partly, so we need to measure the effectiveness in the acceptance cases, which facilitates the user or researcher, whether statistical or financial, to determine that the results are acceptable according to the nature of the data under study. All this in addition to measuring the factors of merge between clusters and giving indicators to the user to decide what fits the nature of the data under study. It has become easier to apply S.C.A on more than 3D, as the evaluation of the results does not depend on the graph only but also on the mathematical and statistical measurements.
Secondly, comparative empirical: Table 14 shows the five cases with the most important factors A.F and M.F. Figure 10 shows that experiment 3 is the best experiment in terms of accepting the implementation of S.C.A and the less acceptable is experiment 5. The greater need to merge two clusters or more are represented in experiments 2 and 4 and the lowest is required in experiments 1 and 3.

Conclusion and Future Work
As we saw, the outputs of S.C.A under Kuwil method can be tested by mathematical and statistical techniques, which make the algorithm more reliable and widespread. We can also use these techniques for testing another S.C.A under another method, or even test any algorithm used in data mining and artificial intelligence after modifying the techniques, because S.C.A depends on the closest distances among the points regardless of the total distance to connect all of them, while the other algorithms depend on another factor, such as the total distance for TSP algorithm or the central points for the K-mean algorithm. So we can say that the door has been opened for further studies for evaluating the performance of the various algorithms on different types of data as well as measuring the effectiveness. Due to the focus of the method on the detailed study of the nature of data and analysis of all relationships within the dataset, the process of implementation is fast and effective in small and medium data, but huge data, such as image and sound files, need to improve by using the technique of OpenMP of parallel programming.