FAST K-MEANS COLOR IMAGE CLUSTERING WITH NORMALIZED DISTANCE VALUES

Image segmentation is an intermediate image processing stage in which the pixels of the image are grouped into clusters such that the data resulted from this stage is more meaningful for the next stage. Many clustering methods are used widely to segment the images. For this purpose, most clustering methods use the features of the image pixels. While some clustering method consider the local features of images by taking into account the neighborhood system of the pixels, some consider the global features of images. The algorithm of the K-means clustering method, that is easy to understand and simple to put into practice, performs by considering the global features of the entire image. In this algorithm, the number of cluster is given by users initially as an input value. For the segmentation process, if the distribution of the pixels over a histogram is used, the algorithm runs faster. The values in the histogram must be discrete in a certain range. In this paper, we use the Euclidean distance between the color values of the pixels and the mean color values of the entire image for taking advantage of the every color values of the pixels. To obtain a histogram that consists of discrete values, we normalize the distance value in a specific range and round the values to the nearest integers for discretization. We tested the versions of K-means with the gray-level histogram and the distance value histogram on an urban image dataset getting from ISPRS WG III/4 2D Semantic Labeling dataset. Comparing the two histograms, the distance value histogram proposed in this paper is better than the gray-level histogram.


INTRODUCTION
The main goal of the clustering methods is to group the given data according to their similarities so that the elements of any group will be similar entities (de Amorim and Makarenkov, 2016).To solve many practical problems, such as image segmentation, the clustering methods are commonly used in image processing applications.Image segmentation is an important intermediate digital imageprocessing problem by this way the data becomes more useful and meaningful for next stages.Digital images are the discrete forms of images, such that it allows to make many processes on it easily and the storage of it in binary format.Image pixels are the smallest and the most fundamental elements of an digital image (Gonzalez and Woods, 2007).
Segmentation of an image mostly depends on the color values of image pixels and the neighborhood between them.The locally segmentation algorithms that respects to the neighborhood of pixels tend to achieving to the goal more quickly than globally segmentation algorithms that consider the features of the entire image for each operation, because the locally segmentation algorithms have inherently low computation complexities (Felzenszwalb and Huttenlocher, 2004;Saglam and Baykan, 2017).However, globally segmentation algorithms are favorable to analyze an image especially if the cluster number has been specified, because they regard the image elements as aggregate.The K-means algorithm, which is also commonly used to segment images, is the most popular clustering algorithm and it needs a specified cluster number.This algorithm is also known as Lloyd's algorithm (Lloyd, 1982).
The K-means algorithm initially chooses the cluster centers by randomly or some specified ways.A cluster center represents the related cluster which covers it.K-means makes progress iteratively.In the first iteration, each element in the whole data becomes a member of a cluster according to its closeness to the cluster centers.In the next iteration, each cluster center is recalculated by taking into account the average feature value (or values if the feature has a vector) of the elements in the cluster.After the calculation of the cluster centers, the memberships of the every element in the data are updated according to the values of the new cluster centers.If there is no changes in the cluster centers, the algorithm ends (Jain et al., 1999).
For the measurement of the closeness between the elements and the cluster centers in the clustering algorithm, the Euclidean distance has been used widely in the literature (Lin et al., 2014;Cheng et al., 2001;Rupali and Shweta, 2014).When the K-means algorithm is applied to a digital image for the segmentation purpose, the distances between many pixels and the  cluster centers must be calculated in each iteration of the algorithm.This means that it consumes a much processing time.To reduce the time complexity, the histogram of the pixel values such as grayscale values and brightness values is commonly used.In digital image processing, for a gray-level image, each pixel has a discrete gray-level integer value, which is ranging from 0 to 255 (Lin et al., 2014).For this reason, a gray-level histogram has 256 bins such that each represents a gray scale value.In a gray-level histogram, a bin indicates an integer number that represents the number of the pixels that have the same gray value; the value also equals the bin number.When the closest cluster center of a bin number is calculated, all of the pixels whose gray-level values equal the bin number are assigned to the same cluster.
Many image segmentation methods mostly use the color values of color images instead of their gray-level values to obtain more accurate results.However, if using the histograms of the every color elements is intended, too many probabilities turn up (256 × 256 × 256 for RGB color space).This needs very much processing time, therefore, in these cases, the histogram-based approach is not used, and the distance values between every pixels and every cluster centers are calculated in each iteration (Lin et al., 2014).

177
In this paper, to use the color values and to take advantage of the histogram-based fast approach, we use the distance values between the color values of pixels and the mean color values of all pixels of the entire images (the mean color values are normalized in a specific range and rounded to the nearest integer values) instead of the gray-level values of the pixels.The RGB color space has been used as color space, and the Euclidean distance has been used as distance measurement (Mignotte 2008;Cheng et al. 2001).We applied the fast gray-level histogram-based approach and the proposed approach based on the histogram of the color distances on the images obtained by airborne sensors acquired from ISPRS WG III/4 2D Semantic Labeling dataset (Axelsson 2000;Labeling and Vaihingen 2016).The ground truth of the images is also available in the data set.As the two histogram-based approaches are compared, it seems that the color distance histogram that is proposed in this paper is better than the gray-level histogram.

MATERIAL AND METHOD
Let  = { 1 ,  2 , … ,   } be the pixel set of the input image  where the number of pixels of the input image is .

Traditional K-means Algorithm for Gray Level Image Segmentation
Let   = { 1  ,  2  , … ,    } be the set of gray-level values of image pixels of the input image , such that    is the gray-level value of the pixel   , where  = 1,2, … , , and   = { 1  ,  2  , … ,    } be the cluster centers in the gray-level space where the specified number of cluster is  (Jain and Dubes 1988;Jain 2010).The distance value (   ,    ) between the gray value of th pixel and th cluster center is calculated as in Eq.1.
Normally, the K-means algorithm ends when any changing at the positions of the cluster centers does not happen.However, for image segmentation, achieving this case may need too much loops because of being many pixels in the image.Due to this reason, a tolerance value can be applied to change the positions of the cluster centers (Lin et al. 2014;Likas et al. 2003).In this paper, we limit the number of loops to a upper limit value  instead of applying a tolerance value.Nevertheless, in our experiments, all cluster centers came to unchanging state for all images in the dataset before without reaching the upper limit of loops.The graphical demonstrations of the total changes of the cluster centers values for each image appear in the section result and discussion.
The output set   defines the segmented image that consist of the label set such that    corresponds to the label number of the th pixel of the input image , where  = 1,2, … , ,    = 1,2, … ,  and  is the number of cluster.The labels corresponding to the pixels within the same segment have the same label number (Figure 1), e.g. if   and   are resulted within the same segment,    would be equal to    .
The algorithm of the traditional K-means for gray-levels is shown in  An ordinary image consists of many pixels.According to the traditional K-means, the distance value between every pixel and every clustering center must be calculated in each loop.This case consumes a lot of time.Reducing the time consuming, using the gray-level histogram of the image is a favorable method (de Amorim and Makarenkov 2016; Lin et al. 2014).A gray-level histogram of an digital image has certain number of bins such that each refers an integer gray-level value from 0 to 255.In the histogram, there are corresponding integer values for every bin, which denote the number of pixels having the same gray-level value (Figure 3).The algorithm of the fast K-means based on gray-level histogram can be seen in Figure 4.
The advantage of the algorithm, which ensures to accelerate the process, is that it calculates the distance values to the cluster centers 256 times instead of  times for each loop.

Figure 3.
A gray-level image and its gray-level histogram

The Proposed Method Based on the Color Distance Histogram
Using more features of images, like color values of pixels for color images, provides better results in terms of accuracy than using gray-level values.In the RGB color spaces, each pixel in a color image has three-color values as red, green, and blue values (Peng et al. 2013;Cheng et al. 2001;Dai et al. 2015;Mignotte 2008).Therefore, the cluster centers have a vector that consists of three values instead of one value.For measuring the distance value between the color value vectors and the cluster centers' vector, the Euclidean distance is generally used, and its equation is that as seen in Eq. 2 (Lin et al. 2014).

RESULT AND DISCUSSION
We use the 2D Semantic Labeling -Vaihingen data to test the accuracy of the proposed method.This data set was captured over the region of Vaihingen in Germany, such that the data set consists of the sub-parts of the data used for the test of digital aerial cameras carried out by the German Association of Photogrammetry and Remote Sensing (DGPF) (Axelsson 2000;Labeling and Vaihingen 2016).The data set contains 33 area images of different sizes and each consists of a true orthophoto (TOP) extracted from a larger TOP mosaic as seen in Figure 8. But, the ground truths of only 16 images of them are available in the data set.Hence, we use the 16 images which have a ground truth.

Figure 8.
The larger true orthophoto mosaic and the areas The data set actually is prepared for the automatic extraction of urban objects such as buildings, roads, or trees.This process includes the classification and the object recognition issues.We use the data set for clustering; therefore, it is no matter that which segment defines which object.The main problem of the clustering process is separating the objects from each other.Among the 16 ground truth, 11 ground truth images consist of 5 clusters, while 5 ground truth images consist of 6 clusters.Thus, we determine the cluster number parameter  according to these numbers for each image.
We compare the gray-level histogram-based K-means and the distance value histogram-based Kmeans.For the two methods, the initial cluster center values have an important effect to the clustering results.Thus, we determine the specific number for the initial cluster centers to keep stability and true comparing.To specify the initial cluster center numbers for the gray-level histogram-based method, we use the mean gray-level values of pixels of the intended regions for each cluster by looking up the ground truth, and similarly, for the distance value histogram-based method the mean distance values of pixels of the intended regions.As the upper limit value  for the loop number, we chose the value 50.All of the cluster centers achieve to stability for all images in the dataset before without reaching the upper limit 50.The graphical projections of the total changes of the cluster centers values for each image can be seen in Figure 9 for the gray-level histogram-based method and in Figure 10 for the normalized distance value histogram-based method.After segmentation process, to remove small cluster particles, we apply the median filter (3 × 3) on the output label values as post-processing.In Figure 11, an example image and its colored segmentation results for the two methods are demonstrated.We use MATLAB to test the algorithms.In Table 1, the PSNR results of the two methods can be seen.The PSNR is the abbreviation of "peak signal to noise ratio" and it can be widely used as a measure of segmentation quality (Punjab and Punjab 2012).The signal in this data set is the segmentation results and the ground truth images for these segmentation methods.A higher PSNR value means that the segmentation result is of higher quality.Looking the results, it seems that the results of the distance value histogram-based method are better than the gray-level histogram-based method for all of the images.

CONCLUSION
In this paper, we demonstrate the traditional K-means clustering algorithm, its fast form based on gray-level histogram, and the proposed method based on the distance value histogram for color images.The histogram-based K-means algorithm is quite fast compared to the traditional K-means algorithm when they are applied on gray-level images.Color image pixels have a feature vector, which consists of commonly three values, instead of a feature value.This case complicates the implementation of the histogram-based approach for color images.To utilize the fast approach, color images are generally converted to the gray-level images before using the histogram-based algorithm; however, the contribution of the color values to the segmentation process is lost.In this study, we aim to take advantage of color values; and for this purpose, we use the normalized distance values of color images for the histogram-based approach.In a normalized distance value histogram, the distance refers to the Euclidean distance between the color value vectors of pixels and the mean color value vector of the entire image.The results show that the proposed method provides better segmentation results than the gray-level histogram-based method.

ACKNOWLEDGE
This study was supported by "Scientific Research Projects of Selcuk University".This study has been presented as an oral presentation at ISCAS'2016 conference held in Antalya (Turkey), 27-30 September 2016 and selected for the journal of SUJEST (Selcuk University Journal of Engineering, Science and Technology).
Fast K-means Color Image Clustering with Normalized Distance Values Figure 2. According to the algorithm, the initial values of the cluster centers are assigned from 0 to 255 randomly.On the other hand, in literature, the process of assigning initial values of the cluster centers is an individual problem.Many researcher introduce various techniques that are relevant to this problem(Tian et al. 2013;Gingles and Celebi 2014).Because, the initializing of the cluster centers values can considerably influence the results of K-means.In each loop, each data chooses the closest cluster center to itself, and its corresponding label number is assigned as the selected cluster center index.After this process, all of the cluster centers are updated, and it continues to the next loop.These process is realized until the fact that there is no change at the values of the cluster centers or the number of loops are up to the limit value .

Figure 2 .
Figure 2. The algorithm of the traditional K-means for gray-level image segmentation 2)However, if the histograms for every color values are used, too much probability of the number of bins turns up to combine them into one histogram (256 × 256 × 256 for RGB color space).To take advantage of the color features of a color image and the speed of the histogram-based approach, we propose the method that bases on the distance values between the color values of the pixels and the mean color values of the entire image (Figure5).Let the mean color values be   = {  ,   ,   } for the means of the red, green, and blue values respectively.To use the distance values on a histogram, the values need to be discrete and normalized.Therefore, we firstly calculate the distance values between

Figure 4 .
Figure 4.The algorithm of the fast K-means based on gray-level histogram

Figure 5 .
Figure 5.The Euclidean distance in the RGB color space between color values of a pixel and the mean color values of the entire given color image

Figure 7 .
Figure 7.A color image and its distance value histogram

Figure 9 .Figure 10 .Figure 11 .
Figure 9.Total changes of the cluster centers values for each image for the gray-level histogrambased method ,  ,  ,  and  ,  refer to the red, the green and the blue values (respectively) of the th pixel of the given color image. ,  ,  ,  and  ,  refer to the vector values of the center of the th cluster.The distance value (   ,    ) refers to the Euclidean distance between the color vector    of the th pixel and the th cluster center    .