In Chapter 2 we saw how clustering methods such as the K-means algorithm and hi-erarchical clustering can be used to identify data point that are similar to each otherin such a way that they belong to the same group (cluster) of points. However, whendealing with data (especially large data sets), there are often some limitations with thesealgorithms already discussed. We want to apply clustering algorithms whilst making thefewest assumptions possible and not having to specify the number of clusters to use be-fore starting.
Also, we want clustering algorithms to discover clusters of arbitrary shapeand be very efficient with large data sets. Although, the algorithms we have alreadyseen do offer solutions to some of these problems, none of which offer a solution to allof these problems. This is why in this chapter we will look at more advanced clusteringtechniques which may provide a better insight to the data and overcome these problems.3.1Density-Based Clustering (DBSCAN)In section 3.
1 I have used pages 226-231 of M. Ester, H.Kriegel, J.snder and X.Xu’sjournal, A Density-Based Algorithm for Discovering Clusters in Large Spatial Databaseswith Noise [9] as a guideline for my definitions and discussions. If we look at Figure 3.1,we see the result of the K-means algorithm with K = 5 being applied to a syntheticdata set designed to expose the limitations of K-means clustering. Looking at the plotwithout the clustering there seems to be some noise within the data but also five distinctclusters, two elliptical clusters, two linear clusters and a small clustering in the bottomright hand corner. We recognise these clusters because of the high density of data pointsin a certain area and we associate data points as areas of noise when the density is muchlower. However, the clustering of the data in Figure 3.1 does not match these expectedclusters and is clearly interpreting the data incorrectly, hence we should use a differentclustering algorithm to group this data.We will now look in particular at DBSCAN (Density-Based Spatial Clustering andApplication with Noise), which is a density-based clustering algorithm, which is used to identify cluster of arbitrary shape in data consisting of noise and outliers. In thisalgorithm, each data point that belongs to a cluster has at least a minimum number ofpoints within a neighbourhood of a given radius of itself. If a data point doesn’t belongto a cluster, the neighbourhood of the point doesn’t have the required number of pointswithin it. Although we can use any appropriate similarity measure (Section 2.1) whendefining the µ-neighbourhood of a point, we will use the Euclidean distance from hereonwards in this section.Definition 10 (µ-neighbourhood of a point). The µ-neighbourhood of a point p is definedby(p) = {q €€ D|d(p, q) < µ},(3.1)where D is the data set. Any point which lies in the neighbourhood of p is called aneighbour of pTherefore, to start DBSCAN algorithm we need two parameters, µ and n, where µis the radius of the neighbourhood around point p and n is the minimum number ofpoints that must lie in the neighbourhood for that point to feature in the cluster. Anypoint p with a neighbour count greater than n is called a core point.If the neighbour count is less than n, but it belongs to the µ-neighbourhood of some point r, the point iscalled a border point. If a point is neither of these, it is called a noise point or an outlier.Before we state the algorithm, we start by defining three terms that are required in theunderstanding of the DBSCAN.Definition 11 (Direct density reachable). A point p is directly density reachable fromanother point q if: p is in the µ-neighbourhood of q and q is a core point.Definition 12 (Density reachable). A point p is density reachable from q if there existsa set of core points leading from q to p.Definition 13 (Density connected). Two points p and q are called density connected ifthere exists a core point r, such that both p and q are density reachable from r.Now that we have these three definitions, we can introduce the DBSCAN algorithmwhich is stated below.DBSCAN algorithm1. For each point p in the data, find all of its neighbour points. If a point, p, hasa neighbour count greater than n, then mark point p as a core point.2. For each core point find all of the points that are density connected and assignthem to the same cluster as the core point. If the point is not already assignedto a cluster, create a new cluster.3. For each point that isn't a core point or a border point, treat as an outlier ornoise.If we have a border point in the data, it may belong to more than one cluster, inthis case the point will be assigned to the cluster that was discovered first. However,generally the result of DBSCAN is independent of the order in which the points arevisited.In an ideal situation we would know µ and n of each cluster and at least one pointfrom each cluster. With this information, we could find accurate clustering of the data byusing density reachability. However, there is usually no way of getting this informationin advance. But we can determine the parameters of the least dense cluster which is notconsidered to be noise, which is then a good candidate for these parameters.To do this we first look at finding the value of µ. Note that the DBSCAN algorithm isvery sensitive to the choice of µ, with particular regard to clusters with different densities.For example, if µ is too large, denser clusters may merge together and if µ is too small,sparse clusters may be considered to be noise. This means if there is a large difference in cluster densities, a single value for µ may not suffice. However, for the time being wewill try and find a single optimal value for µ.Firstly we calculate the distance between every point in the data and the K-nearestneighbour (K-NN) of this point.The idea is then to calculate the average of the distancesbetween every point and its K-NN where K is given by n. We then sort these distancesinto ascending order and plot them. On this graph, we now look for a point whichcorresponds to the optimal µ. This optimal µ will be the point such that all the pointsto the left of it correspond to core points and all the points to the right of it correspondto noise. This means we are looking for a knee in the data, which will correspondto the threshold where a sharp change occurs along the K-distance curve and will inturn correspond to the value of µ It turns out K-distance graphs for k > 4 do not differgreatly from the 4-distance graph, therefore for 2-dimensional data we will eliminate thisparameter by setting n = 4.Figure 3.2: Finding the optimal value for µLooking at Figure 3.2 we can see that the knee of the data is at 0.15, thereforewhen applying the DBSCAN algorithm we set µ = 0.15 and n = 4. Now that we have thetwo parameter values we need, lets apply this algorithm to the data. Figure 3.3 showsus the visualization of the DBSCAN algorithm. We can see that this is a much moreappropriate clustering than that provided by the K-means algorithm which is shownin Figure 3.1. Since the DBSCAN algorithm can group points into arbitrary shapes, itmeans it can distinguish what appears to be the correct clusters within the data whereasthe K-means algorithm wasn’t able to do this. If we look at Figure 3.1 the inaccuraciesof the plot all seem to be accounted for by the K-means algorithm not being able todistinguish arbitrary shapes. Also note how the K-means algorithm will include every single point into a cluster, whereas the black points in the DBSCAN algorithm accountsfor outliers, meaning not every point is clustered. This in turn leads to a more robustresult as it is less likely to be affected by unusually behaving data.Figure 3.3: Visualization of DBSCAN algorithmReviewing DBSCAN, we can say it is a very good clustering algorithm to use whenthe data involves lots of noise/outliers. It is also more appropriate than algorithmssuch as the K-means algorithm when the clusters are of arbitrary shapes and not justspherical. Finally, unlike the algorithms we have seen before, at no point do we have tospecify the number of clusters to split the data into. This means less prior knowledgeof the data is needed. Although, as we have seen, the algorithm is very sensitive to thechoices of µ and we need to try and choose the most appropriate value for µ.
Cite this essay
In Chapter 2 we saw how clustering methods such as the Kmeans. (2019, Aug 20). Retrieved from https://studymoose.com/in-chapter-2-we-saw-how-clustering-methods-such-as-the-kmeans-essay
How to Avoid Plagiarism
Use multiple resourses when assembling your essay
Use Plagiarism Checker to double check your essay
Get help from professional writers when not sure you can do it yourself
Rate this post A thirteen year old girl with concerns and fears for her future has a dream to make a difference, not just for her, but for everyone, strangers, friends and family. She dreams that everyone will be able to experience the sort of life she has. But mostly she dreams for change. Severn...
Rate this post The tiny emerald land on the west coast of India is best known as a place of sandy beaches and parties. But far from the popular tourist hotspots, sandy beaches and loud parties – lush greenery, sparkling waterfalls and the calm of the countryside is the other face of Goa. Party all...
Rate this post Academic dishonesty is a vice that is plaguing higher education in today’s society. While lecturers and exam invigilators always seem to be a step behind in catching the culprits, the techniques being invented by the students to beat the system can almost be viewed with grim fascination. The emphasis that has been...
Rate this post A review of related literature and studies is the theories which the researchers use to explain the existence of a research problem and use as a bases in analyzing relationship between variables can be generated from reference books and of collecting, selecting and reading books, journals, reports, abstract, and other reference materials....
Rate this post In order to discuss fully the concept of cultural differences in human communication, it is important to tackle this in light of Latin communication. Don't use plagiarized sources. Get Your Custom Essay on In Chapter 2 we saw how clustering methods such as the Kmeans Just from $13,9/Page Get custom paper Aside...
Not Finding What You Need?
Search for essay samples now
Copying content is not allowed on this website
Ask a professional writer to help you with your text