All the previous chapters have been in preparation for this last chapter where we investigate a data set to try and identify bots. A bot  is a software application that automatically performs simple and structurally repetitive tasks over the internet at much higher rates than humanly possible.
Bots were initially introduced to perform important tasks on the internet, especially in relation to operating search engines. However, now around 66% of internet bots are considered to be malicious . These bots can do multiple things when visiting a website, including scraping personal information or making the website unavailable to users by generating large volumes of traffic against the target.
Due to the dangers these bots can create, it has become an essential task to try and identify these bots and stop them from visiting your webpage. When malicious bots were first starting to be used, they were relatively easy to identify due to their low levels of sophistication. However, now bots are much more advanced and often act in identical ways to that of a human , making them almost impossible to detect.
Hence in our analysis, we have to accept that we are very unlikely to find all the bots within the data set, but detecting any will be a big step in protecting the website from future attacks.
The data we will be looking at was provided by a local company, Clicksco and contains information about the visits to a webpage on the day of 2018-06-10. The data set contains 1,048,575 pieces of data, in which each one contains 17 variables of information.
The variables include information such as session ID (a unique identifier for the browsing session), start and end time of each session, the device type the web page was visited on and whether or not that profile is known to be a bot. The final goal is to try and isolate the key variables when identifying a bot and then use these to find the bots within the data set.
From the initial variables, we plot the history score, device type and end time type in Figure 6.1. We then use some of the initial variables to create 3 new variables: average time spent on a web page, number of web pages visited per session and the hour in which the session first started. The plots of these three variables are also shown in Figure 6.1.
Looking at Figure 6.1, we can immidiately see that the plot of Device type and End time type fail to add much information, with end time type only having two responses, both seem to be relatively similar to each other. Whereas, Device type is dominated by just two devices: desktop and smartphone. Hence, this too will sturggle to add any valuable information when trying to locate anomalous behaviour. Let’s now look at the variable containing information about the hour in which the page was visited, shown by the Start hour plot. The plotshows what you’d expect for hits on a web page, with low frequency in the early hours of the morning and increasing steadily until it peaks in the early afternoon, until a steady decline in the night.
However, this is still an interesting pattern and we will investigate this variable more. If we now look at the variable, history score, which is a variable that scores in the range 0 to 1, with 0 being unlike bot behaviour and 1 being certain bot behaviour. We see that there is a clear peak in the plot at 0.38 and then numerous other smaller peaks higher than this too. Hence, this could also lead to some interesting insights with respect to anomalous behaviour, so we will also carry on investigating this variable.From the definition of a bot we are looking for data points that visit a lot of pages in a short space of time. Therefore, plotting the average time on a web page vs the number of page visits in a session is show in Figure 6.2. There seems to be a clearshape to this graph, with the vast majority of points having very low values for both variables. However, for the data points that have very low values for average time buthigh values for number of pages visited, we would expect these to be potential bots from the definition.
If we next look at the variable declaring whether or not a data point is a bot, we can get a better understanding of what a bot might look like and explore this to identify more of them. If we look at Figure 6.3, we can see four different plots of what are considered to be the key variables of the data set. The plot of time agrees with our definition of a bot, with most of the known bots having an average time very close to zero. Although, there are around 12 points which have a time much greater than zero, demonstrating how known bots can evolve to try and mask themselves. The plot of number of page visits, shows that most known bots visit less than five web pages, which again is unexpected from the definition. The third plot shows that all known bots have a history score greater than 0.38, with common scores being 0.38, 0.44 and 0.48. Interestingly, the maximum history score obtained by a may ot was 0.63, which is much less than 1 which is given outfor certain bot-like behaviour.
Finally, the fourth plot shows the hour in which the bot started visiting the webpages. From our initial analysis, we thought that this variable have been interesting and unlocked potential bot-like behaviour, but this plot looks almost like a uniform distribution and hence will offer very little insights to the data. Thererfore, we will discard this fourth variable and just focus on the other three.
The data set we are considering is very large and before we can apply some of the algorithms discussed earlier in the report, we are going to have to condense the data whilst trying to lose the least amount of information possible. Since there is so much data with time=0 and page visits less than or equal to 3, we will delete this from the data set, simply because we can’t get much information from these data points. It is worth noting that some of these data points may in fact be bots, but because the frequency of their page visits is so low, even if they did visit the web page they wouldn’t be choking it up and we can discard them without much worry. Also, we have just seen that all knownbots have a history score greater than 0.38, hence it would seem sensible to reduce the data set even more by removing any values which have a history score less than 0.38.
If we now plot the three variables: average time spent on page, number of page visits and history score for all the data in which history score is greater than 0.38 and the time is not equal to zero with frequency greater than three (this is the data we will consider from now on) using a 3D scatter plot we obtain Figure 6.4. Where the red coloured data points correspond to the known bots and the black points correspond to the rest of the data, in which we are yet to decide if they are bots or not. The known bots all have an average time close to zero which is what we would expect, but the number of page visits are all relatively low. Also, note how the known bots seem to be amongst the bulk of the data and therefore could be potentially difficult to try and find similar anomalous points that can be explained by bots.