search-bot

How to filter bots from analytic platforms with User Agent data

Kirstin R

8/23/2021 4:30 PM

Analytics Device Detection Tutorial User Agent Analysis

Using User Agent data as a starting point, you can investigate unusual website traffic spikes

I start my day like many other marketing professionals by checking website traffic data within an analytics platform. After a few months, I started to notice something unusual in our reports on visitor traffic to 51Degrees.com.

Some weeks we’d have an unexplained increase in traffic. This traffic wasn’t the result of an email I sent out or a paid social media campaign. In fact, it had no correlation to the marketing I was pushing out at all.

I would analyze this traffic spike, investigating the acquisition, what country it came from, what page it visited and where it went next, and yet I couldn’t find a common thread.

This type of traffic was skewing my data reports. I couldn’t find any commonality between the traffic spikes, so I was unable to remove it from my reports. This led to many awkward conversations with colleagues as I sheepishly explained I didn’t know how to identify and filter out the traffic!

After spending a lot of time searching the web for an answer, a few good guides to filtering crawlers and identifying unnatural traffic presented themselves. They were helpful, but I found that it didn’t fully solve the problem that I was seeing.

So, I set out to find a different solution to filtering bot traffic in our analytics platform. This blog delves deeper into the bot filtering method we discovered by using 51Degrees real-time data services.

Spoiler alert: our method removed the fluctuations in traffic that I couldn’t explain, producing a smoother website traffic report. With added confidence in our reports, we can now focus on the marketing channels that bring real people to our site.

before-after-graph
The resulting traffic once unnatural bots and crawlers were filtered from our website reports.

The problem: how to identify bot traffic

An internet bot is an automated process that has been programmed to do certain tasks. “Good” bots like Googlebot or Bingbot are bots designed by search engines to crawl a site to index any pages. These bots will proudly announce themselves as bots within their User Agent:

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

“Bad” bots often have malicious purposes, such as to crawl for email address to send spam to, or to break into people's accounts. These bots don’t want to be caught doing this, so they often masquerade as something else by simulating a different User Agent string. It's these “bad” bots that we want to filter out from our website traffic reports.

At 51Degrees, we have our own bot filtering system that can remove known bots. The 51Degrees bot filter acts like a bouncer, stopping unwanted bots and crawlers before they even get through the doors to your website.

Certain analytics platforms have their own bot filtering system. For example, with Google Analytics you can click the “exclude all hits from known bots and spiders” button in the admin settings. Unfortunately, some bots are still able to slip through the cracks, especially when they pretend to be something they’re not.

I analyzed our website traffic over several weeks and was able to spot a few noticeable patterns. Often when we had a large, unexplained increase in traffic, the biggest change in traffic had a source and medium of (direct) / (none), a 100% bounce rate, and a 0 second average session duration.

Landing page Users Bounce Rate Pages/Session Avg. Session Duration
/apple-touch-iconon-152x152.png 29 100.00% 1.00 00:00:00
/index.xml 28 100.00% 1.00 00:00:00
/blog/rss/1641-1 25 100.00% 1.00 00:00:00

These attributes aren’t always indicative of a bot. Someone could visit your page via a bookmark or by typing your URL, read the content, then immediately close it. An analytics platform could record this as a (direct) / (none), a 0 second session duration, and a 100% bounce rate. Why? Because these figures are calculated as time between pages – if they only visited one page, there is nothing to compare it to.

In reality, this type of direct/none traffic couldn’t account for the sudden increase in website visitors. Furthermore, when you compare this unusual traffic with other factors, more insights are revealed.

As an example, you may spot that the sudden traffic increase all led to one page. In our case, we would often see traffic spikes to error 404 landing pages that contained “wp-admin” in the URL – clearly some malicious bots scraping the web trying to find and attack websites hosted on WordPress.

So, we’ve found a common thread between the unusual traffic spikes. The only problem is, we can’t filter out this traffic confidently without filtering out potential human traffic. We needed more data gathered from custom dimensions.

The User Agent custom dimension

One such way to gather further information on website traffic is via their User Agent string. If you are unfamiliar with the User Agent string and what information can be parsed from it, take a look at our User Agents and Device Detection blog.

Having User Agent data within our analytics platform would help us to identify any areas of commonality between the unusual traffic. To record User Agent information within our analytics platform, we had to create a custom dimension.

More Visibility have a good guide to getting started with creating a User Agent custom dimension. If you aren’t confident with implementing the change, a developer may be able to help you.

Once your User Agent custom dimension has been set up and you gather enough visitor data, you can start to investigate the unusual traffic. For us, a lot of the traffic spikes were due to one User Agent, or User Agents with older browser versions.

Let’s take an example from the earlier table. We identified that the traffic that landed on the non-existent /index.xml page had all the characteristics of a crawler or bot. And the traffic all came from one source:

Mozilla/5.0 (Windows NT 6.1; WOW64; rv:39.0) Gecko/20100101 Firefox/39.0

Seems normal, right? But when you search for this User Agent within our User Agent parser, it turns out that Firefox version 39 was discontinued in 2015!

Most people's web browsers are updated automatically. Older browsers, and especially ones that were discontinued years ago, are not often used by real humans. You can then safely conclude that this traffic was due to a bot or crawler using a simulator to hide their true User Agent string.

After further analysis of the unusual traffic spikes over a period of a few weeks, I noticed that most of the User Agent strings contained browser versions that were at least a few years old. I had the hypothesis, but I needed to test it.

The browser age custom dimensions

Within our device detection, we can parse any User Agent to determine the age of the browser within it. We have two specific device detection properties that come in handy here, BrowserReleaseAge and BrowserDiscontinuedAge.

Referring to our property dictionary, BrowserReleaseAge indicates the age in months of the browser since it was released. And BrowserDiscontinuedAge indicates the age in months since the browser was discontinued, or in other words, no longer supported.

When I filtered our website traffic using the BrowserReleaseAge custom dimension, most of our website traffic used browsers that were relatively new – with ages of 0-6 months. Unusually though, there were a few outliers in the form of 60 or 124-months-old browsers. Who would be using a browser version that is over 10 years old?

With our three custom dimensions, the stage was set to remove the stubborn crawler traffic, but there was one final step to complete. I needed to create a filter that would finally remove the bot traffic.

The segment filter

I added the filter as a segment within the property; the data is still captured and recorded, but I can choose whether to see it. Plus, this method allows for comparisons between data sets, before the browser age filter is applied and after.

We found that the segment worked best when we filtered out any browser ages older than 6 months. This may not be the ideal number for your needs as it all depends on your typical audience and the data that you see within your analytics platform.

For reference, here is the segment I created to filter out the unusual traffic. As you can see, this segment includes a BrowserReleaseAge of 0-6, as well as BrowserDiscontinuedAge of 0. If the BrowserDiscontinuedAge is anything other than 0, it means it is an old browser version that is no longer supported by the software company.

I set up the filter to only include traffic that had a BrowserReleaseAge that exactly matched 0, 1, 2, 3, 4, 5, 6. It also included traffic where the BrowserDiscontinuedAge exactly matched 0. I’m sure there are cleaner ways to create this segment, but this works on a basic level. Now it’s time to put this filter to the test.

The results of our bot traffic filter

To summarize, this is the method I used to filter out crawlers from our analytics platform using our device detection service:

  1. Add three custom dimensions to identify the User Agent String, BrowserReleaseAge, and BrowserDiscontinuedAge.
  2. Analyze the data. Do the bots use browser versions of a certain age?
  3. Decide on what browser ages you want to remove.
  4. Create the segment filter.
before-after-graph
The unexplained spikes in website traffic (depicted in the red line graph) are smoothed out with the BrowserReleaseAge filter (shown with the blue line).

It’s worth mentioning that when you filter out unusual traffic using the BrowserReleaseAge filter, you will see a reduction in your overall total number of website visitors. With a filter like this, you may even filter out one or two real human beings – the chances that someone out there is browsing your website using a decade-old tablet that hasn’t been updated since it was bought are slim, but not impossible.

However, for a general overview of your traffic, adding the BrowserReleaseAge filter allows the residual data to be richer, more reliable, and less susceptible to sudden invasions from bot, crawler, or spam traffic.

If bad actors see value in sending bots to your website, completely blocking out this malicious traffic will always be an endless task. This is especially true when the bot technology becomes more sophisticated, and harder to detect.

Using our own website as an example, we have shown that anyone can benefit from richer insights into their website traffic. You may get one million hits a month, but how many of those are real humans looking to convert on your website? By filtering out stubborn bots pretending to be real human traffic, we can be confident in the data we report on.

All of this was possible due to our thorough device detection. Without our extensive User Agent database, the analytics platform is limited in what data it can collect. Adding the custom dimensions (which could only be populated when combined with our device detection) allowed us to deduce how old the browser in the User Agent string was, and ultimately find the link between unusual traffic spikes and browser age.

If filtering out unusual traffic is your SEO goal, then I recommend our bot filtering properties. To get started with our cloud device detection, choose the BrowserReleaseAge and BrowserDiscontinuedAge properties on the cloud configurator and implement the code. You will need to sign up to one of our pricing plans to get started.

Ultimately, I can now discuss the website traffic reports with more confidence. Finally, I can forget about the spam bot traffic and focus my marketing efforts on the channels that bring people to our website.