wrong-way

Why regular expressions can't spot fake user-agents

Engineering

4/1/2015 9:26 PM

Device Detection Development

The dangers of regular expression based device detection

What is regular expression device detection?

A regular expression (Regex) is a very small computer programme that can locate sequences of characters in text. Consider the following User-Agent from a Nexus 4 device running the Android operating system.

Mozilla/5.0 (Linux; Android 4.2.1; en-us; Nexus 4 Build/JOP40D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166 Mobile Safari/535.19

The version of Android, in this case "4.2.1", can be extracted with the regex "Android\s)\d+\.\d+\.\d+".

A check could be performed to determine if the device is a Nexus 4 using the regex "\sNexus 4\s".

Sometimes more precise regexes can be used to determine the positioning of the specific piece of string, for example, the regex ‘^Mozilla/' will search for all user agents that begins with word Mozilla or in the ‘Safari$' regex, the $ symbol indicates the end of the string, matching to all devices, where the user agent string ends with the word ‘Safari'.

The use of such regular expressions has become a popular method of detecting the model, operating system and browser associated with a web request. However the volume and complexity of web enabled devices has grown at such a rate over the past 2 years they can no longer be relied on to provide accurate results.

How does 51Degrees solve the problem?

51Degrees device detection uses the relevant characters and their positioning to find and match the most identical portion of substrings.

Let's look at Nexus 4 user agent as an example

Mozilla/5.0 (Linux; Android 4.2.1; en-us; Nexus 4 Build/JOP40D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166 Mobile Safari/535.19

On our User Agent Tester page you can check the results how the particular user agent (UA) is detected.

Even when the specific piece of string in UA differs but it's not seen as significant, it doesn't affect detection result and it indicates a correct device.

If in the mentioned Nexus 4 example we change ‘KHTML, like” into “Opera Mobi”, it will return the same substrings combination because the detection process doesn't see “KHTML, like” as a relevant characters in this position.

user-agent-nexus-4

Four different detection methods

51Degrees uses 4 different methods: exact, nearest, numeric and closest, which are employed in the following order to find the best matching of the user agent. These methods come along with a difference indicator to provide more insight about the result of characters positioning.

Exact

Exact method is the most popular and it's used when there is a precise match of characters to the tested user agent string. The difference indicator will always be 0 for this method. See the example below.

method-exact

Nearest

When the exact method cannot be found and the relevant character positioning has changed then the Nearest method is used. It means that the important substrings of user agent are the same but their positioning differ.

Let's look at the same example of Nexus 4 where the user agent has different irrelevant characters (additional X) that caused the movement of the relevant substring positioning. Nevertheless, the result still indicates Nexus 4 device but the type of method and difference will indicate that it vary from the expected user agent.

Numeric

The Numeric method is used where the only difference is in the numeric version in the user agent string.

Look at the example below<./p>

method-numeric

All relevant substrings indicate that this is a Nexus 7 device that accessed the website using Chrome version 51. There is no substring with Chrome 51 for that reason the most relevant substring with Chrome 32 was found and the numeric difference is 19.

However, regular expression won't be able to distinguish the numeric difference in relevant substrings and it can raise an accuracy issue.

Closest

When none of the above methods can be used then the Closest method finds the most similar match of relevant substrings. It will always give the result but the difference value will be very high and it will not be a reliable outcome.

None

If the user agent has some random characters that it doesn't match any of relevant substrings then none method is used suggesting that 51Degrees doesn't contain enough information to classify a user agent to any device.

In this case, where none of the method is used the user agent is detected as a default Desktop device.

"Wrong Way" photo courtesy of David Goehring