• About Us
  • Blog
  • Basket
  • Account
  • Sign In
  •  

How Trie Device Detection Works

What is a Trie?

First designed in the 1960s tries are ordered data structures where the keys are usually strings of characters. They are perfectly suited to the purposes of device detection where the user agent HTTP header can be used as the string key.

Consider the models of mobile devices shown in the following table. They all share the same initial characters followed by different digit and letter combinations to represent the precise model.

Trie ExampleModel
GT-I9000GT-I9000
100GGT-I9100G
P1000GT-P1000
TGT-P1000T
MGT-P1000M
S5670GT-S5670
830GT-S5830

Each character can have none, one or many child characters. The first character of the target string can be compared to the first character in the trie. If they match then the next character can be compared to the possible child characters of the first. The process continues until a precise match has been found or no more characters are available. The value associated with the key could be any data type and referenced by additional data stored with the right most character.

Trie Device Detection

A device is a collection of unique properties and values. There are 10,000s of different unique devices. On average each device combination has approximately 500 different user agents. This happens because user agents contain characters that are not relevant to the device properties. For example information about the current language the device is configured for is not something included as part of a device.

By assigning each device a unique integer key, the device key can be used as the value in the trie data structure. The following table has been amended so that the unique ID of the model is shown in the right hand column and as a bold digit under each character.

Trie ExampleModelID
G
1
T
1
-
1
I
1
9
1
0
1
0
1
0
1
GT-I90001
1
2
0
2
0
2
G
2
GT-I9100G2
P
3
1
3
0
3
0
3
0
3
GT-P10003
T
4
GT-P1000T4
M
5
GT-P1000M5
S
6
5
6
6
6
7
6
0
6
GT-S56706
8
7
3
7
0
7
GT-S58307

In addition to assigning the ID of the device to the leaf character in the trie, the "best" ID can be assigned to each character in the trie. The "best" device is the one that is statistically most likely to the correct one for a given character in the trie. The accuracy of the ID will diminish the further to the left of the leaf character it is.

When compared against the above table the target string GT-P1000 will relate to a leaf character and ID 3. A precise match.

If the target string GT-P1010 were compared against the table the second 1 character will not be found. Therefore the ID of the device found would be that associated with the first 0, the last character that matched exactly. In this case also 3.

Finally if the target string were GT-X9000 the initial hypen would be the last character that matched and ID 1 would be returned.

Big Data Sets

Tries with relatively short keys and very large data volumes are extremely quick compared to any other indexing method. They're faster than hashing tables as they avoid the need to calculate a hash code. User agents are perfectly suited as keys because they're short (less than 500 characters).

However they do require all possible characters, including those that are not relevant to be present. This drawback could be worked around with modifications that allow for wild card characters or jump ahead operations. However these modification immediately reduce the performance and are less effective than other solutions.

51Degrees has therefore opted to use basic tries and populate them with 10s of millions of user agents. As a result the data structures are large, but very fast and accurate. The data sets do need to be updated relatively often to ensure detection accuracy is maintained because they are more sensitive to irrelevant characters.

Summary

Trie's are very fast and conceptually simple to understand. However they do require gigabytes of main memory and as such are not well suited to running on the same web server as a production web site. We therefore recommend they're either deployed on dedicated servers, or incorporated into the transform layer of data warehousing tools. The Pattern detection method is better suited to general detection needs.