How 51Degrees evaluated GPU micro benchmarking and why it's not viable today
As part of a series of experiments conducted on Apple devices, 51Degrees considered using the performance of the GPU to differentiate iPhone and iPad devices for the purposes of device detection. Whilst the approach has some merit, it was not as simple as CPU benchmarking. Due to limitations within iOS concerning background graphics operations, GPU benchmarking would introduce an unacceptable delay to web page rendering.
This blog describes the techniques considered and some of the results observed. Whilst the technique is not used today, it's possible an adaption could be useful in the future.
For those looking to find out more about the production solution based on CPU benchmarking and image hashing read more here.
An obvious drawback with GPU micro-benchmarking is the simple fact that on any regular GPU test, benchmarks are run for minutes at a time to get an accurate value for performance. This is down to the nature of GPU architecture, which is designed to be run for a sustained period, processing massive amounts of data in parallel with a large variety of inputs. Conversely, a benchmark to identify a device ideally needs to complete in a few seconds and execute in the background of a web page without adversely affecting the user experience.
For the purposes of device detection, we only require that the benchmark can discriminate between GPU configurations. It is conceivable that such a test could be performed much faster than traditional GPU benchmarks, where the objective is an absolute measure of performance.
To improve the stability of measurement, a loop is introduced that runs several frame drawing operations consecutively and averages the result. Because of the previously noted unreliability of the first result in any frame execution, this value is always excluded from the average calculation. The stability of the overall result is directly proportional to the number of iterations performed in a single test, however a cap is required on the total number of iterations in order to keep overall test times acceptable. A value of 9 iterations per test shows a good balance between performance time and stability.
The issue with measurements being either condensed into sub-milliseconds times or stretched over several seconds is harder to handle. Instead draw calls per millisecond (drw/ms) can be used in place of the absolute millisecond time. Up to this point, a fixed number of draw calls per frame have not been tested. Varying the number of draw calls provides a simple method to vary the complexity of the task being performed for each frame. Starting with a small number of draw calls, and doubling the number used until we measure an overall frame time of 16ms or greater provides a neat solution. The value is then used as the overall test to base the drw/ms calculation on. However, this approach throws up two additional problems:
- Setup Time - Setting up a series of draw call commands for rendering in WebGL is not without its own cost in CPU resources, and as we get into more and more powerful devices, this setup starts taking more time than the GPU test itself e.g. millions of draw calls per frame taking several seconds per test to queue up on the WebGL context.
- Variability - There appears to be some variability with the drw/ms result based on the amount of overall work being done. This will most likely be down to the GPU's power management algorithm - throwing more power at more demanding frames and thus producing a higher drw/ms value. Ideally running a test that stresses the GPU by the same amount each time is needed.
Solving the first challenge requires not only the amount of draw calls to be varied, but also the complexity of the individual draw calls being rendered. This is achieved by creating a larger vertex buffer initially, and then gradually rendering more and more vertices with each additional draw call, to compress the total amount. Increasing the draw call complexity exponentially (thus giving a logarithmically scaled value for the number of draw calls) is required to avoid crushing the resolution for lower powered GPUs.
To avoid having to create an overly large vertex buffer while also bumping up the demand on the GPU, the complexity of the fragment shader is increased for each quad rendered. This could literally be any calculation. A 2D noise function with several hundred operations per pixel, pushing the GPU hard enough to only need a few thousand quads in the vertex buffer to be sure headroom was always available worked well.
Solving the second challenge requires some additional finessing of the initial ramping up of draw calls to hit a 16ms or higher time taken to render. Doubling the draw call value can in theory give a final frame render time of anywhere between 16ms and 32ms, and in practice can produce ever larger gaps thanks again to whatever power management algorithm is in use. So rather than stop there, the final draw call value is used as a starting point for further iteration to logarithmically regress the draw call amount (remember the draw call value now represents an exponential value of complexity) to try and hit a 16ms frame time. This approach is limited by several assumptions:
- The act of performing readPixels doesn't carry its own fixed time cost.
- Any power management algorithm at work is scaling linearly and instantaneously with demand.
- Nothing else is using GPU cycles while the test is being performed.
The results showed some separation of devices, but the spread still overlaps with GPUs closely matching in performance.
The main issue is time taken. 9 frame iterations per test and iterative attempts to find the 16ms per frame sweet spot for calculating drw/ms resulted in a total test time between 200 - 2000 ms. Further separation of devices could be achieved by running more iterations per test, but not without increasing total test time. This could be improved by using platform information to provide a better first guess for number of draw calls and reducing the subsequent number of iterations - perhaps starting with a higher value for devices that have higher screen resolutions, as these tend to be coupled with higher performing GPUs.
Finally, in some cases the values returned for milliseconds per frame appear almost nonsensical. While this is produced relatively rarely, there are a handful of test cases which produce either 0 for the time measurements, or a number very close to zero, regardless of what operation the GPU is performing. The cause may be down to further anti-fingerprinting measures from Apple, some rare GPU bug that fails to establish a proper WebGL context or loses context almost instantly, some security setting in iOS Safari or something else unknown.
Any meaningful benchmarking operation will require 1 to 2 seconds to execute. As such the benchmark must be run in a web worker, or other non-blocking manner, to avoid degrading the web page render time and user experience.
GPU benchmarking requires access to an HTML canvas element and ideally would be conducted in a web worker. Google discuss the merits of combining OffscreenCanvas and Web Workers here.
However Apple iOS , along with many other vendors, are not yet supporting the feature. Along with complexity and challenges compared to CPU benchmarking the approach has been parked for re-evaluation should Apple iOS adopt the necessary features.
Read more about the techniques used in 51Degrees device detection here.
- 51Degrees Open Sources GPU Renderer Technique to Identify Apple Devices Using iOS 12.2 or Higher
- Multi Stage Approach to Apple iOS Device Detection
- Granular Apple iOS and iPadOS Device Detection
The full Apple identification solution is part of 51Degrees device detection suite of services. Save the hassle of rolling your own solution and deploy 51Degrees today and get access to over 55,000 different device models with associated properties.