The percentile is particularly used for monitoring a computer application. Measurements are always collected over a given period of time. And these measurements refer to one or more transactions (An application transaction is a request, a call or a function related to the application). Percentiles are used to represent a response time of a transaction. The percentile calculation applies to a set of measures of these response times.
Average and median are not enough
Not adapted to the effect of dependencies
In the context of a computer program, for the realization of a single specific function, many technical requests are usually sent to the application. In addition, a technical request will itself often be dependent on other requests. Thus, we generally observe a phenomenon of cascading and propagation of slowness. A single slow request can impact several functions of the application. And one slow request can slow down many others. These two combined effects exacerbate the impact of a few slow requests on the entire application system. Thus, a few slow requests can severely degrade the user experience.
When we look at the queries as a unit, we generally do not have a good representation of the impact on the functions of the application. The metrics associated with particular queries are always more optimistic and partial compared to a metric that would take the application functions needed by the end user as a reference. A few abnormally long queries will only slightly vary the mean (or median) response time for that query with a large set of calls. Indeed, the data is "smoothed" in the mass of more nominal queries in the case of the mean (or median). On the other hand, these few queries can greatly vary the accessibility and speed of the application function that depends on this query.
Thus, the mean and the median do not allow for easy observation of anomalies that have a real impact on the users of the application.
Sensitivity to singularitie
Studies show that response times for transactions/queries in a computer system are generally not consistent over time. Sporadic events often result in short and infrequent spikes in response times or system downtime. These disruptions result in very slow response times compared to the vast majority of those measured.
The causes of these peaks must of course be analyzed and treated in their context. However, these spikes should be ignored when analyzing the performance of particular queries because they are caused by factors that are independent of the implementation of those queries. The probability of occurrence of these peaks is very low and are the consequence of phenomena that are not related to the load of the application, nor to a nominal operating context. The causes can be :
An independent external process allocating resources shared with the application (CPU, memory, network, ...) such as an automatic system update process.
A break induced by the rare and impacting "Garbage Collector".
The spontaneous and manual execution of a resource-consuming batch.
A pause related to an execution in "debug mode".
A disk swap
etc ...
These peaks have an impact on the measures. And the average is unsuitable because it is sensitive to these disruptive measures, especially when they have high values. The greater the disturbance, the greater the impact on the average.
Source : http://www.jybaudot.fr/
Definition of a percentile
A percentile is a particular number in a set of sorted numbers. A percentile is always related to a percentage that defines the proportion of numbers less than (or equal to) the desired proportion. For example, we are talking about the 99th percentile of a set. And the 99th percentile of a set of numbers is the largest number in the subset containing 99% of the measurements and where the largest values of the initial set are absent.
The percentile allows you to directly indicate significant values that really impact their environment (up to its percentage parameter). Moreover, the percentile is only weakly sensitive to singularities because these are represented by high values that are ignored. Thus the percentile is suitable for analyzing query response times in the case of an application for example.
An example
Let be a set E of 10 sorted numbers {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}.
The 90th percentile of E is 9
The 80th percentile of E is 8
The 70th percentile of E is 7
etc.
Special cases
The 50th percentile is the median of a set of values, i.e. the value in the center.
The 100th percentile is the maximum of a set of values, i.e. the highest value.
The nine-th or the number of nines
Only specific percentiles are generally used for application performance analysis. These are percentiles whose associated percentage is written with only nines. So, in this case we can abbreviate this measure by calling it "n nine-th" where n is the number of nines in the percentage. Of course, beyond two nines, the next nines are listed in the decimal fraction of the percentile percentage.
There is no formula (to my knowledge) to determine the number of nines appropriate for a measure. Rather, it is empirical to determine the threshold of nine beyond which performance deteriorates significantly. Tools (an APM for example) will allow you to observe your results with different percentiles. Therefore, it will be interesting to consider the percentile that shows significant degradations compared to the previous one (with one nine less).
Applications with particularly high performance or availability issues will analyze the results of the first or second nine. But the most robust and stable applications will refer to the 5th nine to measure their quality of service.
The importance of the percentile with a large number of nines is also detailed on the blog "Latency Tip Of The Day": http://latencytipoftheday.blogspot.fr/2014/06/latencytipoftheday-most-page-loads.html
Conclusion
The percentile makes it possible to deal with problems of response time on an application system because it meets the following needs:
Ignore the larger values, which usually come from singularities unrelated to the observed transaction (or application).
Highlight degraded values to deal with problem cases that have the greatest cascading effect on the rest of the application system
Ability to meet the above requirements by varying the ratio of data to be filtered, using the percentage (or number of nines in the ratio) of data to be filtered.
ASK FOR YOUR FREE TRIAL NOW ! BECOME A USER SATISFACTION CENTRIC COMPANY
We use cookies to ensure that we give you the best experience on our website. If you continue to use this site we will assume that you are happy with it.OkPrivacy policy