Statistical Analysis of Performance Data Part 1 - Averages

by Nicholas Dille on 06/27/2011 | 7 Comments | 2,069 Views

In the first article about the statistical analysis of performance data, I will be dealing with averages. I will explain why it is a bad idea to work with averages of averages (like when storing intermediate results), why relying on averages can be dangerous and what good averages look like.

For a full list of articles published in this series, please refer to the primer.

Intermediate Results

Performance analysis usually generates huge amounts of data. This often forces us to aggregate data by storing averages over a certain time interval instead of the individual values. Although this seems to be a very convenient way of saving space and time, you need to be extremely careful when processing those aggregated values because – in general – you are not allowed to calculate the average of average values.

Only in a very special situation are you allowed to proceed in this manner: When you are working with averages that are calculated from the exact same number of values, you receive a mathematically correct result. Have a look at the proof at the end of the article.

Fortunately, there is an easy remedy to this problem. By retaining the sum and the number of the values you are on the mathematically save side and you are able to aggregate your data to save space and time. This is why Citrix EdgeSight stores at least two values per metric (like processor_time_sum and processor_time_cnt in vw_system_perf).

Downside of Using Averages

It is often considered sufficient to judge a series of values by the average value. Unfortunately, the average is susceptible to outliers. Consider the following series of values: 10, 10, 10, 10, 10, 10, 10, 10, 10, 0. The average over all values is 9 but the set contains an outlier (value 0). By excluding the outlier the average rises to 10. Apparently, the initial average has been off by 10%.

It is especially problematic to use averages when the data set includes sequences of very low or even no values. This is often the case when considering server load over time. If a server is only used during business hours (e.g. between 7am and 7pm) only half the values can be considered to be useful values. Twelve hours of low or even no load will have an especially dramatic effect on the average load. This may cause peak usage to be disguised by seemingly normal load indicators. Ultimately, users will be complaining about slow response times.

The following chart displays hourly load values for a single server over a single day. Apparently, the average load per day will be wrong by a large percentage due to the hours with low load:

Therefore it is necessary to exclude nightly hours by filtering for business hours. When considering average values for a whole server farm, some servers may be configured to not take part in load balancing and therefore not accept user sessions. This causes average values to be much lower.

The effect of wrongly including zero values can easily be analyzed in a spreadsheet. Most spreadsheet applications provide a function for calculating the average value from a column of values, like AVERAGE in Microsoft Excel. This may not lead to the desired outcome. Instead a slightly more complex formula removes apparent outliers: SUMIF/COUNTIF or even easier AVERAGEIF. See this Excel file for a demonstration of these functions.

Although your spreadsheet application offers similar functions, you need to be very careful about considering values to be outliers. From a mathematical point of view you either need to have very good arguments for excluding values or need to use a well-known method to ensure a statistically correct result. I will expand on this topic in a later article of this series.

Recognizing "Good" Averages

The easiest way to recognize good average values is to be skeptical. Always take a close look at the minimum and maximum values to get a first impression about how far the values are spread.

Some months ago, I analyzed a series of latency values. The average value was far too high and kept me wondering how users were able to work at all. A closer look at the data set revealed incorrect values that must have been caused by an error in the product because it reported negative latency values and several impossibly high values in the billions. I decided to exclude those values as they were obviously caused by a software error.

There are several properties of a series of values that can hint towards a trustful average. The standard deviation is a common value to judge how far the values are spread around the average. A small standard deviation (when compared to the average) shows the values to be clustered around the average. But your data set may contain very high as well as very low values. In which case, the standard deviation correctly reports a very widely spread data set. I will present an approach to analyzing such a data set in a later article of this series.

Many statistical properties of a data set assume that you are analyzing normally distributed values ´(see image below) which many data sets indeed are. Usually the peak value is set off to the left or right and the peak may be higher or lower. Just because your values do not look like they are distributed similarly to such a curve usually only means that you do not have a sufficiently high number of values.

(Source: Wikipedia about Normal distribution, image http://en.wikipedia.org/wiki/File:De_moivre-laplace.gif)

For such a normally distributed data set, the average value is equal to the mean. To determine the mean value, you need to sort your data set in ascending order and select the value that has an equal number of values on the left and on the right side. This property of a normal distribution leads to the insight that an average is more trustful when it is close to the mean of the data set.

Working with Few Values

Another downside of statistical analysis is the case of looking at a small data set. When your data set consists of less than 30 values, many statistical properties may not provide a secured insight.

For example, the standard deviation needs to be corrected by a small factor based on the number of values. Microsoft Excel offers two functions for calculating the standard deviation, STDEV.S and STDEV.P. The latter assumes that your data set has a sufficient number of values whereas STDEV.P is calculated from STDEV.S by multiplying by SQRT(N/(N-1)) with N being the number of values. See the end of this article for the mathematical connection between these two functions.

Fortunately, in performance analysis you are usually confronted with too many values. Although this may cause other problems (like the maximum number of rows in Excel 2003), from a statistical point of view this is a very comfortable situation!

Mathematical Proof: Why is the Average of an Average a Dangerous Thing?

This section contains the mathematical proof why calculating the average of averages is only valid in a very special case. I will reduce the problem to the average of two averages.

Let’s consider the following two sets of values. The first set consist of values a1, ..., an and the second set consist of values b1, ..., bm. The left side of the first line represents the average of the two averages and the right side shows the average over all the individual values.

To make the equation more readable, I have substituted the sums with upper letters, so that A represents the sum of all ai and B represents the sum of all bj.

The remaining lines are basic algebraic operations to simplify the equation.

Apparently, the initial equation is only valid if n and m are the same value. This translates to the result that the average of averages may only be used when the individual data sets have exactly the same size.

Mathematical Connection between both Functions for Calculating the Standard Deviation

Excel offers two functions for calculating the standard deviation. STDEV.P assumes that your data represents the entire population – meaning that you have all the data there is. Therefore, this function is only suited for large data sets with 30 or more values. During performance analysis, this is the only function to be used because we usually have more than enough values.

In case, you have less than 30 values, you must either use STDEV.S or apply a correction to STDEV.P as displayed below.

Apparently, the right-hand side can be simplified to be equal to the left-hand side.

AnhangGröße
statistical_analysis_of_performance_data_part_1_-_example.zip6.6 KB

+++ Profile Migrator 2 - Ein neuer Desktop, ein frisches Benutzerprofile und alle bewährten Einstellungen und Daten. Jetzt kostenlos und unbefristet testen!

7 responses for "Statistical Analysis of Performance Data Part 1 - Averages"

Hi Nicholas, Good post about

Hi Nicholas,

Good post about using averages in statistical analysis, I'm looking forward to read the next post.
Out of curiosity, why is 30 samples enough for a good average? Wouldn't it be good to look at the standard deviation and (for instance) the Student t-distribution (http://en.wikipedia.org/wiki/Student's_t-distribution)?

Regards,
Ingmar Verheij

Ingmar, depending on the

Ingmar,

depending on the source, you'll find 30 or even 20 to be the threshold when your samples can be modelled by the normal distribution. Below this mark you should be using the Student t-distribution which makes a difference when calculating the standard deviation.

As mentioned earlier, performance monitoring usually yields many values so that this threshold is irrelevant. Fortunately there is seldomly any need to dive deeper into statistics ;-)

I had an extensive discussion with my wife yesterday evening just about this threshold and the effect of working with small samples. In her line of business, it makes a huge difference whether you use STDEV.P or STDEV.S. Or in other words whether you need to correct the standard deviation of the normal distribution by the standard error.

Regards,
Nicholas

Nicholas, I noticed that I

Nicholas,

I noticed that I can't find anything on the internet about STAB.S and STAB.P. I'm assuming it's the german translation of STDEV.S and STDEV.P (since the connection between the functions match)?

Another thing I notices it that you switched STDEV.S and STDEV.P. STDEV.P assumes that you're calculating the standard deviation of the complete population (instead of STDEV.S). STDEV.S should add SQRT(N/(N-1)) to STDEV.P.

I absolute agree that we should prevent dive deeper into statistics as much as needed! Statistics make the analysis complex, on the other hand they help assuring the information is correct.

I'm trying to find a substantiated number (30, 50 or 100) for the number of values in a data set. Your explanation helped me a lot.

Regards,
Ingmar

Ingmar, sorry for the

Ingmar,

sorry for the confusion! STAB.* are indeed the German equivalents. I have corrected the article and the last picture.

I just read through your article about averages comparing the mean and the median. Nice!

All the best,
Nicholas

For all my German readers: In

For all my German readers:

In the English version of Excel, STDEV.P and STDEV.S are used for a population and a sample respectively. Which function to use can easily be derived from the first letter of both words.

In German, this is less intuitive: STABW.S is used for a sample and can easily be derived from the German word "Stichprobe". But the standard deviation of a population (in German: "Grundgesamtheit") is calculated by STABW.N.

Regards,
Nicholas

Thanx Nicholas, it was

Thanx Nicholas, it was necessary to explain the difference between mean and median for a future post about measuring perceived performance.

Don't wan't to nag about it, but in the "Working with Few Values" chapter you need to swap the STDEV.S and STDEV.P (also in the Excel screenshot) :-)

Regards,
Ingmar

Ingmar, sorry for being so

Ingmar,

sorry for being so slow to moderate your comments and reacting. You are so right about the chapter about working with few values. I have corrected the usage of STDEV.P and STDEV.S as well as created a new screenshow.

Thanks for the heads up!
Nicholas

Add Comment

Der Inhalt dieses Feldes wird nicht öffentlich angezeigt.
Sicherheitscode
Diese Frage hat den Zweck zu testen, ob Sie ein menschlicher Benutzer sind und um automatisierten Spam vorzubeugen.
Bild-CAPTCHA
Geben Sie die Zeichen ein, die im Bild zu sehen sind.

Microsoft Competence Blog

Application Infrastructure mit Microsoft Technologien ist ein wichtiges Fokusthema der sepago. Wir haben langjährige Projekterfahrung, sind neugierig auf neue Technologien und möchten diese bis in letzte Detail verstehen. Die Competence Blogs berichten davon.

RSS-FeedAlle Artikel des Competence Blogs abonnieren.

 

Über den Autor

Bild von nicholas
Nicholas Dille
Head of Technology and Innovation
Blogs about Centralized computing, virtualization and performance monitoring

Alle Artikel des Bloggers