Using Standard Deviations To Determine Web Analytics Benchmarks


I was asked to explain what I meant in a recent post about using standard deviations with web analytics data to formulate benchmarks.

First I’ll explain what I mean by Standard deviations. A standard deviation is the most common measure of a statistical dispersion, in other words measuring how widely spread the values in a data set are. If many data points are close to the average (or mean), then the standard deviation is small; if many data points are far from the mean, then the standard deviation is large. If all the data values are equal, then the standard deviation is zero.

The reason I use Standard deviations is to quickly spot deviations in data from normal levels. Sometimes trends are hard to spot, but with standard deviations they are easy to pick up. Take the last example I showed;

all_traffic.gif

As you can see I’ve circled the slight peak, however that is not an obvious peak to an analyst looking at data every day. Over the course of this dataset, in this graph, it looks like a fairly smooth variation – a little more at the beginning of the month than the end but nothing drastic. The reason I spotted this was because I ran the visit data through Excel and did a benchmark/standard deviation check.

The practical value in web analytics of using Standard Deviations is basically to determine when the data tells you that something is moving away from the “mean average”. It highlights things in your data that should be looked at. In the case above I drilled down over the time period to take a closer look and found that one particular site had driven more traffic from an awareness campaign that lasted approximately two weeks and then dropped off to normal levels.

The way to calcultate the standard deviation is pretty easy if you can grasp the basic maths. The longer the period of data you look at theoretically the more accurate your standard deviation should be. I am going to show the example above and how it was done which basically had a months data.

You can benchmark visitors, visits and page views (or interactions) quite easily. You can also then work out ratio benchmarks such as conversion rates by working out the number of successful “actions” with a standard deviation in the same manner (so once you have got a standard deviation for visits, then work out the standard deviation for actions and you have 2 average numbers which can then be used as your conversion ratio benchmark).

Step 1) Firstly get all the data into excel. It makes it easier. I took 30 days visit data into excel listed as days 1-30 and the visit numbers in a simple list. Like so;
Standard deviation

NB; It’s important you get the right number for every day in your list, so you have all the fluctuation in your data sample listed. If you’re not interested in the maths behind the method go straight to step 4.

Step 2) Find the mean of the numbers listed. In my excel sheet this was done by using the following function =AVERAGE(B2:B31). My mean average was 3204.2 visits per day. I do this simply to verify what the numbers are telling me look about right. The mean should be about in the middle.

Step 3) Find the standard deviation of the numbers. In excel this is all worked out for you by simply using a function =STDEVA(B2:B31). Essentially what this function does is work out the data fluctuation from the mean. By doing this I had a standard deviation of 336.55. This means that on average my daily visits are 3204, my top limit is 3204 + 336.5 and my bottom limit is 3204.2-336.5. So if my numbers fall between roughly 2868 and 3540 based on the historical data for 30 days it’s average behavior. If the figure appears over or below that period it’s showing abnormal behaviour and should be examined further.

Step 4) Use Excel to build a graph. Once you have the data it’s very fast and very simple to display your deviation. Simply do a normal line graph of your data, days along the bottom visits as your data points. Once you have your chart click the “layout tab” in excel 2007 and then the error bars. Then simply select “Error bars with standard deviation”.

standard deviation 2

What you should then see is a graph which looks a bit like this;

Final graph

As you can see the graph shows you the standard deviation shown by the black lines surrounding the the data at roughly 3540 and 2868 across the 30 day sample. Because of this we can clearly see that there is a “deviation from the norm” around the 8th till about the 15th of the month which was the period you then want to examine more closely.

So in summary standard deviations are very fast to set-up in Excel and very easily allow you to examine which trends you should be concentrating on. Here I’ve explained roughly the principles wihtout going too deep into the maths behind it. Suffice to say Excel works it all out for you. All you need to do is import your data, run a line graph and add error bars to 1 standard deviation to see roughly what your benchmarks should be.

Information and Links

Join the fray by commenting, tracking what others have to say, or linking to it from your blog.


Other Posts

Archives

Enter your email address:

Delivered by FeedBurner

Categories

Finnish blogs of note (In English)

Web Analytics Blogroll


Write a Comment

Take a moment to comment and tell us what you think. Some basic HTML is allowed for formatting.

Reader Comments

interesting
thanks

Hi Steve,

Great post, actually a really great post! and I would positively advocate for web analysts to apply it as a weapon in their toolbox. However; I hope it is ok to add a few comments.

- The first is, that I would probably make sure that my data does in fact follow a Normal Distribution. (I am quite confident most visit datasets are, so nothing wrong with your example :-)

- The second comment is, that I would probably advise to either look at smaller datasets or apply a growth (trend) attitude towards my standard deviation visualization (error bars).

In plain English; if your data does not follow a bell curve distribution it does not make much sense to apply deviation tactics to your analysis. Furthermore, if you look at too big a data set (which include growth) you will see that the beginning of the data set and the end of the data set depart from the norm.

Cheers

Dennis R. Mortensen, (COOish at IndexTools)
My Web Analytics and Online Marketing Blog

Hi Dennis,

I completely agree with you on both points, especially the second one. Many of my clients have a been growing their numbers over a year so what was a relevant standard deviation in January 2007 is not the same in 2008. As you say the end of the dataset is totally different to the beginning. We tend to either use smaller datasets (monthly or bi-monthly) or apply growth trends, depending on how accurate we need to be. However as a flag raiser the method described is a good method.

I didn’t want to complicate things too much in my post, so just tried to keep it simple.
Cheers
Steve

Thanks for the great setup!

I usually use median and then average deviation from the median for visits in order to find the typical traffic.

For flagging unusualness I just use conditional formatting (below the median minus the ave deviation or above the median plus the ave deviation) to bold and color positive or negative unusualnesses. Not as pretty as a graph, but I know the instant the data is entered into the sheet whether the number is typical or not. The math is done on a separate sheet that updates as data is entered (or I can adjust the maths sheet to select just a specific date range–dealing with growth trends).

The graph is nice for showing the degree of unusualness. I’ll be adding this to my setups shortly. Thanks!

Steve,
While I also agree that standard deviation is an underused tool for most analysts, I don’t recommend using a single standard deviation, for two reasons:

Dennis Mortensen’s point about a normal distribution is right on the money.
More important, even in a normal distribution, outliers will occur within 1 standard deviation only two-thirds of the time. Using upper control limits and lower control limits (borrowed from Six Sigma practices) to cover 2 or 3 standard deviations will account for the truly unusual more effectively. Avinahsh Kaushik had a great piece about control limits last year that’s well worth checking out.

Otherwise, keep up the great work!

Tim

Hi Tim,

Back in my quality control days we used 3 standard deviations. However the method I explained is a quick and simple thing you can run in excel. I wasnt trying to give the experts like yourself a lesson in how to do it, just raising the point that this is very easy to do and show folks with less experience how to do it. Your points like Dennis’ earlier are all valid.

Br
Steve

Hi Steve how can you determine the percentage area enclosed by the standard deviation +- 1 error.

Hi Jonathon,

The +-1 standard deviation is worked out by the statistical dispersion over the time period. Basically you’re calculating how widely spread the figures are and drawing a top and bottom limit (as shown).

[...] [...]