A:B Advert Testing – Is Statistical Significance Over-Rated?

Posted by Steve Baker on June 29th, 2007

Adgroups, Advert Text, Google Adwords, Pay Per Action, Testing

On the face of it, probably a bit of a daft question. How can you be sure that your new advert is better than the old one, if you don’t wait to see if it’s statistically significant? And to an extent, that’s true. If you were to ignore significance completely, the moment somebody clicked through one of your adverts, you’d decide that it was the better advert, and bin the other one. It’s quite possible that only 50% of the time you’d select the better advert, and for every improvement that you make to your advert, you make another change for the worse, and you don’t get any overall improvement at all. But there’s a trade-off for statistical significance. Suppose that you have two adverts, one that generates a click-through rate of 5%, and one that generates a click-through rate of 10%. How long should you wait before you are sure the 10% advert really is better? If you get 30 impressions per day, it’ll take four days to be 85% certain (3/60 vs. 6/60 is significant at the 85% level). But if you want to be 95% certain, it’ll take eleven days (8.25/165 vs. 16.5/165 is significant at the 95% level). And to be 99% certain, it’ll take twenty days! So, in the time that it takes to run one test at the 99% level, you can run five tests at the 85% level. Clearly, you can get far quicker improvements in your overall click-through rate, if most of these changes are genuinely for the better. But what about the risks? You could select to keep adverts that are, in fact, worse than the existing ones (and you will, 15% of the time , any change to an advert will change the click-through rate; there are no ˜equally good’ adverts). But I would challenge that if an advert appears better at the 85% level, whilst it may be worse, the chances are very small that it’ll be much worse. So, if you run five tests in those twenty days, you’ll probably make one change for the (slightly) worse, and four changes for the better. Still an improvement on the one change that you’d make if you were determined to wait until you were 99% certain that you were making the right choice , this is advertising, not a clinical trial! Of course, this is a bit of an over-simplification. In reality, most of your advert tests will yield a much smaller return than doubling the click-through rate, and a lot of them will not be better than the old advert. The first point here is quite important , the smaller the difference between the two adverts (increasingly true once you’ve entered an ongoing process of testing), the longer it’ll take to get strong significance, and the less risk there is in taking the wrong option occasionally. For example, if you were getting 30 impressions per day, and had adverts with 5% and 6% click-throughs, you’d get 85% significance after 75 days, but even 95% significance is going to take 193 days , nearly three times as long. As for the second point, what if the new advert is performing worse than the existing one after a few days? It’s not significant, but, in a mirror of the argument so far, if it is in reality a better advert, is it likely to be much better? Is it worth waiting weeks to see if this advert, that’s probably worse than the existing one, is actually slightly better (remember that the smaller the difference, the longer it’ll take to be sure). Perhaps the time is better spent writing a new challenger, which may prove itself quickly? So what level of significance should you use? Personally, I’d say that 85% is probably sufficient, but I can see an argument for 90%. I feel that running a test for three times as long (as an 85% test) to get to 95% is excessive , yes, you’ll get it wrong less often, but it’ll take a lot longer to generate improvements, and lets face it, your rivals probably aren’t standing still! There is, of course, one problem that brings the whole process to a grinding halt. What if the two adverts are producing very similar results? It’s widely acknowledged that a small change to an advert can have a big impact, but more often than not, it has a very small impact. Everything stops until you get significant results, and the more similar the performance of the adverts, the longer it’ll take. The solution is fairly clear , sooner or later, you’ll have to stop the test. You can either keep the existing advert, since the new advert hasn’t proven itself, or you can take whichever is the better to date, regardless of whether it’s significant or not (this’ll be the better advert more often than not). I’d advocate the second option, although really, it doesn’t make much difference which you choose (since they are performing very similarly). An interesting claim , that under certain circumstances, you should take the advert that is performing better, regardless of whether it’s significant or not! So what process have we arrived at?

  1. Decide before you run your new advert how long you are willing to wait for a result , this’ll depend on how long you’ve been testing (as you go on, the chances of finding a quick, big win decrease) and (obviously) how many impressions you are getting.
  2. Set the advert live, checking regularly for significance. I’d recommend www.splittester.com, but any testing tool will do.
  3. If, after a few days (longer if you’ve got little traffic), the new advert is worse than the old one, kill it, and write a new advert.
  4. Once you’ve got 85% significance (or 90%, if you’re of a nervous disposition), keep the better advert.
  5. If the deadline set in step one is reached without a significant result, keep the better advert, regardless of how small the difference is.

One Response to “A:B Advert Testing – Is Statistical Significance Over-Rated?”

  1. Katie Saxon says:

    A really useful and insightful piece! I’ve been reading around the subject of PPC a lot recently, and was starting to think that I needed to be a lot more scientific in my approach to testing. It’s good to know that someone else also uses the “go with the better ad” instinct – especially as, so far, this approach has always given me great results! Cheers.

Leave a Reply