I frequently create reports that show the average (mean) value of a metric based on any number of dimensions. For example, I might report the average cost of a product by state. Usually, I include the overall average cost at the bottom of the report, and one of the first things people do to “check” my work is to highlight the entire row of average costs, to have Excel show them the average of the entire group. These numbers rarely match because taking an average of averages is wrong.

The reason an average of averages is wrong is that it doesn’t take into account how many units went into each average. For example: Let’s say we only have orders in 2 states, New York and Pennsylvania. The average cost of the product in NY is $100, and the average cost in PA is $50. Now…let’s say order volume in NY is 10 orders, and volume in PA is 2 orders. Taking the average of the two averages would give us $75 as the overall average cost. (($100+$50 = $150)/2) = $75 However, if we take volume into account, as we should, the average is $91.67. 10 NY orders at $100 = $1,000 + 2 PA orders at $50 = $100, for a total cost of $1,100, divided by all 12 orders = $91.67. That’s a difference of over $16! Now what if we had 100,000 vs 20,000 orders and we were trying to budget or plan for next year? Instead of reporting the actual total costs of $11,400,000, we would inaccurately report that our total costs are expected to be around $9 Million. Finding out late next year that you miscalculated by $2.4 Million might be considered a big deal…

If you must calculate an average of these averages, and you have volume information, there’s a handy Excel function called sumproduct. SumProduct is a function that takes two arrays (vectors) and multiplies them together, line by line, before adding up the total. In the example above, assuming our state is in column A, Volume is in Column B, and Avg Cost in Column C, we’d use the formula =sumproduct(B2:B3,C2:C3). This takes B2*C2 + B3*C3, giving us the total of $1,100, as we had calculated previously. If we take that formula a step further, to divide by the total number of orders, we can get the actual average cost per order. =sumproduct(B2:B3,C2:C3)/sum(B2:B3)

So, the next time you need to get the average of values that have already been averaged, make sure you take the volume into account.

{ 0 comments }

How to Create a Simple Graph With R

by Jason Green on 2013/07/01

In the process of doing an ROI analysis, I wanted a simple area chart showing the negative burn down of our costs while the project was in development, through to the positive savings of the project after completion. Excel didn’t do a great job for me, so I thought I’d give R a try.

As you can see in the image below, Excel put the x axis right through the center of the chart and since I had to use a column chart, it’s more blocky/chunky than I liked. With R and ggplot2, I was able to smooth out those lines and put the x axis in the right place. I also had a little more control over the labels on the x axis, so I could show the intervals at 12/24/… months instead of 1/13/….. Of course it’s super-easy to create charts in Excel, but with a little extra effort in R, you can have a much better final product.

The code should be fairly well commented, but here’s the general idea:

  1. Install ggplot2 and scales to build the chart and format it with dollar signs.
  2. Load data from a csv file. Ours simply contains months as numbers, and the total savings.
  3. Interpolate our 60 months into 1000 separate values to reduce the choppines.
  4. Add a valence column to indicate which values are positive or negative.
  5. Plot the chart.

Area Chart in R vs Bar Chart in Excel

Download the spreadsheets here:
ROI_Calculator (For coming up with the necessary numbers.)
SampleROI.csv (Easy format for importing into R.)


### Begin R Code ###
# If you haven't already, install the packages below.
library(ggplot2) # ggplot2 for creating the charts
library(scales) # scales for the dollar formatting of the axis.

#Load the csv file containing the values:
SampleROI <- read.csv("C:/Users/convalytics/Documents/R/SampleROI.csv") # Make sure to use your own path ***

# Interpolate the data into 1000 separate values. (to smooth out the choppiness of having only a few values)
interp <- approx(SampleROI$Month, SampleROI$TotalCostVsSavings, n=1000)

#Rebuild the data frame with the interpolated values.
roi <- data.frame(Month=interp$x, Savings=interp$y)

#Add a "valence" column to indicate positive vs negative values.
#Essentially selects the not-yet-existing "valence" column on a
# subset of the positive or negative values in the "Savings" column,
# and inserts a value of "pos" or "neg" accordingly.
roi$valence[roi$Savings >= 0] <- "pos"
roi$valence[roi$Savings < 0] <- "neg"

# Plot the chart
ggplot(roi, aes(x=Month, y=Savings, width=1)) +
geom_area(aes(fill=valence, alpha=.8,stat="identity")) +
scale_x_continuous(breaks=seq(0,60,12),expand=c(0,0)) +
scale_y_continuous(breaks=seq(-4000000,6000000,1000000),labels=dollar) +
scale_fill_manual(values=c("darkred","darkgreen")) +
labs(title="Sample ROI",x="Months from Project Start",y="Running Cost vs. Savings") +
theme(plot.title=element_text(face="bold"))

# R Code by Jason Green : Convalytics.com #
### End R Code ###

I hope this makes for a good example that you can use to tweak for your own uses. If you have any questions or suggestions, please leave us a comment!

{ 0 comments }

Outlandish Measurements

June 28, 2013
Thumbnail image for Outlandish Measurements

So I’ve been reading “How to Measure Anything” by Douglas W. Hubbard and “Predictive Analytics” by Eric Siegel, and decided to create a post highlighting things that are “impossible,” or at least difficult, to measure. Then, I’ll take on the task of finding ways to measure them. Not everything has to be business-related either. Maybe […]

Onward to the full article →

Is Google Analytics a Waste of Time?

February 28, 2011

Yes, if you only look at the data and don’t act. Who cares if your bounce rate is 70% or 30%? Who cares if you have 5,000 visitors per month? Who cares if 30% of your visits come from California? None of that matters if you just look at it and say, “Yay! I have […]

Onward to the full article →

Google Analytics Referrer Spam

January 26, 2011

If you’ve launched a new website recently and were excited to see referrals from golbnet or forexmarket, you have been spammed. This is a tactic used by spammers to get webmasters, curious to research their referrers, to visit the desired website. Also referred to as log spam or referrer bombing. It’s not necessarily malicious, but […]

Onward to the full article →

What is My Conversion Rate?

January 24, 2011

If you’re just getting started with internet marketing, your first question might be, “What is a conversion rate?” Put simply, your conversion rate is the percentage of visitors who complete your desired action. If your goal is for your visitors to buy your product, a conversion would occur once that has happened. If you had […]

Onward to the full article →

Welcome to Convalytics.com

January 22, 2011

Convalytics.com was originally created to focus on the analytics and statistics behind marketing conversions.  I’d like to take that a step further and devote this website to the demonstration and discussion of all types of analytics. I will be writing about data visualization, the latest programming languages, and the latest business intelligence applications. Specifically, QlikView, […]

Onward to the full article →