Statistics: Calculating Mean, Median, Mode, Standard Deviation using Ruby, R, or Javascript
I'm currently studying statistics for my EMBA at USC, and the best way for me to learn is to both write down my notes and include some code.
Mean
Mean - The sum of all data points divided by the total number of observations.
Ruby
weight = [115, 117, 120, 123, 126, 129, 132, 135, 139, 142, 146, 150, 154, 159, 164]
height = [58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72]
# Provide the average
def mean(array)
array = array.inject(0) { |sum, x| sum += x } / array.size.to_f
end
puts %Q{ Mean Weight: #{mean(weight)}, Mean Height: #{mean(height)} }
R
weight <- c(115, 117, 120, 123, 126, 129, 132, 135, 139, 142, 146, 150, 154, 159, 164)
height <- c(58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72)
weight_mean <- mean(height)
height_mean <- mean(weight)
sprintf("Mean Weight: %1.4f, Mean Height: %1.4f", weight_mean, height_mean)
Javascript
const util = require("util");
const math = require("mathjs");
let weight = [115, 117, 120, 123, 126, 129, 132, 135, 139, 142, 146, 150, 154, 159, 164];
let height = [58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72]
var weightMean = math.mean(weight).toFixed(2);
var heightMean = math.mean(height).toFixed(2);
console.log( util.format("Mean Weight %s, Mean Height: %s", weightMean, heightMean) );
Ruby
is such an elegant language that shows your work is fun, but I love how R
has a native method for mean()
. In the Javascript example, I'm splitting the difference with a little help from MathJS package.
Median
Median is the midpoint of data. Suppose you have 25 observations. The midpoint would be the middle observation or row 13.
When you are given a mean or median number, consider this it the beginning of an adventure.
Ruby
weight = [115, 117, 120, 123, 126, 129, 132, 135, 139, 142, 146, 150, 154, 159, 164]
height = [58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72]
# If the array has an odd number, then simply pick the one in the middle
# If the array size is even, then we must calculate the mean of the two middle.
def median(array, already_sorted=false)
return nil if array.empty?
array = array.sort unless already_sorted
m_pos = array.size / 2
return array.size % 2 == 1 ? array[m_pos] : mean(array[m_pos-1..m_pos])
end
puts %Q{ Median Weight: #{median(weight)}, Median Height: #{median(height)} }
R
weight <- c(115, 117, 120, 123, 126, 129, 132, 135, 139, 142, 146, 150, 154, 159, 164)
height <- c(58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72)
weight_median <- median(weight)
height_median <- median(height)
sprintf("Median Weight: %s, Median Height: %s", weight_median, height_median)
Javascript
const util = require("util");
const math = require("mathjs");
let weight = [115, 117, 120, 123, 126, 129, 132, 135, 139, 142, 146, 150, 154, 159, 164];
let height = [58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72]
var weightMean = math.median(weight).toFixed(2);
var heightMean = math.median(height).toFixed(2);
console.log( util.format("Mean Weight %s, Mean Height: %s", weightMean, heightMean) );
The Mode
The mode is the data point that is most prevalent in the data set. It represents the most likely outcome in a dataset.
Ruby
weight = [115, 115, 117, 120, 123, 126, 129, 132, 135, 139, 142, 146, 150, 154, 159, 164]
height = [59, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72]
# The mode is the single most popular item in the array.
def modes(array, find_all=true)
histogram = array.inject(Hash.new(0)) { |h, n| h[n] += 1; h }
modes = nil
histogram.each_pair do |item, times|
modes << item if modes && times == modes[0] and find_all
modes = [times, item] if (!modes && times>1) or (modes && times>modes[0])
end
return modes ? modes[1...modes.size] : modes
end
puts %Q{ Mode Weight: #{modes(weight)}, Mode Height: #{modes(height)} }
R
weight <- c(115, 115, 117, 120, 123, 126, 129, 132, 135, 139, 142, 146, 150, 154, 159, 164)
height <- c(58, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72)
get_mode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
height_mode <- get_mode(height)
weight_mode <- get_mode(weight)
sprintf("Mode Weight: %s, Height Mode: %s", weight_mode, height_mode)
Javascript
const util = require("util");
const math = require("mathjs");
let weight = [115, 115, 117, 120, 123, 126, 129, 132, 135, 139, 142, 146, 150, 154, 159, 164];
let height = [58, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72]
var weightMedian = math.mode(weight);
var heightMedian = math.mode(height);
console.log( util.format("Median Weight %s, Median Height: %s", weightMedian, heightMedian) );
Standard Deviation
Standard Deviation is the average (square) distance from the mean. Said differently, it's a number that measures how close your data set –as a whole– is to the mean.
This data point will help you get a better field of the distribution of your data points.
Ruby
weight = [115, 115, 117, 120, 123, 126, 129, 132, 135, 139, 142, 146, 150, 154, 159, 164]
height = [59, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72]
def mean(array)
array = array.inject(0) { |sum, x| sum += x } / array.size.to_f
end
def standard_deviation(array)
m = mean(array)
variance = array.inject(0) { |variance, x| variance += (x - m) ** 2 }
standard_deviation = Math.sqrt(variance/(array.size-1))
# Round floating point to 4 decimals
format = "%0.4f"
return format % standard_deviation
end
puts %Q{ Weight SD: #{standard_deviation(weight)}, Height SD: #{standard_deviation(height)} }
R
R method sd
uses sample standard deviation, not the population standard Deviation.
weight <- c(115, 117, 120, 123, 126, 129, 132, 135, 139, 142, 146, 150, 154, 159, 164)
height <- c(58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72)
weight_sd <- sd(weight)
height_sd <- sd(height)
sprintf("Weight SD: %1.4f, Height SD: %1.4f", weight_sd, height_sd)
Javascript
const util = require("util");
const math = require("mathjs");
let weight = [115, 115, 117, 120, 123, 126, 129, 132, 135, 139, 142, 146, 150, 154, 159, 164];
let height = [58, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72]
var weightSD = math.std(weight).toFixed(4);
var heightSD = math.std(height).toFixed(4);
console.log( util.format("Weight SD %s, Height SD: %s", weightSD, heightSD) );
Z Scores
Z-scores are simple arithmetic transformations of the actual measurements.
R
In R
, you can calculate the z-score using the scale()
method.
Longhand
This is using the z-score algebraic expression.
weight <- c(115, 117, 120, 123, 126, 129, 132, 135, 139, 142, 146, 150, 154, 159, 164)
height <- c(58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72)
x <- 50
zWeight <- (x - mean(weight) ) / sd(weight)
zHeight <- (x - mean(height) ) / sd(height)
sprintf("Weight Z: %1.2f. Height Z: %1.2f", zWeight, zHeight)
This is using R's scale()
method.
weight <- c(115, 117, 120, 123, 126, 129, 132, 135, 139, 142, 146, 150, 154, 159, 164)
height <- c(58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72)
x <- 50
zWeight <- scale(x, center = mean(weight), scale = sd(weight))
zHeight <- scale(x, center = mean(height), scale = sd(height))
sprintf("Weight Z: %1.2f. Height Z: %1.2f", zWeight, zHeight)
Javascript
const util = require("util");
const math = require("mathjs");
let weight = [115, 117, 120, 123, 126, 129, 132, 135, 139, 142, 146, 150, 154, 159, 164];
let height = [58, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72]
//How many standard deviations our datapoints lie from the mean
//This will help you determine if a specific datapoint is an outlier
function zScore(datapoint, mean, std, n=1){
let score = (datapoint - mean) / (std / Math.sqrt(n) );
// Number of standard deviations from the mean.
return Number(score).toFixed(4);
}
var x = 50
var mean = math.mean(weight)
var sd = math.std(weight);
var zWeight = zScore(x, mean, sd);
var mean = math.mean(height)
var sd = math.std(height);
var zHeight = zScore(x, mean, sd);
console.log( util.format("Weight Z %s, Height Z: %s", zWeight, zHeight) );
Correlation
This little method in R is convenient. Sometimes you might want to ask yourself, "Are these two data points correlated?" Using R, it's straightforward to understand p
.
R
weight <- c(115, 117, 120, 123, 126, 129, 132, 135, 139, 142, 146, 150, 154, 159, 164)
height <- c(58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72)
# What percentage of correlation
cor <- cor(weight, height)
sprintf("Percentage of Correlation: %f", cor)