No one is average.
January 24, 2020
No, this is not about diversity in societies but about some interesting statistics. Additionally, we will see how we can easily explore this curiosity with some Julia code! If you haven't used Julia before, you will understand this anyway and might be tempted to give the language a try!
The situation
Consider a certain set of features, maybe height, age, hair length, shoe size. We now measure each feature for a large amount of persons and examine the average of each feature. For example, our sample might yield that, in average, a person is 1.73 m tall, 24.1 years old, has hair of 13.4 cm length, and shoe size 38.1.
It seems quite obvious that one should expect some people in our sample to have approximately these properties. After all, such natural features often follow a normal distribution with a characteristic concentration around the average. But as it turns out, this is absolutely not the case! By considering only a few features, we typically do not find a single person in our sample that could be considered average in all respects.
A much nicer explanation (and story!) of this problem can be found in Matt Parker's video Does The Average Person Exist? which has been the inspiration for this blog post.
Generate some data
After I had seen that video, I immediatly wanted to find out if I could reproduce that with some decent Julia hacking or if this only occurs in certain real-world data sets. So, warm up your fingers, launch Julia, and let's go!
We should probably start with
using Statistics
since this is a statistics problem.
Now we need to get some data.
We keep it simple and assume that each feature has a normal distribution with a certain mean and variance.
Let's say we have F
features with means μs
and standard deviations (SD) σs
each randomly chosen between 0 and 1.
F = 15
μs = rand(F)
σs = rand(F)
We can use the function randn
to produce normally distributed values with mean 0 and SD 1.
Shifting and scaling (μ + σ * rand()
) yields a normally distributed random number with mean μ
and SD σ
.
We utilise Julia's dot syntax and with
(μs .+ σs .* randn(F))'
we get a row vector ('
transposes a vector) of length F
that represents one person in our sample.
Remark
If you are new to this dot syntax, just read it as "do this operation elementwise". So[1, 2, 3] .+ [4, 5, 6]
is equivalent to [1+4, 3+5, 4+6]
and f.([1, 2, 3])
is equivalent to [f(1), f(2), f(3)]
.
One person is not enough, let's have N
of them in one matrix.
We create N
rows and then stack them:
N = 100
rows = ((μs .+ σs .* randn(F))' for i in 1:N)
data = vcat(rows...)
That's something we can work with.
data
is a matrix with N
rows, one for each individual, and F
columns, one for each feature.
We could check if everything went fine so far and compute the empirical mean and SD of each feature, i.e. column in data
:
μs_emp = mean.(eachcol(data))
σs_emp = std.(eachcol(data))
These two arrays should be very similar to their theoretical counterparts μs
and σs
.
We can check that by dividing them entrywise:
println(μs_emp ./ μs)
println(σs_emp ./ σs)
You should only see values near 1, for sufficiently large N
.
In my case, the output was:
[1.00118, 1.0253, 0.986944, 1.1464, 1.01061, 0.921282, 1.0203, 1.13068, 1.16493, 0.902776, 0.991456, 0.924438, 0.941744, 0.972863, 1.0154]
[0.961988, 1.03478, 0.991027, 0.97863, 0.87932, 0.953272, 1.01566, 0.916768, 1.02057, 1.06744, 0.968002, 0.970934, 0.984849, 0.91135, 1.03027]
So, pretty one-ish.
The most average individual
Okay, back to the topic.
We want to find someone in our sample (that is, a row in data
) that is near the average.
Or to ask this question more quantitatively: What is the minimal distance any individual in our sample has to the average?
What do we mean by *distance to the average* in the first place?
This is not trivial and there are multible possibilites:
- any p-norm, including the maximum norm
- the maximal or minimal z-score of one feature
- come up with your own clever idea!
Additionally, we can always decide if we want to use the theoretical or empirical values for mean and SD.
Just to give an example, this would be the 2-norm (euclidian distance) using the theoretical mean:
dist_eucl(row, fs) = (row[fs] .- μs[fs]) .^2 |> sum |> sqrt
(Or you are clever and use norm
from the LinearAlgebra
package.)
We also give a parameter fs
that determines what features are used for the computation (maybe we are only interested in fs = [1, 4, 7]
).
Later, we will explore the influence of this parameter.
I opt for the maximal absolute z-score of one feature using the empirical mean and SD. Here is why: In statistics, we generally consider a value to be average if it is at most one SD away from the mean, i.e. the corresponding z-score is between -1 and 1. Thus, to call an individual average in every feature, the absolute z-score should not exceed 1 for any feature. By expressing the difference to the mean in multiples of the SD (and that's what the z-score does), we take into account the different variances of each feature.
In Julia, it reads like this:
dist_max_abs_z(row, fs) = abs.((row[fs] .- μs_emp[fs]) ./ σs_emp[fs]) |> maximum
Whatever metric you decided to use, this will give you the smallest distance any row has to the average, regarding that metric:
min_dist_to_avg(dist_to_avg, fs) = minimum(dist_to_avg(row, fs) for row in eachrow(data))
The moment of truth
There we are, we have everything we need to give the answer: How far from the average are we at least away, considering all features?
println(min_dist_to_avg(dist_max_abs_z, 1:F))
And the output is:
1.2241054929529303
So, indeed: Every row in our matrix has one entry that is at least 1.2 standard deviations away from that column's mean. None of them could be considered average in every feature.
Influence of amount of features
Finally, we could ask a bonus question:
When considering only a certain amount of all F
features, what minimal distance do we get then?
We need to find the set of all subsets of [1, ..., F]
, which is called the power set.
But don't worry, no need for nested loops to construct this set!
All we need is the package Combinatorics
which gives us the function powerset
.
We can then iterate through this power set of all features and compute the minimal distance as above.
Multiple subsets have the same length and we reduce these to one value by averaging them.
sums = fill(0.0, F)
lengths = fill(0.0, F)
for fs in powerset(1:F)
if length(fs) == 0
continue
end
sums[length(fs)] += min_dist_to_avg(dist_max_abs_z, fs)
lengths[length(fs)] += 1
end
avg_dists = sums ./ lengths
for fc in 1:F
println("$fc features -> minimal distance $(avg_dists[fc]).")
end
This outputs:
1 features -> minimal distance 0.013784291578738696.
2 features -> minimal distance 0.1212074863165526.
3 features -> minimal distance 0.25896662807332965.
4 features -> minimal distance 0.37577905164030784.
5 features -> minimal distance 0.47341615845091006.
6 features -> minimal distance 0.5625986502091298.
7 features -> minimal distance 0.6460987469288716.
8 features -> minimal distance 0.724867174703495.
9 features -> minimal distance 0.7951927768837386.
10 features -> minimal distance 0.854560515984068.
11 features -> minimal distance 0.9083062719229041.
12 features -> minimal distance 0.9665906080745594.
13 features -> minimal distance 1.038106028492815.
14 features -> minimal distance 1.1261922721338005.
15 features -> minimal distance 1.2241054929529303.
Final remarks
That's it, we have seen that the interesting statistical phenomenon that no individual in a sample is similar to the average of that sample can easily be reproduced by only a handful of Julia. If you haven't already, give this a try and play and experiment with it!
Here you can find the complete code that was used:
using Statistics
using Combinatorics
F = 15
μs = rand(F)
σs = rand(F)
N = 100
rows = ((μs .+ σs .* randn(F))' for i in 1:N)
data = vcat(rows...)
μs_emp = mean.(eachcol(data))
σs_emp = std.(eachcol(data))
println(μs_emp ./ μs)
println(σs_emp ./ σs)
dist_eucl(row, fs) = (row[fs] .- μs[fs]) .^2 |> sum |> sqrt
dist_max_abs_z(row, fs) = abs.((row[fs] .- μs_emp[fs]) ./ σs_emp[fs]) |> maximum
min_dist_to_avg(dist_to_avg, fs) = minimum(dist_to_avg(row, fs) for row in eachrow(data))
println(min_dist_to_avg(dist_max_abs_z, 1:F))
sums = fill(0.0, F)
lengths = fill(0.0, F)
for fs in powerset(1:F)
if length(fs) == 0
continue
end
sums[length(fs)] += min_dist_to_avg(dist_max_abs_z, fs)
lengths[length(fs)] += 1
end
avg_dists = sums ./ lengths
for fc in 1:F
println("$fc features -> minimal distance $(avg_dists[fc]).")
end