No, this is not about diversity in societies but about some interesting statistics. Additionally, we will see how we can easily explore this curiosity with some Julia code! If you haven’t used Julia before, you will understand this anyway and might be tempted to give the language a try!
Consider a certain set of features, maybe height, age, hair length, shoe size. We now measure each feature for a large amount of persons and examine the average of each feature. For example, our sample might yield that, in average, a person is 1.73 m tall, 24.1 years old, has hair of 13.4 cm length, and shoe size 38.1.
It seems quite obvious that one should expect some people in our sample to have approximately these properties. After all, such natural features often follow a normal distribution with a characteristic concentration around the average. But as it turns out, this is absolutely not the case! By considering only a few features, we typically do not find a single person in our sample that could be considered average in all respects.
A much nicer explanation (and story!) of this problem can be found in Matt Parker’s video Does The Average Person Exist? which has been the inspiration for this blog post.
After I had seen that video, I immediatly wanted to find out if I could reproduce that with some decent Julia hacking or if this only occurs in certain real-world data sets.
So, warm up your fingers, launch Julia, and let’s go!
We should probably start with
since this is a statistics problem.
Now we need to get some data.
We keep it simple and assume that each feature has a normal distribution with a certain mean and variance.
Let’s say we have
F features with means
μs and standard deviations (SD)
σs each randomly chosen between 0 and 1.
F = 15 μs = rand(F) σs = rand(F)
We can use the function
randn to produce normally distributed values with mean 0 and SD 1.
Shifting and scaling (
μ + σ * rand()) yields a normally distributed random number with mean
μ and SD
We utilise Julia’s dot syntax and with
(μs .+ σs .* randn(F))'
we get a row vector (
' transposes a vector) of length
F that represents one person in our sample.
If you are new to this dot syntax, just read it as “do this operation elementwise”. So
[1, 2, 3] .+ [4, 5, 6]is equivalent to
[1+4, 3+5, 4+6]and
f.([1, 2, 3])is equivalent to
[f(1), f(2), f(3)].
One person is not enough, let’s have
N of them in one matrix.
N rows and then stack them:
N = 100 rows = ((μs .+ σs .* randn(F))' for i in 1:N) data = vcat(rows...)
That’s something we can work with.
data is a matrix with
N rows, one for each individual, and
F columns, one for each feature.
We could check if everything went fine so far and compute the empirical mean and SD of each feature, i.e. column in
μs_emp = mean.(eachcol(data)) σs_emp = std.(eachcol(data))
These two arrays should be very similar to their theoretical counterparts
We can check that by dividing them entrywise:
println(μs_emp ./ μs) println(σs_emp ./ σs)
You should only see values near 1, for sufficiently large
In my case, the output was:
[1.00118, 1.0253, 0.986944, 1.1464, 1.01061, 0.921282, 1.0203, 1.13068, 1.16493, 0.902776, 0.991456, 0.924438, 0.941744, 0.972863, 1.0154] [0.961988, 1.03478, 0.991027, 0.97863, 0.87932, 0.953272, 1.01566, 0.916768, 1.02057, 1.06744, 0.968002, 0.970934, 0.984849, 0.91135, 1.03027]
So, pretty one-ish.
Okay, back to the topic.
We want to find someone in our sample (that is, a row in
data) that is near the average.
Or to ask this question more quantitatively: What is the minimal distance any individual in our sample has to the average?
What do we mean by distance to the average in the first place?
This is not trivial and there are multible possibilites:
Additionally, we can always decide if we want to use the theoretical or empirical values for mean and SD.
Just to give an example, this would be the 2-norm (euclidian distance) using the theoretical mean:
dist_eucl(row, fs) = (row[fs] .- μs[fs]) .^2 |> sum |> sqrt
(Or you are clever and use
norm from the
We also give a parameter
fs that determines what features are used for the computation (maybe we are only interested in
fs = [1, 4, 7]).
Later, we will explore the influence of this parameter.
I opt for the maximal absolute z-score of one feature using the empirical mean and SD. Here is why: In statistics, we generally consider a value to be average if it is at most one SD away from the mean, i.e. the corresponding z-score is between -1 and 1. Thus, to call an individual average in every feature, the absolute z-score should not exceed 1 for any feature. By expressing the difference to the mean in multiples of the SD (and that’s what the z-score does), we take into account the different variances of each feature.
In Julia, it reads like this:
dist_max_abs_z(row, fs) = abs.((row[fs] .- μs_emp[fs]) ./ σs_emp[fs]) |> maximum
Whatever metric you decided to use, this will give you the smallest distance any row has to the average, regarding that metric:
min_dist_to_avg(dist_to_avg, fs) = minimum(dist_to_avg(row, fs) for row in eachrow(data))
There we are, we have everything we need to give the answer: How far from the average are we at least away, considering all features?
And the output is:
So, indeed: Every row in our matrix has one entry that is at least 1.2 standard deviations away from that column’s mean. None of them could be considered average in every feature.
Finally, we could ask a bonus question:
When considering only a certain amount of all
F features, what minimal distance do we get then?
We need to find the set of all subsets of
[1, ..., F], which is called the power set.
But don’t worry, no need for nested loops to construct this set!
All we need is the package Combinatorics which gives us the function
We can then iterate through this power set of all features and compute the minimal distance as above.
Multiple subsets have the same length and we reduce these to one value by averaging them.
sums = fill(0.0, F) lengths = fill(0.0, F) for fs in powerset(1:F) if length(fs) == 0 continue end sums[length(fs)] += min_dist_to_avg(dist_max_abs_z, fs) lengths[length(fs)] += 1 end avg_dists = sums ./ lengths for fc in 1:F println("$fc features -> minimal distance $(avg_dists[fc]).") end
1 features -> minimal distance 0.013784291578738696. 2 features -> minimal distance 0.1212074863165526. 3 features -> minimal distance 0.25896662807332965. 4 features -> minimal distance 0.37577905164030784. 5 features -> minimal distance 0.47341615845091006. 6 features -> minimal distance 0.5625986502091298. 7 features -> minimal distance 0.6460987469288716. 8 features -> minimal distance 0.724867174703495. 9 features -> minimal distance 0.7951927768837386. 10 features -> minimal distance 0.854560515984068. 11 features -> minimal distance 0.9083062719229041. 12 features -> minimal distance 0.9665906080745594. 13 features -> minimal distance 1.038106028492815. 14 features -> minimal distance 1.1261922721338005. 15 features -> minimal distance 1.2241054929529303.
That’s it, we have seen that the interesting statistical phenomenon that no individual in a sample is similar to the average of that sample can easily be reproduced by only a handful of Julia. If you haven’t already, give this a try and play and experiment with it!
Here you can find the complete code that was used:
using Statistics using Combinatorics F = 15 μs = rand(F) σs = rand(F) N = 100 rows = ((μs .+ σs .* randn(F))' for i in 1:N) data = vcat(rows...) μs_emp = mean.(eachcol(data)) σs_emp = std.(eachcol(data)) println(μs_emp ./ μs) println(σs_emp ./ σs) dist_eucl(row, fs) = (row[fs] .- μs[fs]) .^2 |> sum |> sqrt dist_max_abs_z(row, fs) = abs.((row[fs] .- μs_emp[fs]) ./ σs_emp[fs]) |> maximum min_dist_to_avg(dist_to_avg, fs) = minimum(dist_to_avg(row, fs) for row in eachrow(data)) println(min_dist_to_avg(dist_max_abs_z, 1:F)) sums = fill(0.0, F) lengths = fill(0.0, F) for fs in powerset(1:F) if length(fs) == 0 continue end sums[length(fs)] += min_dist_to_avg(dist_max_abs_z, fs) lengths[length(fs)] += 1 end avg_dists = sums ./ lengths for fc in 1:F println("$fc features -> minimal distance $(avg_dists[fc]).") end