Sunday, 6 October 2013

An algorithm to weight influences

This is my first few times using MathJax! If you have any troubles in viewing these TeX formula don't hesitate to tell me.

Introduction

There are many types of contests that require subjective judging like diving as well as osu! mapping contests. We are human that it is unavoidable that bias may occur even though there are well-written guidelines to follow. The aim of this entry is to make an attempt on correcting such bias by an algorithm, and to determine the degree of bias occurred in the judgement.

The problem

Definition. The objective judgement is defined by a function $H:S\rightarrow I$ where S is the entire set of all possible submission towards the judgement and I is the set of all possible scores. For each map $m\in H$, the scoring $H(m)$ is, if not objectively correct, the most recognized value. Call $H(m)$ the objective score of the submission $m$.

In my entry, we assume that the scoring is a single number. i.e., I is a subset of R. In particular, we may assume that I = [0,100] for our examples. Why do we use 100? It's because our algorithm uses the percentage rank which also ranges from 0(%) to 100(%).

Judgement, is naturally a process of putting submissions into score. However judges are human that their judgement may vary with factors independent of the submissions. We call the judgement a random variable instead of a function.

Definition. The distribution of scoring by a particular judge for a particular map $m$ of fixed objective score is defined by $J_m: \Omega \rightarrow A$ where $A\subseteq I$.

For $m,m' \in S$, we assume that if $H(m) = H(m')$, then $J_m = J_{m'}$ to make our life easier. In fact, this may not be the case as there are factors depends on the submission (for instance, the author's name) but should be independent of the judgement. In such a case the judges (human) may produce bias.

Definition. The expected behavior of a judge can be defined by a function $J:S\rightarrow I$. For any $m \in H$, we have $J(m) = E(J_m)$. i.e., $J(m)$ is the mean (expected) scoring of the submission $m$.

Definition. A judge is defined by $j: \Omega \times I \rightarrow I$ where for every fixed $m \in I$ that $H(m) = k$, the distribution of scoring is given by $J_k$.

Now we will introduce two type of bias occurring in the judgement:

Definition. The consistency of a judge is a measurement of how consistent is a judge score a fixed map. In particular, define the consistency of a judge $C_1: T\times I \rightarrow \mathbb{R}$ where $T$ is the set of all possible judges and $C_1(J, k) = V(J_m)$ for all $m\in S$ that $H(m) = k$.

Definition. The consistency across judges is a measurement on how consistent judges score the same map. In particular, define the consistency across judges $C_2: P(T)\times I \rightarrow \mathbb{R}$ so that for $A = \left\{ J, J',...\right\} \in P(T)$, $C_2 (A,k) = V(J+J'+...)$ for all $m\in S$ that $H(m) = k$.

(The covariance among distinct judges are supposed to be zero, but possibly non-zero if bias exist.)

The first type of bias can be treated as randomness, and we can correct them by simple correlation with large enough number of samples of submissions with close enough scores. Why do we need "close enough scores"? It's more from maths:

Definition. A correction function of a particular judge $J$ is defined by $f_J: I\rightarrow I$ so that $f_J\circ J = H$.

If $f_J$ is continuous we can always take an linear approximation (correlation here) around a small region. If it's discrete, we can still do the same, though the error is not precisely bounded at non-differentiable points of $f_J$.

The maths job is almost done. Let's explain them in words. The reason to fix the first bias is to create an unique standard --- one should score the same against the same submission, as always. This should be the same for submissions of equal quality. The reason to fix the second bias is to equalize the influence from each judges. Sometimes having different perspective on the submission is fine (otherwise we only need one judges, though having more judges with exactly the same judging scale reduces the first type of bias by taking averages as promised by CLT) --- but the influence from each judge should be the same. For instance, if judges A, B, C and D judges 51/100 for 'good' submissions and 49/100 for the 'poor' ones, while judge E gives 95/100 and 5/100 for them respectively, then the final result depends on judge E very much.

Such a correction can't be easily done by investigating the variance because the actual distribution is more than what variance can represent. Mathematically, we can define something like RMSE (root-mean squared error) to represent how a judge's scoring far from the objective value by defining a function $d: T\times I\rightarrow \mathbb{R}$ so that the RMSE is equal to $d(J,k) = \sqrt{\int _I (H(m) - J(m))^2}$ for all $H(m) = k$.

One of the simple way to correct the second type of bias is to equalize their influence by dividing their scores by their influence index (in particular, the RMSE), but since we have $f_J$ on our hand, why can't we use that to give a more precise correction? Here we define the problem:

Instance: Scoring $s_{11},...s_{np}$ from judges $j_1, j_2,..., j_p$ and submissions $m_1, m_2,..., m_n$.
Question: Find an algorithm so that the bias defined about, if exist, has minimized effect on the ranking that supposed to reflect the real order of $H(m_1), ..., H(m_n)$.

Scoring does not matter (hence we say that it is fine that $f_J$ not being the identity function exactly) we want a ranking that follows the ranking of objective values.

Our plan

Note that $(I, \leq )$ is supposed to be totally ordered before we proceed into any calculations. Then we can put $I$ into a more precise scoring system: the percentage rank which varies in $[0,1]$.

Define the new objective judgement function $H^* : T \rightarrow [0,1]$ so that for $m \in T$, $H^*(m) = P(H(m) \geq H(m'))$ where $m' \in T$ is (uniformly) randomly selected. Our aim is to construct a function $J^* : T\rightarrow [0,1]$ so make it close to $H^*$.

Of course, if $I$ is designed to be discrete, then we can take a discrete subset $A\subseteq [0,1]$ following the above definition so that there exist bijective function to map between $I$ and $A$. To see this suppose $H^*(m_1) =H^*(m_2)$ where $H(m_1) > H(m_2)$. Now:

$H^*(m_1) = P(H(m_1)\geq H(m')) = P(H(m_1) \geq H(m') > H(m_2)) + P(H(m_2)\geq H(m')) \geq H^*(m_2)$

Since the first probability is non-zero with $m_1$ satisfying it so that it can injective. By construction it can be surjective as well. Now the transformation can be bijective so that any algorithms on $H^*$ can be transformed by to $H$. For the rest of the entry we assume that the mapping works on $I\rightarrow [0,1]$, a continuous interval.

There are two percentage rank being useful: the first one is the ranking among scores from an individual judge. Treating submissions being sampled from $T$, we say that scoring from one judge, actually forms the distribution on how the judge scores. i.e., the corresponding score at a supposed percentage rank (or objective score). We can't avoid the first type of bias but every score $s_{ij}$ can be treated as the approximation of the score from that judge: $s_{ij} = \hat{J}_i(s_j)$.

As submissions are sampled we can approximate a function $f_{J_i} : I \rightarrow [0,1]$ so that $f_{J_i}(s_{ij}) \approx H^*(m_j)$. There is actually an assumption here, that the percentage rank of a particular judge approximately equal to that of the objective rank.

Meanwhile we take the whole $s_{ij}$ as the sample where a random submission is sampled, then judged by a random judge. This is the way how the judges score. Define a function $J^* : I \rightarrow [0,1]$ so that $J^*(s) = P(s \geq J_i(m'))$, which is the probability that the scoring is higher than a random map judged by a random judge.

Similarly as long as the judges' decisions are not largely disagreeing, we have $J^*(s) \approx H^*(m)$ for $H(m) = s$.

Now we are ready to put the scoring into a unique scale, which implies that the influence from everyone is approximately the same with the same scoring scale, then the mean can be comfortably taken to the final result: for every score $s_{ij}$, transform iby $s'_{ij} = J^{*^{-1}}(f_{J_i}(s_{ij}))$ and take the mean of the transformed score as the finalized score for ranking.

How do we approximate the functions? A simple KDE would complete the job.

Algorithm

def $alg(x[[s_{11},s_{12},...,s_{1p}],...,[s_{n1},...,s_{np}]],min=Null,max=Null)$:
$if~min==Null:~ min = min(s_{ij}) - 3\sigma$
$if~max==Null:~max = max(s_{ij}) + 3\sigma$
for every column of $x$ approximate $f_{J_i}$ using KDE
$y = matrix(n,p)$ - create an empty matrix
for every entry of $y$, $y_{ij} = f_{J_j}(s_{ij})$
approximate $J*$ using KDE
$z = matrix(n,p)$
for every entry of $z$, $z_{ij} = J^{*^{-1}}(y_{ij})$
return row means of $z$ --- if the row mean of $x$ is the definite minimum then return the row mean of $z$ as the definite minimum, and vice versa.

Analysis

The core idea of the algorithm is to sacrifice some independence of scoring to balance the influences from each judges as they are uniformly sampled from the population. It is supposed to correct the scores around it's own position --- so we expect the transformed score approximately equal to the original score under normal circumstances --- consistent judgement without too much extreme scores.

There are two perspectives to deal with. First we shall try to transform the scores and see what we can investigate. Second we measure the performance by making up an estimator with the assumption, as always: the judges selected represents the whole pool of judges so that we can use some bootstrapping techniques.

For each iteration, we sample some judges (as many as original) and compute their rankings. We take the RMSE of difference(w.r.t. the rankings computed by take mean only, or by the algorithm) of rankings for the submissions and we can compute the confidence interval for the estimator.

We know that the simple mean is still the best representation for the entire scores. We have to sacrifice some  independence and possibly representative power to balance the influence among judges, but how much are we going to sacrifice? We can acquire these data from the computational result. On the other hand, some of the scores exert high degree of agreement so that the estimator of the fixed scores aren't even dispersed from the original one.

Computational result

Test 1: Beatmapping contest I
Background information: High uniqueness in judgement
Source: Kept confidential to protect some sort of privacy --- but they will be partially quoted.
Scoring range: 0-100, integer --- that does not affect that we can ignore integrity in transformation.

We plot the graph of original score against fixed score:

What happened to the last point? Let's look at the actual scoring: 86,76,100,83,76,72

The 100 turns out to be the only scoring higher than 90 which is clearly out of the general range. Instead of providing a large influence in our algorithm we treat 100 (just same as 90) as a 'top score' which is pretty much equivalent to the top score from others. The re-evaluated score is then lowered.

If we remove that point and do a linear regression:

The line is simply fitting well --- but the linear multiplier isn't 1. This is a really interesting phenomenon which also occurs for the rest of the time using other instances. Why do this happen?

Solving $f(x) = x$ we have $x \approx 66.2$ which is also about the mean. The linear regression tells us that we enlarged the result along the mean. A possible explanation for it is that we are sampling selectively so that $J^*$ is not precisely fit. In particular, we haven't sample enough submissions of lower scores (it make sense because no one want to submit trash for judgement right?).

And now we investigate the variation between the scores and difference between transformed scores and predicted transformed score using the linear regression:

We will be satisfied with this as it looks entirely random. (Just want to note that if you apply linear regression here the slope is vastly close to zero and this should be expected as this is the error from another linear regression and these error comes from the least squares fit.)

Test 2: Beatmapping contest II
Background information: The minimum possible score (5) is awarded if the submission has some technical failure.
Source: https://osu.ppy.sh/p/contestresults?c=4 (combining all results from the same contests; song does not matter)
Scoring range: 5-50 with integrity

$\alpha \approx 1$ indicates that we sampled from board enough sources, and the slightly larger variation indicates a slightly larger difference among judges.

The difference between predicted and actual transformed score again looks random. But if we try absolute error:

We will see that the error is a bit larger for lower scores, indicating that the scoring for those submissions are less consistent than the rest. Note that if we use the abs. error for test 1 the error still looks random and has no significant trend.

For the second part of computational results we introduce another commonly used transformation, where the scores are divided by the standard deviation and the mean of the weighted scores are processed for ranking. By default this assumes a lower bound of zero but we only need one step more of transformation for non-zero lowest possible score.

We run 1000 iterations using source from test 1:

Iteration: 1
Simple Mean 2.41738495073
Algorithm 2.63983901024
Weighted Mean 1.96054839267

Iteration: 2
Simple Mean 2.20794021658
Algorithm 3.37268439081
Weighted Mean 2.7041634566
...
Iteration: 1000
Simple Mean 1.53093108924
Algorithm 2.63983901024
Weighted Mean 1.77658380044

RMSE for Simple Mean 2.23874267168
RMSE using Algorithm 3.14210816968
RMSE of Weighted Mean 2.31559252676

Original-Alg CI -0.9917 -0.9449
Reject Original = Alg

Original-weighted CI -0.1149 -0.0716
Reject Original = weighted

Noticing that the estimator is actually the mean ranking shift for every submission and the range is widen by 50% in the computation. Also since the judgments highly agree with each other the standard deviation also does not vary too much and it is similar to the simple mean.

Source from test 2:

Iteration: 1
Simple Mean 3.39504282587
Algorithm 3.68496232931
Weighted Mean 3.67781399604

Iteration: 2
Simple Mean 3.04354364101
Algorithm 2.69502465568
Weighted Mean 2.71448357015
...
Iteration: 1000
Simple Mean 0.0
Algorithm 0.858395075279
Weighted Mean 0.827170191869

RMSE for Simple Mean 3.39444980427
RMSE using Algorithm 3.46425734428
RMSE of Weighted Mean 3.47411859325

Original-Alg CI -0.1656 -0.1220
Reject Original = Alg

Original-weighted CI -0.1676 -0.1242
Reject Original = weighted

This is very interesting. The conclusion we made is that, the scoring from test 2 is more problematic than test 1 but the difference between simple mean and algorithm has been narrowed. We can realize it in this way: the scale of the tournament (i.e. $n$ and $p$) are similar for the two tests, so the algorithm works with similar efficiency for both sets of data --- the one who did worse is the simple mean (so as the weighted mean) --- because under a more problematic scoring data their ability to reflect proper ranking drops.

Lastly we make up an example where scores highly agrees but in different scale.

Iteration: 1
Simple Mean 0.387298334621
Algorithm 0.316227766017
Weighted Mean 0.316227766017

Iteration: 2
Simple Mean 0.0
Algorithm 0.22360679775
Weighted Mean 0.22360679775
...
Iteration: 1000
Simple Mean 0.22360679775
Algorithm 0.0
Weighted Mean 0.0

RMSE for Simple Mean 0.249699819784
RMSE using Algorithm 0.261820549232
RMSE of Weighted Mean 0.261820549232

Original-Alg CI -0.0094 0.0101
Accept Original = Alg

Original-weighted CI -0.0094 0.0101
Accept Original = weighted

Here we have algorithm = weight mean that the two algorithm totally agrees with each other (this is fine). Meanwhile, the estimator is small that all three ranking method can reflect the scoring accurately, and we can accept that three algorithm makes no difference in the power of representation by confidence interval.

Conclusion

The belief of the algorithm is that individual scoring can be and should be up to one's (professional) judgement, but the overall result is supposed to reflect the willing of the entire group of judges. The algorithm can sometimes scales the scoring effectively, but if the judges are not a proper sample from the population the scoring is flawed no matter how the algorithm goes because those who scores properly will be classified as flawed scores --- at the end of the day, this is still an algorithm that based on numbers, but not the submissions themselves. Picking the right judges will still be the basis of a unbiased scoring.

Source

Thanks for your patience to read such a long thing! I still hate myself for not using rigorous maths to deal with these things but as it still works as a rough algorithm to handle scores I'll still happily announce them.

I won't release the scores for test 1 as the individual scores are confidential stuffs but test 2 (which is available online indeed) as well as the last test can be shown to all readers.