Comments on MsHedgehog: Judging in the Mundial Final (Fun with Data)

@Yokoito: indeed. I agree with you that this is ar...

2016-07-28T23:02:34.243+01:00

@Yokoito: indeed. I agree with you that this is arguably a good thing, although I also think the Mundial has a somewhat paradoxical role in protecting tango from ballroomisation. Because all the finalists indisputably have good looking technique, whereas there are ballroom schools teaching a genuinely ballroomised argentine tango with a totally inappropriate (IMV) ballroom technique and approach. However, that's a subject for further research.

@Mike: Judges rank a lot of couples equal with one another. They don't give forty different marks to forty different couples, and that really shows in the cross-plots of rankings. I haven't done the marks seperately from the rankings, but it implies that some judges at least give a restricted set of integers as marks, they don't try to make fine distinctions. They only see each couple dance once - have a look at the post on Music in the Mundial for a description of the procedure, and links to video.

Yes, having some slightly guilty fun - tangueros a...

2016-07-28T14:46:32.830+01:00

Yes, having some slightly guilty fun - tangueros are supposed to be in a trance rather than playing with stats, no?

Anyway, I thought about how to visualise this. I'm not sure what you're planning but I took a cross plot of the rankings for each pair of judges. If they agreed perfectly the points would lie on the 45 degree line. However, as you have obviously found, they're nothing like this. I have a neat figure showing all cross plots for a single year in the same picture, but unfortunately cannot post this in a comment!

I did wonder about the scores. Over how many dances are they averaged, for any given couple. Some judges (eg Garcia) only have averages that are integers or in steps of 0.5. Others (eg Brufman) have averages such as 7.05. Does this mean Brufman is finding fine distinctions between couples or is it the result of an averaging over several dances? Given this, rankings might be better.

Hope you're enjoying this too! Have you tried graphing the top or bottom five only, or some such?

Mike.

Congratulations for all this number crunching...I ...

2016-07-28T13:29:32.022+01:00

Congratulations for all this number crunching...I am deeply impressed and also a bit jealous about the spare time you must have. In any case, I agree that it is a good thing that Tango is not (yet) as standardized and regulated as ballroom. And I hope it will never be. So maybe the scores given are just for "it looks good" which would explain the large spread and the non-correlations.

I'm so pleased that you're having such fun...

2016-07-27T20:43:23.868+01:00

I'm so pleased that you're having such fun.

Thank you for the rank correlations; this is rather complicated to do in DAX for PowerBI, so it's super helpful to get an independent answer. I now get almost the same results for 2015, and I suspect the small differences are due to rounding at various stages of the calculation, so I will go ahead and pull in the other years. The scatter plots are pleasingly chaotic-looking, with intriguing flashes of seeming order.

Looking at the visuals, I see very little indication that there's any agreement at any level. You'd expect the bottom five to be the most likely place to find agreement if there is any, partly because of the 'floor'. Of course, most people feel there's more agreement generally on bad dancing than good, but the kind of bad that people agree on should have been eliminated by the final.

... and the rank correlations: Rank correlations ...

2016-07-27T10:05:44.003+01:00

... and the rank correlations:

Rank correlations

2015
(obs=41)

| blanco roldan delarosa torres copes cejas galera
-------------+---------------------------------------------------------------
blanco | 1.0000
roldan | 0.2213 1.0000
delarosa | 0.1491 0.2063 1.0000
torres | 0.2315 0.2913 0.2966 1.0000
copes | 0.1032 0.2042 0.0159 -0.0777 1.0000
cejas | 0.1779 0.1947 0.3707 0.3181 0.1116 1.0000
galera | 0.0428 0.3602 0.1961 0.0925 0.3121 0.0208 1.0000

B[7,1]
c1
r1 0.28
r2 0.35
r3 0.32
r4 0.31
r5 0.24
r6 0.31
r7 0.29

2014
(obs=41)

| eduard~i JorgeF~o Javier~z juliob~a VilmaV~a claude~a OlgaBe~o
-------------+---------------------------------------------------------------
eduardomasci | 1.0000
JorgeFirpo | -0.1276 1.0000
JavierRodr~z | -0.0400 0.1785 1.0000
juliobalma~a | -0.0800 0.2986 0.1484 1.0000
VilmaVega | 0.0296 0.1269 0.3057 0.1519 1.0000
claudelina~a | -0.1837 0.1695 0.3939 0.3282 0.2335 1.0000
OlgaBesio | 0.1266 0.1164 -0.0595 0.1424 0.2080 0.0124 1.0000

B[7,1]
c1
r1 0.10
r2 0.25
r3 0.28
r4 0.28
r5 0.29
r6 0.28
r7 0.22

2013
(obs=40)

| zotto martin~y ermocida schapira roldan brufman garcia
-------------+---------------------------------------------------------------
zotto | 1.0000
martinezpey | 0.2939 1.0000
ermocida | 0.6370 0.4660 1.0000
schapira | 0.2271 0.2161 0.2158 1.0000
roldan | 0.2949 0.1630 0.2614 0.5297* 1.0000
brufman | 0.3879 0.1293 0.1812 0.2372 0.5121* 1.0000
garcia | -0.0170 0.1450 0.0858 0.1841 0.2390 0.1721 1.0000

B[7,1]
c1
r1 0.40
r2 0.34
r3 0.41
r4 0.37
r5 0.43
r6 0.37
r7 0.26

2012
(obs=41)

| nieves besio balmac~a duplaa lubiz rodrig~z juarez
-------------+---------------------------------------------------------------
nieves | 1.0000
besio | 0.2709 1.0000
balmaceda | 0.0935 0.1732 1.0000
duplaa | 0.2814 0.0504 0.0291 1.0000
lubiz | 0.2147 0.2518 0.4319 0.0699 1.0000
rodriguez | 0.1811 0.1309 0.3219 0.0898 0.0476 1.0000
juarez | 0.4957 0.1259 0.1727 0.0109 0.4051 0.1436 1.0000

B[7,1]
c1
r1 0.36
r2 0.29
r3 0.32
r4 0.22
r5 0.35
r6 0.27
r7 0.34

Well, here's all the correlations, from 2015 b...

2016-07-27T10:05:08.030+01:00

Well, here's all the correlations, from 2015 back to 2012. After the standard correlations I then do all the rank correlations, in the same manner.
What do we learn?
Rank correlations don't make a lot of difference.
Only a few correlations between judges in 2013 and 2012 are statistically significant (the * indicates significant at 5*). Hence most of it is random!
There was a very cohesive set of judges in 2013
Eduardo Masci is a real outlier!

Might be interesting to see if judges at least agree on the best two couples (and the worst). Have to think about that.

Long set of results follow... Aargh, comment too long. Split into sections.

Mike.

2015
(obs=41)

| blanco roldan delarosa torres copes cejas galera
-------------+---------------------------------------------------------------
blanco | 1.0000
roldan | 0.2409 1.0000
delarosa | 0.2580 0.2577 1.0000
torres | 0.2143 0.1812 0.2900 1.0000
copes | 0.0647 0.2008 0.0553 -0.1512 1.0000
cejas | 0.1183 0.1373 0.3761 0.2793 0.1287 1.0000
galera | 0.0813 0.3913 0.2001 0.0049 0.2984 0.0621 1.0000

B[7,1]
c1
r1 0.28
r2 0.34
r3 0.35
r4 0.26
r5 0.23
r6 0.30
r7 0.29

2014
(obs=41)

| eduard~i JorgeF~o Javier~z juliob~a VilmaV~a claude~a OlgaBe~o
-------------+---------------------------------------------------------------
eduardomasci | 1.0000
JorgeFirpo | -0.0905 1.0000
JavierRodr~z | -0.1226 0.1511 1.0000
juliobalma~a | -0.0939 0.2734 0.2583 1.0000
VilmaVega | 0.0448 0.1019 0.3668 0.1990 1.0000
claudelina~a | -0.1795 0.2706 0.4213 0.3922 0.2200 1.0000
OlgaBesio | 0.0698 0.0859 -0.0244 0.2158 0.1818 0.0386 1.0000

B[7,1]
c1
r1 0.09
r2 0.26
r3 0.29
r4 0.32
r5 0.30
r6 0.31
r7 0.22

2013
(obs=40)

| zotto martin~y ermocida schapira roldan brufman garcia
-------------+---------------------------------------------------------------
zotto | 1.0000
martinezpey | 0.3627 1.0000
ermocida | 0.6356* 0.4795* 1.0000
schapira | 0.2730 0.2537 0.1477 1.0000
roldan | 0.3086 0.2274 0.2218 0.5632* 1.0000
brufman | 0.4436 0.2098 0.1779 0.3372 0.5321* 1.0000
garcia | -0.0559 0.1322 0.0787 0.0997 0.2217 0.1795 1.0000

B[7,1]
c1
r1 0.42
r2 0.38
r3 0.39
r4 0.38
r5 0.44
r6 0.41
r7 0.24

2012
(obs=41)

| nieves besio balmac~a duplaa lubiz rodrig~z juarez
-------------+---------------------------------------------------------------
nieves | 1.0000
besio | 0.2979 1.0000
balmaceda | 0.1973 0.2151 1.0000
duplaa | 0.2959 0.1449 0.1307 1.0000
lubiz | 0.2593 0.2949 0.6150* 0.1382 1.0000
rodriguez | 0.1362 0.1603 0.3847 0.1180 0.1486 1.0000
juarez | 0.4649* 0.1342 0.2087 0.0928 0.3948 0.1595 1.0000

B[7,1]
c1
r1 0.38
r2 0.32
r3 0.39
r4 0.27
r5 0.41
r6 0.30
r7 0.35

I've now worked out how to visualise the non-c...

2016-07-27T00:02:07.916+01:00

I've now worked out how to visualise the non-correlation between any two chosen judges in Power BI with a scatter chart. It needs some refinement, though. Calculating the Pearson correlation for the rankings turns out to be much harder in DAX than just using the scores, because you have to tell it exactly how to add up the results of RANKX, but it seems worth doing because it would be totally re-usable. Working on it!

Hi again, I only posted the ordinary correlations...

2016-07-26T19:59:33.023+01:00

Hi again,

I only posted the ordinary correlations but I did try the rank correlation too - it didn't make much difference, so didn't post.

I'll post the results for earlier years later on (about to go and dance!). Doing it quickly, it seemed there was more agreement between judges in the earlier years.

Mike.

@Mike - Great! Just the sort of comment I was hopi...

2016-07-26T14:50:27.392+01:00

@Mike - Great! Just the sort of comment I was hoping for. In my anxiety to get this out there, after all the faffing about, I didn't get as far as measuring the disagreement more rigorously. There's a lot to say about it, and it might be worth a part 2.

As the judges don't seem to be awarding marks around any common average - some of their lines are just higher than others - I feel like we should measure the correlations between the rankings, rather than the scores themselves. Which did you use?

Hi, The inner geek in me couldn't resist tryin...

2016-07-26T09:37:10.499+01:00

Hi,
The inner geek in me couldn't resist trying this out. Here is a measure of the agreement between judges. It shows the pairwise correlations between judges, over the 41 couples (2015 data):

| blanco roldan delarosa torres copes cejas galera
-------------+---------------------------------------------------------------
blanco | 1.0000
roldan | 0.2409 1.0000
delarosa | 0.2580 0.2577 1.0000
torres | 0.2143 0.1812 0.2900 1.0000
copes | 0.0647 0.2008 0.0553 -0.1512 1.0000
cejas | 0.1183 0.1373 0.3761 0.2793 0.1287 1.0000
galera | 0.0813 0.3913 0.2001 0.0049 0.2984 0.0621 1.0000

The correlations are generally low (none are statistically significant at 5%) and Copes and Torres seem to have opposing views! Copes stands out as disagreeing more with the other judges. Here are the average correlations for each judge (against the other six:

r1 0.28
r2 0.34
r3 0.35
r4 0.26
r5 0.23
r6 0.30
r7 0.29

Copes (row 5) has the lowest score. A couple of interesting questions: (1) is the correlation higher between judges who themselves have similar dance styles? (2) In psychology, it's known that judges become more severe over the course of a day. Does the same apply here?

Great post!
Mike