Monday 25 July 2016

Judging in the Mundial Final (Fun with Data)

You would think it would be easy to download the scores for a fairly simple dance competition. There are forty-odd pairs of competitors, there are seven judges, the judges observe the competitors doing their thing, and each judge utters a score for each pair. The scores are recorded and tabulated, an average is calculated for each pair, and they are ranked accordingly. It's that simple. They don't even do a 'sporting average' - which would mean they knocked off the highest and lowest scores before calculation. Repeat yearly.

As it turned out, it's rather a pain, but the data for 2015 was published by someone who apparently knew what they were doing and could create a relatively sensible PDF table of results, so I started there. But below, you can explore results for each year from 2012, which is where we start to get half-way useful data. [Edit: I forgot to mention that I use, here, only the Tango Pista (improvised Tango de Salon) competition in the Mundial. I do not look at Tango Escenario (choreographed 'Stage Tango'). That might be a useful comparison.]

The data is not perfect; in particular there are errors in the names of couples where I had to look these up from different documents that were very poorly formatted and I didn't have time to fix all the problems. There are lots of messed-up accented characters, and some town or country names mixed in with the couple names. But relationship between couple ID number and score should always be right, and the name recognisable, where it's available at all.

It's possible that there is cleaner data somewhere else, but I decided to go entirely from the official website and do the data cleaning myself. If two people do this independently, that's no bad thing.

Before starting, I had some questions.

  1. How much agreement is there between the judges about which couples are better than others?
  2. If the highest and lowest scores were rejected before calculating the average, as is done in most competitions with subjective scoring, how much difference would it make to the results? 
  3. Supposing there is agreement between the judges, is there anything we can observe about the couples that explains high or low scores?
Below, I've embedded a Power BI dashboard addressing these questions.

It's interactive. You can navigate between the pages using the arrows at the bottom, and select the year using buttons.  It has a page of notes, but I'm going to repeat the gist of them below. The big tables take several seconds to load. If you can't see it well, it may behave better if you make it full screen using the arrow thing at bottom right.



The data all comes from http://festivales.buenosaires.gob.ar/, but you can download my cleaned-up compilation instead (from a few minutes after posting time).

For some years, the names of the couples are not given in the final rankings, only their competition numbers. Where possible, I have looked up the names from the published scores of preliminary rounds. I assume that the couple's ID number stays the same throughout the competition. Not all couple numbers appear in the scores of preliminary rounds, perhaps because they reached the final via other rounds in other countries or other competitions. In these cases, the couple name reads "Not Provided" with the year and ID number.

In this report, as well as the official average, I also calculate what I call the "sporting average" as used in most subjectively scored competitions; that is, the average if you ignore the couple's highest and lowest score. Finally I calculate the standard deviation of the scores.

The pages are as follows:
  1. Scores chart - shows the scores given by each judge in the selected year.
  2. Hi/Lo chart - shows the high and low scores averages for each couple.
  3. Ranks chart - shows how far the judges agreed on how to rank the couples.
  4. Scores table - shows how many places each couple moves if you ignore high and low scores in calculating the average.
  5. Ranks table - shows detail of how each judge ranked the couples. If they gave two couples equal scores, those couples get the same rank.
  6. Competition ID - we'll come back to this below.
  7. Notes, basically this information.
  8. A table of all the data, not as it looks in the underlying spreadsheet, but as it looks after Power Query mashes all the years into one data set for calculations. This also shows the average score and the standard deviation calculated over the population as a whole; you can select individual years and judges.

Question 1: agreement between the judges

There is not very much consensus between the judges on either the score or the ranking of any particular couple. They make it difficult for themselves to make fine distinctions by not awarding the full range of marks. Marks are out of ten, but the lowest that appears in any of the clasificatorias (not shown in this data) is 3.75.

I see a floor in the marks for the final; in 2015 the flat lines at 7 stand out in the scatter of scores, as though the judges felt collectively that anything lower would be impolite.

The second-placed couple in 2015 has a high score of 10 and a low score equal to that of the lowest couple. The first-placed couple were not ranked first by any judge. The only couple ranked first by more than one judge was placed 9th. To find the lowest-ranked who were placed top by at least one judge, we have to go down the couple ranked 25th overall. The lowest-ranked couple with a top-three ranking from at least one judge were placed 39th of the 41 couples. Looking at the other years, 2015 does not look atypical. In 2012 and 2013, exactly one of the top five was placed first by more than one judge, and in 2014 two of them were, including the winners.

There seems, looking at the Hi-Lo charts, to be slightly more consensus at the bottom than at the top, but this could be just because of the unofficial floors (which it looks as though not every judge agrees on). When I look at the chart of rankings, rather than scores, I don't see any more agreement at the lower end than the higher end.

In the ranking table, you can de-select a particular judge or combination of judges to see how your favourite couple might have done without them.

On only two occasions from 2012 has any one of the top five couples been placed first by more than one single judge.

On the final page of the report you can look at the standard deviation in the scores awarded by individual judges. Some judges appear in more than one year, sometimes with their names formatted differently, as full names were given in only one year. If a judge has a higher standard deviation, it means they awarded a wider range of marks; presumably, they were more convinced that some couples were better than others. A lower standard deviation means they awarded similar marks to everyone. Unfortunately the judges don't seem to agree on which couples they are, or are not, so convinced about.

Question 2: Sporting Average

Because the marks are, in my view, all over the place anyway, eliminating high and low scores before calculating the average doesn't make a lot of difference to the competition overall. It does make a difference to individual couples: it would have reversed the top 2 in 2015, and the couple placed 30th would have risen 8 places. This is the largest gain in any year, and also occurred in 2014. The largest loss is 12 places in 2012, and there seem to be bigger losses than gains for individual couples generally; someone goes down by a lot and everyone they drop below gains one. This seems consistent with the observed 'marking floor'; when a judge disagrees with their peers, they apparently tend to do so by awarding a very high mark rather than by going below the general 'floor' for that year.

Question 3: Is there anything we can observe about the couples that goes with high or low scores?

There isn't, in my view, enough agreement between the judges - or enough good video - to say much about this question.

I noticed is that there was a pattern to the numbers pinned on the couples' suits; there are a lot more lower ones. Closer inspection of the source data shows that this probably has something to do with the geographical origin of the couple and their route to the final. The system of awarding numbers is not covered in the published rules, but it seems the lower numbers are given in Buenos Aires and the higher numbers further afield.

So, taking this as a proxy for where couples came from, I checked to see if it was also related to their scores, and this is shown in the final chart, "Competition ID". Answer: not really.

The line in the same chart shows the average score for each block of 10. There are more couples with lower numbers, so perhaps we'd expect their average score to end up closer to the overall average of all couples than it is; it's rather higher. But those couples are also likely to have had more serious competition in previous rounds, which should also drive their average up compared to everyone else arriving via other routes. There isn't an obvious relationship between couple number and score as such. The foreigners are fine too, there just aren't that many of them.

More precise geographical origin of the couples is at least partially given in the source data, but as it's mostly in the form of tiny flags in graphics it would be a lot more work to get it, which I haven't done.

So, basically, no, there isn't anything I can say about how to do well, based on this data. There's no couple who did so clearly well or so clearly badly that you could watch and learn.

General remarks

In my own opinion, it's rather unrealistic of me to look at the Mundial as though it were a sporting competition. If it were you were really going for an exciting sporting competition, or some sort of mechanism for identifying the best dancers, then you would probably design a rather different event. It might, for example, include challenging tests of the ability to dance well to a variety of music, including milonga and vals, on a floor more than one-third full. There might be more rounds, with the judges taking longer looks at fewer couples in each. Judging criteria would be a matter of public record, rather than rumour. And there would be a system for creating agreement between the judges over time, beyond simply agreeing that scores below 7 were impolite. What it is, rather, is a marketing exercise for the 'Tango Salon' industry, designed to honour the heritage and disseminate awareness of the music and dance, while bringing lots of young couples who dance in a certain popular, standard-ish way, to public attention and prosperity.

If you are choosing a teacher, having reached the final in the Mundial indicates that a couple dance well in a particular style and have good tango technique, at least when dancing with their competition partner - as opposed to the very different sort of technique that is used for "Argentine Tango" on Strictly Come Dancing. It is not evidence that even one judge in the final thought they were the best. They may have been, but the chances are the judges didn't know - or if they thought they knew, they certainly didn't agree - in which case, I definitely don't know, and you don't know, either. Their ranking within the final says very little, if anything at all.

This is, in my opinion, pretty much how it should be. I don't think a true sporting competition in these circumstances would necessarily be a good idea. It didn't do ballroom any good, as a social dance.

In particular, I think it's probably a good thing that the judges don't agree. Standardisation would be toxic.

I do have a couple more questions.
  • Can we seperate the level of disagreement between the judges from the question of whether there is any real difference between the couples that they could possibly measure? I can compare the real data with simulated data based on having and not having a real difference, and the results are amusing, but I think I end up assuming what I set out to prove. It might be more interesting to compare the Campeonato de la Ciudad.
  • Does the order in which the couples are called - in four rondas - have any relation to their scores? I do have at least partial data for this, but putting it together requires some more work.
  • It would be nice to have tidy data about geographical origin, but again, it's a lot of work to peer at all the little flags in the published data and write down what they are, and it probably doesn't tell us much more than the competition ID numbers do; most of the people who are both interested in entering this competition, and competent enough to do well, are Argentinians.
Anyway, enjoy interacting with the report, and go ahead, share and comment. I'll upload the data so you can download it and do your own analysis.

10 comments:

El escritor said...

Hi,
The inner geek in me couldn't resist trying this out. Here is a measure of the agreement between judges. It shows the pairwise correlations between judges, over the 41 couples (2015 data):

| blanco roldan delarosa torres copes cejas galera
-------------+---------------------------------------------------------------
blanco | 1.0000
roldan | 0.2409 1.0000
delarosa | 0.2580 0.2577 1.0000
torres | 0.2143 0.1812 0.2900 1.0000
copes | 0.0647 0.2008 0.0553 -0.1512 1.0000
cejas | 0.1183 0.1373 0.3761 0.2793 0.1287 1.0000
galera | 0.0813 0.3913 0.2001 0.0049 0.2984 0.0621 1.0000

The correlations are generally low (none are statistically significant at 5%) and Copes and Torres seem to have opposing views! Copes stands out as disagreeing more with the other judges. Here are the average correlations for each judge (against the other six:

r1 0.28
r2 0.34
r3 0.35
r4 0.26
r5 0.23
r6 0.30
r7 0.29

Copes (row 5) has the lowest score. A couple of interesting questions: (1) is the correlation higher between judges who themselves have similar dance styles? (2) In psychology, it's known that judges become more severe over the course of a day. Does the same apply here?

Great post!
Mike

msHedgehog said...

@Mike - Great! Just the sort of comment I was hoping for. In my anxiety to get this out there, after all the faffing about, I didn't get as far as measuring the disagreement more rigorously. There's a lot to say about it, and it might be worth a part 2.

As the judges don't seem to be awarding marks around any common average - some of their lines are just higher than others - I feel like we should measure the correlations between the rankings, rather than the scores themselves. Which did you use?

El escritor said...

Hi again,

I only posted the ordinary correlations but I did try the rank correlation too - it didn't make much difference, so didn't post.

I'll post the results for earlier years later on (about to go and dance!). Doing it quickly, it seemed there was more agreement between judges in the earlier years.

Mike.

msHedgehog said...

I've now worked out how to visualise the non-correlation between any two chosen judges in Power BI with a scatter chart. It needs some refinement, though. Calculating the Pearson correlation for the rankings turns out to be much harder in DAX than just using the scores, because you have to tell it exactly how to add up the results of RANKX, but it seems worth doing because it would be totally re-usable. Working on it!

El escritor said...

Well, here's all the correlations, from 2015 back to 2012. After the standard correlations I then do all the rank correlations, in the same manner.
What do we learn?
Rank correlations don't make a lot of difference.
Only a few correlations between judges in 2013 and 2012 are statistically significant (the * indicates significant at 5*). Hence most of it is random!
There was a very cohesive set of judges in 2013
Eduardo Masci is a real outlier!

Might be interesting to see if judges at least agree on the best two couples (and the worst). Have to think about that.

Long set of results follow... Aargh, comment too long. Split into sections.

Mike.

2015
(obs=41)

| blanco roldan delarosa torres copes cejas galera
-------------+---------------------------------------------------------------
blanco | 1.0000
roldan | 0.2409 1.0000
delarosa | 0.2580 0.2577 1.0000
torres | 0.2143 0.1812 0.2900 1.0000
copes | 0.0647 0.2008 0.0553 -0.1512 1.0000
cejas | 0.1183 0.1373 0.3761 0.2793 0.1287 1.0000
galera | 0.0813 0.3913 0.2001 0.0049 0.2984 0.0621 1.0000


B[7,1]
c1
r1 0.28
r2 0.34
r3 0.35
r4 0.26
r5 0.23
r6 0.30
r7 0.29


2014
(obs=41)

| eduard~i JorgeF~o Javier~z juliob~a VilmaV~a claude~a OlgaBe~o
-------------+---------------------------------------------------------------
eduardomasci | 1.0000
JorgeFirpo | -0.0905 1.0000
JavierRodr~z | -0.1226 0.1511 1.0000
juliobalma~a | -0.0939 0.2734 0.2583 1.0000
VilmaVega | 0.0448 0.1019 0.3668 0.1990 1.0000
claudelina~a | -0.1795 0.2706 0.4213 0.3922 0.2200 1.0000
OlgaBesio | 0.0698 0.0859 -0.0244 0.2158 0.1818 0.0386 1.0000



B[7,1]
c1
r1 0.09
r2 0.26
r3 0.29
r4 0.32
r5 0.30
r6 0.31
r7 0.22


2013
(obs=40)

| zotto martin~y ermocida schapira roldan brufman garcia
-------------+---------------------------------------------------------------
zotto | 1.0000
martinezpey | 0.3627 1.0000
ermocida | 0.6356* 0.4795* 1.0000
schapira | 0.2730 0.2537 0.1477 1.0000
roldan | 0.3086 0.2274 0.2218 0.5632* 1.0000
brufman | 0.4436 0.2098 0.1779 0.3372 0.5321* 1.0000
garcia | -0.0559 0.1322 0.0787 0.0997 0.2217 0.1795 1.0000




B[7,1]
c1
r1 0.42
r2 0.38
r3 0.39
r4 0.38
r5 0.44
r6 0.41
r7 0.24


2012
(obs=41)

| nieves besio balmac~a duplaa lubiz rodrig~z juarez
-------------+---------------------------------------------------------------
nieves | 1.0000
besio | 0.2979 1.0000
balmaceda | 0.1973 0.2151 1.0000
duplaa | 0.2959 0.1449 0.1307 1.0000
lubiz | 0.2593 0.2949 0.6150* 0.1382 1.0000
rodriguez | 0.1362 0.1603 0.3847 0.1180 0.1486 1.0000
juarez | 0.4649* 0.1342 0.2087 0.0928 0.3948 0.1595 1.0000




B[7,1]
c1
r1 0.38
r2 0.32
r3 0.39
r4 0.27
r5 0.41
r6 0.30
r7 0.35


El escritor said...

... and the rank correlations:

Rank correlations

2015
(obs=41)

| blanco roldan delarosa torres copes cejas galera
-------------+---------------------------------------------------------------
blanco | 1.0000
roldan | 0.2213 1.0000
delarosa | 0.1491 0.2063 1.0000
torres | 0.2315 0.2913 0.2966 1.0000
copes | 0.1032 0.2042 0.0159 -0.0777 1.0000
cejas | 0.1779 0.1947 0.3707 0.3181 0.1116 1.0000
galera | 0.0428 0.3602 0.1961 0.0925 0.3121 0.0208 1.0000



B[7,1]
c1
r1 0.28
r2 0.35
r3 0.32
r4 0.31
r5 0.24
r6 0.31
r7 0.29


2014
(obs=41)

| eduard~i JorgeF~o Javier~z juliob~a VilmaV~a claude~a OlgaBe~o
-------------+---------------------------------------------------------------
eduardomasci | 1.0000
JorgeFirpo | -0.1276 1.0000
JavierRodr~z | -0.0400 0.1785 1.0000
juliobalma~a | -0.0800 0.2986 0.1484 1.0000
VilmaVega | 0.0296 0.1269 0.3057 0.1519 1.0000
claudelina~a | -0.1837 0.1695 0.3939 0.3282 0.2335 1.0000
OlgaBesio | 0.1266 0.1164 -0.0595 0.1424 0.2080 0.0124 1.0000



B[7,1]
c1
r1 0.10
r2 0.25
r3 0.28
r4 0.28
r5 0.29
r6 0.28
r7 0.22


2013
(obs=40)

| zotto martin~y ermocida schapira roldan brufman garcia
-------------+---------------------------------------------------------------
zotto | 1.0000
martinezpey | 0.2939 1.0000
ermocida | 0.6370 0.4660 1.0000
schapira | 0.2271 0.2161 0.2158 1.0000
roldan | 0.2949 0.1630 0.2614 0.5297* 1.0000
brufman | 0.3879 0.1293 0.1812 0.2372 0.5121* 1.0000
garcia | -0.0170 0.1450 0.0858 0.1841 0.2390 0.1721 1.0000



B[7,1]
c1
r1 0.40
r2 0.34
r3 0.41
r4 0.37
r5 0.43
r6 0.37
r7 0.26


2012
(obs=41)

| nieves besio balmac~a duplaa lubiz rodrig~z juarez
-------------+---------------------------------------------------------------
nieves | 1.0000
besio | 0.2709 1.0000
balmaceda | 0.0935 0.1732 1.0000
duplaa | 0.2814 0.0504 0.0291 1.0000
lubiz | 0.2147 0.2518 0.4319 0.0699 1.0000
rodriguez | 0.1811 0.1309 0.3219 0.0898 0.0476 1.0000
juarez | 0.4957 0.1259 0.1727 0.0109 0.4051 0.1436 1.0000



B[7,1]
c1
r1 0.36
r2 0.29
r3 0.32
r4 0.22
r5 0.35
r6 0.27
r7 0.34

msHedgehog said...

I'm so pleased that you're having such fun.

Thank you for the rank correlations; this is rather complicated to do in DAX for PowerBI, so it's super helpful to get an independent answer. I now get almost the same results for 2015, and I suspect the small differences are due to rounding at various stages of the calculation, so I will go ahead and pull in the other years. The scatter plots are pleasingly chaotic-looking, with intriguing flashes of seeming order.

Looking at the visuals, I see very little indication that there's any agreement at any level. You'd expect the bottom five to be the most likely place to find agreement if there is any, partly because of the 'floor'. Of course, most people feel there's more agreement generally on bad dancing than good, but the kind of bad that people agree on should have been eliminated by the final.

Yokoito said...

Congratulations for all this number crunching...I am deeply impressed and also a bit jealous about the spare time you must have. In any case, I agree that it is a good thing that Tango is not (yet) as standardized and regulated as ballroom. And I hope it will never be. So maybe the scores given are just for "it looks good" which would explain the large spread and the non-correlations.

El escritor said...

Yes, having some slightly guilty fun - tangueros are supposed to be in a trance rather than playing with stats, no?

Anyway, I thought about how to visualise this. I'm not sure what you're planning but I took a cross plot of the rankings for each pair of judges. If they agreed perfectly the points would lie on the 45 degree line. However, as you have obviously found, they're nothing like this. I have a neat figure showing all cross plots for a single year in the same picture, but unfortunately cannot post this in a comment!

I did wonder about the scores. Over how many dances are they averaged, for any given couple. Some judges (eg Garcia) only have averages that are integers or in steps of 0.5. Others (eg Brufman) have averages such as 7.05. Does this mean Brufman is finding fine distinctions between couples or is it the result of an averaging over several dances? Given this, rankings might be better.

Hope you're enjoying this too! Have you tried graphing the top or bottom five only, or some such?

Mike.


msHedgehog said...

@Yokoito: indeed. I agree with you that this is arguably a good thing, although I also think the Mundial has a somewhat paradoxical role in protecting tango from ballroomisation. Because all the finalists indisputably have good looking technique, whereas there are ballroom schools teaching a genuinely ballroomised argentine tango with a totally inappropriate (IMV) ballroom technique and approach. However, that's a subject for further research.

@Mike: Judges rank a lot of couples equal with one another. They don't give forty different marks to forty different couples, and that really shows in the cross-plots of rankings. I haven't done the marks seperately from the rankings, but it implies that some judges at least give a restricted set of integers as marks, they don't try to make fine distinctions. They only see each couple dance once - have a look at the post on Music in the Mundial for a description of the procedure, and links to video.