Friday, 29 July 2016

Judging Correlation in the Mundial de Tango (more Fun with Data)

Ok, in the previous post I said that there's no agreement among the judges during the final of the Mundial de Tango (Pista) or Tango World Championship. I showed you the charts that convinced me, but I didn't properly measure and show the degree of disagreement.

This second Power BI report has a page for each year available. You can select any two individual judges and see how far they agreed with each other about how to rank the couples. So, if you think, for example, that two of the judges dance a similar style to each other, you can see* if their opinions about the finalists' quality of dance correlate with each other. (Spoiler: nope.)**

To change the year, move to the next page using the arrows at the bottom centre. If the report is too small, misbehaves, or won't fit on your screen properly, try popping it out with the diagonal arrow thing at the bottom right hand corner. You might have to scroll the selectors right and left to see all the judges.

The judges' rankings of the couples do not correlate with one another.

1.00 is a perfect correlation: each judge agrees perfectly with him-or-her self. A low correlation between two judges means they didn't agree much, and a negative correlation would mean they ranked the couples in the opposite way to each other. There are one or two cases of small negative correlations.

I'm sure all the judges' opinions on people's dancing, in various circumstances, are highly valuable - that's why they were picked to judge - but they have nothing to do with one another, and their collective decisions are therefore, to put it mildly, not much help to anyone else in distinguishing between the finalists.

One reasonable interpretation of this result is that the judges have an impossible task; all the couples in the final dance in much the same way, and there is no real difference between them that the judges could possibly agree about. It is as though you, I, and five of our mates solemnly and conscientiously gave scores to the aesthetic qualities of six eggs from the same nest.

Why are the eggs all from the same nest? Perhaps because any excellent dancer with a visually-apparent difference of style and musicality would, on the face of it, have much to lose and nothing to gain by entering this competition. But even if the dancers were different, while all good, it's not clear that would help; it might be even more meaningless to decide between them.

There may be different interpretations: go ahead and put them in the comments, and let's see if we can think of a way to tell which is right. One would be that there are real differences, but the judges don't agree about which ones are important; they are using totally individual and independent criteria. No information is published about what criteria they use.

In order to distinguish between the couples, the judges would have to agree both on what differences exist and on which ones are important. For example, because of the way the couples get to the final, one of them is usually much older and less mobile than the others. It seems to me that the judges have agreed that the differences which go with that are not important, although I don't have the couple-number data to show that; the only way to get it is to watch the video.

As for what it means, and whether it is a good thing, we began to talk about this in the comments on the previous post.

I think it is a good thing that the Mundial is not like a ballroom competition, with the rigidity and the arms-race that implies; that could be very toxic for something that wants to remain a living social dance.  I don't think that finding the best dancers out of a good bunch is what the Mundial is really for. As I said before, it makes more sense to think that its purpose is to bring a steady stream of decent young salonsters to public notice, while honouring the occasional veteran; it's a very pretty industry-promotion and heritage-publicity thingie, not a sport.

Indeed, perhaps the Mundial has a somewhat paradoxical role in protecting tango from ballroomisation. All the finalists indisputably have good looking technique, whereas there are ballroom schools teaching a genuinely ballroomised argentine tango with a totally different  technique and approach, completely clueless about the social scene. The international dance associations even include it in some of their competitions (and that, for UK readers, was what Vincent and Flavia were up to with their "Tango World Champions" thing, which I've explained elsewhere). We can fairly confidently say that nobody dancing that way would ever get to the Mundial final, at least not in the Pista category - and that is a good thing. It's good that the Mundial exists and people can discover, quite easily, that the ballroom competitions are not it. But the relationship between regular ballroom schools, various international dance organisations, and Argentine tango, is another interesting subject for further research.

It would be great to have judge-level scores from earlier rounds. I'd expect to see a lot more agreement at the lower end; if we could combine that with video, we'd be able to learn something about what criteria are really being applied. And, if so, I'd expect to find that those criteria are by their nature useless in the final. Unfortunately, that data isn't published. If you think you can obtain it, please comment.

Bottom line: there's no evidence here that there's any point in remembering who won.

*You'll notice some straight vertical and horizontal lines in the charts. Judges rank a lot of couples equal with one another. They don't give forty different marks to forty different couples. I haven't done the calculations over the marks seperately from the rankings; I thought doing rankings would be clearer, as the judges don't work around any common average. Some judges give out marks only from a restricted set of integers, but others try to make fine distinctions. They see each couple dance three tracks. The see them in groups of ten to a dozen couples, and the couples don't all dance the same tracks - have a look at the post on Music in the Mundial for a description of the procedure, and links to video. 
** To be fair, there is one case of a nearly 0.7 correlation, which is very impressive compared with all the others, and you probably could say the two judges involved went together. I won't spoil that one, as it would be much better if you tried to predict who it would be and then looked. Maybe it's real, or maybe it just had to happen accidentally somewhere. There are also some cases of unimpressive 0.3 or 0.4 correlations looking strong against a background of zero to negative correlations. People who are personally acquainted with the judges might feel there was something to say there, but I'm sceptical that it isn't pure chance.


Ghost said...

One thing I'm curious about. For years people have said that it's an open secret that who wins the Mundail is fixed. It's about who you know and loosely speaking, how good a tango ambassador you'll be.

Your work, which let's face it, is pretty damn thorough, seems to blow this out of the water. Ok, yes everyone except the winners could be chaos, but if the conspiracy theory is right, there should be major correlation on the winners.

Are you convinced you've debunked this theory (whether you intended to or not)? Any idea which was the year they supposed announced the winners before the Mundail? I'd love to see your correlations for that one.

msHedgehog said...

I certainly don't think any such theory is supported by the evidence. You can see the results. These are the only years for which the data is available. Not one I've heard.

Ghost said...

This might explain the chaotic judging and the perception that judges are biased

"A friend called me as I was leaving for La Rural. She has been a judge during previous competitions, but doesn't want to do it anymore. She told me that for her, judging isn't about who you like, it's about who are the best dancers. She may not like or know someone, but if they are the best dancers of tango de salon, they get her vote. Unfortunately, most of the judges pick their friends, not the best dancers."

ie there are biased judges, but they're all biased in different ways!

It's from 2005 though, so a lot could have changed since then.

Link -

I also wonder if the earlier rounds are used to filter out the "social dancers" etc, so that it really doesn't matter who wins the Mundial, all the final round contestants are going to be the kind of couple who will make good Ambassadors for tango?

Thanks for putting in the work, so I can point people to this, next time a conversation heads into "Well, everyone knows the Mundial is fixed..."

msHedgehog said...

The whole concept of "bias" appears to me to be completely meaningless when there are no agreed and published criteria which the judges could possibly be using. They're hired because they're tango authorities, and can decide on whatever criteria they personally like. To talk about bias, you'd have to start by agreeing on what the concept of "the best dancers" actually meant.

If by definition it's whatever the judges think it is, then all we can say is there's not enough consensus there for the rest of us to regard their collective opinion as informative.

El escritor said...

Great work! The evidence strongly suggest that marks are pretty much random and that the winners are very lucky to be the ones chosen. They then go on lucrative world tours while the others generally retire back to Bs As and maybe try again next year.

There's some evidence (not tango related) that people are (considered) good because they are famous, rather than famous because they are good. Hence I guess that this maybe applies here too.

msHedgehog said...

@El escritor: Indeed. Just for kicks, I tried generating 40 imaginary couples and 7 imaginary judges who assign random marks with an average of 8, a ceiling of 10, a floor of 7, and a standard deviation the same as that observed in 2015, I get almost, but, I think, not quite, indistinguishable results (not yet published). Try it, by all means. I'd value input on exactly how the results can be distinguished.

Mike Barrow said...

Hi, I'm back... this is a great way to while away time between milongas. OK, so I've done something which I think is interesting. Even though the judges' decisions seem fairly random, what about the overall decision of the panel? Wisdom of the crowds and all that.

What if the judges voted (majority vote) on each couple compared to each other couple? So they would vote on couple A vs couple B, B vs C, A vs C, etc. Would this give an unequivocal winner, or at least something coherent?

A minimum requirement for consistency might be that if A is preferred to B by the panel vote, and B is preferred to C, then A should be preferred to C. So, for every triple of couples, there should be a clearly preferred winner. Unfortunately, majority voting does not guarantee this and it's possible to get a paradox. See here:

So what happens here? There are some inconsistencies, but not all that many. In 2015 there were 136. This is out of a total of 10,660 possible triples, or 1.3%. That seems quite low, but you couldn't have 10,660 inconsistencies and I don't know the maximum possible.

As an example, the first three couples in 2015 comprise such a triple, perhaps unfortunately. Couple 1 beats couple 3 and they beat couple 2. However 2 beats 1. The vote is 4-3 in all cases, so close. There is some consensus overall however. Couple 1 is only beaten by couple 2 and couple 2 is itself beaten by 3 and by 5. Couple 3 is beaten by 1 and by 4, and so it goes on. So maybe couple 1 are the genuine winners.

In 2013, couple 2 beat couple 1 by 5 votes to 2, a big margin, and lost to no one. Yet couple 1 were the winner as one judge, garcia, gave couple 2 a score of only six. Couple 1 were also beaten by couples 5 and 6, so probably were lucky winners.

Similarly, in 2012 couple 3 beat 1 who beat everyone else. And no one beat couple 3, so maybe we should be celebrating Diego Ortega and Andrea Albornoz... One judge, Rodriguez, gave them a 7 and that screwed their chances.

With a few simulations of random data, I reckon that the number of inconsistencies between the judges is about 1/3 to 1/4 or what we'd see with random marking, so their is some method to what they do. So I guess the lesson from all of this is that the the averaging across judges largely works in finding (at least approximately) the best couple. (I was going to write 'wisdom of the crowd' but realise it's not quite right - that applies to a large group of people who have little, if any, knowledge of the subject.)

Caveat: I *think* I got all the calculations right...


msHedgehog said...

@Mike: that seems like an interesting approach. I'm not entirely sure what it proves - have to think about it. I think it supports the idea that the judges task might be easier if they were simply asked to rank a smaller number of couples, rather than assign scores to each of forty.

In analysing the random data I notice it seems to matter that random imaginary judges award finder gradations of marks - that is, how far the scores are rounded or how many different marks each judge actually uses. That dramatically reduces the number of ties. I think that might affect your comparison, too. And I wonder why we see the occasional - but just the occasional - tenth or hundredth. I'd guess it occurs when a judge wants to do precisely that thing; rank a specific couple against another specific couple.

stompyzilla said...

Kudos for some fascinating data science, and thanks for all the hard work to gather and clean the data!! I've always been curious about judging at these tournaments.

When I watch youtube clips knowing who was highly ranked, I always wonder if I would have picked those winners had I been a judge. By contrast, some couples I certainly would have placed outside the winners' circle. Perhaps that's the aggregate scoring dynamic, a concurrence that some couples are definitely not the best, revealing by contrast a pattern of winners.

I do wish they used a semantic scale to ground the meaning of a 5 versus a 10, etc, to encourage each judge to have equal ranking power. As for qualities of the dance such as feeling, style, musicality, and precision, of course they are valued differently, which explains the scatter. Yay for this variety, it helps make the dance addictive and infinite :-)

Seems like you convincingly demonstrate the judging is not wholly fixed, right?! Weak correlation is certainly better than random. There are few negative correlations. And I wonder if multiple correlations exist that we can't quite see. Not that I'm suggesting you crunch more data ... looks like fun if I had time.