There are plenty of resources on the net about the Simpson’s paradox, Simpson-Yue paradox or Simpson’s reversal. I used a Google spreadsheet and a couple of paper folders to explain it.
I will spend most of my time going through an imaginary example to see what is the Simpson’s paradox and how we deal with it, and then I will discuss some real occurrences of it.
Please imagine that I must undergo heart surgery and, since a hearth surgery is a serious issue, I want to select the best available surgeon. The hearth surgeons in my local hospital are Dr. A and Dr. B and I need to know which of them is the best one.
Fortunately, a close friend of mine works in the hospital and he smuggles this two folders to me. Here are records of the last operations performed in the hospital. Then I just need to see the outcome of the operations performed by each doctor to find which is better. In fact, I nearly don’t need statistics, I just need counting.
I take every record from folder number 1 and I write down if the patient survived the operation and which doctor performed it. You can see survivors in green, and deceased patients in red, and here is the result: out of 12 patients of Dr. A, 2 died, but out of 34 patients of Dr. B, 8 died. It clearly seems safer to be operated on by Dr. A.
Here is the data, although it can be better viewed at this spreadsheet.
First | dead | total | dead | total | ||||||
folder | Dr. A | 2 | 12 | 16,7% | Dr. B | 8 | 34 | 23,5% | ||
dead | alive | alive | dead | dead | alive | alive | ||||
alive | alive | alive | alive | alive | alive | dead | ||||
alive | alive | alive | dead | alive | alive | alive | ||||
alive | alive | dead | alive | alive | dead | alive | ||||
dead | alive | alive | alive | |||||||
alive | alive | alive | alive | |||||||
dead | alive | alive | alive | |||||||
alive | alive | alive | alive | |||||||
dead | alive | |||||||||
Second | dead | total | dead | total | ||||||
folder | Dr. A | 19 | 42 | 45,2% | Dr. B | 3 | 6 | 50,0% | ||
dead | dead | alive | dead | alive | alive | dead | ||||
alive | alive | dead | alive | dead | dead | alive | ||||
dead | dead | dead | dead | dead | alive | dead | ||||
alive | alive | alive | alive | alive | ||||||
dead | alive | dead | dead | alive | ||||||
alive | dead | dead | dead | alive | ||||||
dead | alive | alive | alive | alive | ||||||
alive | alive | alive | alive | alive | ||||||
dead | dead | |||||||||
dead | total | dead | total | |||||||
Total | 21 | 54 | 38,9% | 11 | 40 | 27,5% |
Just to make sure, I take folder 2, and the results are similar: it’s safer to be operated by Dr. A.
Anyway, if I count together all records from both folders, I get a surprising result: it’s safer to be operated on by Dr. B.
That is: when you compare proportions in records from each folder, more patients of Dr. A survive, but if you compare all records together, more patients of Dr. B survive: that is Simpson’s paradox.
Before seeing how we solve the paradox, I would like to outline why there is a paradox. In fact, it’s not a paradox but two contradicting results that according to our intuition should agree.
For each surgeon, the global proportion of deaths lies somewhere between the proportions for folder 1 and folder 2. Therefore, we intuitively expect that any comparison made from both folders would hold for the whole. Anyway, global proportion doesn’t lie in the middle: it can be more close to folder 1 for one doctor and to folder 2 for the other doctor.
How can we solve the paradox and, more important, which surgeon should I choose?
In principle, our decision shouldn’t be based on how records are distributed among folders, because they are placed in an arbitrary way. In fact, if we move some records from folder 1 to folder 2, the paradox disappears. You can try by going to the spreadsheet and moving rows 4 and 5 from the first folder to the second and checking the percentages again.
Indeed, if we place records at random in both folders, most times we won’t get any paradox, and each folders will lead to the same conclusion that we get by considering the whole set of records.
Anyway, I flipped the folders and I saw some labels on it that changed the whole situation: folder 1 contents records of easy operations and folder 2 contents records of difficult operations. Now our folders became a new meaningful variable. If we read again our results, now we can say that Dr. A is better at easy operations, and that Dr. A is better at difficult operations, too. What does the overall result mean? It just says that Dr. B performs mostly easy operations and Dr. A deals with difficult cases. That’s the reason so patients of Dr. B survive more often, even if Dr. A is better.
My story of two doctors and two folders is just an example I invented, but there are real life occurrences of the Simpson’s paradox, and some of them have lead to big problems. In fact, my example is inspired by data in a real study about two methods of removing kidney calculi where the paradox arise. The two doctors in my example would be the two methods, and the two folders would be small and big calculi.
A real case of troubles caused by misunderstanding the Simpson’s paradox happened when in 1973 Berkeley University was sued for sex discrimination in rejecting applications. The university was rejecting admission to a higher proportion of women than men when numbers were looked globally. However, when rejections were counted by department, there wasn’t discrimination in none of them – in fact there was even a small but significant bias favouring women in some departments. The fact was just that most women were applying to the most difficult to enter departments, and that was the only reason for women being rejected in a higher proportion than men.
Other examples are found in social sciences when analysing wage rates or school results in societies with mixed ethnic groups. Often scores rise for every ethnic group while dipping when the whole population is considered, due to change in ethnic composition of the society: if the groups with lower scores increase in size, global scores dip even when increasing for each group.
However, there could be found examples in the opposite way: when groups of observations are of little meaning, comparisons should be made in a global way. Sport statistics are full of occurrences.
Now, let me conclude that whenever we are comparing proportions or averages using observations split in group, we should see if those groups are meaningful for us and decide to use aggregate data or split data.
Proposed exercise:
The following table shows an extremely simplified recreation of the Berkeley University case outlined before. You can take the data and compute which percent of men or women applications are rejected in the whole department and in the whole university. Do results by department and global results contradict each other?
admitted men | rejected men | admitted women | rejected women | |
department 1 | 1 | 1 | 2 | 2 |
department 2 | 4 | 2 | 2 | 1 |
A few references:
Berkeley admission bias:
- Paper: http://www.unc.edu/~nielsen/soci708/cdocs/Berkeley_admissions_bias.pdf
- Data visualisation: http://vudlab.com/simpsons/
Other examples:
Acknowledgements:
I would like to thank my fellow participants in the June 2016 Basic Skills and Tools to Teach Content Subjects in English course in Universitat de Barcelona and our professor M. del Mar Suárez for their comments and encouragement.