Pandemic“Big Data Paradox”: 2 Early Vaccination Surveys Worse Than Worthless
“Big data paradox” is a mathematical tendency of big data sets to minimize one type of error, due to small sample size, but magnify another that tends to get less attention: flaws linked to systematic biases that make the sample a poor representation of the larger population. Analyst found that that tendency caused two early vaccination surveys to be misleading – a findings which holds warning for tracking efforts as governments and health officials as they formulate policies to battle the pandemic.
When Delphi-Facebook and the U.S. Census Bureau provided estimates of COVID-19 vaccine uptake last spring, their weekly reports drew on responses from as many as 250,000 people.
The data sets boasted statistically tiny margins of error, raising confidence that the numbers were correct. But when the Centers for Disease Control and Prevention reported actual vaccination rates, the two polls were off — by a lot. By the end of May, the Delphi-Facebook study had overestimated vaccine uptake by 17 percentage points — 70 percent versus 53 percent, according to the CDC — and the Census Bureau’s Household Pulse Survey had done the same by 14 percentage points.
A comparative analysis by statisticians and political scientists from Harvard, Oxford, and Stanford universities concludes that the surveys fell victim to the “big data paradox,” a mathematical tendency of big data sets to minimize one type of error — due to small sample size — but magnify another that tends to get less attention: flaws linked to systematic biases that make the sample a poor representation of the larger population.
The big data paradox was identified and coined by one of the study’s authors, Harvard’s Xiao-Li Meng, the Whipple V.N. Jones Professor of Statistics, in his 2018 analysis of polling during the 2016 presidential election. Famous for predicting a Hillary Clinton victory, the polls were skewed by “nonresponse bias,” which in this case was the tendency of Trump voters to either not respond or define themselves as “undecided.”
A biased big data survey can be worse than no survey at all, says Meng, because with no survey, researchers at least understand that they don’t know the answer. When underlying bias is poorly understood — as in the 2016 election — it can be masked by confidence created by the large sample size, leading researchers and readers astray.
“The larger the data size, the surer we fool ourselves when we fail to account for bias in data collection,” the paper’s authors wrote in their analysis, published Wednesday in the journal Nature.
The misleading results can be particularly harmful when actions are based on them, the authors note. The governor of a state where a survey shows that 70 percent are vaccinated against COVID, for example, might relax public health measures. If actual vaccination rates are closer to 55 percent, the move could result in a spike in cases and a rise in COVID deaths.