Data Mining

When researchers cannot develop their own random samples for epidemiological studies, they sometimes draw data from existing databases. While useful data is sometimes available in these databases, sometimes researchers manipulate data to achieve desired results in a process that some call “data mining.”

Reasons for researchers manipulating such data include the desire to generate positive and interesting results and to increase publication possibilities. Rather than producing a truly random sample, data mining may involve excluding certain participants as “outliers” or pulling out select data subsets from the larger database. In that case, the sample may reflect the researcher’s bias instead of constituting a truly random sample.

For example, many studies on the chemical BPA rely on data mined from the National Health and Nutrition Examination Survey (NHANES),(1) a Centers for Disease Control and Prevention (CDC) program that assesses national health trends. CDC collects health data from a different group of volunteers every year via physical exams and interviews. In addition to recording the volunteers’ health ailments, the data also include measurements BPA in urine and blood. Numerous BPA studies pull the data from various years to see if there are correlations between certain illnesses and levels of BPA in the volunteers’ urine. But the data only measure a one-spot measurement of BPA, which varies considerably in the body over just hours. These data tell us nothing about overall exposure and hence are inappropriate for drawing conclusions about BPA risk. Yet there are dozens of studies that rely on BPA data from NHANES that are published in peer-reviewed journals. While these studies make headlines, they do not offer much scientific insight.(2)

The problem can be so bad that James Mills of the National Institute of Child Health and Human Development lamented in an article in the New England Journal of Medicine back in 1993: “‘If you torture your data long enough, they will tell you whatever you want to hear’ has become a popular observation in our office. In plain English, this means that study data, if manipulated in enough different ways can prove whatever the investigator wants to prove.”(3)

Browse the terms on the sidebar of this webpage for more information and/or download a copy of A Consumer’s Guide to Chemical Risk: Deciphering the “Science” Behind Chemical Scares.

(1) For more information see: National Health and Nutrition Examination Survey, Centers for Disease Control and Prevention, pages last updated August 26, 2013.
(2) Judy S. LaKind, Michael Goodman, and Daniel Q. Naiman, “Use of NHANES Data to Link Chemical Exposures to Chronic Diseases: A Cautionary Tale,” PLoS ONE, Vol. 7 No. 12 (December 2012), p. 51086.
(3) James L. Mills, “Data Torturing,” New England Journal of Medicine, Vol. 329 No. 16 (October 14, 1993), pp. 1196-1199.

Scientific Terminology

Alarmist Vocabulary

Download