Art or Science: The Perils and Possibilities of Survey Sampling in the Evolving Online World

evolving online world

Introduction

Survey research is a relatively new endeavor in the history of business and science, and the nature of how we identify respondents and capture data has changed considerably over its short life span. From door to door, to telephone, to online, to online mobile, the nature of how research is conducted has been constantly morphing.

Interviewer effects and impact of interfaces have been important issues that the survey research world has grappled with. But one thing that has changed, more than anything else, is the way we connect with the people who answer our surveys; our sample.

We’ve come a long way from the days in which George Gallup attempted true probability sampling by sending people out to knock on carefully selected doors across America. Today, sample mainly comes from a variety of online sources including communities, publishers, advertisements and social media sites. Increasingly, the sample used commercially is an unpredictable blend of all these sources.

But what effect does this have? Does the source of the sample have an impact on who is responding and how they respond? How do people’s motivations for doing surveys fit with how we approach them? Is our approach to “sample” making it more difficult to obtain good data? What are the implications for reliability? And what does that mean for the long term viability of our industry?

These are important questions. Ones that we, as an industry, cannot afford to ignore. If we do, they threaten to undermine to the basic premise of our offering.

In this post we review the results of a series of studies comparing the findings from different sample sources, including publishers, individual panels, and “river” sample, in the US and Canada. We compare across sources and over time. We draw on large tracking studies, as well as studies conducted specifically to focus on particular sample sources.

In the face of evidence of both reliability and unreliability, bias and accuracy, we consider the role sample source has on determining whether survey research can or should be considered a science.

We conclude by asking all of us in the industry to question how we can change our practices to treat respondents more like people, and less like the commodity we commonly call “sample”.

The death and probability sampling

Real probability sampling became extinct many, many years ago, largely before market research really took off as a discipline and business. As an industry we’ve struggled along with varied attempts to provide representative—or at least reproducible—samples. Adapting to the times, we have used face to face interviews, mail surveys, phone surveys and online surveys, all of which have their biases.

As society has changed, response rates have declined and the best way to reach representative samples has been transformed by how technology mediates our lives. In the past few years, the industry has discovered new ways to source respondents that magically both reduce costs and increase our ability to find “sample” that will fill our difficult quotas. But, as we will see, this low price sample has other costs.

The samples the market industry uses are convenience samples that attempt to represent the population in a way that is reproducible and representative of other markers of reality. Some do this in more or less convincing ways.

Political poll aggregators like Nate Silver provide useful reality checks on the quality of select sample sources [1]. And while the often poor ratings of these polls are sobering, they probably represent a relatively rosy picture of the state of the industry. Few companies want to publicize results from their truly dodgy sample. They tend to save that for market research buyers who are looking for low price sample.

A brave new world

Previous research has shown that we have reason to be concerned about the quality of sample. Dutch group NOVPO’s produced pioneering work in 2006 [2] and there were seminal studies by ARF in 2008 [3] and MRIA in 2009[4] . They all raised questions about online sample quality.

But 2009 is a lifetime ago in the nascent world of on-line sample. Even the ARF’s FOQ2 study was fielded in 2012 [5]. Back in 2012, the ARF FOQ2 study looked at routed sample vs straight panel sample and between serial routing vs parallel routing, which gives us some hints about the new world of sample from multiple sources.

In the comparison between traditional sample and routed sample, the FOQ2 researchers saw statistically significant differences on 10% of questions—double what you’d expect by chance. In the comparison of serial routing vs parallel routing, differences were detected on a total 12% of questions. Not exactly confidence inspiring stuff.

Since then, sample from non-panel sources has become ubiquitous and routing has become part of the everyday reality of the sample world. Times have changed.

Findings since FOQ2

Subsequent to the early work that raised questions about online sample quality, the authors of this piece have conducted a number of studies that looked into sample reliability. We review a few highlights here.

In one example we tracked the results of a set of questions about shopping and loyalty card use. The study was conducted in Canada with 4 panels, including Maru/Matchbox’s Angus Reid Forum, and a River sample. The study was originally fielded in May 2013 and then replicated with the same questionnaire and sample sources in June 2014 [6].

The chart on below shows the percentage of answers in the study that were statistically significantly different between the two waves, using a 95% confidence interval. The stores and loyalty programs we asked about were mature and there were no known notable changes in either market, so we expected the results to be relatively stable over time.

survey sampling

We found that, of the five sample sources, three provided results that were notably different between waves. But the changes they measured did not agree with each other.

Some showed usage increasing, while others showed usage decreasing with no discernable or consistent patterns. We know that at the source that showed the most change (Panel B) also changed where they obtained their panel members from. Two of the sources showed the usage patterns remaining stable, with the only differences occurring in less than the expected 5% of cases.

So we are seeing poor reproducibility from three of the five sample sources, one of them being the river sample and one being a large “panel” which had changed the source from which it recruited panelists during the study period.

60,000 People and a lot of unexplained variability

survey samplingAnother study tracked familiarity with well-known brands—expected to be a stable phenomenon. It was conducted with three waves of over 19,000 people per wave, over a five month period.

The sample was drawn from seventeen sample providers, including a river source. The research revealed that “over half of sample vendors show unexplainable, yet statistically significant, differences each wave” [7].

Houston, we have a problem.

But where does the problem come from? And what does it have to do with respondent motivations.

The quiet revolution of sample sources

The advent of social media and publisher sources for sample has radically changed the “sample” we use in recent years. We have gone from relying on panels of known respondents—people we have vetted and profiled—to streams of unknown respondents whose motivations for answering our questions are variable and not always aligned with our aim of collecting reliable and useful data.

Research by Complete, a Millward Brown Company, tracks the actual URL source of sample sold through the largest panel and other sample providers in the United States. It shows how that, in the past few years alone, the source of most sample—including sample sold by “panel” sources—had become unknown, nonpanel sample that is simply resold from secondary sources that are coming from social media, paid survey or routers.

The chart below shows the radical change in the source of sample coming from the major sample providers across the industry.

source of sample

This represents a sea change in the nature of sample, one that is little discussed and even less researched. But hey, who wants to ask hard questions when we can save a few bucks on sample, while also increasing the feasibility to meet difficult quotas?

The reality is, we as an industry need to think hard about questions around this change in sample. What implications does this have for who these people are and why they are doing surveys? Do samples coming from these sources really represent the population, beyond seeming to meet the demographic requirements? What does this mean for data quality?

As we will see, our research suggests these changes have implications for respondent engagement and, subsequently, data quality.

Connecting the dots – Linking sample source, respondent engagement and data quality

In March 2016 we conducted a study with over 9,000 American adults from 7 sample sources [9]. The study included measures of things unlikely to change in the aggregate over short periods of time (visiting the dentist in the past six months, owning a car, overall health, etc), as well as questions pertaining to motivations for participating in surveys.

The study was conducted in two waves, about a week apart, so that we could look at reliability. Of the seven samples sources, one was a river sample sourced through Fulcrum, another was a sample coming from a multi-reward community—where people can obtain rewards for playing games, online shopping, online searches and of course, do surveys. The remainder were all panels, including our SpringBoard America community. But many of the “panels” were also using sample piped in from other sources.

One of the sample sources included was a large well-known US panel which we’ve called Panel A. We focus on them in this analysis because they help us see a connection between sample source, respondent engagement and data quality.

Panel A, over the past three years, has shifted from providing most of their sample from their panel to almost three quarters of the sample being sold by their company being brought in from a range of non-panel sources.

source of sample

When we look at the reliability of the different sample sources, we see that Panel A is the least consistent, with statistically significant differences between wave 1 and wave 2 on 11% of the 73 items we measured.

question differences

The River sample, along with the Multi-reward community sample, were also different more often that classic sample theory would suggest we can normally expect.

The sample coming from Panel A also tended to answer somewhat differently than the other sample sources, especially on the questions pertaining to why they do surveys. Overall, they tended to be less motivated by intrinsic rewards like “feeling like a trusted advisor”, and feeling they are “doing my part as a good consumer and citizen when I provide feedback”, and doing surveys to “learn new things”. The chart below shows just one such example, but it is indicative of numerous others.

surveys

With Panel A we see an example of largely non-panel sample being less engaged and providing less reliable data.

This would suggest that there is a connection between a sample source and motivations of the people responding that has implications for the quality of the data we collect and, ultimately, the decisions made based on those data.

Sample source can create other problems as well

As part of the study we also asked about social media habits and which social media networks people were active on. The sample from the multi-reward source—often drawn in through social media—turned out to be notably different in their social media behavior. They tended to be engaged in more social media sites across the board. The chart below illustrates one of a number of differences, relative to the other sample sources and to Pew’s 2015 Social Media research.

multi-reward sample

Sample that is answering to get access to gated content

We also studied data coming from a well-known service that encourages publishers to “monetize your website’s content” and “join the hundreds of publishers who are using [this service] to earn revenue from their content.” So the people who are answering the questions are doing so solely because they want to get past the survey and to the content they desire.

In this study we tracked the percentage of people who said they were active on a number of the most common social media sites over a two year period starting in April 2014 and continuing to February 2016. What we found were data suggesting wild increases and decreases in social media habits.

According to the latest wave, somewhere between January 2015 and February 2016 there was a stark plunge in the number of people using social media sites, including Twitter and Instagram.

active on various social media

These data appear implausible and, indeed, when we compare them to data from our 2016 study and to Pew’s Social Media tracking data we see a very different story. Not only does sample from this source grossly underestimate the prevalence of social media activity, it also suggests dramatic upheaval where Pew’s data would indicate there has been little change.

pew research centre data

Why such an epic fail

There are many reasons why the data from this from this publisher-sourced sample could be so wrong. They could be sampling an unrepresentative set of people. The data could be weighted incorrectly because of error in the imputation methods used to estimate the demographics. And it could be that the people answering the questions just don’t care about the survey, so they don’t bother to answer correctly. It is probably a combination of all those things.

The sample might be unrepresentative, but we do know the provider sources from across a host of publishers—to mitigate the potential of getting sample mainly from people trying to get access to book called, say, “I hate Facebook”. By sourcing from multiple places it is hoped that kind of bias is at least mixed and muted.

The fact that the demographics are estimated rather than being measured directly is problematic and certainly there are errors. But when we looked at the data weighted and unweighted, the differences were not anywhere near large enough to account for the kind of variation we observed here. Which leads us back to wondering about the motivations of the people answering the survey and the impact that has on the accuracy and reliability of the data.

Discussion

The research reviewed in this paper suggest that the source of sample matters a great deal. It has implications for the representativeness of the information and for the motivations of the people who are completing the survey which, in turn, has further implications for reliability and representativeness.

We believe that there is cause for concern when we see that, industry wide, there is a dramatic increase in the amount of sample coming from unknown and varied sources, often less motivated to participate in surveys for intrinsic reasons. When we can connect that to poor data quality, we are doubly concerned.

One of the reasons companies—both sample providers and buyers—have gravitated toward river-type samples of unknown respondents is that it has become a lower cost source of sample. But this cheap sample has a hidden cost—accuracy.

Trade-offs between price and quality are common in many markets. But for an industry whose raison d’etre is to provide reliable data to inform decision making, that trade-off would seem to be not just a bad deal, but a proposition that threatens the foundations of our business.

We would encourage other researchers to study questions around sample source and data quality. We would also love to see further exploration of the linkages between respondent motivations and data quality. We have seen some linkages, but this area needs to be much more fully comprehended.

Ultimately, we believe there is opportunity in better understanding why people do surveys. If we can better appreciate how to engage respondents and reward them in intrinsic ways, we think that we can increase the quality of the information they share with us. And, ultimately, accurate information is what it’s all about.

Download Your Own Copy


  1. FiveThirtyEight’s Pollster Ratings, found at http://fivethirtyeight.com/interactives/pollster-ratings/. Sourced March 20, 2016.
  2. Online Panel Research: A Data Quality Perspective edited by Mario Callegaro, Reginald P. Baker, Jelke Bethlehem, Anja S. Göritz, Jon A. Krosnick, Paul J. Lavrakas John Wiley, 2014, John Whiley & Sons Ltd.
  3. For a useful summary of the ARF FOQ effort, see Chapter 18 of Leading Edge Marketing Research: 21st-Century Tools and Practices by Robert J. Kaden, Gerald Linda, Melvin Prince. Chapter 18 is entitled Panel Online Research and Survey Quality and was written by Raymond C. Pettit.
  4. Canadian online panels: Similar or Different by P Chan and D Ambrose, Vue magazine January/Febrary 2011
  5. For information on the ARF FOQ2 effort see http://thearf.com/feature-orqc-knowledge-briefs
  6. The first wave of the study was conducted in Canada between April 30, and May 13, 2013 with a sample of 1,580, with roughly 300 coming from the Angus Reid Forum and, three other panels and a River sample source. The second wave was conducted in Canada in May/June 2014 with a sample of 1,580 with roughly 300 coming from the Angus Reid Forum and the same three other panels and River sample source used previously. Study 3 consisted on 3 waves of a brand image tracking study conducted in the US. Wave 1 had a sample of 19,229 and was collected Jan 14-20, 2013. Wave 2 had a sample of 19,222 and was collected February 18-24, 2013. The third wave had a total sample of 19,605 and was collected April 29-May 5, 2013. In both cases we used a 95% CI in comparing between waves.
  7. Noah Marconi, “Know your Sample Sources: Quality Testing Sample Vendors” August 2013. Available at http://vcu.visioncritical.com/system/files/WHITE_PAPER_KnowYourSample_September26_2013_Final.pdf Sourced March 21, 2016.
  8. Source: data gathered by Complete, A Millward Brown Company—2013-2016, analyzed by MARU/VCR&C.
  9. The study was conducted in March 2016 with a total sample of 9,125 American’s aged 18+. For SpringBoard America, there a sample of 1003 for wave 1 and 1005 for wave 2. For Panel A we had a sample of 447 for wave 1 and 500 for wave 2. For the multi-reward sample source we had a sample of 817 for wave 1 and 502 for wave 2. For the river source we had a sample of 496 for wave 1 and 543 for wave 2. For panel B we had a sample of 619 and 597 for waves 1 and 2 respectively. For panel C we had a samples of 559 and 528 for waves 1 and 2. Quotas were set for age, gender, race, education and region to ensure the sample was demographically representative of the US population. The data were then weighted to the same targets.
  10. We ran a survey that asked about social media usage and sent it out to a sample of people who wanted access to premium content, at three time periods. In April 2014 the questions were answered by 511 Americans. We then repeated them in December 2014 (n=1888), January 2015 (n=1510) and February 2016 (n=500). In all cases the data was weighted to be representative of the American population, based on imputed demographics. The sample was drawn to our study by the sample provider targeting Internet users who seek to access “premium content,” including news articles, videos, or other websites that would otherwise require a payment or subscription to access the content. The publishers of these websites have agreed to allow the sample provider to administer questions to their users through a corporate agreement wherein the sample provider pays the publisher for access to the potential respondents. In exchange, the respondent gains access to the content for free. The questions appear as prompts when users try to access the premium content; this prompting is also known as a “survey wall” since respondents must either answer the question or click an X to remove the question from] their screen. The sample provider uses an algorithm to properly distribute the questions across the publishers’ networks.