When is big data enough? Implications of using GPS-based surveys for travel demand analysis
Traditional models of individual and household travel and activity behavior are estimated using travel diary datasets that ask a small subset of the population of interest to record over a period of one or two days which activities were conducted where, when, for how long, with whom and using what mode of travel. For example, the New York Best Practices Model (NYBPM), the activity-based model of travel demand developed for the New York Metropolitan Region, was estimated using travel diary data from about 28,000 individuals collected over a period of one day. The size of that and other similar travel diary datasets pales in comparison to the volume of information that can potentially be retrieved from new technologies, such as Global Positioning System (GPS) sensors and smartphones, and social media platforms, such as Twitter and Facebook, both now and in the future.
Advances in GPS technologies in particular have received substantive attention in the last decade. Early applications sought to supplement extant methods of travel diary data collection that rely on self-reporting, such as mail-back, phone-based or door-to-door travel diary surveys, through their ability to control for factors such as trip underreporting. However, the long-term objective has always been the development of GPS-based surveys that can collect all the information that is usually collected by traditional travel diary surveys, but with very little input from survey participants.
The advantages of using GPS-based surveys are manifold. They impose fewer requirements on survey respondents, offer greater spatiotemporal precision and are potentially cheaper to implement. However, GPS-based surveys do not collect certain key inputs required for the estimation of travel demand models, such as the travel mode(s) taken or the trip purpose, relying instead on data-processing procedures to infer this information.
A 2015 study by Dr. Akshay Vij of the Institute for Choice and K. Shankari of the University of California, Berkeley assesses the impact that errors in inference can have on travel demand models estimated using data from GPS-based surveys and proposes ways in which these errors can be controlled for during both data collection and model estimation. They use simulated datasets to compare performance across different sample sizes, inference accuracies, model complexities and estimation methods. Findings from the simulated datasets are corroborated with real data collected from individuals living in the San Francisco Bay Area, United States.
Boxplots showing estimated values of time for a relatively parsimonious travel mode choice model specification, using different synthetic GPS-based datasets with different sample sizes and levels of inference accuracy. In general, the variability in estimates decreases as the sample sizes increase, but the magnitude of bias appears to be independent of the number of observations, and surprisingly large. The median estimate for value of time for 10,000 observations and an inference accuracy of 80% is 12.9$/hr, off by 35% from the true value of 20$/hr. Even at higher accuracies, such as 95%, the median estimate for 10,000 observations is 17.3$/hr, off by 14%.
Their analysis indicates that the benefits of using GPS-based surveys will vary significantly, depending upon the sample size of the data, the accuracy of the inference algorithm and the desired complexity of the travel demand model specification. If the data is truly big enough, the quality of inference may not matter. But in many cases, gains in volume could potentially be neutralized by losses in quality. For example, a Monte Carlo experiment finds that a relatively parsimonious model of travel mode choice behavior that could reliably be estimated using 100 high-quality observations could need 10,000 observations and more, depending upon the accuracy of the inference algorithm. In practice, no algorithm will ever guarantee complete accuracy. For data from GPS-based surveys to still be useful for travel demand analysis, it will need either to be incredibly big, or it will need to be supplemented with data that can be treated as a reliable source of ground truth.
Contact us to find out more information about this project.