This is a repository for research resources that might be of interest to academics and practitioners who use discrete choice methods. Below, you can find access to datasets that might be helpful for your analysis, code in various programming languages for the estimation of different discrete choice models, and working papers. The Institute is deeply committed to the open data and open source movement. The objective of this repository is to encourage opportunities for further analysis, replication, verification and refinement. 


Working Papers 

Modeling and Forecasting the Evolution of Preferences over Time: A Hidden Markov Model of Travel Behavior

El Zarwi, Vij and Walker

Abstract: Preferences, as denoted by taste parameters and consideration sets, may evolve over time in response tochanges in demographic and situational variables, psychological, sociological and biological constructs,and available alternatives and their attributes. However, existing representations typically overlook theinfluence of past experiences on present preferences. This study develops a hidden Markov model with adiscrete choice kernel for modeling and forecasting the evolution of individual preferences over time. Thehidden states denote different latent preferences, and the evolutionary path is hypothesized to be a firstorderMarkov process such that an individual’s preferences during a particular time period are dependenton their preferences during the previous time period. The framework is applied to study the evolution ofmodal preferences, or modality styles, over time, in response to a major change in the publictransportation system. Empirical findings reveal two complementary narratives. At the population level,there are significant shifts in the distribution of individuals across modality styles before and after thechange in the system, but the distribution is relatively stable in the periods after the change. At theindividual level, greater instability in preferences is observed, much after the change, despite accountingfor the inertial influence of past preferences. A comparison between the proposed dynamic frameworkand comparable static frameworks reveals corresponding differences in aggregate forecasts for differentpolicy scenarios, demonstrating the value of the proposed framework for both individual and populationlevelpolicy analysis.

Download Here


Moving past random taste heterogeneity in discrete choice models: Multivariate nonparametric finite mixture distributions

Vij, A.

Abstract: This study develops an expectation maximization algorithm for the estimation of mixed logit models withmultivariate nonparametric finite mixture distributions, where the support of the distribution is specified as ahigh-dimensional grid over the coefficient space, with equal or unequal intervals between successive pointsalong the same dimension, and the location of each point on the grid and the probability mass at that point aremodel parameters that need to be estimated. The framework does not require the analyst to specify the shape ofthe distribution prior to model estimation, but can approximate any multivariate probability distributionfunction to any arbitrary degree of accuracy. The estimation algorithm can feasibly estimate behaviorallymeaningful models with multivariate distributions over high-dimensional coefficient spaces with hundreds ofmass points. Multiple synthetic datasets and a case study on travel mode choice behavior are used todemonstrate the value of the model framework and estimation algorithm. The literature on discrete choicemodels is replete with ways to incorporate random taste heterogeneity. By proposing a fully flexible andcomputationally tractable approach, this study aims to bring to a close the question of how best to includerandom taste heterogeneity within existing representations of decision-making.

Download Here


California Household Travel Survey 2012:

Tour mode choice data from the San Francisco Bay Area

This data was originally collected as part of the California Household Travel Survey (CHTS) in the year 2012. Individuals belonging to sampled households were asked to report their complete activity diary data over an observation period of one day, including which activities were conducted where, when, for how long, with whom and using what mode of travel. More information on the raw data can be found in NuStats, LLC (2013).

The data included here corresponds to individuals from the subset of households located in the nine-county San Francisco Bay Area. The raw trip data was processed into home-based tours that can be used for the purpose of tour-based travel mode choice analysis. The resulting dataset includes 27,054 tours made by 17,717 individuals from 8,228 households.

For each tour, six possible travel mode alternatives are defined: private vehicle, private transit, walk to public transit, drive to public transit, bike, and walk. Private vehicle refers to cases where the individual used a motorized vehicle owned by themselves (or someone they know) as a driver or a passenger. Private transit includes the use of travel modes such as taxis, Uber, carshare, rental cars and private shuttles. Walk to public transit captures all cases in which an individual only used non-motorized travel modes to access public transit, and drive to public transit captures all cases in which a motorized travel mode was used to access public transit.

The level-of-service attributes, namely travel times and costs, for each of the six travel modes for each tour are determined using network skims from the SF MTC for 2010, generated using version 3 of their travel demand model. We are unable to decompose travel time into its constituent elements, such as in-vehicle time and waiting time, as this information was unavailable at the time of processing. Travel costs are in 2000 US dollars.

The download link below contains five files: the processed data file, the Python script used to process the raw data, an iPython notebook included as an example on how to use the data file for analysis, the data dictionary for the raw data and a readme file.

A subset of this data was originally used by Vij et al. (2017) for understanding modal preference shifts in the San Francisco Bay Area over time. For more details, please refer to the original study. And if you have any questions, feel free to contact

Download Here


Nustats, LLC, 2013. 2010–2012 California Household Travel Survey Final Report.

Vij, A., Gorripaty, S., & Walker, J. L. (2017). From trend spotting to trend’splaining: Understanding modal preference shifts in the San Francisco Bay Area. Transportation Research Part A: Policy and Practice, 95, 238-258.

Estimation Code

Python estimation code for flexible Latent Class Choice Models (LCCMs)

Lccm is a Python package for estimating latent class choice models using the Expectation Maximization (EM) algorithm to maximize the likelihood function. The package was developed by Feras El Zarwi, a PhD candidate at the University of California, Berkeley, with assistance from Akshay Vij from the Institute for Choice. The package offers significant improvement over other estimation packages, some of which are listed below:

  • Supports datasets with multiple observations per decision-maker
  • Supports datasets where the choice set differs across observations
  • Supports model specifications where the coefficient for a given variable may be generic (same coefficient across all alternatives) or alternative specific (coefficients varying across all alternatives or subsets of alternatives) in each latent class
  • Accounts for sampling weights in case the data you are working with is choice-based i.e. Weighted Exogenous Sample Maximum Likelihood (WESML) from (Ben-Akiva and Lerman, 1983) to yield consistent estimates
  • Constrains the choice set across latent classes whereby each latent class can have its own subset of alternatives in the respective choice set
  • Constrains the availability of latent classes to all individuals in the sample whereby it might be the case that a certain latent class or set of latent classes are unavailable to certain decision-makers

For more information about the estimation code, see El Zarwi (2017). If the package is useful in your research or work, please cite the dissertation reference before and the package itself. For any questions, please contact Feras at


El Zarwi, Feras. "Modeling and Forecasting the Impact of Major Technological and Infrastructural Changes on Travel Demand", PhD Dissertation, 2017, University of California at Berkeley.

Areas of study and research

+ Click to minimise