Making Sure People Count with Big Data and Machine Learning

By OmniSci

The US Constitution bases representation in Congress on an "actual enumeration" carried out at least once every ten years.

Last year, there was a large fight over this obscure clause, ending in a supreme court decision re-affirming that all permanent residents should be counted, including non-citizens.  

At stake were as many as 3 of California's 53 house seats and the allocation of $700 billion in annual funding nationally.

Yet even if we agree in principle to count everyone, it has always been problematic in practice.  

There has always been a census undercount because some people are hard to locate, contact, and interview.  Last year there were even more challenges due to the outbreak of Covid-19.  

Changes in mobility make it more challenging for the mail to reach people on time or at all.

Census counting ended early last year, limiting the time available to track down missing respondents.

Essential workers worked long hours away from home, and many service workers were unemployed and may have moved.

We were curious if new data sources and approaches might help shed new light on these issues.  

In this post, we use OmniSci to visualize a massive GPS mobility dataset, correlate our observations with historical censuses, and then predict census undercount.

Prior Undercounts

The census explicitly measures undercounting.  Census researchers have found a high correlation between mail return rates and actual undercounts.

We can look directly at this variable and see how it connects with geographies and demographics. For example, minority groups are particularly affected.

There are also some surprises.  For instance, "white rural renters" actually had the highest undercount in the 1990 census.

 

Figure 1: Major Undercounted Census groups

Figure 1: Major Undercounted Census groups

Historically, a household with a complex family structure has a higher chance of being undercounted because such complexity may cause ambiguity for census respondents about whom to include on the household roster.

GPS Data

This work explores how anonymized GPS data could have improved the COVID-19 pandemic response. We examined 3.1 billion location observations from the beginning of the pandemic in the US (February and March of 2020).

Performing almost any spatial analysis on 3B+ records with conventional tools falls somewhere between excruciatingly slow and impossible.  

Fortunately, OmniSci's GPU-enabled architecture allowed us to explore and analyze the data interactively.  

The dashboard shown in Figure 2 operates at the speed of curiosity, allowing for the choice of any date range, time range, census block, or device speed throughout the entire corpus of data.

Figure 2: Location data (3.1B) filtered to a single census block group (135k)

Data Enrichment

The location data contained basic information like an anonymized device id and GPS location.

We geospatially enriched each GPS observation with attributes like the current census block, county, and state.

Since the census measures residential population at various levels of geography, we estimated the points' demographic characteristics and compared them to overall census demographics.  

This estimation is important because we had little understanding of how representative the location data might be before starting.

In addition to demographic enrichment from the census, we measured device mobility.  

We did this with general measures like distance traveled and measures that considered trip purpose.  In particular, we computed the distance to the nearest road segment, building, Safegraph Place, and the "dwell times" for each.  

Beyond this specific project, some of these measures are useful for retail and transportation planning.  But here, we only considered if such "mobility indices" were predictive of census undercount.  

We also enriched the Census demographic dataset with the GPS location data at block level to anonymize it and information from before and after the pandemic (Figure 2).

The location data had an average sampling rate of 0.45% at the block level. We were interested to see if, by enriching the Census dataset with the location data, we could better understand the reasons for census undercount.

In Figure 3, we can see the sampling at a block level across the United States.

Figure 3: Mobility across the United States pre/during Covid. Mobility ratio is defined as the ratio of the vehicles travelling on the road at an instant to the capacity of the vehicular traffic.

Figure 4: Sampling and Census Mail Return Rate

Toolkit

To accomplish our analysis, we used OmniSci and ML Flow:

  • OmniSci Data Science Foundation
    • Maintains tight integration with JupyterLab
    • Data stored in OmniSciDB is available for use in JupyterLab through Ibis connections
    • OmniSci's visual analytics platform, Immerse, visualizes the results
    • An expansive library of analytical packages pre-installed

Figure 5: Single click Jupyter access from OmniSci

  • ML Flow
    • An open-source platform to organize and manage the ML lifecycle
    • Supports model experimentation and iteration
    • The code can be quickly packaged and reproduced elsewhere
    • Central registry where all the model results and parameters are stored

Figure 6: ML Flow Lifecycle

Currently, backend support for ML Flow is available to OmniSci through python, and the option to support it through Immerse is under evaluation.

Here is a Sample ML Flow implementation of Random Forest model training. The constructor WITH clause creates a model run, and function log_param records parameters used in training.

Figure 7: Sample Random Forest Model Building using MLFlow (full source)

In Figure 7, we are using the hyperparameters corresponding to the best model for prediction. We have the log_metric and the log_model functions, which store the corresponding model features.

Figure 8: Sample Random Forest Model Prediction using MLFlow (full source)

Figure 9: MLFlow UX

Figure 9 depicts the default ML Flow UX, which shows various model runs, corresponding parameters, and the results.

Methodology

As described above, the mail return rate in the census survey is used as a proxy variable for the population undercount at the Census Block level.

In our study, we used the sociodemographic variables from the article published in the Oxford Academic journal to see if we could replicate the same results by modeling the data using ML.

We trained various models on the data, including the Elastic Net model, Random Forest, and Gradient Boosting.

Results

Figure 10: Comparing Model Results for different ML models

Figure 11: Feature importance for the XGBoost Model

Inferences

Our models predicted that Total Population is more accurate than the Mail Return Rate (r2 score 0.98 vs. 0.65), and XGBoost is the best predictor.

This result is not unexpected because Total Population is assumed to be the most effective way to predict the undercount proxy.  

However, we were surprised that the location data variables didn't add much explanatory power to the model.

The location data may have less explanatory power due to the sample size, as the percentage sampling of the GPS points is relatively low compared to the census.

Another possibility could be that the location data is particularly sparse in those block groups, essential for the census undercount.

However, the differences here were less than we had initially anticipated.  

Visually, there is no apparent skew between the location data sampling rates and the distribution of Census mail return rates.  

When we compare block groups with low (<50%) and high (>85%) response rates, we find sampling is higher in low response areas (0.77% versus 0.49%).

Figure 12: Location  Data Sample Density versus Census Mail Response Rates

Ultimately, a third hypothesisabout the lack of prediction difference is simpler: when it comes to counting people, demographics are more predictive than mobility patterns.  

To be more precise, we should say the conventional census demographic measures in the literature are more predictive than the specific mobility measures we applied.  

While this was somewhat analytically disappointing to us, the good news is that widely available methods are very robust; neither significant differences in methods nor the use of big data changed our results.  

So from a public policy perspective, if we want to reduce census undercount in the future, we should continue to focus on the special-needs populations described above.

Conclusions

The robust correlation between the location data and the census, as well as the lack of evident demographic biases, bodes well for its use elsewhere within retail analytics or public policy. 

In particular, the visibility this data gives into travel patterns could prove very useful to transportation planners.  

The sample sizes here are so large that they completely dwarf ground-based survey methods commonly used, for example, in travel demand forecasting.

In terms of tools and methods, we found the combination of OmniSci with ML Flow very handy. As the examples above illustrate, we interrogated the data interactively in the initial exploration, feature engineering, and in reviewing conclusions.  

ML Flow allowed us to rapidly build a medium-large archive of experimental model runs and made it visually apparent that hyperparameters and models performed better than others.

The Jupyter notebooks and models are available as open-source (MIT license) from OmniSci's Github repository.

Try it out for yourself today, download OmniSci Free, a full-featured version available for use at no cost.

SIGN UP FOR THE DSS PLAY WEEKLY NEWSLETTER
Get the latest data science news and resources every Friday right to your inbox!