Large-scale Predictive Modeling Helps the FCC Identify Underserved Areas

In 2009, Congress directed the Federal Communications Commission (FCC) to develop a National Broadband Plan to ensure every American has “access to broadband capability.”

The FCC, as regulator of all telecommunications companies, maintains large databases about them and whom they serve.  For this project, the FCC wished to map broadband availability at all Census Blocks in the country.   There are almost ten million of them: on average, a Block comprises only 35 people—about ten households.  That is much more detailed information than the companies would agree to provide.  “The complexity of this analysis is driven by the need for a very granular geographic view of the capabilities of all the major types of broadband infrastructure as they are deployed today… .”  Additional data and statistical analysis were needed to achieve the FCC’s goal.

The FCC turned to the private sector for help.  It hired CostQuest Associates to collect and model the data.  CostQuest retained Dr. Huber to lead the statistical efforts and produce the information for a  National Broadband Map.

The statistical modeling began with a comprehensive review of the academic literature on all factors relating to the availability and cost of Internet access.  The review identified hundreds of possible predictive variables used to model average broadband availability in states, counties, and zip codes.  However, such large scale results do not necessarily hold at more granular levels.   (The assumption that they do is the “Ecological Fallacy.”)  An entirely new model had to be built.

Using large, detailed databases of broadband infrastructure created for several states, as well as proprietary engineering data provided by CostQuest and its telecommunications clients, Dr. Huber’s team collected, studied, and created thousands of variables—social, demographic, economic, engineering, geographic, and geophysical.  They built and tested logistic regression models for predicting broadband availability across a range of speeds.  Dr. Huber also used spatial analyses to create and test new variables, such as distances to key telecommunications infrastructure. After extensive cross-validation and testing, the team identified 150 variables and combinations of variables that provided over 90% predictive accuracy.

Predicting availability across America required linking the statistical output to a massive database built by CostQuest.  Dr. Huber created software to translate the model into thousands of lines of SQL code that extract the relevant data and perform the statistical calculations.  This assured the model results were correctly computed and mapped.  The output is the foundation of the statistical summaries and analyses in the 376-page National Broadband Plan presented to Congress in 2010.