If you use RegData, please cite:
Al-Ubaydli, O. and McLaughlin, P.A. (2015) “RegData: A numerical database on industry-specific regulations for all United States industries and federal regulations, 1997-2012.” Regulation & Governance, doi: 10.1111/rego.12107.
The History of RegData
Patrick McLaughlin initially conceived RegData as a way of improving the measurement of regulations and the regulatory process. He quickly partnered with George Mason University economics professor Omar Al-Ubaydli to create the first two versions of RegData, and later he brought Oliver Sherouse into the RegData team to develop the latest iterations of the database. In years past, regulation was a phenomenon that largely went unmeasured, rendering discussions and research related to the regulatory process qualitative and abstruse. On those rare occasions when regulation was quantified, researchers would measure it by counting pages published in the Federal Register. The problems with measuring regulation in this way are well documented: in addition to being rather noisy because many pages have nothing to do with regulation, measuring by counting Federal Register pages also runs the risk of counting deregulation as an increase in regulation, because deregulation requires the publication of pages in the Federal Register. Even more important, not all regulations are created equal in their effects on different sectors of the economy or on the economy as a whole. One page of regulatory text can be quite different from another in content and consequence. So measuring regulation by counting pages misses a lot of detail that could be useful in understanding the causes and effects of regulation.
RegData introduced an objective, replicable, and transparent methodology for measuring regulation. RegData improved on existing measures of regulation in two principal ways:
- RegData quantifies regulations based on the actual content of regulatory text. In other words, our custom-made program examines the regulatory text itself, counting the number of binding constraints or “restrictions,” words that indicate an obligation to comply such as “shall” or “must.” This is important because some regulatory programs can be hundreds of pages long with relatively few restrictions, while others only have a few paragraphs with a relatively high number of restrictions.
- RegData quantifies regulation by industry. We assess the probability that a given regulatory restriction is targeting a specific industry, and this allows us to create industry- specific measures of regulation over time. RegData uses the same industry classes as the North American Industrial Classification System (NAICS), which categorizes and describes each industry in the US economy. Using industry-specific quantifications of regulation, users can examine the growth of regulation relevant to a particular industry over time or compare growth rates across industries.
There are several potential uses of industry-specific measurements of regulation. Both the causes and the consequences of regulation are likely to differ from one industry to the next, and by quantifying regulations for all industries, users can test whether industry characteristics, such as industry growth, dynamism, unionization, or a penchant for lobbying are correlated with industry-specific regulation levels.
We have been transparent about our methodology, even as it has evolved over time, and to date we have released four different iterations of the database as our methodology evolved.
Version 1.0 (range: 1997–2010), created by McLaughlin and Al-Ubaydli
Released in 2012, RegData 1.0 introduced the restrictions metric—the method of measuring regulation by counting words such as “shall” and “must” within regulatory text. It also introduced the idea of creating industry-specific measures of regulation. These industry metrics of regulation were based on a human-assisted search algorithm, which involved taking the descriptions of specific industries in the North American Industry Classification System and creating a set of search terms or keywords that were based on the industry description. The data in version 1.0 covered the years 1997–2010 and included two- and three-digit NAICS-coded industries.
Version 2.0 (1997–2012), created by McLaughlin and Al-Ubaydli
RegData 2.0 provided the user with the ability to quantify the regulation that specific federal regulators (including agencies, offices, bureaus, commissions, or administrations) have produced. For example, with version 2.0, a user could see how many restrictions a specific administration of the Department of Transportation (e.g., the National Highway Traffic Safety Administration) has produced in each year. It also added years 2011 and 2012 to the database. Version 2.0 was bundled with a new data set that calculated the probabilities of specific industry search terms occurring in written English, compiled using the Google NGram Viewer’s underlying database. It also added four-digit NAICS-coded industries.
Version 2.1 (1975–2013), created by McLaughlin and Sherouse
RegData 2.1 introduced machine-learning algorithms to the project. While versions 1.0 and 2.0 had relied upon search terms, devised using a scheme initially conceptualized by McLaughlin and Al-Ubaydli that created permutations of individual industry descriptions, machine-learning algorithms did not require humans to tell the program what specific words or phrases to search for. Instead, we found thousands of documents that we knew were about specific industries and used those documents to train our programs. Our programs parsed the training documents and identified which words and phrases were used in reference to specific industries. This permitted industry-specific classification of regulation to be much more accurate, primarily by avoiding false positives.
RegData 2.1 added several more years’ data, from 1975 to 2013, and included three-digit NAICS-coded industries. It also introduced the public law database (PLDB), which mapped specific regulations to their authorizing statutes from 1980 to 2013.
Version 2.2 (1975–2014), created by McLaughlin and Sherouse
RegData 2.2 included significant refinements in the machine-learning algorithm used to classify regulations by industry. We also expanded the machine-learning–based dataset to include two- and four-digit NAICS-coded industries and added the year 2014 to both the regulations data and the PLDB so that RegData 2.2 covers 1975–2014 and the PLDB covers 1980–2014.
Future versions are currently planned to expand the data to cover five- and six-digit NAICS industries as well as to quantify the regulations of other jurisdictions, including US states and other countries. Furthermore, our database will continue to include more statistics and information about other legal documents besides regulations. Check this page for future updates.
For more information, contact Patrick McLaughlin at email@example.com.