RegData provides an estimate of the degree of federal regulation faced by each industry in the United States from 1997 to 2012. RegData relies on a transparent and replicable methodology developed by Patrick A. McLaughlin and Omar Al-Ubaydli to quantify federal regulation, as explained in detail in this working paper.
The database actually offers multiple novel and objective measures of regulation overall and for all the different industries in the United States. RegData uses text analysis to count the number of restrictions in the text of federal regulations, which are codified in the Code of Federal Regulations (CFR). In addition, it measures the degree to which different groups of regulations target specific industries.
This process results in the creation of a set of variables that are combined to produce the growth of industry regulation metric, which shows how much regulation targeting specific industries has grown over time as well as regulator-specific metrics of regulation.
Growth of Industry Regulation – Growth of Industry Regulation measures the growth of regulation targeting a given industry. It is calculated relative to 1997. Values greater than 0 indicate a growth in regulation relative to 1997, and values less than 0 indicate a diminution of regulation relative to 1997. It is constructed by dividing Industry Regulation from a specific regulator (or group of regulators) a in year y by Industry Regulation in the base year, 1997:
Industry Regulation – Industry Regulation is the product of two of the objective measures produced by text analysis of the CFR, restrictions and Industry Relevance. RegData’s interface permits the user to examine this by looking at the table view for specific combination of industries and regulators. In the calculation of Industry Regulation, restrictions are only considered if they are in the same part as one or more industry search terms; in other words, the part’s industry relevance must be greater than zero for restrictions in that part to apply to that industry. The value for Industry Regulation for industry i by an agency or set of agencies α (where a set of agencies equals one or more agencies beginning with 1 and ending with the maximum number of agencies in year A) in year y would be given by:
Industry Relevance – A measure of how relevant a specific part or group of parts, such as an agency’s group of parts in the CFR is to a specific industry. RegData’s interface permits the user to examine this by looking at the table view for specific combination of industries and regulators. It is constructed by searching for a set of phrases that may indicate that the CFR part is targeting a specific industry. The phrases that are searched for are based on the 2007 2-digit, 3-digit, and 4-digit industry descriptions given in the North American Industry Classification System, and are developed following the rules given in Appendix A of the RegData working paper. It is normalized by the number of words in the CFR part, so that Industry Relevance is given by the number of times an industry’s descriptive phrases were found in a CFR part divided by the words of pages in that part. See the example below.
Restrictions – The variable restrictions is a count of the number of occurrences of words or terms that indicate obligation or restriction. These include five terms: “shall,” “must,” “may not,” “prohibited,” and “required.”
Word Count – Word count gives the number of words found in the relevant regulatory text in the selected year or years. This does not include words in tables or graphics that are in the regulatory text.
File size (available only in the full RegData 1.0 dataset, discontinued in RegData 2.0) – Gives the size (in bytes) of each set of files that corresponds to a CFR title.
Page Count (available only in the RegData 1.0 dataset, discontinued in RegData 2.0) – Count of the number of pages in a given CFR title in a given year.
Details on the Construction of the Industry Relevance Metric
This section explains in excruciating detail how we constructed the industry relevance metric. First, by decomposing typical NAICS industry descriptions, we describe the structure of industry descriptions. Second, we explain the rules we developed to turn the NAICS industry descriptions into a set of search strings. Third, we cover some shortcomings of our systems and offer possible solutions to the individual user of the database. Finally, we explain how we calculated the industry relevance metric and discuss alternative ways to calculate it.
A. Industry Name Structure
The NAICS industry description is a collection of words or phrases linked by conjunctions or commas, such as “Agriculture, Forestry, Fishing and Hunting” and “Finance and Insurance” (we discuss some important exceptions below). The full description can be divided into an exhaustive collection of phrases that may have some overlap in shared words. For example, “Oil and Gas Extraction” can be divided into “Oil Extraction” and “Gas Extraction.”
Each individual phrase is a noun phrase. The noun phrase has up to three components.
Head noun: The main word in the phrase. This can be in the form of a present participle (Fishing) or not (Construction).
Pre-modifiers: Words that precede the head noun and modify its meaning. They can be adjectives (Educational in “Educational Services”), nouns (Waste Management in “Waste Management Services”) or a mixture (Electronic Product in “Electronic Product Manufacturing”). They can also be absent (Construction).
Post-modifiers: Words that follow the head noun and modify its meaning. They can be nouns (Companies in “Management of Companies”) or a mixture of adjectives and nouns (Economic Programs in “Administration of Economic Programs”). They can also be absent (Construction). We ignore prepositions.
B. Rules for Strings
Each of the following rules applies to each of the full phrases derived from the industry description. All searches are case insensitive.
Rule 1: The full phrase.
- Conditions: None.
- Examples: (wholesale trade).
- Exceptions: None.
Rule 2: The singular form of the full phrase.
- Conditions: The full phrase is naturally pluralized.
- Examples: (utility in “utilities”).
- Exceptions: None.
Rule 3: The person who does the full phrase (singular).
- Conditions: A commonly used version actually exists.
- Examples: (retail trader in “retail trade”).
- Exceptions: None.
Rule 4: The person who does the full phrase (plural).
- Conditions: A commonly used version actually exists.
- Examples: (retail traders in “retail trade”).
- Exceptions: None.
Rule 5: The head noun.
- Conditions: The full phrase is composed of more than one word.
- Examples: (trade in “wholesale trade”).
- Exceptions: The head noun is used extensively in the CFR to convey a meaning that is fundamentally different from the meaning in the phrase. Exclude (assistance in “social assistance”) and (services in “educational services”).
Rule 6: The base form of the head noun.
- Conditions: The full phrase is only one word AND the head noun is a present participle.
- Examples: (hunt in “hunting”).
- Exceptions: None.
Rule 7: The pre-modifiers together as a whole string.
- Conditions: The head noun has pre-modifiers.
- Examples: (waste management in “waste management services”).
- Exceptions: The pre-modifiers are used extensively in the CFR to convey a meaning that is fundamentally different to the meaning in the phrase. Exclude (public in “public administration”).
Rule 8: The post-modifiers together as a whole string.
- Conditions: The head noun has post-modifiers.
- Examples: (human resource programs in “Administration of Human Resource Programs”).
- Exceptions: The post-modifiers are used extensively in the CFR to convey a meaning that is fundamentally different from the meaning in the phrase. Exclude (enterprises in “management of enterprises”).
Rule 9: Individual words and phrases in pre-modifiers and post-modifiers.
- Conditions: The head noun has more than one pre- or post-modifier.
- Examples: (coal in “coal products manufacturing”).
- Exceptions: The pre- or post-modifiers are used extensively in the CFR to convey a meaning that is fundamentally different from the meaning in the phrase. Exclude (products in “plastics products manufacturing”).
In our database, we begin by dividing the industry description into the individual noun phrases described above; within the industry, each noun phrase is assigned a group number to distinguish its strings from those belonging to the other noun phrases. For example, in the industry “oil and gas extraction,” oil extraction is assigned group 1 and gas extraction is assigned group 2.
C. Situations When the Rules Are Ineffective
The above rules are ineffective in three infrequent classes of NAICS industry descriptions. The first is when the industry description involves a parenthetical comment, typically an exception, such as “mining (except oil and gas).” Our solution is to simply ignore the parenthetical comment. The following industries suffer from this problem:
- 212 (Mining (except Oil and Gas))
- 511 (Publishing Industries (except Internet))
- 515 (Broadcasting (except Internet))
- 533 (Lessors of Nonfinancial Intangible Assets (except Copyrighted Works))
The second is the case of “other,” “support,” or “related” activities, such as “support activities for mining” or “furniture and related product manufacturing.” We apply the rules in the normal fashion; however, in some of these cases, the outcome is unlikely to fully reflect the spirit of the NAICS industry description. The following industries suffer from this problem:
- 56 (Administrative and Support and Waste Management and Remediation Services)
- 115 (Support Activities for Agriculture and Forestry)
- 323 (Printing and Related Support Activities)
- 337 (Furniture and Related Product Manufacturing)
- 488 (Support Activities for Transportation)
- 519 (Other Information Services)
- 525 (Funds, Trusts, and Other Financial Vehicles)
- 562 (Administrative and Support Services)
- 711 (Performing Arts, Spectator Sports, and Related Industries)
- 712 (Museums, Historical Sites, and Similar Institutions)
- 813 (Religious, Grantmaking, Civic, Professional, and Similar Organizations)
- 921 (Executive, Legislative, and Other General Government Support)
The final case is that of industry names that contain the word “general” or “miscellaneous,” such as “general merchandise stores.” We apply the rules in the normal fashion. However, in some of these cases, the outcome is unlikely to fully reflect the spirit of the NAICS industry description. The following industries suffer from this problem:
- 339 (Miscellaneous Manufacturing)
- 452 (General Merchandise Stores)
- 453 (Miscellaneous Store Retailers)
We have omitted industry description 81 (Other Services (Except Public Administration)) because any search for strings based on the words “other services” would return useless results. We have also omitted 423 (Merchant Wholesalers, Durable Goods) and 424 (Merchant Wholesalers, Nondurable Goods); they are the only three-digit industries that fall under 42 (Wholesale Trade), and we cannot think of a sensible way of distinguishing them since they do not follow the phrase structure of the other NAICS industry names. Therefore, we direct the user to the data on 42 (Wholesale Trade) only.
Each industry description is associated with a collection of strings. The strings are classified according to group and rule. For each group in each industry, each rule in the range 1–8 is associated with at most one string. Rule 9 can yield multiple strings associated with the same group or industry.
As an illustration, consider industry 316 (Leather and Allied Product Manufacturing). The industry name is composed of two phrases: leather manufacturing (group 1) and allied product manufacturing (group 2).
The resulting strings are in table A1.
In this example, based on our discretionary interpretation of the rules, we exclude manufacturing, manufacture, allied, and product. In the final database, there is a variable denoting which strings we recommend including or excluding, though we still measure the occurrence of every string to allow users to judge for themselves. Though we judge each rule-9 string on individual merit, in the default version of the final database (which we use for the figures and tables in the main text), we exclude all rule-9 strings. In appendix C, we detail strings where we struggled to decide on inclusion or exclusion.
As table A1 shows, some of the smaller strings are contained in the larger strings from the same group. More formally, each string derived from rules 1, 2, 3, or 4 can potentially contain the head noun (string from rule 5), the pre-modifier (string from rule 7) or post-modifier (string from rule 8) from the same group. (We ignore containment of the strings from rule 9 because we are excluding rule 9 strings.) We therefore create three additional dummy variables: contains_head_noun, contains_pre_modifier, and contains_post_modifier. These variables make it easy to use statistical software to eliminate double-counting. For example, every occurrence of the string “leather manufacturing” automatically implies an occurrence of the string “leather,” but we would only want to count such an occurrence once. We provide programming code for Stata that prevents double-counting by using these variables.
In some cases, a string is shared by multiple groups in the same industry (e.g., manufacturing in the example in table A1). We assign such shared strings to the first group that shares them since we are ultimately aggregating at the industry level, so assigning them to multiple groups within the same industry will result in double-counting.
Once we have eliminated the possibility of double-counting for each industry or title, we sum the total occurrences of the included strings in that title. We then divide that sum by the number of words in the part and multiply by 1000 to obtain a measure of industry relevance. This measure prevents longer titles from appearing to be more relevant to an industry simply by virtue of their length. Users have the opportunity to undo this act of deflation should they so desire.