Using historical newspaper data to deal with measurement errors

Data that has not been collected in the past is often unavailable to researchers and policy makers today. A recent and rapidly evolving literature is trying to circumvent this limitation by collecting information buried in old newspapers. Past digitalized newspaper articles contain information on events such as wages, prices and strikes, episodes of violence and natural disasters. These articles also offer subtle granular geography and time variations. To date, historical newspapers have been used to gather information on, among other things, anti-black attitudes (Ottinger and Winkler 2022), fertility restrictions (Beach and Hanlon 2019), cotton seed prices, and varieties (Rhode 2021). The increasing availability and scope of digital newspaper archives (such as or Chronicling America), like data obtained from maps through machine-learning methods (Combes et al. 2020), paves the way for long-term data creation that was previously unavailable to researchers. This development follows the recent trend of integrating historical data and methods into the mainstream economy (Margo 2017).

Typically, researchers collect newspaper-based data for use as a result, treatment, or control variable in statistical analysis. In a recent study (Ferrara et al. 2022), we show how data generated from historical newspaper articles can be used for another important purpose: to solve measurement errors in statistical analysis. We built on the framework of Chalfin and McCurry (2018), who argued that when a researcher has two incorrect measurement variables for the same amount of interest, one measurement can serve as an instrument for the other and to recover the actual coefficient of interest. Unless the errors of the two variables are unrelated. However, this is rarely an option, since collecting a second independent measure for the same amount of interest is prohibitively expensive. We show that such a second measure can be made cheaply from digitized newspapers, and outline the conditions under which such a strategy is likely to succeed in practical and experimental terms.

To illustrate our structure, we replicate two recent studies by Eleven et al on the effects of the spread of Bowl on the economic results in the U.S. South between 1892 and 1922. (2017) and Clay et al. (2019). The prevalence of bollworm is usually measured from a map published by the US Department of Agriculture (USDA). Although this map is usually of high quality, it contains errors such as the crossing of the date line shown in Figure 1. Each of these lines marks the most distant spread of the cotton-bowl consuming beetle in any given year; Theoretically, date lines cannot be crossed. If such errors occur randomly, any kind of statistical analysis would underestimate the effect of the bowl. An accurate estimate of the effect of the pest is important because it informs policymakers who want to use it as a baseline comparison for the spread of insects in other contexts.

Figure 1 Error on USDA map for bowl scattering

Comments: USDA maps for the spread of Bowl ticks show the date of Beetle’s arrival each year in the southern counties between 1892 and 1922. Each line marks a different year of arrival, which means that in principle these lines should not be crossed. In practice, such crossings occur; Examples are marked by red rectangles.

Chalfin and McCrary (2018) To solve the problem of this measurement error in the structure, a small measurement of the bowl scattering spread over time and space will require a second measurement which can be collected at a reasonable cost. We create such a measure from digitized newspapers by searching articles on to include the words “bowl tiny” and the name of the county for which we want to measure the arrival of insects. Although not all county newspaper articles are recorded, newspapers in the same state will report tiny presence in other parts of the state, as in the example given in Figure 2.

Figure 2 Bowl reporting in the local newspaper

Comments: Hinds County’s Jackson Daily News (left) and Atlanta County (right) Star Ledger’s newsletter report on the arrival of the ball in Marion County, Mississippi.

We make a noisy newspaper-based measure of the date of arrival of the Bowl Weevil, based on the maximum five-year moving average in the share of the documents mentioned by the insect with the name of a county. In Figure 3 we provide an example of a newspaper-based arrival measurement compared to the USDA map date for Marion County. Newspaper data tends to be noisy, which is why we smooth out some of the noise by applying the five-year running method. On average it should be noted that in order to retain our framework, it is not important that newspaper data provide more or less noise measurement of the date of arrival of the bowl pupae than the USDA map. The only necessary assumption is that the errors in the map must be unrelated to the errors in the coverage of the insect newspaper.

Figure 3 Example of newspaper-based and USDA arrival dates for Marion County, MS

Comments: The dashed line is the share of articles referring to “Bowl Vivil” and “Marion County” among all the articles mentioning “Marion County” in every available newspaper outlet in Mississippi between 1882 and 1932. The solid line represents its five-year moving average. The red part of the horizontal line shows the arrival of the Bowl in Marion County from the USDA map. The blue horizontal line indicates the predicted arrival at the highest of the five-year moving average.

We outline three ways in which this second measurement can be used to resolve error in the USDA map arrival date variable measurement. These include set identification, sample limitation, and a parametric bias correction. The most intuitive of these is the sample limitation, where we use observations that match the USDA map and the date of arrival of the newspaper-based measure. We call this the ‘contract model’. Although any one of the two measurements is likely to be incorrect, both are significantly less likely to be combined.

The main way from our replica of Ager et al. (2017) and Clay et al. (2019) following. First, even when using a newspaper-based arrival date instead of a USDA date, we can still replicate the results on both papers. If the USDA did not have maps, their research could be conducted with data collected from digitalized newspapers. It highlights the usefulness of digitized historical newspapers for creating innovative information content. Second, we applied three measurement error correction methods using newspaper-based arrival dates to increase the size of the effect and reinforce the results on both papers. Although newspaper-based measurements were made in a bold and fast and cost-effective way, they provided considerable value to reduce measurement errors on the original USDA map.

Often, researchers tend to ignore measurement errors in applied settings until some conventional level of statistical significance is achieved. When that doesn’t happen, promising research projects tend to be abandoned. We hope to provide a low-cost option for dealing with measurement errors, especially using historical data where measurement errors are a widespread problem. Using newspaper-based information, our strategy works best for quantities that can be easily extracted from newspapers, such as events that can be easily identified with a single (or a small set) search term. Other quantities, such as prices, are difficult enough to extract; Our solution may not be a viable option in this case.


Ager, P and B Herz (2019), “From Farm to Factory Floor: How Structural Transformation Triggers Fertility Change”,, 16 May.

Ager, P, M Brueckner and B Herz (2017), “The boll weevil plague and its effect on southern agricultural sector, 1889-1929”, Search into economic history 65: 94-105.

Beach, B. and W. Hanlon (2019), “Censorship, Family Planning, and Historical Fertility Change,”, 4 August.

Shelfin, A and J McCurry (2018), “Are US cities underpolished? Theory and Evidence “, Economics and Statistics Review 100 (1): 167-186.

Clay, K, E Schmick and W Troesken (2019), “The Rise and Fall of Pellagra in the American South”, Journal of Economic History 79 (1): 32-62.

Combes, PP, G. Duranton, L. Gobbilon, C. Gorin and Y. Zylberberg (2020), “(Decision) Trees and (Random) Forests: Urban Economy, Historical Data, and Machine Learning”,, 17 November.

Ferrara, A, JY Ha and R Walsh (2022), “The Use of Digitalized Newspapers to Refine Historical Arrangements: The Case of the Ball Wild”, NBER Working Paper No. 29808, February.

Margo, R (2017), “Integrating Economic History into Economics”,, 3 September.

Ottinger, S and M Winkler (2022), “The Political Economy of Propaganda: Evidence from US Newspapers”, IZA Working Paper No. 15078, February.

Road, P (2021), “Biological Innovation Without Intellectual Property Rights: Cotton Seed Market in Antibelam American South”, Journal of Economic History 81 (1): 198-238.

Leave a Reply

Your email address will not be published.