H. L. Gassmann
This semester, I took a risk and enrolled in an elective outside the history department: Introduction to Statistical Methods. As an undergrad, I excelled in my statistics course and tutored the topic for a summer. Since starting my MA in history here at Villanova, I’ve been a little disturbed by the way some historians toss statistics into their books with very little explanation. I’ve also sensed that when we discuss an author’s use of statistics in class, none of us are really sure what exactly we are talking about. I hoped to grapple with statistics in a meaningful way so that, at the very least, I could know enough to know that I don’t know enough.
It didn’t take too many weeks surrounded by MA candidates in Statistics to realize that I didn’t know very much. When my professor gave me the opportunity to complete a project analyzing the use of statistics in a historical study, I jumped at the chance. After bouncing ideas off professors and peers, I decided to dig into Edward Baptist’s 2014 The Half Has Never Been Told: Slavery and the Making of American Capitalism. Keep reading to find out what I learned.
Baptist’s book has inspired interdisciplinary discussions. Baptist argues slave labor was an extremely efficient, productive system that was responsible for the success of American capitalism. This idea conflicts with much of the historiography on the economics of slavery. For decades historians have claimed slavery was already an outdated and inefficient labor system during the Civil War. Baptist roots his argument in quantitative data, utilizing a few different databases and complementing his text with charts and graphs. However, his use and application of quantitative data has elicited criticism. I examined one of Baptist’s quantitative sources to shed light the author’s manipulation of statistics regarding picking productivity. By ignoring characteristics of the sample data and strategically choosing pieces of the study, Baptist inappropriately uses a regression model as evidence for his argument.
Cover via amazon.com.
Baptist depends on “Biological Innovation and Productivity Growth in the Antebellum Cotton Economy” by Alan L. Olmstead and Paul W. Rhode as evidence to show that the quantity of cotton picked per slave per day increased between three and four times from 1800 and 1862. Baptist uses their scatter plot showing a positive association between the mean daily pounds of cotton picked per worker on southern US plantations and time (Panel A, which I’ll show you later on). In the same study, Olmstead and Rhode create a multiple regression model. Based on their model, they argue that “the increase in picking productivity was primarily due to the spread of improved cotton varieties” (Olmstead, 920). Baptist, however, presents Olmstead and Rhode’s data while ignoring their conclusions. He asserts that the increase demonstrated by Olmstead and Rhode is a result of elevated torture tactics by slaver masters. Slaves picked more productively, he argues, due to a heightened fear of the harsher punishments they would endure for slow work.
Olmstead and Rhode collected picking data from documents they found in public archives and private papers across the United States. These documents included plantation journals, diaries, cotton books, ledgers, and letters. They used 704,800 individual picking entries from 142 different plantations over the course of 509 plantation crop-years from 1801-1862. Olmstead and Rhode explain that their sample is not “distributed uniformly with cotton production over space and time,” and that the “quanitity and quality of the data differ greatly from plantation to plantation” (Olmstead and Rhode, 1146). For example, some plantations are represented with a record from one picking season. Others have records from multiple picking seasons over the course of ten to twenty years. Olmstead and Rhode note that more data was available during the later years considered in their study: 407 of the 509 plantation crop-years are recorded between 1840-1862. Geographically, 474 records are from southern US cotton plantations while 35 come from the Sea Islands.
For this retrospective observational study, Olmstead and Rhode’s data collection strategy was consistent with that of most historians. Because they had to work with pieces of the existing historical record that they could access, their method can most closely be identified as convenience sampling and is unlikely to be random. This method is not considered good statistical practice, but may be the only viable option for analyzing plantation records from the 19th century. Historical records like the ones Olmstead and Rhode utilized are always suspect for measurement bias. It is difficult to know whether plantation managers, overseers, or whoever else was keeping plantation records may have fudged their numbers or falsified their accounts 150-200 years ago. Historians dealing with primary sources like these should use the archival resources available to them while thinking critically about how possible biases are influencing their analysis.
Olmstead and Rhode analyze their data in two main ways. First, they calculated the mean daily picking rates for a plantation crop-year for each individual plantation. These averages are depicted on two scatter plots, where the y-axis represents the year and the x-axis represents the daily pounds of cotton picked per worker. One scatter plot is dedicated to southern United States cotton plantations (Panel A), while a second represents Sea Island plantations (Panel B). The data points are labeled with location abbreviations to indicate geographical placement. Both graphs show a positive association between year and daily pounds of cotton picked per worker. The figures visually demonstrate that the mean daily pounds of cotton picked per worker increases dramatically in the southern US from 1801-1862, while the mean daily pounds of cotton picked per worker sees only a slight increase in the Sea Islands.
Olmstead, Alan L. and Paul W. Rhodes. “Biological Innovation and Productivity Growth in the Antebellum Cotton Economy.” Journal of Economic History 68, no. 4 (2008): 1148.
Olmstead and Rhode also created a multiple regression model of the log of mean daily picking rates (Table 2). In a multiple regression model, one response variable is related to many explanatory variables. In this model, the response variable is the log of the mean daily picking rates. Olmstead and Rhode first relate the log of mean daily picking rates to time trends separately for southern US plantations and Sea Island plantations. A second regression adds the log of the average number of pickers per day, again separating plantations by location. US plantations are treated with a third regression that relates time, the log average numbers of pickers per day, the percentage of picking observations for males, and the percentage of picking observations during non-peak months. The third regression also includes a categorical variable: whether the plantation was in a New South state (AL, AK, FL, LA, MS, TN, TX). New South = 1 if the plantation was in one of these states, while New South = 0 if it was not. They calculated r2 for each regression, which is the proportion of variability in the response that is accounted for by the regression model.
Olmstead, Alan L. and Paul W. Rhodes. “Biological Innovation and Productivity Growth in the Antebellum Cotton Economy.” Journal of Economic History 68, no. 4 (2008): 1149.
To create this regression model, Olmstead and Rhode calculated the regression coefficient for each pairing of an explanatory variable with the response variable. In a multiple regression model, the regression coefficient represents the average change in the response variable for one unit of change in a single explanatory variable when the other explanatory variables do not change. They also calculate the constant, which is the expected value of the response variable when the explanatory variable is equal to 0. The model for a multiple regression is written as yi = B0 + B1xi1 + B2xi2 + . . . + Bkxik +ei. To this point, multiple regression models are similar to simple linear regression models. However, finding B0, B1,…Bk is more complicated in multiple regression models. To minimize the sum of squared residuals for multiple regressions, a set of calculations called normal equations must be solved.
In their study, Olmstead and Rhode address the possibility that slave-drivers may have been exploiting their slaves more efficiently, thus increasing productivity. After testing for the significance of the coefficient of the year and the log of the picking crew size, they found the relationship had a statistically insignificant effect at the 10% level. Their conclusion is that “managerial innovations…unlikely accounted for much of the increase in picking efficiency” (Olmstead and Rhode, 1143). It is debatable whether this coefficient is truly representative of the increased violence Baptist argues for, but it is alarming that Baptist uses this data without acknowledging this conclusion.
In Olmstead’s review of The Half Has Never Been Told, he criticizes Baptist for manipulating the statistical model he and Rhode developed. He takes issue with Baptist’s selective use of their data: Baptist reproduces their figure that shows productivity increasing dramatically on southern US cotton plantations (Panel A), but does not include their figure that shows the productivity on Sea Island plantations changing very little (Panel B). Olmstead argues that masters on the Sea Island would have had access to the same torture techniques as their peers in the southern US, and would have implemented them to match southern output. He notes that the data had a pattern: picking rates were low at the beginning and end of a season, with the most productivity in the middle and variations throughout attributed to weather. An individual slave’s productivity, he says, varied 30% or more on an given day. These characteristics support the idea of improved cotton varieties, but do not match up nicely with Baptist’s theory. In Olmdstead’s words, “it is impossible to reconcile his story with the data” (Olmstead, 920).
After examining Olmstead and Rhode’s “Biological Innovation and Productivity Growth in the Antebellum Cotton Economy,” it is evident that Baptist manipulated their study to support his argument. He used one of their visuals to support his claim of increased productivity in cotton picking. He did not take characteristics of the sample into account, ignored their discussion of the Sea Islands, and discounted much of their statistical analysis.
So what does it all mean? We don’t need to completely discount Baptist’s argument because of this – he has many other sources, primary and secondary, to take into consideration. But did Baptist intentionally pick and choose from Olmstead and Rhode’s study, or did he just not quite understand what he was looking at? Either way, it seems odd to include a secondary source that explicitly rejects your argument…as a supporting argument.
My main take-away from my foray into graduate level statistics this semester is a little piece of advice: if you don’t understand something, ask for help from someone who does.
Baptist, Edward. The Half Has Never Been Told. New York: Basic Books, 2014.
Olmstead, Alan L. “Roundtable of Reviews for The Half Has Never Been Told.” Journal of Economic History 75, no. 3 (2015): 919-931.
Olmstead, Alan L. and Paul W. Rhodes. “Biological Innovation and Productivity Growth in the Antebellum Cotton Economy.” Journal of Economic History 68, no. 4 (2008): 1123-1171.
Ott, R. Lyman and Michael T. Longnecker. An Introduction to Statistical Methods and Data Analysis. 6th ed. Independence, KY: Duxbury Press, 2008.