How Data Happened
08 Jan 2023Book | How Data Happened: A History from the Age of Reason to the Age of Algorithms |
Author | Chris Wiggins, Matthew L. Jones |
Principles of Analysis
John Tukey’s principles of data analysis:
Data analysis must seek for scope and usefulness rather than security.
Data analysis must be willing to err moderately often in order that inadequate evidence shall more often suggest the right answer.
Data analysis must use mathematical argument and mathematical results as bases for judgment rather than as bases for proofs or stamps of validity.
History of Data Analysis
Data analysis is detective work.
In the atmosphere of Bell Labs, Tukey and his collaborators created a wide variety of statistical and computational tools needed to make data analysis a reality. Sixteen years later, in a practical textbook, he explained, “exploratory data analysis” (EDA) is “detective work-numerical detective work–or counting detective work–or graphical detective work.’ EDA offered some “general understandings” useful across domains of detective work.
Data visualization is incredibly important for effective data analysis.
Tukey celebrated the creation of new tools for that craft.
Tukey’s 1978 textbook, whose draft had circulated for years in Bell Labs circles and beyond, offered a survey of the arts of exploring data through potent means of “reexpression.” “We have not,” he explained in bold type, “looked at our results until we have displayed them effectively.”
Effective display means efficiency with many developing profisized, forms of visualizing data, “much more Tukey emphasized output creative effort is needed to pictorialize data analysis.
For humans, the use of appropriate pictures offers the possibility of great flexibility all along the scale from broad summary to fine detail, since pictures can be viewed in so many ways.
The Big Data Moment
A similar observation–that large data sets gathered for one purpose may yield potential new kinds of scientific and commercial knowledge-would be made in a diversity of computational fields over the coming decades.
Financial data and their practical analysis would give rise to technical analysis, statistical arbitrage, and later, with more computational engineering, the field of high frequency trading. Similarly, computational biology in the 1990s and 2000s exploded with analysis of differing genomes as well as high throughput biological assays for understanding genetic networks, large-scale mining of electronic health records, and clinical informatics.
In industry, applied, computational statistical methods changed the way companies recommended books and movies early in the rise of e-commerce, then later techniques would be applied to wine, shoes, same tech and eventually information and communication.
Each of these fields had its own “data moment” as it discovered anew how large quantities of data, generated for purposes other than learning, could be valuable given a bit of statistical analysis surrounded by an infrastructure need to gather, process, and productize insights from these data. Chambers, Tukey, and others argued that the statistical analysis was a mere part of this project–the mathematical nugget at the core of “greater” statistics. But they were also warning that academic statistics was doomed to irrelevance if it didn’t begin providing the tools for learning from this data.