Living in Data

04 Apr 2022

Book	Living in Data: A Citizen’s Guide to a Better Information Future
Author	Jer Thorp
Published	May 4, 2021

Data: a new era for the computer

“Data” has always been a restless word. It first appeared in the English language on loan from Latin, where it meant “a thing given, a gift delivered or sent.” It spent its early years in the shared custody of theology and mathematics. The clergyman Thomas Tuke wrote this in 1614 about the difference between mystery and sacrament: “Every Sacrament is a Mysterie, but every Mysterie is not a Sacrament. Sacraments are not Nata, but Data, Not Naturall but by Divine appointment.” Here “data” holds its Latin meaning as something given, but because its giver is the almighty God, it carries with it a particular strength of truth. In 1645, the Scottish polymath Thomas Urquhart wrote The Trissotetras; or, A Most Exquisite Table for Resolving All Manner of Triangles. In it, he defined “data” as “the parts of the triangle which are given to us.” By 1704, data had found a hold in mathematics beyond geometry. Another clergyman, John Harris, defined “data” in his Lexicon Technicum as follows: “such things or quantities as are supposed to be given or known, in order to find out thereby other things or quantities which are unknown.” Data as givens, things we already know, truths like gravity and pi and the Holy Ghost.

The linguistic neighbors of “data” remained, for a century or two, consistent. “Math,” “numbers,” “quantities,” “evidence,” “unknowns.” Some new words arrived as mathematicians and philosophers worked to order their universe: “qualitative,” “quantitative,” “ordinal,” “cardinal,” “ratio.” At the turn of the twentieth century, with Galton and Pearson and the birth of modern statistics, came a new way for data to be thought of, and a new way for it to live: as the contents of a table. Fifty years after that, data became bound to one of its stalwart allies, a word that would change the way in which data is commonly understood: “computer.” Between 1970 and the end of the millennium, data changed quickly: from a thing of God and mathematics to a collection of bits and bytes. The word still adhered firmly to concepts of truth, but it was a different kind of veracity, one stamped into thin layers of silicon.

Measurements occur only in a moment

Data about anything a sentence, a bird, the temperature of a room, the age of the universe, the sentiment of a tweet, the flow of a river-is an artifact of one fleeting moment of measurement and is, as Drucker’s concept of capta gets at, as much a record of the human doing the measuring as it is of the thing that is being measured.

Data is the end product of a messy process/ system

“Data”, it seems, is being pulled by strong currents. One eddy seems to be drawing it toward a dystopic future, an inevitable payback for a decade of unsavory practices. The other, seen through a hopeful lens, might bring data to a more utopian place. One in which it is bound tightly to our lives, to the places where we work and play, to the friends with whom we share the experience of being. A future where “data” is in fact closer to “art” and “dirt” and “laughter.” Also to “community” and “empowerment” and “equality.”

Is it possible, then, that we might give it a push?

To do this, we need to unfold and examine the act of data. We need to understand how data is created, how it is computed upon, and how it is communicated. How it is made, changed, and told. We need to recognize that each of these steps cannot be looked at in isolation, that to know data we need to be able to look at the end-to-end process of it, to view data not as a noun, or a verb, or a thing but as a system and a process.

Much of the narrative of data has leaned on the idea of discovery. After all, as marketing seminars and TED talks have told us again and again, we collect data to climb a hierarchy of knowing, up to information, then to knowledge, and finally to the sharp peak of wisdom. In our ascent, we are filling in the blanks, shading in areas that are unknown (or at least unknown to us). Using a more critical eye, though, we start to see that the goal of the climb isn’t so much to find wisdom as to stake a claim-a claim to customers and insights and ad dollars, but also to territories, to names, to ways of knowing. As Kate Crawford underscores in an essay, today’s complex AI systems require “a vast planetary network, fueled by the extraction of non-renewable materials, labor, and data.”

We are told that the seemingly light and transitive computational processes of data systems happen in “the cloud.” Crawford shows us that they in fact are anchored in on real people. Lithium real places, are reliant crystals forming on the salt flats of Bolivia, labyrinthine assembly lines running 24/7 in Chinese factories, Congolese mines staffed by children as young as seven, cargo ships emitting ton after ton of COz, Amazon workers struggling to survive in workplaces that have become surveilled to a point of absurdity. There is, as Crawford writes, “a complex structure of supply chains within supply chains, a zooming fractal of tens of thousands of suppliers, millions of kilometers of shipped materials and hundreds of thousands of workers included within the process even before the product is assembled on the line.”

A law of data: data collection has an assymetric impact on everything downstream

Put into its most harmless verbiage, data is collected. The word “collect” appears seventeen times in Facebook’s data policy: We collect the content, communications, and other information you provide. We collect information about the people, pages, accounts, hashtags, and groups you are connected to. We collect, we also collect, we collect, we also collect, we collect, we collect, we use information collected. We require that each of our partners has the lawful right to collect. We collect. You might imagine group of lederhosened foragers in an alpine meadow, plucking data gently from a bush. The tenderness of the word might be part of why we hardly consider collection when discussing potential harms of data. And yet the decision to collect some set of dataor not to collect another-is the biggest one to make, and how data is collected and stored deeply affects the ways in which it might later be used to make choices, to tell stories, or to act on individuals and groups. Most important, each decision made at the moment of collection is amplified as data are computed upon, inflated by algorithms, and distilled by visualization.

“Collect” is an asymmetric word. The experience of the collector is very different from that of the collected from; the benefits and risks are piled unevenly on one side. Much of the unease in our data lives comes from this lopsidedness, from being on the low end of collection’s fulcrum. Some balance can be found by blocking the collectors, through shielding and camouflage and obfuscation. By installing ad blockers, turning off our phones, encrypting our email, quitting Facebook. With these tactics we lessen the burden of being collection’s objects, but to reset the scales, we also need to learn how to be its subjects, to collect data for our own benefit and for the benefit of others.

Links between data (“interdata”) are incredibly important b/c they allow us to paint a bigger, more comprehensive picture

If data about data is called metadata, data between data might be called interdata. Interdata are records or measurements that act as bridges between two data sets. Among the complicated bureaucratic data systems of the U.S. government, a formal piece of interdata is your Social Security number. On the internet the most oft-used interdata are your email address and your machine’s IP address. Both of these pieces of information give the data-ers a way to find you in more than one data set, thus being able to know how your Facebook posts intermingle with your Instagrams, how your dating profile might predict what will go into your Amazon shopping cart, how your mobile calling records might connect to your voting record.

Much of the “innovation” in the surveillance capitalism era been around finding new kinds of interdata. This has is the central promise of facial recognition: that the physiognomy of your face might give a ubiquitous, publicly recordable data bridge that will bring targeted marketing tactics into the real world. The data signature of your face becomes a kind of Social Security number readable by anyone, anywhere, used not just to confirm your identity at the bank but to follow you as you move from the gym to the grocery store, to the public health clinic, to the political rally, to pick up your child from day care.

You can produce vast amounts of data from just about anything, and data multiplies on itself. So the choices of “what to measure and how” are very important

Remember that the amount of data that can be conjured from any given thing is almost limitless. Pick up a plain gray rock from the side of the road, and play the same data-ing game we did with the bird in chapter 2. Very quickly you’ll have assembled a set of descriptors and values: size, weight, color, texture, shape, material. If you take that rock to a laboratory, these data can be made greatly more precise, and instrumentation beyond our own human sensorium can add to the list of records: temperature, chemical composition, carbon date. From there comes a kind of fractal unfolding of information that begins to occur, where each of these records in turn manifests its own data. The time at which the measurement was made, the instrument used to record it, the person who performed the task, the place where the analysis was performed. In turn, each of these new metadata records can carry its own data: the age of the person who performed the task, the model of the instrument, the temperature of the room. Data begets data, which begets metadata, repeat, repeat, repeat. It’s data all the way down.

This infinity mirror of data and metadata can be exhausting for people who are trying to decide exactly what to record about a thing.

“Data genesis”: the creation story of a data point

Somehow all of these conversations kept coming back to this moment of data genesis, when a book or a map or an audio recording produces a catalog record, for it is in this moment when the object’s life as a thing that might be found is largely defined. If the cataloger has a bit of time, and if they are thorough, our object might get a record that will serve it well for future searches. An exact date, an accurate place, an extra line of descriptive text: all of these things privilege the object greatly in the context of being found, they increase its chances of being included in research and in the stories that get told about the past. Conversely, many objects are data-fied with a spareness that all but guarantees them a life a on the last page of search results.

“Schematic” bias in data: how data is organized and/ or labelled influences people’s view of it and how it’s used

We learned in the last chapters that data can bestow privilege and that its absence can push a thing out toward the margins. Here we see that the processes in which a thing is data-fied and the constraints of the structures made to hold that information can also have a profound effect on how that thing can participate in the databases and search engines, newspaper articles and court hearings, in the record of history. This effect-where data is trimmed or transfigured to match the expectations of the machine-can be called schematic bias. For those involved in building data systems, a schema is a kind of blueprint, a map of which types of information will be stored, in what form, and which types of information will be rejected. In cognitive science, a schema is a pattern of thought, a framework of preconceived ideas that directs how a person sees the world: if you observe something that fits neatly into your schema, it gets filed easily and efficiently into your memory. On the other hand, schema-foreign things will often not be noticed or remembered, or they will be modified to fit into what you expect based on the frameworks you’ve constructed. The many machines that order our data lives are working the same way, paying more attention to the things that fit neatly into their schema and ignoring things that don’t-or changing them to fit.

It’s particularly important to understand how schematic biases are amplified. How a decision made by a developer in a newsroom affects how a data point is stored, how a visualization is made, how a story is told, how a public understands. The structures built to store data affect how things are found and lost, how histories are written and who is included in them. Algorithms, with their expansionary tendencies, can loop these omissions upon themselves until they become wide sinkholes, affecting the ways in which people live (and lose) their data lives.

Neural networks (and machine learning models, in general) take data and produce an algorithm that can predict outputs from inputs according to statistical rules. This contrasts with traditional CS where the algorithm had to be designed to produce outputs from inputs according to deterministic rules.

Even now, few have fully realized what this paradigm shift means.

There’s an important difference between the way neural networks work and the way a standard computer program does. With a run-of-the-mill program like a decision tree, we push a set of data and a list of rules into our code-based machine, and out comes an answer. With neural networks, we push in a set of data and answers. and out comes a rule. Where we were once drafting our OWN rules for what is and what isn’t a bird, or which prisoners may or may not re-offend, the computer now constructs those rules itself, reverse engineering them from whatever training sets it is given to consume.

Algorithms can magnify schematic bias

Algorithms can, in themselves, be biased. They can be coded to weight certain values over others, to reject conditions their authors have defined, to adhere to specific ideas of failure and success. But more often, and perhaps more dangerously, they act as magnifiers, metastasizing existing schematic biases and further darkening the empty spaces of omission. These effects move forward as the spit-out products of algorithms are passed into visualizations and company reports, or as they’re used as inputs for other computational processes, each with its own particular amplifications and specific harms.

Data visualization is story telling

I’m starting this chapter with crayons and birthday cake to underline one of the most important things I’ve come to understand about visualization as a form of telling: that it is a simple thing. Any burdens we’ve placed on it-requirements for objectivity or truthfulness–come more from our own politics than from some innate character of the act itself. These simple examples also help to illustrate that data representation of any kind is a human act, full of human choices. As we’ve seen, the processes of making data and changing it with computers are rampant with decision points, each of which can greatly increase or greatly limit the ways in which our data systems function. When we reach the showing stage, where we decide how our data might tell its story to humans, possibility space goes critical. Each time a data designer picks a chart type or a color palette or a line weight or an axis label, they’re trimming the prospects for communication. Even before that, the choice of a medium for representation has already had a predestinatory effect. A web page, a gatefold print, a bronze parpet, a birthday cake each of these media is embedded with its own special opportunities and its own unavoidable constraints.

Data visualization is “knowledge compression”

Search for a definition of “data visualization,” online and in books, glue the pieces together, and you’ll end up with a chimera. Data visualization is a process. It is a set of technologies. It is the use of a graphical representation. It is a situation. It is a practice. It is visual effects for communication. It is mapping, it is displaying, it is transforming and translating. It enables communication and supports decision-making. It amplifies human cognition. It is looking at the world from a data point of view. It is a form of knowledge compression, an emerging market space, a software feature, and a discipline.

Visualization tools sometimes limit exploration. In those cases, it can make sense to push the boundaries of what’s been done and come up with a new type of visual.

Visualization favors cleanly drawn lines and neatly printed dots. It wants the edge of bars and borders on maps to be cut with a sharp blade. Some of this neatness is a result of the tool, the computer’s cookie-cutter logic exposed again as things are taken out of a database and put onto the screen or the page. Some is a result of visualization’s fundamental distaste for uncertainty and ambiguity. But much of the cutting away is deliberate, done to fit the graphic to the shape that it holds in the designer’s mind.

One way to avoid the onus of omission is to give the people who are reading the visualization some control as to what they see and from which vangles. While static visualizations hand viewers a carefully framed postcard of the data, exploratory visualizations give users a vehicle, where they are (to some extent) free to roam the full terrain of the data, snapping photos as they go. Almost all of my visualization work has taken the form of exploratory tools. Even in the case where the result is a static image (like the PopSci piece), I build my own vehicles, to make it easy for me to range widely across a data set’s terrain. More often, I let others drive.

Visualization isn’t just about “the morality of truth”, it’s also an art form

Reading descriptions from its canonical texts, you might be forgiven for assuming that data visualization describes a moral outlook as well as an act. In the opening chapter of The Visual Display of Quantitative Information, Tufte, the grand old man of data viz, lists nine things that graphical displays of data should do, among them a list of acceptable purposes for data viz: description, exploration, tabulation, or (we get the sense grudgingly) decoration. Stephen Few invokes the Buddha on page 9 of Show Me the Numbers, suggesting that practicing the good, conservative kind of data viz is to find a “right livelihood.” Thou shalt not “entertain,’ he writes, nor “indulge in self-expression.” “We must,” he tells us (emphasis my own), “lead readers on a journey of discovery, making sure that what’s important is clearly seen and understood.” In the introduction to The Truthful Art, Alberto Cairo writes that the purpose of data visualization is to “enlighten people not to entertain them.” Quite the responsibility has been placed in this simple thing of turning numbers into dots, shapes, and colors. There are echoes of Clive Humby’s data-as-oil idea here, too. It has to be changed into gas. It must be broken down. It must be analyzed to have value.

To be a practitioner of a “truthful art” is, after all, to name yourself as a truth teller, to self-ascribe authority to your own perspectives and ways of knowing. To follow Tufte and Few is to walk the path of an ascetic, signposted by objectivity and reductionism. In a quest to avoid the daunting specter of bias, data visualization practitioners too often adhere rigidly to best practice, scrubbing and scraping at the excesses of “decoration” until, they hope, there’s nothing but the clean white bone of truth. The result of all this is that there’s a kind of meal-replacement logic at work–a conviction that a story might be blended down into a neat, easily consumed slurry, with all the essential vitamins and absent the pesky nuance. That none of us should miss the crisp snap of an apple’s skin.

producing health

Living in Data

Related posts

The Future of Healthcare is Predictive 06 Apr 2025

Unlocking the Customer Value Chain 28 Jan 2025

The Most Elegant Product Framework You’ve Never Heard Of 04 Dec 2024