Being Data-Pilled

Andrej Karpathy was pretty data-pilled in his podcast with Dharmesh last year. Seemed bearish on AI’s current state and seemed to attribute that to the poor quality of the internet as a dataset.

This reminded me of the “data-centric AI” trend from a few years ago. During the deep learning boom, everyone obsessed over model architectures. Then people realized cleaning data could beat using fancy new models.

This is probably why AI coding has taken off in the way it has. Not just because there’s tons of data, but because code is verifiable, it either runs or doesn’t. So the quality can be verified and improved. Compare that to the corpus of text from the internet, where quality can’t be verified and is probably poor in many cases.

Clearly, data plays a critical role upstream of AI, even in this latest wave, but in what ways is it different from the deep learning era?

AI and the Data Value Chain

The classic data value chain follows a clear progression:

Data → Insight → Action → Value

Traditional analysis/ analytics required humans to process and analyze data to uncover insights that could inform actions that create value. For example, an analysis might predict demand for 100 widgets next quarter, leading to a decision to produce ~100 widgets instead of 150. Prediction was the insight, action was adjusting production and value was cost savings from avoiding surplus.

Machine learning models represented a step closer to the “action” step by integrating into real-time systems. Recommendation engines were trained to show users “People who bought this also bought…”, directly driving additional purchases (action). ML models are machines that take in data and produce insights–they’re effectively insight machines.

The new state-of-the-art AI models go further in that not only can generate insights, they can also take action. Because LLMs are trained on the internet—essentially a digital representation of human reality—they can both predict the right actions and then execute them in digital environments. They’re not just insight machines; they’re insight + action machines.

Consider the progression:

Analysis/ Analytics: Human or system collects data → human processes data/ generates insight → human takes action

ML Models: System collects data → system processes data/ generates insight → human takes action

LLMs/ AI Agents: System collects data → system processes data/ generates insight → system takes action

The latest era of AI effectively automates most of the data value chain (AI = Insight + Action) and is the next evolution of automating the whole process.

AI and The Space of Possible Insights/ Actions

The other difference between this wave of AI and previously ones is that the immense breadth of insights generated by these models.

Again, because the internet is effectively a digital representation of human reality, these latest AI models have knowledge and can generate insights across basically all domains of human knowledge. Which means the breadth of their actions can also span all of human affairs.

The one caveat: AI is currently limited to taking action in the digital world, which means it is only effective in domains that have undergone digital transformation. Sure, software continues to eat the world, but there are many domains that haven’t yet been fully digitized.

Consider this progression:

Analysis/ Analytics: Human or system collects specific dataset → human processes data/ generates specific insight → human takes specific action

ML Models: System collects specific dataset → Human trains model on specific dataset → system processes data/ generates specific insight → Human takes specific action

LLMs/ AI Agents: System collects general dataset → Human trains model on general dataset → System processes data/ generates general insights → System takes general actions

Central Dogma of AI

I’m becoming increasingly convinced this is the way to think about AI and I’m calling this the Central Dogma of AI (analogous to biology’s central dogma: DNA -> RNA -> Protein):

Data → Analysis/Analytics → AI

Here are some of the parallels (although obviously not a perfect analogy):

  • Data only becomes useful through transformation, just like genes. It has to be compressed or transformed into insights and actions that create value.

    This has always been true, however, companies are now leaving so much more value on the table by not leveraging their data. Particularly unstructured data, which is now much easier to transform and operationalize using AI.

  • Poor quality data still leads to poorly performing AI. Just like a pathogenic variant can lead to disease, incorrect data can lead to incorrect analysis, insights and actions. Same is true for AI. Bad data in the training set will lead to bad insights and bad actions.

    This is also true for context and data provided at inference time. This is why context engineering has become a thing. Again, collecting and leveraging the right data is critical to making AI as valuable as possible.

  • Taking this a step further, it’s been said that any task that can measured and verified can be optimized and solved by AI. What that means is that companies need to collect as much verifiably-high-quality data as possible so they can build agents to automate those workflows. This means digitization and robust data collection are even more important, because AI has infinitely increased the opportunity cost of not doing those things.

  • Analysis/ analytics bridge the gap between data and AI (like RNA bridges DNA to proteins). In traditional AI/ ML model development, you need to validate models on test datasets to assess how well the model is performing. Deeper error analysis can uncover poor quality data, or other ways to improve the model. When deploying models, you also need to measure and track its performance in production.

    This latest wave of AI is no different. To develop effective AI-driven workflows or agents, you need to measure the performance of the output through formal evaluations (“evals” for short). Evaluations help improve an AI workflow/ agent’s performance by iterating on data/ context, prompts, etc. When deploying to production, you also need to instrument and measure the AI agent’s outputs to assess effectiveness on real-life work.

  • Similar to proteins, AIs are machines that can do work and take action. Analysis/ analytics can only give insight, not action. AI can do both. AIs are workhorses of data, just like proteins are workhorses of the cell.

    However, their actions are only limited to digitized systems. The less digitized the domain, the less effective AI will be.

Implications/ Predictions

Ok, this is all good in theory, but can we use this framework to make predictions?

  • Digital transformation is a pre-requisite to AI transformation. To collect the right data and automate the right workflows, those processes must already by in the digital world. AI can only provide value when it has data and can take action.

    A corollary to this: domains that are the most digital with be the first to be transformed by AI. Software engineering is a good example.

  • Data rich domains will also be the first to be transformed by AI. Particularly domains that with lots of unstructured data that was previously harder to operationalize.

    This is why there’s so much excitement about AI for healthcare and bio: both are dominated by unstructured data and workflows, which have been difficult for traditional/ deterministic software solutions.

  • Task verification and data collection/ quality will be become even more important. The data centric AI trend will continue in this next wave of AI, and it will become even more sophisticated. You can see this with the rapid success of new companies for data labeling and reinforcement learning environments. Both are aimed at addressing this.

  • AI will replace the need for analysis/ analytics in almost all cases. The biggest pain points of any analysis/ analytics effort are: (a) they usually raise as many questions as they answer and (b) you still need to take action. AI is an antidote to both of these. You can generate a large space of insights, answer any follow-ups right there in the chat and then take action.

    For this reason, AI agents will likely overtake all analysis/ analytics solutions. Customers/ users will come to expect it.

Final Takeaway

The most important takeaway?

Fundamentally, AI isn’t very different from digital and data revolutions that have been happening for the past ~50 years. In essence, it’s still about turning data into action and real value.

Orgs that treat AI as completely separate from existing digital/ data capabilities will struggle. Those that see it as the next step in their digital and analytical maturity will succeed.