5 Founders on the Future of Data – Andreessen Horowitz

Its well understood that artificial intelligence is advancing like mad right now. Whats less appreciated is the role that data and infrastructure continue to play in these advances whether thats adding new data sources to train better models, building the data infrastructure to support AI workloads, or taking advantage of more powerful hardware to do all sorts of new things. And, of course, lost in all the excitement around AI is the fact that good, old-fashioned data analysis is still a major enterprise workload and continues to see its own fair share of innovation.

We recently held our Data and AI Forum in New York City, featuring talks from a collection of our founders and other leaders in the space about where the world of data is heading. Here are some highlights (edited for readability) from the founders building products across the spectrum of use cases.

Our most fulfilled, amazing days as humans are the days that we are spending doing creative and interesting work and not doing the tedious drudgery stuff. And I think AI is here to help us achieve that state of fulfillment.

Ive been working in data, data science, and data analytics my whole career. I am now the founder and CEO of a company that builds a data science and analytics tool, and our product is used by thousands of data practitioners every day. And we see them do some really creative, interesting stuff.

I think data practitioners are creatives. I know its not the first thing that comes to mind when I say creatives, I think of artists or whatever but think about what data scientists do in their day. Theyre asking questions, theyre forming hypotheses, theyre testing new things, theyre building narratives, theyre taking risks, theyre telling stories. This is good data science, its good data analytics. And its what we expect from our data teams. Its an art and a science and a great use of human time.

But data work can also be really tedious. You spend a lot of time writing boilerplate and fixing dependencies and tracking down missing parentheses in a query. It can be more plumbing than science sometimes. This is where I think people wind up spending a lot of their time, and it really is a blocker to them doing their best work. So this really feels like a perfect opportunity to bring human-computer symbiosis into this creative profession.

Now, most people, when they think of this, assume it means just replacing data teams with a magic insights text box. Like, the next step is well all buy solutions into which our stakeholders or executives will come in, theyll write a question, and itll give them a magic response back. You know: properly formatted charts and well-reasoned explanations and full business context. But that doesnt really work.

And it doesnt work, one, because these models arent perfect. They can hallucinate, theyre missing a lot of context, they dont understand the full situation of things. But also that humans want to be able to hear a story, and understand, and ask and answer questions of a human around these things.

Even though AI has this power to enable us to get more value out of our content, its really challenging to do that. Theres no such thing as a free lunch. And I think that there are four main challenges that prevent organizations and businesses from really being able to unlock this data right now.

The first one is scale. When we think about unstructured text and visual data, its orders of magnitude larger than todays big data. So to put that into perspective: If we were to think about tabular data, say we had 10 million rows of tabular data, thats around 40 megabytes. And to put that into perspective, we can think of that as being like [the surface] area of Lake Tahoe in California, which is around 496 square kilometers.

If we were to think about 10 million text documents, we go from 40 megabytes to 40 gigabytes. And now we have something thats more on the scale of the Caspian Sea 371,000 square kilometers of space. Its three orders of magnitude more data, in terms of volume, than when we think about tabular data.

And then when we think about visual data, if we had 10 million images, that would be 20 terabytes of data. Thats another three orders of magnitude bigger. Thats like the Pacific Ocean in terms of the sheer scale of data volume. . . .

Right now, when we think about big data or data lakes, we have these tools and vehicles that can process that efficiently. But thats kind of like having a rowboat or a canoe: Itll get you across a lake, but I wouldnt trust that if youre trying to cross the Pacific Ocean.

In order to actually be able to unlock the value from this richer, more contextual data that we get with content, we actually need to create tools and infrastructure to process that. Its going to be probably a similar shape in terms of how a seagoing boat looks somewhat similar to a rowboat but the scale and the processing of it will have to be completely different. Well need to prepare ourselves for the sheer volume and scale that were thinking about when we move from a tabular view of the world to more of a content view of the world.

I think one really interesting thing thats happening and is changing the way systems need to be architected is that what is considered big data is actually increasing. When Google came out with the MapReduce paper 2004, there were a lot of workloads that you had to spread across multiple machines because machines were pretty small. Like, the first AWS instances had a gigabyte of RAM and one CPU.

Now, [you can rent AWS instances with hundreds of processors and terabytes of RAM]. There are very few workloads that wont fit into that amount of hardware. . . .

I think theres a bunch of things that have to be true in order for you to really need big data systems: Youve got lots of data. You need it all. You do need it all at once. The amount youre using doesnt fit on a machine. You cant get rid of that data and you cant summarize it. OK, then you need some fancy scale-out system.

So what does the world look like if data size isnt the primary driver of your architecture? What are some things that you can do about it? One is: Dont be afraid to scale up. Scale-up became a dirty word, I guess, once Google published the MapReduce paper. Everybodys building these large-scale distributed systems but, actually, scale-up works really well if you clean up after your data. Just good data hygiene can get you pretty far.

Another interesting one: If you have smaller data, you can push some of that out to the user. When we built BigQuery, one of the things we said was that, with large data, you want to move the compute to the data rather than the data to the compute. Laptops used to be synonymous with underpowered, but, these days, M2 Macbooks are basically supercomputers. If you have smaller data sets, why not push the workloads out to them? . . . Its a lot less expensive to do locally than it is to do in the cloud.

Theres this Cambrian explosion of new data sources and new applications every single year . . . And what that creates, of course, is data silos. You now have your most valuable data in a variety of different database systems, and it creates a lot of vendor lock-in because many of these systems are proprietary in nature, which means you can only access that data through that particular system.

So, this notion of centralizing your data, that model is much slower than it looks because you have to move all of the data out of all these different systems and get it into one place before you can do analysis. It limits your view to what is in that enterprise data warehouse, which is never the complete truth about your business. You always have data in other places. And take it from me spending time at Teradata not one of their customers had all of their data in Teradata, its just not possible.

And, of course, proprietary lock-in and it can become very expensive. And that was really the challenge for many of these early databases: Oracle, Teradata, IBM DB2. Theyre not bad databases by any stretch of the imagination. Even today, I would argue Teradata is a better database than Snowflake. But the market is moving away from them and thats because its incredibly expensive and customers feel locked in.

So, [the idea that you need to centralize] your data: not true, and also impossible. The truth is you need to optimize for decentralized data.

Most of the AI systems that are being trained today are trained on these public datasets, mostly data crawled from the web. And I think theres actually still a decent amount of public data available. Even if were reaching the limits, say, of text, there are other modalities that folks are starting to explore audio, video, images. I think theres a lot of really rich data sources out there, still, on the web.

There are also I dont know the exact magnitudes, but I imagine roughly a similar scale of private datasets out there. And I think thats going to be really important in certain applications. Imagine if you have a cogeneration system, its great that its trained on all of public GitHub, but it might be even more useful if its trained on my own private code base. I think figuring out how to blend these public and private datasets is going to be really interesting. And I think its going to open up a whole bunch of new applications, too.

From Characters perspective, and I guess more generally, one of the things that were starting to see that is pretty exciting is this move from, you could call it, static datasets data that exists already out there, independent of AI systems. Were moving now, I think, toward data sets that are being built with AI in the loop. And so you have what people often refer to as data flywheels. You can imagine, say, for Character, we have all these rich interactions where a character is having a conversation with someone, and we get feedback on that conversation from the user, either explicitly or implicitly, and thats really the perfect data to use to make that AI system better.

And so we have these loops that I think are going to be really exciting and provide both richer and, perhaps, much larger data sources for the sort of next generation of systems.

* * *

The views expressed here are those of the individual AH Capital Management, L.L.C. (a16z) personnel quoted and are not the views of a16z or its affiliates. Certain information contained in here has been obtained from third-party sources, including from portfolio companies of funds managed by a16z. While taken from sources believed to be reliable, a16z has not independently verified such information and makes no representations about the enduring accuracy of the information or its appropriateness for a given situation. In addition, this content may include third-party advertisements; a16z has not reviewed such advertisements and does not endorse any advertising content contained therein.

This content is provided for informational purposes only, and should not be relied upon as legal, business, investment, or tax advice. You should consult your own advisers as to those matters. References to any securities or digital assets are for illustrative purposes only, and do not constitute an investment recommendation or offer to provide investment advisory services. Furthermore, this content is not directed at nor intended for use by any investors or prospective investors, and may not under any circumstances be relied upon when making a decision to invest in any fund managed by a16z. (An offering to invest in an a16z fund will be made only by the private placement memorandum, subscription agreement, and other relevant documentation of any such fund and should be read in their entirety.) Any investments or portfolio companies mentioned, referred to, or described are not representative of all investments in vehicles managed by a16z, and there can be no assurance that the investments will be profitable or that other investments made in the future will have similar characteristics or results. A list of investments made by funds managed by Andreessen Horowitz (excluding investments for which the issuer has not provided permission for a16z to disclose publicly as well as unannounced investments in publicly traded digital assets) is available at https://a16z.com/investments/.

Charts and graphs provided within are for informational purposes solely and should not be relied upon when making any investment decision. Past performance is not indicative of future results. The content speaks only as of the date indicated. Any projections, estimates, forecasts, targets, prospects, and/or opinions expressed in these materials are subject to change without notice and may differ or be contrary to opinions expressed by others. Please see https://a16z.com/disclosures for additional important information.

Cloud Hosting

5 Founders on the Future of Data – Andreessen Horowitz

Recent Posts

Categories

Archives

Media Sites

Pages

Site admin