SAN FRANCISCO--(BUSINESS WIRE)--Unstructured, the leader in ingestion and preprocessing for large language models (LLMs), today announced its $40 million Series B led by Menlo Ventures with participation from Databricks Ventures, IBM Ventures, Sacramento Kings Chairman Vivek Ranadivé, Datastax CEO Chet Kapoor, Allison Pickens of the New Normal Fund, and NVentures, NVIDIA’s venture capital arm, as well as existing investors Madrona, Bain Capital Ventures (BCV), and Mango Capital. Tim Tully of Menlo Ventures joined the board of directors as part of the investment, which brings the company’s total capital raised to $65 million. Unstructured will leverage this new injection of capital to grow its team and accelerate its development of data preprocessing tooling for LLMs.
Globally, more than half of organizations have increased their investment in generative AI programs over the past year, but the rise of this transformational technology presents a massive data challenge. While the advent of the “modern data stack” over the past decade unlocked structured data for advanced analytics, until now there hasn’t been an equivalent set of tooling for the more than 80% of enterprise data that is unstructured. This includes files such as emails, documents, images, videos, and other data organizations historically haven’t been able to use at scale, in conjunction with machine learning. Addressing this critical gap, Unstructured is the first and only company that can ingest and pre-process all unstructured data into formats ready for use with foundation models.
Since its founding in 2022, Unstructured has been at the forefront of the productization of enterprise LLMs—empowering organizations to quickly automate the transformation of its messy, unstructured data into formats necessary for retrieval augmented generation (RAG) and LLM fine tuning. Unstructured’s technology has emerged as a critical piece of infrastructure not only to deliver LLM-ready data to vector databases but also for driving performance improvements of more than 20% across LLM applications without any customization. Unstructured’s open source library has been downloaded more than 6 million times, is used by more than 12,000 code bases, and more than 45,000 organizations, including more than one third of the Fortune 500, are using Unstructured to preprocess their proprietary data.
In January 2024, the company released its commercial SaaS API and already has more than 1,000 paying customers; in February, Unstructured announced their enterprise platform, which is the first solution to continuously extract raw unstructured data from existing databases, transform more than 30 file types into LLM-ready formats, and automatically load this data into a vector database for RAG. Developers and data scientists spend more than 75% of their time preparing data, and Unstructured’s solution removes the critical barrier to moving LLM pilots into production. The real-time, continuous data access that Unstructured provides means that LLMs are kept up to date, have access to knowledge specific to organizations, and are less prone to hallucinations.
“Over the last decade the emergence of the modern data stack has enabled analytics products to take advantage of the cloud and structured data to deliver incredible value to organizations, but the development of LLMs nested in a RAG architecture has enabled a similar shift for the world of unstructured data. For the first time, developers are able to interact with all of their data through large foundation models. This new data stack rests on four key components: LLMs, orchestration frameworks, new cloud storage solutions, and ingestion and preprocessing tooling,” said Brian Raymond, CEO and Founder of Unstructured. “A critical bottleneck to realizing the emerging value of LLMs is the ability to ingest and preprocess any human-generated data into an LLM-ready format. 2024 will be the year of moving LLM prototypes into production and organizations of all types and sizes are hungry to build out these architectures efficiently and at scale. Automating the process of structuring data and seamlessly delivering it into storage is critical for enterprises that want to build solutions on this new tech stack and go to market quickly.”
“Unstructured has built an exceptional cloud AI platform to help developers build data pipelines for RAG, AI applications, chatbots, and more,” said Tim Tully, Partner at Menlo Ventures. “It has become the preferred way developers build AI applications and assemble data pipelines. People in the industry know that RAG quickly became the industry standard. Soon they will understand that Unstructured is the tip of the RAG spear.”
“Generative AI is key to gathering useful, intelligent insights from the massive amounts of data that enterprises create everyday,” said Mohamed “Sid” Siddeek, corporate vice president and Head of NVentures at NVIDIA. “Unstructured is an emerging leader in data ingestion and preprocessing, working to make AI more accessible, useful, and powerful for all.”
“Unstructured is turning the data challenge into opportunity — helping businesses optimize for AI,” said Thomas Whiteaker, Investment Partner at IBM Ventures. “We are proud to invest in a company that shares our mission of driving AI for business and empowering enterprises to unlock greater insights from their data.”
“We are thrilled to invest and partner with the Unstructured team,” said Andrew Ferguson, VP of Corporate Development and Ventures at Databricks. “Unstructured is rapidly becoming a critical technology for delivering RAG-ready data to the Databricks platform and more than 120 customers are already using its best-in-class data preprocessing tool. We look forward to growing our partnership and accelerating enterprise adoption of generative AI.”
For organizations eager to unlock the full potential of their data, Unstructured offers an open source solution, a commercial SaaS API, Marketplace APIs with Azure and AWS, and their commercial platform currently in beta. For details on how to get started, visit unstructured.io.
About Unstructured
Unstructured is the leading provider of LLM data preprocessing solutions, empowering organizations to transform their internal unstructured data into formats compatible with large language models. By automating the transformation of complex natural language data found in formats like PDFs, PPTX, HTML files, and more, Unstructured enables enterprises to leverage the full power of their data for increased productivity and innovation. With key partnerships and a growing customer base, Unstructured is driving the adoption of enterprise LLMs worldwide. To learn more, visit unstructured.io or email hello@unstructured.io.