Google just open-sourced LangExtract, a free tool that does what $50K enterprise document extraction software does
The data extraction market has been growing fast, quietly, and profitably. That calm just ended. Google has released an open-source tool that puts a fast-growing, multi-billion-dollar industry on notice. The tool, called LangExtract, tackles a problem that has long required paid software, custom pipelines, or large data teams: turning messy, unstructured text into clean, structured data at scale. This time, it’s free.
LangExtract launched in July 2025 as part of Google’s Gemini-powered information extraction stack. It’s a Python library built to pull structured information from long documents using large language models. What sets it apart is verification. Every extracted entity is tied back to its exact position in the source text, down to character offsets. The result is data that can be reviewed, audited, and traced visually.
“We’re excited to introduce LangExtract, a new open-source Python library designed to empower developers to do just that. LangExtract provides a lightweight interface to various LLMs such as our Gemini models for processing large volumes of unstructured text into structured information based on your custom instructions, ensuring both flexibility and traceability,” Google said on the LangExtract Developers’ page.
That matters in a market where trust is fragile. The global data extraction space has been valued at roughly $1.5 billion to more than $5 billion in 2024 to 2025, with forecasts reaching tens of billions by the mid-2030s. Growth has been fueled by cloud adoption, AI use inside enterprises, and pressure to automate document-heavy workflows across healthcare, finance, law, and compliance. LangExtract arrives squarely in that demand curve.
LangExtract: Turn Messy Text into Graph-RAG Insights

Instead of relying on brittle scripts or opaque APIs, developers define what they want using a schema and a handful of examples. LangExtract then applies that structure across large document sets, returning outputs such as JSON that remain tied to the original text. Long files are handled through chunking and parallel passes, and results can be reviewed through interactive HTML files that highlight each extraction in context.
Before getting into the broader market impact, it helps to see what LangExtract actually does in practice — and what it makes unnecessary.
What LangExtract Does
- Extracts structured data from unstructured text using large language models
- Grounds every extracted entity to its exact location in the source document
- Handles long documents, including files exceeding 100 pages
- Produces interactive HTML files for in-context review and verification
- Works with cloud-based models and local models via tools like Ollama
What LangExtract Replaces
- Regex-based pattern matching that breaks on format changes
- Custom named-entity recognition pipelines that demand constant upkeep
- Paid extraction APIs that charge by volume with limited transparency
- Manual data entry workflows in document-heavy environments
This shift has broader implications for modern AI systems. Retrieval-augmented generation relies on clean, structured metadata to work well. LangExtract feeds those systems with traceable structure rather than loose text blobs, improving retrieval accuracy and reducing silent failure modes when models are asked to reason over large document collections.
Google positions LangExtract as a developer utility, yet its impact extends beyond that. By open-sourcing a tool that covers core extraction needs across industries, Google has compressed a wide category of paid products into a library call. That doesn’t erase the market overnight, but it resets expectations around pricing, differentiation, and value.
LangExtract does not promise perfection. Results still depend on the underlying model and the quality of the examples provided. The library can supplement extracted facts with model knowledge, which introduces its own trade-offs. Even so, the direction is clear. Data extraction is shifting from a standalone product category into a shared layer of the AI stack.
This is what open source from Google looks like when it lands in the middle of a growing market—and why a once-comfortable industry is paying attention.
Below are examples of LangExtract in action.
