Here at Aligned, we believe that business decisions should be backed by relevant data. Like most startups, we’re constantly testing out new features and onboarding new users to our platform, which requires frequent evaluation. However, crafting and updating queries to answer every new business question quickly turns into a time-consuming task.
While some companies are now adopting Text-to-SQL solutions to simplify data querying, we found that there are no comparable solutions available for teams working with Firebase applications.
Recognizing this challenge, we developed a solution for querying Firestore databases using Large Language Models (LLMs) and retrieval-augmented generation (RAG) techniques.
Our Text-to-Firestore solution allows anyone on our team to query our database with natural language. Now our teams can simply ask questions like “How many users signed up by country in the last 30 days?” or “Show me monthly revenue so far this year and highlight any discrepancies”.
Using a new RAG technique developed by Anthropic called contextual retrieval, we were able to achieve a solution that delivers quick results with 93% accuracy.
This has significantly reduced the time spent on manual coding, enabling our team to learn about our users more quickly and innovate faster.
While originally developed for internal use at Aligned, our solution is flexible and can be adapted by any company that uses Firestore.
This article marks the beginning of a series of posts exploring the development of our end-to-end system which enables our team to query a Firestore database using natural language prompts.
The Problem
Providing our stakeholders with quick access to crucial data is a constant challenge as we grow. Building reports and data infrastructure takes effort to develop and requires constant iteration. Any data that falls outside the capabilities of our existing reports demands regular ad-hoc data requests, consuming valuable time and resources.
Difficult data access may lead to several issues for our company:
Incorrect or delayed decisions. Without quick access to insights, our team may rely on gut decisions rather than data-driven strategies.
Knowledge silos. Critical company information and insights gets siloed within specific teams or individuals rather than shared to all key stakeholders.
Resource drain. Our skilled engineers risk significant time writing queries and optimizing reports which could be effort better spent elsewhere.
Limited innovation. When data access is limited, our team would be less likely to explore patterns, prototype solutions, or run quick experiments
To tackle this challenge, we wanted to build in-house data infrastructure to allow anyone to quickly get any question about our data answered. Given our experience implementing AI-driven systems, we developed an approach using LLMs and RAG as the foundation of our solution.
Text-to-Code Generation
LLMs are now skilled at generating code across a wide variety of coding languages and tasks, making them an ideal solution for enabling users to easily query databases with natural language.
For example, Google’s Gemini AI assistant currently supports a text-to-code feature for answering user queries in Google Sheets. Gemini can generate complex formulas and even Python code snippets to summarize and analyze data in the current document.
Other LLMs also support similar capabilities. For example, by uploading data sources into ChatGPT (or by connecting to a cloud service like Google Drive), you can then ask the assistant to generate a summary of your data or retrieve specific data points.
For more robust use cases, text-to-code generation can help generate SQL queries from natural language. Online, you can find a growing body of literature surveying the current landscape of Natural Language to SQL (NL2SQL) models.
Businesses are now adopting text-to-SQL solutions to improve productivity and empower non-technical teams with data. Recently, the Pinterest Engineering team shared a blog post detailing how they were able to create a Text-to-SQL solution that led to a substantial increase in team productivity.
Our platform stores data in Firestore, a NoSQL database ideal for rapid and flexible development. Unlike traditional SQL databases, which rely on tables, data in Firestore are stored in documents which are then grouped into collections.
While many solutions already exist for converting natural language into SQL, there are surprisingly few resources available for converting natural language into executable Firestore queries.
Recognizing this opportunity, we set out to develop a system capable of querying our own Firestore database.
Our Solution: Text-to-Firestore
We built a Text-to-Firestore solution that enables us to quickly access data from our Firestore database — simply by entering straightforward, plain-language queries.
We built a simple interface to allow our team to submit queries in plain English. If the query is successful, the user is shown both the query results and the generated Python code. The Python code allows our eng/data teams to quickly iterate.
Here’s the pipeline we developed to achieve this:
User Query. Our system takes a natural language query from the user, e.g., “How many users were active in the past 30 days?”.
Database Schema Retrieval. We use RAG to retrieve the Firebase collections most relevant to the user’s query.
Code Generation. We prompt an LLM to generate Python code to answer the query given the relevant firestore collections, context about our platform and database, and instructions on how to query and format the results.
Code Validation. We validate the generated Python code against a tight set of security rules.
Code Execution. We execute the Python code against our Firestore database. If the code fails to execute, repeat one more time from step 2.
Return Results. Return the query result in the most relevant markdown format (like a graph, table, or list) as well as the generated Python code.
Results
To test our Text-to-Firestore pipeline, we evaluated its performance against 21 unique queries. We evaluate each query three times (due to the random nature of LLMs), resulting in an evaluation set with 63 total queries.
For each query, we’ve also noted which specific collections were required to correctly answer the query. You can see a couple of example queries from the evaluation set below.
Using these test cases, we can now evaluate how well our RAG module performed.
We tested the RAGs performance on retrieving the top-K most relevant documents for three values of K (Top-3, Top-5 and Top-10).
Using a traditional RAG system, we encountered an 85.71% accuracy rate for Top-3 document retrieval. While expanding the retrieval scope to include the top 10 documents increases accuracy, such an approach would significantly increase prompt length, which negates one of the main benefits of using RAG and short prompts.
We pivoted to using a modified RAG method developed by Anthopic called contextual retrieval to improve retrieval accuracy and overall better performance. This improved Top-3 retrieval to an accuracy of 93.55% and 100% accuracy for Top-5 retrieval.
In the next stage of our evaluation, we tested our Text-to-Firestore solution's code generation capabilities, focusing on three key areas: code validity, successful execution, and accuracy of the final output.
With contextual retrieval, we achieved 91-93% accuracy during evaluation, a significant improvement from 73.02% accuracy when using traditional RAG.
We also observed the following results:
100% of the generated Python scripts passed our validation script
93-95% of our Python scripts were able to execute after 1-2 attempts.
We’ve observed that about ⅓ of our queries required a retry.
Queries generally take 8-12 seconds to complete, and between 12-18 seconds if a retry is required.
How We Did It
In this section, we will provide a detailed overview of the initial architecture of our Text-to-Firestore system, where we relied on non-contextual RAG techniques to manage collection schema retrieval.
Text-to-Code Generation
We tested both GPT-4o and Sonnet 3.5 as our LLM for code generation. Both models performed remarkably well with GPT-4o having marginally better accuracy overall.
We used python to query and process the Firestore data, although we could have also used any language with accessible Firestore libraries, such as JavaScript or Flutter. To streamline the process further, we set up the query to format the result into Markdown for when we finally display the query results on the client-side application.
However, here we encountered one major challenge. How do we let the model know how our data is structured?
Adding Company and Database Context
To guide code generation, we’ve included a comprehensive company summary document to our prompt. The document details the different types of users that use the Aligned platform, the basic flow of each user type, and an overview of how our payment system is set up. The document also explains how our business processes map to different collections within our Firestore database.
Here’s a short snippet of how this document describes the basic flow of the manager user type.
While this document helps give context to the overall structure of our platform, the AI model still lacks knowledge on how the data itself in Firestore is structured.
A naive approach would be to include a JSON representation of the Firestore database in the prompt.
This JSON file should define a unified schema representation of the Firestore database:
Firestore schema representation. Each collection must have a list of fields and their corresponding data types
Firestore indexes. Each collection must also list which fields are used for indexing.
Firestore enums. For fields with restricted value sets in each collection, specify their allowable values.
However, if we tried to include all this context, our prompts would not only be more costly, but may also suffer from lower speed and accuracy.
Optimizing Prompts with RAGs
In order to work with smaller prompts, we can adapt a retrieval-augmented generation (RAG) architecture to retrieve only the most relevant collections for each user query.
The RAG pipeline can be broken down into three main stages:
Ingestion. Knowledge is split into chunks and indexed.
Retrieval. When a query is received, search through the ingested data to find relevant chunks.
Synthesis. Include retrieved information when sending prompts to a generative model
Traditional RAGs typically use an embedding model to convert data from a knowledge base (such as our collection schemas) into vectors. These vectors are then stored and indexed in a vector database for quick retrieval later.
When a user sends a query like “Who are the top performing users this year?”, we can then use the RAG to retrieve the top-K collections that seem the most relevant to the user’s request.
These relevant collections will then be added to our prompt as additional context before being sent to the LLM.
Validating and Running the Query
At this point, our prompt should now have the following content:
The original user query
Context on our company/platform and the database we’re using
A set of collections relevant to our query
Instruction on how to query the database and format the results
Now that we’ve built our prompt, we’ll now pass the prompt to an LLM to produce Python code.
It’s good practice to validate the generated code first before executing. We use a separate validator script that performs the following checks:
Checks if the Python code is valid
Checks if only the provided collections are being queried
Code only performs read operations
Does not contain other potential malicious commands like “eval” or “exec”
After the validation step, our system executes the Python code. Following execution, we expect one of two outcomes: (a) the code successfully runs and generates the correct answer to the query, or (b) the code fails during execution and returns an error.
If the latter occurs, we’ll pass both the code and the corresponding error into a new prompt to retry one more time. As the results show, this occurs about ⅓ of the time.
When the code runs successfully, the formatted output and the generated Python code will then be sent back to the web application for display.
Why Querying Firestore is Harder than SQL
While evaluating our system, we discovered a few properties that make Firestore uniquely difficult to generate queries for:
Firestore is a NoSQL database. Data is unstructured, documents in the same collection can have different combinations of fields, and fields aren’t strictly typed.
No foreign key relationships between tables. This makes joining data across collections non-trivial.
Fields can be nullable. When aggregating data with nullable fields, null values may skew output. Because of this, we needed to be careful of how we process the data post-query.
Firestore queries themselves have certain limitations. For example, Firestore does not allow inequality (!=) filters on null fields and has limits to the number of clauses you can use per query.
We also encountered particular query challenges due to our business and database structure.
A recurring issue is the AI model not understanding how our data is represented. For instance, monetary values in our database are stored as MicroUSD and our dates are stored as strings rather than timestamps. Our database also uses enums to represent states across different collections.
What’s Next?
In Part II of this series, we'll focus on how we were able to enhance the performance of our Text-to-Firestore system by adopting a new RAG technique known as contextual retrieval.
This new method expands on traditional RAG by pre-processing each chunk to include helpful context.
We’ll take a closer look at the basics of contextual retrieval, an overview of how we pre-processed our data to fit this infrastructure, and how this new method compares to traditional RAG in practice.
Stay tuned!
Need help building your own text-to-firestore query tools? Let us know, we'd love to help.