How to Use a List of Tables and Fields from Azure Databricks or BigQuery for LLM Queries
You want to ask a question about your data, but you’re stuck.
You’re staring at your database or your analytics platform—Azure Databricks, BigQuery, Power BI—and thinking:
« Where do I even start? What tables do I use? What fields should I include? »
I’ve been there. Sometimes, the schema of your data is so big, complex, or unfamiliar that it becomes a mental block. You know the insight is there—total sales, revenue by product, user activity over time—but writing the SQL or DAX to get it feels overwhelming.
Here’s where Large Language Models (LLMs) like GPT come in. But wait—don’t just dump your raw data into an AI model.
You can keep it simple and safe: just give the model a list of your tables and fields. That way, it has the structure it needs to suggest queries—without ever touching sensitive data.
But there’s a catch:
- The quality of the suggestions depends heavily on how your tables and fields are named.
- If your schema is clear and descriptive (like
orders,products,order_date), the LLM will have an easier time understanding what to use. - If your schema is cryptic or generic (
tbl_123,fld_x,col1), the model might suggest something completely off.
So, before you start, it’s worth cleaning up your table and field names (if you can) or at least explaining key naming conventions when asking your question.
In this article, I’ll show you how to:
- Create a list of your tables and fields from Azure Databricks or BigQuery.
- Feed that structure to an LLM to ask your data questions.
- Get SQL or DAX queries you can use—while keeping your actual data private.
Sound good? Let’s dive in.
📋 Step 1: Create Your Table & Field List
Before you ask an LLM to write a query, you need to tell it what’s in your database—but without giving it access to the actual data.
- If you’re using Azure Databricks, run a simple query to list your tables and fields.
Example:SHOW TABLES IN your_schema; DESCRIBE TABLE your_table;
- For BigQuery, you can use the
INFORMATION_SCHEMA:SELECT table_name, column_name FROM your_project.dataset.INFORMATION_SCHEMA.COLUMNS;
Save the output as a CSV file. This file just contains the structure—table names and fields—nothing sensitive.
🛠 But don’t stop there. If you can, also include:
✅ Foreign key relationships between tables:
- Which fields link one table to another?
- For example,
orders.customer_id→customers.customer_id. - This helps the LLM understand how to join tables in the queries it generates.
If your schema doesn’t explicitly store relationships, consider documenting them manually in your CSV or adding a “notes” column.
🔄 How to Automate This
- In BigQuery, the
INFORMATION_SCHEMA.REFERENTIAL_CONSTRAINTSandKEY_COLUMN_USAGEcan help:
SELECT
rc.constraint_name, rc.table_name AS foreign_table, kcu.column_name AS foreign_column, kcu.referenced_table_name AS primary_table, kcu.referenced_column_name AS primary_column
FROM your_project.dataset.INFORMATION_SCHEMA.REFERENTIAL_CONSTRAINTS rc JOIN your_project.dataset.INFORMATION_SCHEMA.KEY_COLUMN_USAGE kcu ON rc.constraint_name = kcu.constraint_name;
This gives you foreign key relationships, which you can merge with your table-field list.
- In Azure Databricks (or Spark SQL), you might not have referential constraints enforced. In that case:
- Look for metadata tables, comments in your data dictionary, or ETL documentation.
- Or automate by pattern matching (e.g., fields ending with
_idlinked to tables with the same name) using a Python or PySpark script. - Example Python pseudo-code:
import pandas as pd schema_df = pd.read_csv('table_structure.csv') relationships = [] for table in schema_df['table_name'].unique(): table_fields = schema_df[schema_df['table_name'] == table]['field_name'] for field in table_fields: if field.endswith('_id'): referenced_table = field[:-3] + 's' # naive pluralization relationships.append((table, field, referenced_table, 'id')) relationships_df = pd.DataFrame(relationships, columns=['foreign_table', 'foreign_column', 'primary_table', 'primary_column']) relationships_df.to_csv('relationships.csv', index=False)
- Combine both CSVs—schema and relationships—into a single reference file for your LLM.
📝 Pro Tip:
The better your table and relationship documentation, the better your LLM suggestions. Without relationships, you risk getting random joins or incomplete queries.
🤖 Step 2: Feed the CSV to Your LLM
Now comes the fun part. Instead of asking,
« Write me a query for my orders table where product is ‘X’ and date is in the last 30 days, »
you do this:
- Upload your CSV with the table/field list.
- Then ask your question:
« Using the uploaded table structure, write a SQL query that returns total sales for product X over the last 30 days. »
The LLM reads the table/field structure like a schema, and it can now suggest SQL or DAX queries—without ever seeing your actual data.
🧑💻 Step 3: Use the Suggested SQL or DAX
Once you get the generated query, just copy it into your favorite tool:
- Databricks SQL
- Power BI (with DAX)
- BigQuery SQL
You might need to adjust syntax or tweak filters, but the heavy lifting—understanding the schema and suggesting query logic—is done.
🔐 Why This Works (and Why It’s Safe)
The beauty of this method is that:
- No sensitive data is shared.
- LLMs work with structure only—like a blank blueprint.
- It’s easy to update: just refresh your CSV with updated schema if your database changes.
- You can ask iterative questions:
- « Now, add a condition for country=’FR' »
- « Aggregate by month instead of day. »
- And here’s a hidden bonus: keeping the context window small.
- By uploading just the table and field list, you avoid overloading the LLM’s token limit.
- This means faster responses, fewer errors.
🧐 Curious to Try This?
If you’re using Azure Databricks or BigQuery, just export your schema list and give this a shot. You’ll be amazed at how much faster, safer, and lighter it feels to query your data this way.
I’m thinking of expanding this into a full template system, maybe even with a Python script to automate schema export and query prompting. (If you’re interested, drop me a message).
Also, if you haven’t yet, check out my post on creating your own AI creative workflow. It’s all about blending tools to keep control of your data and your ideas.
What’s your take? Would you use a schema-only CSV to feed an LLM for queries? Or do you think there’s a smarter way?
Drop your thoughts below—I’d love to hear from you.
