ETL Unstructured Text to BigQuery Tables with Gemini

Streamline GCP Lab Setup for Cost-Free AI Experiments

Start with a personal Gmail account to avoid restrictions on corporate or edu emails. Redeem $5 free credits via the lab link—no payment info needed; verify in the Credits section of Google Cloud Console.

Launch Cloud Shell (G+S shortcut) for a persistent VS Code-like editor in a managed VM. Authenticate with gcloud auth list (login if needed). Clone repos: agentverse-data-engineer for ETL code and agentverse-dungeon for the gamified boss endpoint.

Run ./init.sh to auto-create a project (e.g., agentverse-scholar-XXXX), link credits, install SDKs, and set billing. Confirm project ID in yellow terminal prompt; switch with gcloud config set project <ID> if lost. Enable APIs (BigQuery, Storage, Vertex AI, Dataflow) via script—takes 20-60s, no billing until usage.

Grant IAM roles to the default Compute service account for Storage, BigQuery, Dataflow access (demo-only; production needs least-privilege per-service accounts). Pre-build/deploy the dungeon image to Artifact Registry and Cloud Run via Cloud Build for later RAG agent testing.

Common pitfall: Multi-project confusion—always verify active project. Refresh Cloud Shell if auth drops.

Quote: "Separation of principles is separation of roles is really good practice. This is like a demo purpose so that we using a single one for easy like so that you can easily follow."

Convert Unstructured GCS Files to Queryable BigQuery Tables

Unstructured data (PDFs, text, Word) resists SQL analytics due to inconsistent formats. Solution: Store in GCS, create BigQuery external tables as pointers—no copying, no governance headaches across envs (dev/staging/prod).

Seed GCS reports bucket with text files (e.g., historical battle reports: adventurers vs. monsters). Real-world: PhD PDFs or business docs.

Create a BigQuery connection (service account identity) for cross-service access: Grants GCS read perms automatically.

bq mk --connection --connection_location=us --service_account_project_id=$PROJECT_ID reports_connection;

Then build external table over GCS texts:

CREATE EXTERNAL TABLE `project.dataset.text_reports`
OPTIONS (
  format = 'TEXT',
  uris = ['gs://$PROJECT_ID-reports/*'],
  connection = 'projects/$PROJECT_ID/locations/us/connections/reports_connection');

This enables petabyte-scale queries on raw text without loading/copying, minimizing costs and security risks.

Principle: External tables decouple storage from compute—query GCS directly via SQL, transform later with Gemini.

Before/after: Raw text files → Queryable lines via SELECT * FROM text_reports LIMIT 10. No more manual parsing.

Next: Use Gemini (via Vertex AI) for ETL—extract JSON schemas (e.g., monsters table: name, HP; battles: date, outcome). Each JSON key becomes a BigQuery table for analytics like ranking strongest monsters.

Quality criteria: Structured output must enable SQL joins (e.g., SELECT adventurer, AVG(damage) FROM battles GROUP BY adventurer). Test with sample queries.

Quote: "How can you take text files that are uploaded to GCS, use something like Gemini to convert those text files to extracted JSONs where you extract all the relevant information from those text files, and you sort them and store organize them into JSONs?"

Build Towards RAG Agent Knowledge Base Without Data Duplication

This ETL feeds a Scholar-class agent in Agentverse (gamified lab): Past battle insights inform real-time fights. Day 2 extends to Cloud SQL RAG indexing, Dataflow pipelines, Cloud Run deployment.

Workflow fit: Prep phase for production AI pipelines. Assumes GCP basics; teaches data eng for AI builders. Practice: Follow in Cloud Shell, query external table, extend to Gemini extraction.

Trade-offs: External tables great for analytics speed/cost, but slower than native tables for heavy transforms (use Dataflow for scale). Single service account simplifies demos, risks over-privileging.

Quote: "What BigQuery external tables allows you to do is you can leave your data in one place such as GCS, but create essentially a pointer or symbolic relationship with that particular data in GCS. So, you can query it... without you having to move or copy the files."

Exercise: After setup, query text_reports for patterns (e.g., SELECT COUNT(*) WHERE LOWER(content) LIKE '%dragon%'). Then script Gemini prompts for JSON extraction: Define schemas upfront (monsters: {name, level, weaknesses}; enforce with structured outputs).

Quote: "In a real life situation, it can be for example you're researcher, PhD researcher, you want to research multiple article in PDF form... you need to figure out a way so that it can be a structured in a structured way so that AI or our computer can able to analyzing."

Key Takeaways

Use personal Gmail + free credits for unrestricted labs; verify in Console Credits.
Cloud Shell persists code—ideal for iterative AI data pipelines; auth check with gcloud auth list.
./init.sh automates project/credits/API setup; always confirm project ID.
BigQuery external tables + connections query GCS text without copying—scale to petabytes cheaply.
ETL flow: GCS raw → External table → Gemini JSON extract → Mini-tables (e.g., monsters, battles) for SQL.
Pre-deploy non-core services (e.g., dungeon Cloud Run) via Cloud Build to focus on core logic.
Production: Split IAM roles per service; read Data Engineer notes for real-world analogies (e.g., PDF research).
Test transforms with SQL previews; define JSON schemas explicitly for reliable structuring.
Fits AI agent KBs: Structured history enables RAG queries like "Best strategy vs. dragon?"