RAG Fundamentals: Solving Hallucinations with Chunking and Embeddings

Retrieval Augmented Generation (RAG) grounds LLMs in custom data to reduce hallucinations and incorporate specialized knowledge. The process: (1) chunk unstructured documents into meaningful blocks, (2) generate embeddings (numeric vectors capturing semantics), (3) retrieve nearest neighbors via similarity search (prefer cosine distance over Euclidean for direction over magnitude), (4) augment prompts with retrieved context for grounded generation.

Why chunk? Embedding entire documents dilutes semantics; split into paragraphs/sentences for precise retrieval. Techniques include fixed-size, recursive (e.g., by periods), or content-aware (e.g., Document AI paragraphs). Here, recursive chunking splits on periods for simple, effective blocks.

Embedding models: Use text-embedding-004 (768 dimensions) or newer Gemini multimodal for text/images/video/audio. In BigQuery, ML.GENERATE_EMBEDDING calls Vertex AI without loading models locally.

Retrieval math: Embed query, compute cosine similarity: closer to 1 means higher semantic match. Top-K (e.g., 3) results ranked by distance.

Common mistake: Using Euclidean distance factors document length; cosine ignores magnitude for pure similarity.

Example query embedding:

SELECT
  ML.GENERATE_EMBEDDING(
    MODEL `projects/YOUR_PROJECT/locations/YOUR_REGION/models/text-embedding-004`,
    STRUCT('What are tactics against a foe that causes paralysis?' AS content)
  ) AS query_embedding;

BigQuery RAG Pipeline for OLAP Workloads

BigQuery excels at analytical processing (OLAP) on large unstructured data: ETL to embeddings, then SQL-based semantic search. Assumes prior setup (e.g., GCS connection from lab Day 1).

Step 1: Recursive Chunking Query chunks input table (e.g., raw text docs):

CREATE OR REPLACE TABLE `your-project.your-dataset.chunks` AS
SELECT
  id,
  REGEXP_EXTRACT_ALL(content, r'[^.!?]+[.!?]+') AS chunks;

Outputs array of sentence-level chunks preserving basic context.

Step 2: Setup Vertex AI Connection Echo GCS connection, then create embedding model connection:

CREATE OR REPLACE MODEL `your-project.your-dataset.embedding_model`
OPTIONS(model_type='VERTEX_AI',
        model_name='text-embedding-004',
        CONNECTION_ID='projects/YOUR_PROJECT/locations/YOUR_REGION/connections/YOUR_CONNECTION');

Replace placeholders; validates project/region.

Step 3: Generate Embeddings

CREATE OR REPLACE TABLE `your-project.your-dataset.embeddings` AS
SELECT
  id,
  chunk,
  ml_generate_embedding_result AS embedding
FROM ML.GENERATE_EMBEDDING(
  MODEL `your-project.your-dataset.embedding_model`,
  (SELECT * FROM `your-project.your-dataset.chunks`));

Parallel API calls; expect latency but scales to massive datasets. Result: 768-dim vectors per chunk.

Step 4: Semantic Search Embed query, join on cosine similarity, LIMIT top-K:

WITH query_embedding AS (
  SELECT
    ML.GENERATE_EMBEDDING(
      MODEL `your-project.your-dataset.embedding_model`,
      STRUCT('What are tactics against a foe that causes paralysis?' AS content)
    ) AS embedding
)
SELECT
  chunks.chunk,
  COSINE_DISTANCE(query_embedding.embedding, embeddings.embedding) AS distance
FROM query_embedding, `your-project.your-dataset.embeddings` AS embeddings
ORDER BY distance DESC
LIMIT 3;

Top result matches query semantically (e.g., retrieves 'paralyzing aura' chunk). Ideal for insights beyond SQL, like semantic Q&A on docs.

Trade-off: BigQuery suits batch analytics (seconds OK); not real-time.

Quality check: Inspect execution graph for parallelism; distances near 1 indicate strong matches.

Cloud SQL RAG for Real-Time OLTP Production

Shift to Cloud SQL (PostgreSQL) for transactional workloads (OLTP): sub-second latency for customer-facing agents. Uses pgvector for vector storage/indexing.

Prerequisites: Service account with 'AI Platform User' role for Vertex AI calls.

Step 1: Instance & IAM Setup Cloud Shell:

gcloud sql instances create rag-agent-db --database-version=POSTGRES_15 --tier=db-g1-small --region=YOUR_REGION
gcloud projects add-iam-policy-binding YOUR_PROJECT --member="serviceAccount:RAG_SA@YOUR_PROJECT.iam.gserviceaccount.com" --role="roles/aiplatform.user"

Creates low-latency instance; binds IAM for Gemini access.

Step 2: Enable Extensions in SQL Studio Connect as postgres user, run:

CREATE EXTENSION IF NOT EXISTS vector;
CREATE EXTENSION IF NOT EXISTS google_ml_integration;

vector adds vector type/indexes (HNSW for ANN search); google_ml_integration enables ml_generate_embedding in SQL.

Step 3: Create Embeddings Table

CREATE TABLE embeddings (
  id SERIAL PRIMARY KEY,
  content TEXT,
  embedding VECTOR(768)
);
CREATE INDEX ON embeddings USING hnsw (embedding vector_cosine_ops);

HNSW index accelerates nearest-neighbor search.

Step 4: Ingest & Embed Data Insert chunks, generate embeddings:

INSERT INTO embeddings (content, embedding)
SELECT
  chunk,
  (ml_generate_embedding('text-embedding-004', chunk)).embedding
FROM unnest(ARRAY['chunk1', 'chunk2']::TEXT[]) AS chunk;

Real-time: Embed on-insert or batch-load.

Step 5: Production Retrieval

SELECT
  content,
  embedding <=> ml_generate_embedding('text-embedding-004', 'query') AS distance
FROM embeddings
ORDER BY distance
LIMIT 3;

<=> is pgvector cosine distance; indexes ensure <100ms queries.

Integrate into agent: Retrieve → augment Gemini prompt → generate. Scales to production (e.g., chatbots).

Common pitfalls: Forget IAM/service account (blocks Vertex calls); no indexes (slow scans); chunk too large (dilutes semantics).

Before/after: Raw LLM hallucinates on unseen data (e.g., latest Pixel); RAG pulls from DB for accurate, fresh responses.

Agent Assembly and Scaling Principles

Full agent: Query → embed → retrieve top-K → stuff into Gemini prompt (e.g., Vertex AI SDK). BigQuery for ETL/indexing builds; Cloud SQL serves live.

Practice: Load game lore docs (e.g., monsters), query tactics—extends to legal/contract search.

Assumed level: Google Cloud basics (Qwiklabs credits); SQL comfort. Fits after ETL lab; before agent orchestration.

Trade-offs: BigQuery cheap for batch ($/TB scanned); Cloud SQL $/query but real-time. Monitor quotas (embeddings API).

"Retrieval augmented generation basically uh my understanding is it's trying to solve the hallucination of AI because uh AI is not always give you the accurate result doesn't necessarily have the specialized um knowledge."

"You want to make sure that you're encoding the document in meaningful chunks so that when you do the retrieval part essentially you're retrieving um aspects of the document that most directly aligns with that particular question."

"We always recommend something like cosine distance because it's more of a matter of like the similar similarity rather than just like the magnitude."

"BigQuery is meant more for OLAP workload... whereas... cloud SQL that's meant more for real time low latency transactional workloads."

Key Takeaways

  • Chunk recursively (e.g., by sentences) before embedding to preserve semantics; avoid full-doc embeds.
  • Use ML.GENERATE_EMBEDDING in BigQuery/Cloud SQL for managed Vertex AI access—no local models.
  • Cosine distance > Euclidean for retrieval; index with HNSW in pgvector for speed.
  • BigQuery for batch OLAP (analytics); Cloud SQL + extensions for OLTP production agents.
  • Always setup connections/service accounts first; test with top-3 similarity queries.
  • Augment prompts with retrieved chunks for grounded LLM outputs.
  • Scale: Parallel in BQ; index in SQL for <100ms latency.
  • Validate: Check distances (near 1 = good match); execution graphs for efficiency.