Flink Treats Batch as Streaming for Unified Low-Latency Processing

Unify Batch and Streaming to Eliminate Latency and Dual Systems

Real-world data like user clicks, views, and purchases arrives as continuous unbounded streams, but traditional batch processing dumps events into hourly files, introducing up to 60-minute latency—critical for recommendation engines where recent user behavior (e.g., hiking gear searches) must immediately influence suggestions like tents, not laptops. Streaming systems like Storm or Kinesis process events in milliseconds but require separate codebases from batch jobs (e.g., Hadoop/MapReduce), leading to sync issues, duplicate logic, and reconciliation bugs.

Flink resolves this by treating bounded datasets as finite streams that have ended: a 5-year historical dataset is a stream started years ago and stopped today. Point the same Flink job at recent Kafka events for real-time recommendations or historical data for nightly retraining. This shares operators, clusters, and code, avoiding Lambda Architecture's two-system pain. Alibaba processes hundreds of billions of events daily across tens of thousands of machines; Netflix uses it for near-real-time anomaly detection; Uber built its analytical platform on it.

Build Stateful Pipelines with Operators, State, and Windows

Flink jobs form a dataflow DAG of sources (e.g., Kafka reads), operators (transformations like filtering bots or enriching metadata), and sinks (e.g., Redis writes). Every operator runs in parallel across cluster machines: set parallelism to 4 for a filter, and 4 subtasks process stream portions simultaneously, scaling to billions of events/day.

State is first-class for context across events—e.g., per-user hash map of recent views (append new item_id, trim >10min old). Flink manages state snapshots to durable storage, restoring on crashes without data loss. Windows slice infinite streams into finite chunks for aggregations: tumbling (non-overlapping, e.g., hourly), sliding (overlapping, e.g., 30min window every 1min), or session-based. Example for recommendations:

searches = readFromKafka("search-events")
clicks = readFromKafka("click-events")
userActivity = (searches + clicks)
  .keyBy(userId)
  .window(slidingWindow(size=30min, slide=1min))
  .aggregate(activityAggregator)  // {userId, recentQueries, recentClicks}
userState = userActivity.asyncMap(callUserTowerModel)  // embedding vector
// ... merge ANN/trending candidates, rank top 100, writeTo(redis)

This computes rolling user features, embeddings, ~1000 candidates (500 ANN + 200 trending, deduped), fetches features, and ranks in seconds per user.

Exactly-Once Guarantees via Lightweight Checkpoints

Flink ensures state updates apply exactly once, even on failures: periodic checkpoints snapshot operator state using Asynchronous Barrier Snapshotting (ABS). Barriers flow like records; operators snapshot on receipt and forward without pausing. On crash, rollback to last checkpoint, replay only post-checkpoint input (bounded by checkpoint interval, tunable). Partial re-execution avoids full restarts. Batch jobs use the same runtime but with blocked data exchange (upstream finishes before downstream starts), confirming no separate batch engine needed.