Kafka Basics
Apache Kafka is a distributed commit log for high-throughput event streaming.
Introduction
Apache Kafka is a distributed commit log for high-throughput event streaming. Producers append records to topics partitioned for parallelism; consumer groups coordinate partition assignment for scale. Unlike traditional queues, Kafka retains messages for configurable periods — enabling replay, audit, and multiple independent consumers.
Kafka appears in HLD for activity feeds, CDC pipelines, metrics ingestion, and microservice choreography. Know topics, partitions, offsets, consumer groups, and replication factor in interviews.
This lesson covers Kafka topology, ordering guarantees, and Spring Kafka integration patterns.
Understanding the topic
Key concepts
- Topic: named log split into partitions for parallel writes/reads.
- Partition key: same key → same partition preserves per-key ordering.
- Offset: monotonic position in partition; consumers commit offsets after processing.
- Consumer group: one consumer per partition max — scale consumers ≤ partitions.
- Replication factor RF=3: leader + followers; ISR for durability.
- Retention: time (7 days) or size — not a task queue that deletes on read.
flowchart LRProducer -->|produce| TopicTopic --> C1[Consumer Group A]Topic --> C2[Consumer Group B]
Internal architecture
Architecture overview
flowchart LRProducer -->|produce| TopicTopic --> C1[Consumer Group A]Topic --> C2[Consumer Group B]
Step-by-step explanation
- Producers with idempotent + acks=all for durability.
- Broker cluster 3+ nodes, RF=3, min.insync.replicas=2.
- Consumers in group 'notification' and separate group 'analytics' on same topic.
- Schema Registry (Avro) for compatible evolution.
- Kafka Connect for CDC from PostgreSQL to downstream systems.
- Monitor consumer lag, under-replicated partitions, disk usage.
Informative example
Spring Kafka producer with idempotence and consumer with manual offset commit:
spring:kafka:bootstrap-servers: kafka:9092producer:acks: allenable-idempotence: truekey-serializer: org.apache.kafka.common.serialization.StringSerializervalue-serializer: org.springframework.kafka.support.serializer.JsonSerializerconsumer:group-id: order-processorauto-offset-reset: earliestenable-auto-commit: falsekey-deserializer: org.apache.kafka.common.serialization.StringDeserializervalue-deserializer: org.springframework.kafka.support.serializer.JsonDeserializerlistener:ack-mode: manual
Partition by orderId for ordered lifecycle events. Increase partitions before consumers for scale.
Real-world use
Real-world use cases
- Uber trip events: driver location, status updates to multiple services.
- Banking transaction audit log with 7-year retention compliance.
- E-commerce CDC: product changes to search and cache invalidation.
- Netflix-style viewing events to recommendation pipeline.
Best practices
- Size partitions for target throughput — rebalancing costly later.
- Use idempotent producers and transactional outbox for exactly-once-ish flows.
- Dead letter topic for poison messages after retries.
- Don't over-partition tiny topics — broker overhead.
- Secure with SASL/SSL in production.
- Capacity plan disk: retention × ingest rate.
Common mistakes
- Consumer count > partitions — idle consumers.
- No key on messages needing ordering — random partition breaks sequence.
- auto-commit before processing — message loss on crash.
- Single broker RF=1 in production — data loss on disk failure.
- Using Kafka as request-response — wrong latency profile.
Advanced interview questions
Q1BeginnerWhat is a Kafka topic?
Q2BeginnerWhat is a consumer group?
Q3IntermediateHow preserve ordering in Kafka?
Q4IntermediateKafka vs RabbitMQ?
Q5AdvancedSize Kafka for 1M events/sec.
Summary
Kafka is a durable distributed log for high-throughput events. Partitions enable parallelism; keys preserve per-entity order. Consumer groups scale processing with partition count cap. Retention enables replay and multiple subscribers. Idempotent producers and manual commits improve reliability. Event streaming patterns compose Kafka with stream processors.