Chat Application
Design a chat application like WhatsApp or Slack — 1:1 and group messaging, online presence, delivery receipts, and message history.
Introduction
Design a chat application like WhatsApp or Slack — 1:1 and group messaging, online presence, delivery receipts, and message history. Real-time HLD centers on WebSockets, message ordering per conversation, fan-out for groups, and mobile push for offline users.
Assume millions of concurrent connections, billions of messages stored, at-least-once delivery with client dedup, and end-to-end encryption as optional advanced topic.
Understanding the topic
Key concepts
- WebSocket gateway horizontal scale with connection stickiness or shared pub/sub.
- Message store: Cassandra partition by conversation_id + time UUID.
- Delivery flow: sent → delivered → read receipts via separate lightweight events.
- Presence: heartbeat + Redis TTL online set; last seen timestamp.
- Group chat: fan-out on write for small groups; fan-out on read for large channels.
- Push notifications (APNs/FCM) when recipient offline.
flowchart TBClient --> WS[WebSocket Gateway]WS --> ChatSvcChatSvc --> KafkaKafka --> PresenceChatSvc --> Cassandra
Internal architecture
Architecture overview
flowchart TBClient --> WS[WebSocket Gateway]WS --> ChatSvcChatSvc --> KafkaKafka --> PresenceChatSvc --> Cassandra
Step-by-step explanation
- Client WSS → WebSocket Gateway cluster → Chat Service.
- Send message → persist Cassandra → publish to Kafka topic conversationId.
- Online recipients: gateway subscribed to user channel via Redis pub/sub or Kafka consumer pushes WS frame.
- Offline: Notification service FCM push with message preview policy.
- Media messages: upload to S3 presigned URL; message body stores URL reference.
- History sync: paginated GET /conversations/{id}/messages?before=cursor.
Informative example
Message send API and Kafka fan-out to WebSocket delivery workers:
@RestController@RequestMapping("/api/v1/conversations/{cid}/messages")public class MessageController {private final MessageService messages;public MessageController(MessageService messages) { this.messages = messages; }@PostMappingpublic MessageDto send(@PathVariable String cid, @RequestBody SendMessageRequest req,@AuthenticationPrincipal Jwt jwt) {return messages.send(cid, jwt.getSubject(), req.body());}}@Servicepublic class MessageService {private final MessageRepository cassandra;private final KafkaTemplate<String, ChatMessageEvent> kafka;public MessageDto send(String conversationId, String senderId, String body) {Message msg = cassandra.save(Message.create(conversationId, senderId, body));kafka.send("chat.messages", conversationId,new ChatMessageEvent(msg.id(), conversationId, senderId, body, msg.sentAt()));return MessageDto.from(msg);}}
Partition Kafka by conversationId for ordering. WS gateway scales on connection count — separate from REST API.
Real-world use
Real-world use cases
- Enterprise Slack-like team collaboration.
- Telehealth secure messaging HIPAA audit.
- In-app e-commerce buyer-seller chat.
- Game guild chat with low latency.
Best practices
- Idempotent send with client-generated message UUID.
- Paginate history — never load full conversation.
- Backpressure on gateway if client slow consumer.
- Encrypt TLS everywhere; E2EE optional product decision.
- Moderation pipeline for abuse reporting async.
- Load test connection count per gateway pod.
Common mistakes
- Polling HTTP instead of WebSocket — battery and latency fail.
- Fan-out on write to 10k member channel — write amplification.
- No ordering guarantee per conversation.
- Storing large media in message row — bloat.
- Single gateway SPOF without horizontal scale plan.
Advanced interview questions
Q1BeginnerWebSocket vs long polling for chat?
Q2BeginnerHow store chat messages?
Q3IntermediateFan-out on write vs read for groups?
Q4IntermediateOnline presence implementation?
Q5AdvancedDesign WhatsApp-scale 2B users.
Summary
Chat HLD combines WebSockets, durable message store, and push. Partition messages by conversation for ordering and queries. Fan-out strategy depends on group size. Presence via Redis TTL heartbeats. Kafka bridges persistence to live delivery workers. Food delivery adds dispatch geo-matching complexity.