Apache Pulsar

Apache Pulsar is a cloud-native messaging system built for high-throughput, low-latency, and real-time data processing.

It supports both publish-subscribe and queuing in a single system. It separates compute from storage and scales easily without repartitioning or manual tuning.

If your system needs to process large volumes of data quickly and reliably, Pulsar is built for the job.

‍

What Is Apache Pulsar?

Apache Pulsar is a distributed open-source messaging and event streaming platform. It supports real-time data processing at scale. It uses a topic-based publish-subscribe model and supports queue-based messaging, batch processing, geo-replication, and serverless processing with Pulsar Functions.

Pulsar was developed at Yahoo to power large products like Yahoo Mail and Finance. Later, it was donated to the Apache Software Foundation. Today, many organizations use it to manage real-time data pipelines.

What makes Pulsar different is its architecture. It separates storage from message delivery. Messages go to a broker, then are stored in Apache BookKeeper. This allows the system to scale each part independently.

This design also supports long-term data storage and improves fault tolerance. Pulsar can grow with your data without causing service disruption.

Key features include:

Real-time message delivery under 10ms
Topic-based publish-subscribe
Queue-based work distribution
Geo-replication across data centers
Support for multi-tenant use
Pulsar Functions for serverless compute
Tiered storage support (e.g. S3, GCS)
Multi-language clients (Java, Python, Go, C++, JavaScript)

Pulsar is cloud-native and supports Kubernetes deployments. Brokers are stateless and can be balanced across clusters automatically. This makes Pulsar suitable for both on-prem and cloud environments.

‍

How Apache Pulsar Works

Pulsar uses a layered design. It separates compute, storage, and coordination into three main parts:

‍

Pulsar Brokers

Brokers handle communication with clients. They receive messages from producers, send them to BookKeeper for storage, and later deliver them to consumers.

Brokers do not store messages. This stateless setup makes it easy to replace or scale brokers. Pulsar automatically detects overloaded partitions and spreads them across brokers to balance traffic.

‍

Apache BookKeeper

BookKeeper stores the actual messages. Messages are written to ledgers, which are split into segments. Each segment is stored on three or more nodes called bookies.

If one bookie fails, the data remains available through the replicas. Bookies can be added without reshuffling existing data. This avoids downtime and allows continuous scaling.

‍

Apache ZooKeeper

ZooKeeper manages cluster metadata and coordinates actions between nodes. It keeps track of broker assignments and helps manage failover.

You usually do not interact with ZooKeeper directly, but it is critical for Pulsar’s internal operations.

‍

The Pulsar Client

Pulsar offers client libraries in several languages, including Java, Go, Python, C++, and JavaScript.

The Pulsar Client supports the following process:

A producer sends a message to a topic
The broker receives and stores the message in BookKeeper
Consumers subscribe to the topic and read messages
Messages can be acknowledged one at a time or in batches

Pulsar supports several subscription types:

Shared: multiple consumers receive different messages
Exclusive: only one consumer can subscribe at a time
Failover: one primary consumer, with backups ready to take over
‍

Pulsar Functions

Pulsar Functions are small pieces of code that run inside the Pulsar system. They let you process messages without setting up an external processing tool.

You can use them for:

Filtering messages
Transforming data
Routing messages to other topics
Enriching incoming data

They scale automatically and are tightly integrated into Pulsar’s architecture.

‍

Geo-Replication

Pulsar includes geo-replication features. Messages can be copied across data centers or cloud regions in real time.

Replication is set at the namespace or tenant level. You can configure unidirectional or bidirectional replication. If a region goes down, Pulsar clients switch to a healthy region without data loss.

This supports disaster recovery and global availability.

‍

Tiered Storage

Pulsar includes built-in support for tiered storage. Older data is moved from high-speed local storage to cloud object storage such as Amazon S3 or Google Cloud Storage.

This keeps your active storage small and fast, while still allowing long-term data access for analytics or replay.

The system handles this transfer automatically, based on policies you define.

‍

Why Apache Pulsar Scales Better

Pulsar was built to scale from the ground up. It does not rely on partitions for distribution. It uses a bundle-based system that can split topics and rebalance load on the fly.

‍

Horizontal Scaling Without Downtime

You can add brokers or bookies at any time. Brokers serve clients. Bookies store data. You do not need to repartition or reassign data manually. Pulsar handles all this in the background.

‍

Millions of Topics

Pulsar clusters can manage millions of topics. This makes it a strong option for platforms with lots of microservices, users, or data streams.

Namespaces group topics by tenant. You can set limits and access controls per namespace.

‍

Geo-Replication Is Built In

Other systems require complex tools to support global data movement. Pulsar includes this out of the box. You get multi-region replication with automatic failover.

‍

Isolated Workloads

You can isolate workloads using namespace policies. Set bandwidth limits, quotas, and access controls per tenant. This helps you avoid performance problems caused by noisy neighbors.

‍

Long-Term Storage, Simple Setup

With tiered storage, Pulsar avoids the need for separate archive systems or ETL jobs. Old messages are moved to object storage. You can still replay them if needed.

‍

When to Use Apache Pulsar

Pulsar works best for teams that need both scale and flexibility. Use it if you:

Need both streaming and queuing
Manage multiple products or tenants
Handle data from global regions
Run real-time analytics or alerts
Require long-term storage without complex setup
Want to avoid operational complexity when scaling

‍

Real-World Use Cases

Multi-tenant platforms A SaaS company supports hundreds of customers on a shared cluster. Pulsar isolates tenants and tracks resource usage without separate infrastructure.

Hybrid messaging and streaming A risk platform uses Pulsar to send transactions to real-time analytics and also queue alerts for processing.

Global apps A chat app syncs user activity across regions. Pulsar handles regional routing and failover.

Edge or IoT data processing An industrial system uses Pulsar Functions to check sensor readings and raise alerts. No need for an external stream processor.

Storage-heavy analytics A trading platform keeps six months of event history in S3 for compliance. Pulsar handles retention and replay.

High-volume data A major retailer uses Pulsar to process logs and user events across over a million topics. Kafka could not scale without manual tuning. Pulsar works out of the box.

FAQs

‍

What is Apache Pulsar?

It is a messaging and streaming platform for real-time data. It supports pub-sub and queuing, multi-tenancy, geo-replication, and long-term storage.

‍

How is it different from Kafka or RabbitMQ?

Pulsar separates brokers from storage. It supports both queue and stream patterns in one system. It scales cleanly and supports features like geo-replication and tiered storage.

‍

What message patterns are supported?

Pulsar supports publisher-subscriber, shared queue, exclusive subscription, and failover modes.

‍

What are Pulsar Functions?

They are serverless code units that process messages in place. You can filter, route, or transform data without extra services.

‍

How durable is Pulsar?

Messages are stored in BookKeeper before acknowledgment. Each message is stored on multiple bookies. Brokers are stateless and can be replaced without data loss.

‍

What is tiered storage?

Old data is moved to object storage like S3. Pulsar manages this and lets you access old data without affecting performance.

‍

Which languages does Pulsar support?

Java, Python, C++, Go, and JavaScript. Each has a full client library.

‍

Is Pulsar cloud-friendly?

Yes. It runs well on Kubernetes and is available as a managed service through StreamNative.

‍

How does geo-replication work?

It copies messages across clusters in different regions. You can choose which topics to replicate. Failover is automatic.

‍

Who should use Pulsar?

Teams with high-throughput needs, global services, real-time systems, or multi-tenant apps. Pulsar is a good fit when you need flexibility and control.

‍

Summary

Apache Pulsar is a strong choice for teams working with fast, large-scale data. It gives you a unified system for queuing and streaming, separates storage from delivery, and scales without complexity.

It includes features like geo-replication, Pulsar Functions, tiered storage, and multi-tenant support out of the box. It handles global use, heavy throughput, and long-term retention with less overhead.

If you're looking to build reliable, real-time systems, Pulsar offers the structure to do it well.

Glossary

Apache Pulsar

What Is Apache Pulsar?