Glossary
Apache Pulsar
Apache Pulsar is a cloud-native messaging system built for high-throughput, low-latency, and real-time data processing.
It supports both publish-subscribe and queuing in a single system. It separates compute from storage and scales easily without repartitioning or manual tuning.
If your system needs to process large volumes of data quickly and reliably, Pulsar is built for the job.
What Is Apache Pulsar?
Apache Pulsar is a distributed open-source messaging and event streaming platform. It supports real-time data processing at scale. It uses a topic-based publish-subscribe model and supports queue-based messaging, batch processing, geo-replication, and serverless processing with Pulsar Functions.
Pulsar was developed at Yahoo to power large products like Yahoo Mail and Finance. Later, it was donated to the Apache Software Foundation. Today, many organizations use it to manage real-time data pipelines.
What makes Pulsar different is its architecture. It separates storage from message delivery. Messages go to a broker, then are stored in Apache BookKeeper. This allows the system to scale each part independently.
This design also supports long-term data storage and improves fault tolerance. Pulsar can grow with your data without causing service disruption.
Key features include:
- Real-time message delivery under 10ms
- Topic-based publish-subscribe
- Queue-based work distribution
- Geo-replication across data centers
- Support for multi-tenant use
- Pulsar Functions for serverless compute
- Tiered storage support (e.g. S3, GCS)
- Multi-language clients (Java, Python, Go, C++, JavaScript)
Pulsar is cloud-native and supports Kubernetes deployments. Brokers are stateless and can be balanced across clusters automatically. This makes Pulsar suitable for both on-prem and cloud environments.
How Apache Pulsar Works
Pulsar uses a layered design. It separates compute, storage, and coordination into three main parts:
Pulsar Brokers
Brokers handle communication with clients. They receive messages from producers, send them to BookKeeper for storage, and later deliver them to consumers.
Brokers do not store messages. This stateless setup makes it easy to replace or scale brokers. Pulsar automatically detects overloaded partitions and spreads them across brokers to balance traffic.
Apache BookKeeper
BookKeeper stores the actual messages. Messages are written to ledgers, which are split into segments. Each segment is stored on three or more nodes called bookies.
If one bookie fails, the data remains available through the replicas. Bookies can be added without reshuffling existing data. This avoids downtime and allows continuous scaling.
Apache ZooKeeper
ZooKeeper manages cluster metadata and coordinates actions between nodes. It keeps track of broker assignments and helps manage failover.
You usually do not interact with ZooKeeper directly, but it is critical for Pulsar’s internal operations.
The Pulsar Client
Pulsar offers client libraries in several languages, including Java, Go, Python, C++, and JavaScript.
The Pulsar Client supports the following process:
- A producer sends a message to a topic
- The broker receives and stores the message in BookKeeper
- Consumers subscribe to the topic and read messages
- Messages can be acknowledged one at a time or in batches
Pulsar supports several subscription types:
- Shared: multiple consumers receive different messages
- Exclusive: only one consumer can subscribe at a time
- Failover: one primary consumer, with backups ready to take over
-
Pulsar Functions
Pulsar Functions are small pieces of code that run inside the Pulsar system. They let you process messages without setting up an external processing tool.
You can use them for:
- Filtering messages
- Transforming data
- Routing messages to other topics
- Enriching incoming data
They scale automatically and are tightly integrated into Pulsar’s architecture.
Geo-Replication
Pulsar includes geo-replication features. Messages can be copied across data centers or cloud regions in real time.
Replication is set at the namespace or tenant level. You can configure unidirectional or bidirectional replication. If a region goes down, Pulsar clients switch to a healthy region without data loss.
This supports disaster recovery and global availability.
Tiered Storage
Pulsar includes built-in support for tiered storage. Older data is moved from high-speed local storage to cloud object storage such as Amazon S3 or Google Cloud Storage.
This keeps your active storage small and fast, while still allowing long-term data access for analytics or replay.
The system handles this transfer automatically, based on policies you define.
Why Apache Pulsar Scales Better
Pulsar was built to scale from the ground up. It does not rely on partitions for distribution. It uses a bundle-based system that can split topics and rebalance load on the fly.
Horizontal Scaling Without Downtime
You can add brokers or bookies at any time. Brokers serve clients. Bookies store data. You do not need to repartition or reassign data manually. Pulsar handles all this in the background.
Millions of Topics
Pulsar clusters can manage millions of topics. This makes it a strong option for platforms with lots of microservices, users, or data streams.
Namespaces group topics by tenant. You can set limits and access controls per namespace.
Geo-Replication Is Built In
Other systems require complex tools to support global data movement. Pulsar includes this out of the box. You get multi-region replication with automatic failover.
Isolated Workloads
You can isolate workloads using namespace policies. Set bandwidth limits, quotas, and access controls per tenant. This helps you avoid performance problems caused by noisy neighbors.
Long-Term Storage, Simple Setup
With tiered storage, Pulsar avoids the need for separate archive systems or ETL jobs. Old messages are moved to object storage. You can still replay them if needed.
When to Use Apache Pulsar
Pulsar works best for teams that need both scale and flexibility. Use it if you:
- Need both streaming and queuing
- Manage multiple products or tenants
- Handle data from global regions
- Run real-time analytics or alerts
- Require long-term storage without complex setup
- Want to avoid operational complexity when scaling
Real-World Use Cases
Multi-tenant platforms A SaaS company supports hundreds of customers on a shared cluster. Pulsar isolates tenants and tracks resource usage without separate infrastructure.
Hybrid messaging and streaming A risk platform uses Pulsar to send transactions to real-time analytics and also queue alerts for processing.
Global apps A chat app syncs user activity across regions. Pulsar handles regional routing and failover.
Edge or IoT data processing An industrial system uses Pulsar Functions to check sensor readings and raise alerts. No need for an external stream processor.
Storage-heavy analytics A trading platform keeps six months of event history in S3 for compliance. Pulsar handles retention and replay.
High-volume data A major retailer uses Pulsar to process logs and user events across over a million topics. Kafka could not scale without manual tuning. Pulsar works out of the box.
FAQs
What is Apache Pulsar?
It is a messaging and streaming platform for real-time data. It supports pub-sub and queuing, multi-tenancy, geo-replication, and long-term storage.
How is it different from Kafka or RabbitMQ?
Pulsar separates brokers from storage. It supports both queue and stream patterns in one system. It scales cleanly and supports features like geo-replication and tiered storage.
What message patterns are supported?
Pulsar supports publisher-subscriber, shared queue, exclusive subscription, and failover modes.
What are Pulsar Functions?
They are serverless code units that process messages in place. You can filter, route, or transform data without extra services.
How durable is Pulsar?
Messages are stored in BookKeeper before acknowledgment. Each message is stored on multiple bookies. Brokers are stateless and can be replaced without data loss.
What is tiered storage?
Old data is moved to object storage like S3. Pulsar manages this and lets you access old data without affecting performance.
Which languages does Pulsar support?
Java, Python, C++, Go, and JavaScript. Each has a full client library.
Is Pulsar cloud-friendly?
Yes. It runs well on Kubernetes and is available as a managed service through StreamNative.
How does geo-replication work?
It copies messages across clusters in different regions. You can choose which topics to replicate. Failover is automatic.
Who should use Pulsar?
Teams with high-throughput needs, global services, real-time systems, or multi-tenant apps. Pulsar is a good fit when you need flexibility and control.
Summary
Apache Pulsar is a strong choice for teams working with fast, large-scale data. It gives you a unified system for queuing and streaming, separates storage from delivery, and scales without complexity.
It includes features like geo-replication, Pulsar Functions, tiered storage, and multi-tenant support out of the box. It handles global use, heavy throughput, and long-term retention with less overhead.
If you're looking to build reliable, real-time systems, Pulsar offers the structure to do it well.
A wide array of use-cases
Discover how we can help your data into your most valuable asset.
We help businesses boost revenue, save time, and make smarter decisions with Data and AI