How Datadog Built Monocle, a Custom Database for Billions of Metrics

This title was summarized by AI from the post below.

611,743 followers

⚙️ How Datadog Built a Custom Database for Billions of Metrics Per Second Datadog’s monitoring platform handles an unimaginable scale of billions of data points flowing in every second from millions of servers. To keep up, Datadog built Monocle, a custom time-series database written in Rust - optimized for raw performance, reliability, and cost efficiency. Monocle runs on a thread-per-core model and uses an LSM-Tree storage design for extreme write throughput. At the architectural level, Datadog’s Metrics Platform splits data into two specialized systems: 1. A Long-Term Store for historical analytics 2. A Real-Time Store for live dashboards and alerts - serving 99% of queries Each incoming data point is first sent to Kafka, which powers data distribution across nodes, write-ahead logging for crash recovery, and automatic replication across availability zones for durability. Performance is maintained under heavy load through key systems: 1. Admission Control, which protects the cluster from overload 2. Cost-Based Scheduling, which prioritizes queries dynamically to maintain low latency 🔗 To see how these systems work together under real Datadog-scale load, read the full breakdown: https://lnkd.in/esMfUBPA Supported by our partners helping teams build and scale reliably: Datadog - Powering observability at scale. Download the On-Call Best Practices guide: https://bit.ly/3Xxk0wZ SonarSource - Bridging the gap between AI-generated code and human-grade quality. Verify every line for security, maintainability, and trust: https://bit.ly/47VpU00

3 Comments

Santosh Waghmare

Great example of solving high ingestion and low latency challenges at a massive scale.

Tecwor

Great Insights 👍

Ihor Senkiv

great visualisation!

See more comments

To view or add a comment, sign in

More Relevant Posts

Yeedu

764 followers
2w
Report this post
A data team builds their stack on a single vendor's ecosystem. Six months later - proprietary formats, custom APIs, and SQL dialects lock everything in. ⭕ Migrations turn into multi-quarter nightmares. ⭕ Costs skyrocket with no alternatives. ⭕ And switching providers? Forget it - you're stuck rewriting code and refactoring pipelines. This is exactly why avoiding vendor lock-in is essential for modern data architecture. It's no longer just about getting started fast - it's about building with flexibility and portability in mind. We've entered an era of open, modular data stacks, where vendor independence, open standards, and multi-cloud strategies ensure you stay agile and in control. In our latest post, we dive into: ⚙️ Real-world examples of vendor lock-in 💡 Strategies to build on open foundations 🚀 How multi-cloud and transparent pricing keep your options open 🧠 Why Yeedu is designed for vendor independence - with support for open formats, catalogs, and seamless migrations Data engineering is evolving from proprietary traps to open freedom. And the teams who prioritize modularity will unlock true innovation without the chains. 🔍 Learn more about building vendor-agnostic data architectures here → https://lnkd.in/d_ErrArn
Like Comment
To view or add a comment, sign in
Balakrishnan Ganesan
1mo
Report this post
One key advantage is GraphQL simplifies data management and integration in complex systems. However need more code compared to REST. Start the application with REST and think about the GraphQL if application needs a unified layer to fetch and combine data from various backend sources, including databases, REST APIs, and microservices, presenting a consistent API to clients. I recommend this course and the author nailed it.
1 Comment
Like Comment
To view or add a comment, sign in
Datavid

8,481 followers
2w Edited
Report this post
🆕 New Blog: Why Upgrading to MarkLogic 11 and 12 Cuts Costs and Boosts Performance With MarkLogic 10 entering restrictive support in March 2026, now is the right time to plan your upgrade. Modernization is not only about avoiding risk but also about gaining speed, scalability, and cost efficiency. In our latest blog, senior MarkLogic Consultant Kurt Cagle explores how MarkLogic 11 and 12 deliver measurable performance gains, simplify architecture, and reduce total cost of ownership. Discover how features like vector search, GraphQL support, dynamic scaling, and built-in automation help organizations achieve faster delivery and lower long-term costs. Learn how Datavid helps teams upgrade with confidence and unlock AI-ready capabilities. 👉 Read the full blog: https://datav.id/3WNJNB0 Progress Progress MarkLogic - Progress Data Platform #Progress #MarkLogic #Upgrade #DataIntegration #AIReady #TCO #EnterpriseData #Datavid

How MarkLogic 11/12 Improves Performance and Reduces TCO datavid.com
Like Comment
To view or add a comment, sign in
Piyush Ranjan
4w
Report this post
🚀 Boost Your API Performance: Key Strategies for Success! 🚀 In today's fast-paced digital world, ensuring your API performs at its best is crucial. Here are some effective strategies to enhance your API performance: 🔹 Caching: Reduce repetitive database queries by storing frequently accessed data. Tools like Redis or Memcached can provide lightning-fast data retrieval. 🔹 Asynchronous Logging: Keep your main operations running smoothly by logging data asynchronously. Leverage log aggregation tools like ELK Stack for efficient monitoring and analysis. 🔹 Payload Compression: Compress your payloads with Gzip or Brotli to minimize data transfer size, resulting in faster response times. 🔹 Connection Pooling: Manage and reuse database connections with connection pools to reduce overhead and optimize performance. Tailor your pool size to fit your workload. 🔹 Database Optimization: Use pagination to handle large datasets efficiently and optimize your queries and indexing for quicker execution. Avoid the N+1 problem by fetching related data in a single query. 🔹 Load Balancing: Distribute requests across multiple servers to maintain high availability and even load distribution. Implement health checks to ensure all servers are in top shape. By implementing these strategies, you can ensure your API is robust, responsive, and ready to meet the demands of your users. What strategies have you found effective in optimizing API performance? Share your thoughts below! 👇
2 Comments
Like Comment
To view or add a comment, sign in
Keshava Mugulur Srinivasa Iyengar
1mo
Report this post
Simplifying ETL with n8n – The Open Source Way! In the modern data landscape, not every ETL process needs heavy code or complex infrastructure. Sometimes, agility and transparency matter more — and that’s where n8n shines. 🔹 n8n (pronounced “n-eight-n”) is a powerful open-source workflow automation platform that lets you: Orchestrate ETL/ELT pipelines visually 🧩 Integrate with hundreds of sources — APIs, databases, cloud apps, and files Use JavaScript functions inline for quick data transformations Schedule, trigger, and monitor data flows with full flexibility Self-host or run securely in the cloud 💡 Example Use Case: Automate ingestion from REST APIs → Transform JSON to structured data → Load into Snowflake or BigQuery — all within a single n8n workflow. It’s a perfect blend of low-code automation and full-code control, empowering data engineers and analysts alike to move faster. If you’re exploring open-source ETL alternatives, give n8n a try — you’ll be surprised by how much it can simplify your data workflows. #DataEngineering #ETL #Automation #n8n #OpenSource #DataPipelines #LowCode #Snowflake #DataOps
Like Comment
To view or add a comment, sign in
Jennifer Busfield
1mo
Report this post
Data teams are facing repeated trade‑offs: convenience vs. control, consolidation vs. flexibility. 💡 In today’s data ecosystem, the smartest move is to design for choice — open architectures, transparent pipelines, flexible deployment: https://bit.ly/4o55d8H. CData Software believes that being locked in rarely leads to strategic advantage. #DataIntegration #HybridCloud #DigitalTransformation #CData

Blog: Why Components, Choice, and Flexibility Still Matter in Data Integration cdata.com
Like Comment
To view or add a comment, sign in
Dr. Bipin Dayal
1mo
Report this post
Data teams are facing repeated trade‑offs: convenience vs. control, consolidation vs. flexibility. 💡 In today’s data ecosystem, the smartest move is to design for choice — open architectures, transparent pipelines, flexible deployment: https://bit.ly/4o55d8H. CData Software believes that being locked in rarely leads to strategic advantage. #DataIntegration #HybridCloud #DigitalTransformation #CData

Blog: Why Components, Choice, and Flexibility Still Matter in Data Integration cdata.com
Like Comment
To view or add a comment, sign in
Sambhav Dalal
1mo
Report this post
Most applications fail not because of bad code, but because of a single database going down at 3 AM. 𝗗𝗮𝘁𝗮𝗯𝗮𝘀𝗲 𝗿𝗲𝗽𝗹𝗶𝗰𝗮𝘁𝗶𝗼𝗻 solves this by copying and maintaining data across multiple servers, ensuring your system stays alive when hardware fails. Hello Everyone today we will look in into 𝗗𝗮𝘁𝗮𝗯𝗮𝘀𝗲 𝗿𝗲𝗽𝗹𝗶𝗰𝗮𝘁𝗶𝗼𝗻!! 𝗪𝗵𝗮𝘁 𝗶𝘀 𝗗𝗮𝘁𝗮𝗯𝗮𝘀𝗲 𝗥𝗲𝗽𝗹𝗶𝗰𝗮𝘁𝗶𝗼𝗻? Database replication is the process of creating and maintaining multiple copies of the same database across different servers or locations. When data changes in one database, those changes are automatically propagated to the replica databases, keeping them synchronized. 𝗪𝗵𝘆 𝗥𝗲𝗽𝗹𝗶𝗰𝗮𝘁𝗶𝗼𝗻 𝗠𝗮𝘁𝘁𝗲𝗿𝘀? Replication addresses three critical needs: • High availability - Replicas take over when the primary database fails • Load distribution - Read queries spread across multiple servers • Geographic proximity - Users access nearby servers for faster response times 𝗖𝗼𝗿𝗲 𝗥𝗲𝗽𝗹𝗶𝗰𝗮𝘁𝗶𝗼𝗻 𝗦𝘁𝗿𝗮𝘁𝗲𝗴𝗶𝗲𝘀:- 𝗠𝗮𝘀𝘁𝗲𝗿-𝗦𝗹𝗮𝘃𝗲 𝗥𝗲𝗽𝗹𝗶𝗰𝗮𝘁𝗶𝗼𝗻 One primary handles writes, replicas handle reads. Simple and effective for read-heavy workloads, but writes remain a single point of failure. 𝗠𝗮𝘀𝘁𝗲𝗿-𝗠𝗮𝘀𝘁𝗲𝗿 𝗥𝗲𝗽𝗹𝗶𝗰𝗮𝘁𝗶𝗼𝗻 Multiple databases accept writes. Better fault tolerance but requires conflict resolution when the same data changes in different locations. 𝗦𝘆𝗻𝗰𝗵𝗿𝗼𝗻𝗼𝘂𝘀 𝘃𝘀 𝗔𝘀𝘆𝗻𝗰𝗵𝗿𝗼𝗻𝗼𝘂𝘀 Synchronous replication guarantees consistency by waiting for replicas to confirm writes—slower but reliable. Asynchronous prioritizes speed, accepting writes immediately and replicating later, which can result in stale reads. Now in system design there are a lot of Tradeoffs :) • The CAP theorem forces tough choices: you can't have perfect Consistency, Availability, and Partition tolerance simultaneously. • Network issues can trigger split-brain scenarios where multiple nodes think they're primary. Asynchronous replication introduces lag—your replica might serve outdated data. Multi-master setups need clear conflict resolution rules. Match replication to your requirements. Banking systems need synchronous replication for consistency. Social feeds can use asynchronous replication for global scale. Replication isn't optional for production systems—it's essential. Understanding your consistency needs and failure scenarios helps you choose the right approach. What are your thoughts about this and as backend devs how you manage replications challenges? #SystemDesign #Backend #Databases #LearningInPublic

4 Comments
Like Comment
To view or add a comment, sign in
Vahid Bakhtiary
3w
Report this post
🚀 𝗗𝗮𝘁𝗮 𝗠𝗲𝘀𝗵 + .𝗡𝗘𝗧 𝟵: 𝗧𝗵𝗲 𝗡𝗲𝘅𝘁 𝗘𝘃𝗼𝗹𝘂𝘁𝗶𝗼𝗻 𝗔𝗳𝘁𝗲𝗿 𝗠𝗶𝗰𝗿𝗼𝘀𝗲𝗿𝘃𝗶𝗰𝗲𝘀 Most teams today think they have a “distributed architecture” — but if all the data still flows into a central warehouse or shared DB… …you don’t have a distributed system. You have distributed apps with centralized data bottlenecks. 😅 Enter Data Mesh — a shift from “central data ownership” to domain-owned data products. This isn’t theory anymore. With .NET 9, Kafka/ServiceBus, Dapr, and OpenTelemetry — the ecosystem is mature enough to build it cleanly. Why it matters: ✅ Teams own and publish data like APIs ✅ Better scaling than the classic enterprise data lake model ✅ Works perfectly with event-driven .NET systems ✅ Enables real AI/ML readiness at the domain level Where it fits: 👉 Large enterprise apps 👉 Multiple domains & distributed services 👉 When analytics, ML, and real-time data matter Where you don’t need it: ❌ Small apps ❌ CRUD systems without analytics needs ❌ Teams struggling with basic architecture fundamentals I wrote a deep dive with diagrams, examples, and .NET context: 🔗 Read here: https://lnkd.in/ggkHR794 #dotnet #architecture #distributedSystems #dataengineering #cloudnative

✅ .NET Data Mesh Architecture — Beyond Microservices medium.com
Like Comment
To view or add a comment, sign in

611,743 followers

View Profile Connect

How Datadog Built Monocle, a Custom Database for Billions of Metrics

More Relevant Posts

Explore content categories