Initial commit to metrics and monitoring course

This commit is contained in:
Sumit Sulakhe 2021-02-08 05:25:23 -08:00 committed by Sumit Sulakhe
parent bbd0cd38b5
commit a3ffe9c1d0
21 changed files with 693 additions and 0 deletions

View File

@ -20,6 +20,7 @@ In this course, we are focusing on building strong foundational skills. The cour
- [NoSQL concepts](https://linkedin.github.io/school-of-sre/databases_nosql/intro/)
- [Big Data](https://linkedin.github.io/school-of-sre/big_data/intro/)
- [Systems Design](https://linkedin.github.io/school-of-sre/systems_design/intro/)
- [Metrics and Monitoring](metrics_and_monitoring/introduction.md)
- [Security](https://linkedin.github.io/school-of-sre/security/intro/)
We believe continuous learning will help in acquiring deeper knowledge and competencies in order to expand your skill sets, every module has added references which could be a guide for further learning. Our hope is that by going through these modules we should be able to build the essential skills required for a Site Reliability Engineer.

View File

@ -0,0 +1,28 @@
##
# Proactive monitoring using alerts
Earlier we discussed different ways to collect key metric data points
from a service and its underlying infrastructure. This data gives us a
better understanding of how the service is performing. One of the main
objectives of monitoring is to detect any service degradations early
(reduce Mean Time To Detect) and notify stakeholders so that the issues
are either avoided or can be fixed early, thus reducing Mean Time To
Recover (MTTR). For example, if you are notified when resource usage by
a service exceeds 90 percent, you can take preventive measures to avoid
any service breakdown due to a shortage of resources. On the other hand,
when a service goes down due to an issue, early detection and
notification of such incidents can help you quickly fix the issue.
![An alert notification received on Slack](images/image11.png) <p align="center"> Figure 8: An alert notification received on Slack </p>
Today most of the monitoring services available provide a mechanism to
set up alerts on one or a combination of metrics to actively monitor the
service health. These alerts have a set of defined rules or conditions,
and when the rule is broken, you are notified. These rules can be as
simple as notifying when the metric value exceeds n to as complex as a
week over week (WoW) comparison of standard deviation over a period of
time. Monitoring tools notify you about an active alert, and most of
these tools support instant messaging (IM) platforms, SMS, email, or
phone calls. Figure 8 shows a sample alert notification received on
Slack for memory usage exceeding 90 percent of total RAM space on the
host.

View File

@ -0,0 +1,40 @@
##
# Best practices for monitoring
When setting up monitoring for a service, keep the following best
practices in mind.
- **Use the right metric type** -- Most of the libraries available
today offer various metric types. Choose the appropriate metric
type for monitoring your system. Following are the types of
metrics and their purposes.
- **Gauge --** *Gauge* is a constant type of metric. After the
metric is initialized, the metric value does not change unless
you intentionally update it.
- **Timer --** *Timer* measures the time taken to complete a
task.
- **Counter --** *Counter* counts the number of occurrences of a
particular event.
For more information about these metric types, see [Data
Types](https://statsd.readthedocs.io/en/v0.5.0/types.html).
- **Avoid over-monitoring** -- Monitoring can be a significant
engineering endeavor***.*** Therefore, be sure not to spend too
much time and resources on monitoring services, yet make sure all
important metrics are captured.
- **Prevent alert fatigue** -- Set alerts for metrics that are
important and actionable. If you receive too many non-critical
alerts, you might start ignoring alert notifications over time. As
a result, critical alerts might get overlooked.
- **Have a runbook for alerts** -- For every alert, make sure you have
a document explaining what actions and checks need to be performed
when the alert fires. This enables any engineer on the team to
handle the alert and take necessary actions, without any help from
others.

View File

@ -0,0 +1,98 @@
##
# Command-line tools
Most of the Linux distributions today come with a set of tools that
monitor the system's performance. These tools help you measure and
understand various subsystem statistics (CPU, memory, network, and so
on). Let's look at some of the tools that are predominantly used.
- `ps/top `-- The process status command (ps) displays information
about all the currently running processes in a Linux system. The
top command is similar to the ps command, but it periodically
updates the information displayed until the program is terminated.
An advanced version of top, called htop, has a more user-friendly
interface and some additional features. These command-line
utilities come with options to modify the operation and output of
the command. Following are some important options supported by the
ps command.
- `-p <pid1, pid2,...>` -- Displays information about processes
that match the specified process IDs. Similarly, you can use
`-u <uid>` and `-g <gid>` to display information about
processes belonging to a specific user or group.
- `-a` -- Displays information about other users' processes, as well
as one's own.
- `-x` -- When displaying processes matched by other options,
includes processes that do not have a controlling terminal.
![Results of top command](images/image12.png) <p align="center"> Figure 2: Results of top command </p>
- `ss` -- The socket statistics command (ss) displays information
about network sockets on the system. This tool is the successor of
[netstat](https://man7.org/linux/man-pages/man8/netstat.8.html),
which is deprecated. Following are some command-line options
supported by the ss command:
- `-t` -- Displays the TCP socket. Similarly, `-u` displays UDP
sockets, `-x` is for UNIX domain sockets, and so on.
- `-l` -- Displays only listening sockets.
- `-n` -- Instructs the command to not resolve service names.
Instead displays the port numbers.
![List of listening sockets on a system](images/image8.png) <p align="center"> Figure
3: List of listening sockets on a system </p>
- `free` -- The free command displays memory usage statistics on the
host like available memory, used memory, and free memory. Most often,
this command is used with the `-h` command-line option, which
displays the statistics in a human-readable format.
![Memory
statistics on a host in human-readable form](images/image6.png) <p align="center"> Figure 4: Memory
statistics on a host in human-readable form </p>
- `df --` The df command displays disk space usage statistics. The
`-i` command-line option is also often used to display
[inode](https://en.wikipedia.org/wiki/Inode) usage
statistics. The `-h` command-line option is used for displaying
statistics in a human-readable format.
![Disk usage statistics on a system in human-readable form](images/image9.png) <p align="center"> Figure 5:
Disk usage statistics on a system in human-readable form </p>
- `sar` -- The sar utility monitors various subsystems, such as CPU
and memory, in real time. This data can be stored in a file
specified with the `-o` option. This tool helps to identify
anomalies.
- `iftop` -- The interface top command (`iftop`) displays bandwidth
utilization by a host on an interface. This command is often used
to identify bandwidth usage by active connections. The `-i` option
specifies which network interface to watch.
![Network bandwidth usage by
active connection on the host](images/image2.png) <p align="center"> Figure 6: Network bandwidth usage by
active connection on the host </p>
- `tcpdump` -- The tcpdump command is a network monitoring tool that
captures network packets flowing over the network and displays a
description of the captured packets. The following options are
available:
- `-i <interface>` -- Interface to listen on
- `host <IP/hostname>` -- Filters traffic going to or from the
specified host
- `src/dst` -- Displays one-way traffic from the source (src) or to
the destination (dst)
- `port <port number>` -- Filters traffic to or from a particular
port
![tcpdump of packets on an interface](images/image10.png) <p align="center"> Figure 7: *tcpdump* of packets on *docker0*
interface on a host </p>

View File

@ -0,0 +1,50 @@
# Conclusion
A robust monitoring and alerting system is necessary for maintaining and
troubleshooting a system. A dashboard with key metrics can give you an
overview of service performance, all in one place. Well-defined alerts
(with realistic thresholds and notifications) further enable you to
quickly identify any anomalies in the service infrastructure and in
resource saturation. By taking necessary actions, you can avoid any
service degradations and decrease MTTD for service breakdowns.
In addition to in-house monitoring, monitoring real user experience can
help you to understand service performance as perceived by the users.
Many modules are involved in serving the user, and most of them are out
of your control. Therefore, you need to have real-user monitoring in
place.
Metrics give very abstract details on service performance. To get a
better understanding of the system and for faster recovery during
incidents, you might want to implement the other two pillars of
observability: logs and tracing. Logs and trace data can help you
understand what led to service failure or degradation.
Following are some resources to learn more about monitoring and
observability:
- [Google SRE book: Monitoring Distributed
Systems](https://sre.google/sre-book/monitoring-distributed-systems/)
- [Mastering Distributed Tracing by Yuri
Shkuro](https://learning.oreilly.com/library/view/mastering-distributed-tracing/9781788628464/)
- Engineering blogs on
[LinkedIn](https://engineering.linkedin.com/blog/topic/monitoring),
[Grafana](https://grafana.com/blog/),
[Elastic.co](https://www.elastic.co/blog/),
[OpenTelemetry](https://medium.com/opentelemetry)
## References
- [Google SRE book: Monitoring Distributed
Systems](https://sre.google/sre-book/monitoring-distributed-systems/)
- [Mastering Distributed Tracing, by Yuri
Shkuro](https://learning.oreilly.com/library/view/mastering-distributed-tracing/9781788628464/)
- [Monitoring and
Observability](https://copyconstruct.medium.com/monitoring-and-observability-8417d1952e1c)
- [Three PIllars with Zero
Answers](https://medium.com/lightstephq/three-pillars-with-zero-answers-2a98b36358b8)

Binary file not shown.

After

Width:  |  Height:  |  Size: 39 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 53 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 49 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 126 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 23 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 45 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 27 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 34 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 9.3 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 307 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 22 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 19 KiB

View File

@ -0,0 +1,280 @@
##
# Prerequisites
- [Linux Basics](https://linkedin.github.io/school-of-sre/linux_basics/intro/)
- [Python and the Web](https://linkedin.github.io/school-of-sre/python_web/intro/)
- [Systems Design](https://linkedin.github.io/school-of-sre/systems_design/intro/)
- [Linux Networking Fundamentals](https://linkedin.github.io/school-of-sre/linux_networking/intro/)
## What to expect from this course
Monitoring is an integral part of any system. As an SRE, you need to
have a basic understanding of monitoring a service infrastructure. By
the end of this course, you will gain a better understanding of the
following topics:
- What is monitoring?
- What needs to be measured
- How the metrics gathered can be used to improve business decisions and overall reliability
- Proactive monitoring with alerts
- Log processing and its importance
- What is observability?
- Distributed tracing
- Logs
- Metrics
## What is not covered in this course
- Guide to setting up a monitoring infrastructure
- Deep dive into different monitoring technologies and benchmarking or comparison of any tools
## Course content
- [Introduction](#introduction)
- [Four golden signals of monitoring](#four-golden-signals-of-monitoring)
- [Why is monitoring important?](#why-is-monitoring-important)
- [Command-line tools](command-line_tools.md)
- [Third-party monitoring](third-party_monitoring.md)
- [Proactive monitoring using alerts](alerts.md)
- [Best practices for monitoring](best_practices.md)
- [Observability](observability.md)
- [Logs](observability.md#logs)
- [Tracing](observability.md#tracing)
[Conclusion](conclusion.md)
##
# Introduction
Monitoring is a process of collecting real-time performance metrics from
a system, analyzing the data to derive meaningful information, and
displaying the data to the users. In simple terms, you measure various
metrics regularly to understand the state of the system, including but
not limited to, user requests, latency, and error rate. *What gets
measured, gets fixed*---if you can measure something, you can reason
about it, understand it, discuss it, and act upon it with confidence.
## Four golden signals of monitoring
When setting up monitoring for a system, you need to decide what to
measure. The four golden signals of monitoring provide a good
understanding of service performance and lay a foundation for monitoring
a system. These four golden signals are
- Traffic
- Latency
- Error
- Saturation
These metrics help you to understand the system performance and
bottlenecks, and to create a better end-user experience. As discussed in
the [Google SRE
book](https://sre.google/sre-book/monitoring-distributed-systems/),
if you can measure only four metrics of your service, focus on these
four. Let's look at each of the four golden signals.
- **Traffic** -- *Traffic* gives a better understanding of the service
demand. Often referred to as *service QPS* (queries per second),
traffic is a measure of requests served by the service. This
signal helps you to decide when a service needs to be scaled up to
handle increasing customer demand and scaled down to be
cost-effective.
- **Latency** -- *Latency* is the measure of time taken by the service
to process the incoming request and send the response. Measuring
service latency helps in the early detection of slow degradation
of the service. Distinguishing between the latency of successful
requests and the latency of failed requests is important. For
example, an [HTTP 5XX
error](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status#server_error_responses)
triggered due to loss of connection to a database or other
critical backend might be served very quickly. However, because an
HTTP 500 error indicates a failed request, factoring 500s into
overall latency might result in misleading calculations.
- **Error (rate)** -- *Error* is the measure of failed client
requests. These failures can be easily identified based on the
response codes ([HTTP 5XX
error](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status#server_error_responses)).
There might be cases where the response is considered erroneous
due to wrong result data or due to policy violations. For example,
you might get an [HTTP
200](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/200)
response, but the body has incomplete data, or response time is
breaching the agreed-upon
[SLA](https://en.wikipedia.org/wiki/Service-level_agreement)s.
Therefore, you need to have other mechanisms (code logic or
[instrumentation](https://en.wikipedia.org/wiki/Instrumentation_(computer_programming)))
in place to capture errors in addition to the response codes.
- **Saturation** -- *Saturation* is a measure of the resource
utilization by a service. This signal tells you the state of
service resources and how full they are. These resources include
memory, compute, network I/O, and so on. Service performance
slowly degrades even before resource utilization is at 100
percent. Therefore, having a utilization target is important. An
increase in latency is a good indicator of saturation; measuring
the [99th
percentile](https://medium.com/@ankur_anand/an-in-depth-introduction-to-99-percentile-for-programmers-22e83a00caf)
of latency can help in the early detection of saturation.
Depending on the type of service, you can measure these signals in
different ways. For example, you might measure queries per second served
for a web server. In contrast, for a database server, transactions
performed and database sessions created give you an idea about the
traffic handled by the database server. With the help of additional code
logic (monitoring libraries and instrumentation), you can measure these
signals periodically and store them for future analysis. Although these
metrics give you an idea about the performance at the service end, you
need to also ensure that the same user experience is delivered at the
client end. Therefore, you might need to monitor the service from
outside the service infrastructure, which is discussed under third-party
monitoring.
## Why is monitoring important?
Monitoring plays a key role in the success of a service. As discussed
earlier, monitoring provides performance insights for understanding
service health. With access to historical data collected over time, you
can build intelligent applications to address specific needs. Some of
the key use cases follow:
- **Reduction in time to resolve issues** -- With a good monitoring
infrastructure in place, you can identify issues quickly and
resolve them, which reduces the impact caused by the issues.
- **Business decisions** -- Data collected over a period of time can
help you make business decisions such as determining the product
release cycle, which features to invest in, and geographical areas
to focus on. Decisions based on long-term data can improve the
overall product experience.
- **Resource planning** -- By analyzing historical data, you can
forecast service compute-resource demands, and you can properly
allocate resources. This allows financially effective decisions,
with no compromise in end-user experience.
Before we dive deeper into monitoring, let's understand some basic
terminologies.
- **Metric** -- A metric is a quantitative measure of a particular
system attribute---for example, memory or CPU
- **Node or host** -- A physical server, virtual machine, or container
where an application is running
- **QPS** -- *Queries Per Second*, a measure of traffic served by the
service per second
- **Latency** -- The time interval between user action and the
response from the server---for example, time spent after sending a
query to a database before the first response bit is received
- **Error** **rate** -- Number of errors observed over a particular
time period (usually a second)
- **Graph** -- In monitoring, a graph is a representation of one or
more values of metrics collected over time
- **Dashboard** -- A dashboard is a collection of graphs that provide
an overview of system health
- **Incident** -- An incident is an event that disrupts the normal
operations of a system
- **MTTD** -- *Mean Time To Detect* is the time interval between the
beginning of a service failure and the detection of such failure
- **MTTR** -- Mean Time To Resolve is the time spent to fix a service
failure and bring the service back to its normal state
Before we discuss monitoring an application, let us look at the
monitoring infrastructure. Following is an illustration of a basic
monitoring system.
![Illustration of a monitoring infrastructure](images/image1.jpg) <p align="center"> Figure 1: Illustration of a monitoring infrastructure </p>
Figure 1 shows a monitoring infrastructure mechanism for aggregating
metrics on the system, and collecting and storing the data for display.
In addition, a monitoring infrastructure includes alert subsystems for
notifying concerned parties during any abnormal behavior. Let's look at
each of these infrastructure components:
- **Host metrics agent --** A *host metrics agent* is a process
running on the host that collects performance statistics for host
subsystems such as memory, CPU, and network. These metrics are
regularly relayed to a metrics collector for storage and
visualization. Some examples are
[collectd](https://collectd.org/),
[telegraf](https://www.influxdata.com/time-series-platform/telegraf/),
and [metricbeat](https://www.elastic.co/beats/metricbeat).
- **Metric aggregator --** A *metric aggregator* is a process running
on the host. Applications running on the host collect service
metrics using
[instrumentation](https://en.wikipedia.org/wiki/Instrumentation_(computer_programming)).
Collected metrics are sent either to the aggregator process or
directly to the metrics collector over API, if available. Received
metrics are aggregated periodically and relayed to the metrics
collector in batches. An example is
[StatsD](https://github.com/statsd/statsd).
- **Metrics collector --** A *metrics collector* process collects all
the metrics from the metric aggregators running on multiple hosts.
The collector takes care of decoding and stores this data on the
database. Metric collection and storage might be taken care of by
one single service such as
[InfluxDB](https://www.influxdata.com/), which we discuss
next. An example is [carbon
daemons](https://graphite.readthedocs.io/en/latest/carbon-daemons.html).
- **Storage --** A time-series database stores all of these metrics.
Examples are [OpenTSDB](http://opentsdb.net/),
[Whisper](https://graphite.readthedocs.io/en/stable/whisper.html),
and [InfluxDB](https://www.influxdata.com/).
- **Metrics server --** A *metrics server* can be as basic as a web
server that graphically renders metric data. In addition, the
metrics server provides aggregation functionalities and APIs for
fetching metric data programmatically. Some examples are
[Grafana](https://github.com/grafana/grafana) and
[Graphite-Web](https://github.com/graphite-project/graphite-web).
- **Alert manager --** The *alert manager* regularly polls metric data
available and, if there are any anomalies detected, notifies you.
Each alert has a set of rules for identifying such anomalies.
Today many metrics servers such as
[Grafana](https://github.com/grafana/grafana) support alert
management. We discuss alerting [in detail
later](#proactive-monitoring-using-alerts). Examples are
[Grafana](https://github.com/grafana/grafana) and
[Icinga](https://icinga.com/).

View File

@ -0,0 +1,151 @@
##
# Observability
Engineers often use observability when referring to building reliable
systems. *Observability* is a term derived from control theory, It is a
measure of how well internal states of a system can be inferred from
knowledge of its external outputs. Service infrastructures used on a
daily basis are becoming more and more complex; proactive monitoring
alone is not sufficient to quickly resolve issues causing application
failures. With monitoring, you can keep known past failures from
recurring, but with a complex service architecture, many unknown factors
can cause potential problems. To address such cases, you can make the
service observable. An observable system provides highly granular
insights into the implicit failure modes. In addition, an observable
system furnishes ample context about its inner workings, which unlocks
the ability to uncover deeper systemic issues.
Monitoring enables failure detection; observability helps in gaining a
better understanding of the system. Among engineers, there is a common
misconception that monitoring and observability are two different
things. Actually, observability is the superset to monitoring; that is,
monitoring improves service observability. The goal of observability is
not only to detect problems, but also to understand where the issue is
and what is causing it. In addition to metrics, observability has two
more pillars: logs and traces, as shown in Figure 9. Although these
three components do not make a system 100 percent observable, these are
the most important and powerful components that give a better
understanding of the system. Each of these pillars has its flaws, which
are described in [Three Pillars with Zero
Answers](https://medium.com/lightstephq/three-pillars-with-zero-answers-2a98b36358b8).
![Three pillars of observability](images/image7.png) <p align="center"> Figure 9:
Three pillars of observability </p>
Because we have covered metrics already, let's look at the other two
pillars (logs and traces).
#### Logs
Logs (often referred to as *events*) are a record of activities
performed by a service during its run time, with a corresponding
timestamp. Metrics give abstract information about degradations in a
system, and logs give a detailed view of what is causing these
degradations. Logs created by the applications and infrastructure
components help in effectively understanding the behavior of the system
by providing details on application errors, exceptions, and event
timelines. Logs help you to go back in time to understand the events
that led to a failure. Therefore, examining logs is essential to
troubleshooting system failures.
Log processing involves the aggregation of different logs from
individual applications and their subsequent shipment to central
storage. Moving logs to central storage helps to preserve the logs, in
case the application instances are inaccessible, or the application
crashes due to a failure. After the logs are available in a central
place, you can analyze the logs to derive sensible information from
them. For audit and compliance purposes, you archive these logs on the
central storage for a certain period of time. Log analyzers fetch useful
information from log lines, such as request user information, request
URL (feature), and response headers (such as content length) and
response time. This information is grouped based on these attributes and
made available to you through a visualization tool for quick
understanding.
You might be wondering how this log information helps. This information
gives a holistic view of activities performed on all the involved
entities. For example, let's say someone is performing a DoS (denial of
service) attack on a web application. With the help of log processing,
you can quickly look at top client IPs derived from access logs and
identify where the attack is coming from.
Similarly, if a feature in an application is causing a high error rate
when accessed with a particular request parameter value, the results of
log analysis can help you to quickly identify the misbehaving parameter
value and take further action.
![Log processing and analysis using ELK stack](images/image4.jpg)
<p align="center"> Figure 10: Log processing and analysis using ELK stack </p>
Figure 10 shows a log processing platform using ELK (Elasticsearch,
Logstash, Kibana), which provides centralized log processing. Beats is a
collection of lightweight data shippers that can ship logs, audit data,
network data, and so on over the network. In this use case specifically,
we are using filebeat as a log shipper. Filebeat watches service log
files and ships the log data to Logstash. Logstash parses these logs and
transforms the data, preparing it to store on Elasticsearch. Transformed
log data is stored on Elasticsearch and indexed for fast retrieval.
Kibana searches and displays log data stored on Elasticsearch. Kibana
also provides a set of visualizations for graphically displaying
summaries derived from log data.
Storing logs is expensive. And extensive logging of every event on the
server is costly and takes up more storage space. With an increasing
number of services, this cost can increase proportionally to the number
of services.
#### Tracing
So far, we covered the importance of metrics and logging. Metrics give
an abstract overview of the system, and logging gives a record of events
that occurred. Imagine a complex distributed system with multiple
microservices, where a user request is processed by multiple
microservices in the system. Metrics and logging give you some
information about how these requests are being handled by the system,
but they fail to provide detailed information across all the
microservices and how they affect a particular client request. If a slow
downstream microservice is leading to increased response times, you need
to have detailed visibility across all involved microservices to
identify such microservice. The answer to this need is a request tracing
mechanism.
A trace is a series of spans, where each span is a record of events
performed by different microservices to serve the client's request. In
simple terms, a trace is a log of client-request serving derived from
various microservices across different physical machines. Each span
includes span metadata such as trace ID and span ID, and context, which
includes information about transactions performed.
![Trace and spans for a URL shortener request](images/image3.jpg)
<p align="center"> Figure 11: Trace and spans for a URL shortener request </p>
Figure 11 is a graphical representation of a trace captured on the [URL
shortener](https://linkedin.github.io/school-of-sre/python_web/url-shorten-app/)
example we covered earlier while learning Python.
Similar to monitoring, the tracing infrastructure comprises a few
modules for collecting traces, storing them, and accessing them. Each
microservice runs a tracing library that collects traces in the
background, creates in-memory batches, and submits the tracing backend.
The tracing backend normalizes received trace data and stores it on
persistent storage. Tracing data comes from multiple different
microservices; therefore, trace storage is often organized to store data
incrementally and is indexed by trace identifier. This organization
helps in the reconstruction of trace data and in visualization. Figure
12 illustrates the anatomy of the distributed system.
![Anatomy of distributed tracing](images/image5.jpg)
<p align="center"> Figure 12: Anatomy of distributed tracing </p>
Today a set of tools and frameworks are available for building
distributed tracing solutions. Following are some of the popular tools:
- [OpenTelemetry](https://opentelemetry.io/): Observability
framework for cloud-native software
- [Jaeger](https://www.jaegertracing.io/): Open-source
distributed tracing solution
- [Zipkin](https://zipkin.io/): Open-source distributed tracing
solution

View File

@ -0,0 +1,37 @@
##
# Third-party monitoring
Today most cloud providers offer a variety of monitoring solutions. In
addition, a number of companies such as
[Datadog](https://www.datadoghq.com/) offer
monitoring-as-a-service. In this section, we are not covering
monitoring-as-a-service in depth.
In recent years, more and more people have access to the internet. Many
services are offered online to cater to the increasing user base. As a
result, web pages are becoming larger, with increased client-side
scripts. Users want these services to be fast and error-free. From the
service point of view, when the response body is composed, an HTTP 200
OK response is sent, and everything looks okay. But there might be
errors during transmission or on the client side. As previously
mentioned, monitoring services from within the service infrastructure
give good visibility into service health, but this is not enough. You
need to monitor user experience, specifically the availability of
services for clients. A number of third-party services such asf
[Catchpoint](https://www.catchpoint.com/),
[Pingdom](https://www.pingdom.com/), and so on are available for
achieving this goal.
Third-party monitoring services can generate synthetic traffic
simulating user requests from various parts of the world, to ensure the
service is globally accessible. Other third-party monitoring solutions
for real user monitoring (RUM) provide performance statistics such as
service uptime and response time, from different geographical locations.
This allows you to monitor the user experience from these locations,
which might have different internet backbones, different operating
systems, and different browsers and browser versions. [Catchpoint
Global Monitoring
Network](https://pages.catchpoint.com/overview-video) is a
comprehensive 3-minute video that explains the importance of monitoring
the client experience.

View File

@ -57,6 +57,14 @@ nav:
- Availability: systems_design/availability.md
- Fault Tolerance: systems_design/fault-tolerance.md
- Conclusion: systems_design/conclusion.md
- Metrics and Monitoring:
- Introduction: metrics_and_monitoring/introduction.md
- Command-line Tools: metrics_and_monitoring/command-line_tools.md
- Third-party Monitoring: metrics_and_monitoring/third-party_monitoring.md
- Proactive Monitoring with Alerts: metrics_and_monitoring/alerts.md
- Best Practices for Monitoring: metrics_and_monitoring/best_practices.md
- Observability: metrics_and_monitoring/observability.md
- Conclusion: metrics_and_monitoring/conclusion.md
- Security:
- Introduction: security/intro.md
- Fundamentals of Security: security/fundamentals.md