diff --git a/courses/index.md b/courses/index.md index 06ac216..47fa092 100644 --- a/courses/index.md +++ b/courses/index.md @@ -20,6 +20,7 @@ In this course, we are focusing on building strong foundational skills. The cour - [NoSQL concepts](https://linkedin.github.io/school-of-sre/databases_nosql/intro/) - [Big Data](https://linkedin.github.io/school-of-sre/big_data/intro/) - [Systems Design](https://linkedin.github.io/school-of-sre/systems_design/intro/) +- [Metrics and Monitoring](metrics_and_monitoring/introduction.md) - [Security](https://linkedin.github.io/school-of-sre/security/intro/) We believe continuous learning will help in acquiring deeper knowledge and competencies in order to expand your skill sets, every module has added references which could be a guide for further learning. Our hope is that by going through these modules we should be able to build the essential skills required for a Site Reliability Engineer. diff --git a/courses/metrics_and_monitoring/alerts.md b/courses/metrics_and_monitoring/alerts.md new file mode 100644 index 0000000..85f9ce1 --- /dev/null +++ b/courses/metrics_and_monitoring/alerts.md @@ -0,0 +1,28 @@ +## + +# Proactive monitoring using alerts +Earlier we discussed different ways to collect key metric data points +from a service and its underlying infrastructure. This data gives us a +better understanding of how the service is performing. One of the main +objectives of monitoring is to detect any service degradations early +(reduce Mean Time To Detect) and notify stakeholders so that the issues +are either avoided or can be fixed early, thus reducing Mean Time To +Recover (MTTR). For example, if you are notified when resource usage by +a service exceeds 90 percent, you can take preventive measures to avoid +any service breakdown due to a shortage of resources. On the other hand, +when a service goes down due to an issue, early detection and +notification of such incidents can help you quickly fix the issue. + +![An alert notification received on Slack](images/image11.png)

Figure 8: An alert notification received on Slack

+ +Today most of the monitoring services available provide a mechanism to +set up alerts on one or a combination of metrics to actively monitor the +service health. These alerts have a set of defined rules or conditions, +and when the rule is broken, you are notified. These rules can be as +simple as notifying when the metric value exceeds n to as complex as a +week over week (WoW) comparison of standard deviation over a period of +time. Monitoring tools notify you about an active alert, and most of +these tools support instant messaging (IM) platforms, SMS, email, or +phone calls. Figure 8 shows a sample alert notification received on +Slack for memory usage exceeding 90 percent of total RAM space on the +host. diff --git a/courses/metrics_and_monitoring/best_practices.md b/courses/metrics_and_monitoring/best_practices.md new file mode 100644 index 0000000..5454bde --- /dev/null +++ b/courses/metrics_and_monitoring/best_practices.md @@ -0,0 +1,40 @@ +## + +# Best practices for monitoring + +When setting up monitoring for a service, keep the following best +practices in mind. + +- **Use the right metric type** -- Most of the libraries available + today offer various metric types. Choose the appropriate metric + type for monitoring your system. Following are the types of + metrics and their purposes. + + - **Gauge --** *Gauge* is a constant type of metric. After the + metric is initialized, the metric value does not change unless + you intentionally update it. + + - **Timer --** *Timer* measures the time taken to complete a + task. + + - **Counter --** *Counter* counts the number of occurrences of a + particular event. + + For more information about these metric types, see [Data + Types](https://statsd.readthedocs.io/en/v0.5.0/types.html). + +- **Avoid over-monitoring** -- Monitoring can be a significant + engineering endeavor***.*** Therefore, be sure not to spend too + much time and resources on monitoring services, yet make sure all + important metrics are captured. + +- **Prevent alert fatigue** -- Set alerts for metrics that are + important and actionable. If you receive too many non-critical + alerts, you might start ignoring alert notifications over time. As + a result, critical alerts might get overlooked. + +- **Have a runbook for alerts** -- For every alert, make sure you have + a document explaining what actions and checks need to be performed + when the alert fires. This enables any engineer on the team to + handle the alert and take necessary actions, without any help from + others. \ No newline at end of file diff --git a/courses/metrics_and_monitoring/command-line_tools.md b/courses/metrics_and_monitoring/command-line_tools.md new file mode 100644 index 0000000..8371c40 --- /dev/null +++ b/courses/metrics_and_monitoring/command-line_tools.md @@ -0,0 +1,98 @@ +## + +# Command-line tools +Most of the Linux distributions today come with a set of tools that +monitor the system's performance. These tools help you measure and +understand various subsystem statistics (CPU, memory, network, and so +on). Let's look at some of the tools that are predominantly used. + +- `ps/top `-- The process status command (ps) displays information + about all the currently running processes in a Linux system. The + top command is similar to the ps command, but it periodically + updates the information displayed until the program is terminated. + An advanced version of top, called htop, has a more user-friendly + interface and some additional features. These command-line + utilities come with options to modify the operation and output of + the command. Following are some important options supported by the + ps command. + + - `-p ` -- Displays information about processes + that match the specified process IDs. Similarly, you can use + `-u ` and `-g ` to display information about + processes belonging to a specific user or group. + + - `-a` -- Displays information about other users' processes, as well + as one's own. + + - `-x` -- When displaying processes matched by other options, + includes processes that do not have a controlling terminal. + + ![Results of top command](images/image12.png)

Figure 2: Results of top command

+ +- `ss` -- The socket statistics command (ss) displays information + about network sockets on the system. This tool is the successor of + [netstat](https://man7.org/linux/man-pages/man8/netstat.8.html), + which is deprecated. Following are some command-line options + supported by the ss command: + + - `-t` -- Displays the TCP socket. Similarly, `-u` displays UDP + sockets, `-x` is for UNIX domain sockets, and so on. + + - `-l` -- Displays only listening sockets. + + - `-n` -- Instructs the command to not resolve service names. + Instead displays the port numbers. + +![List of listening sockets on a system](images/image8.png)

Figure +3: List of listening sockets on a system

+ +- `free` -- The free command displays memory usage statistics on the + host like available memory, used memory, and free memory. Most often, + this command is used with the `-h` command-line option, which + displays the statistics in a human-readable format. + +![Memory + statistics on a host in human-readable form](images/image6.png)

Figure 4: Memory + statistics on a host in human-readable form

+ +- `df --` The df command displays disk space usage statistics. The + `-i` command-line option is also often used to display + [inode](https://en.wikipedia.org/wiki/Inode) usage + statistics. The `-h` command-line option is used for displaying + statistics in a human-readable format. + +![Disk usage statistics on a system in human-readable form](images/image9.png)

Figure 5: + Disk usage statistics on a system in human-readable form

+ +- `sar` -- The sar utility monitors various subsystems, such as CPU + and memory, in real time. This data can be stored in a file + specified with the `-o` option. This tool helps to identify + anomalies. + +- `iftop` -- The interface top command (`iftop`) displays bandwidth + utilization by a host on an interface. This command is often used + to identify bandwidth usage by active connections. The `-i` option + specifies which network interface to watch. + +![Network bandwidth usage by + active connection on the host](images/image2.png)

Figure 6: Network bandwidth usage by +active connection on the host

+ +- `tcpdump` -- The tcpdump command is a network monitoring tool that + captures network packets flowing over the network and displays a + description of the captured packets. The following options are + available: + + - `-i ` -- Interface to listen on + + - `host ` -- Filters traffic going to or from the + specified host + + - `src/dst` -- Displays one-way traffic from the source (src) or to + the destination (dst) + + - `port ` -- Filters traffic to or from a particular + port + +![tcpdump of packets on an interface](images/image10.png)

Figure 7: *tcpdump* of packets on *docker0* +interface on a host

\ No newline at end of file diff --git a/courses/metrics_and_monitoring/conclusion.md b/courses/metrics_and_monitoring/conclusion.md new file mode 100644 index 0000000..3119fce --- /dev/null +++ b/courses/metrics_and_monitoring/conclusion.md @@ -0,0 +1,50 @@ +# Conclusion + +A robust monitoring and alerting system is necessary for maintaining and +troubleshooting a system. A dashboard with key metrics can give you an +overview of service performance, all in one place. Well-defined alerts +(with realistic thresholds and notifications) further enable you to +quickly identify any anomalies in the service infrastructure and in +resource saturation. By taking necessary actions, you can avoid any +service degradations and decrease MTTD for service breakdowns. + +In addition to in-house monitoring, monitoring real user experience can +help you to understand service performance as perceived by the users. +Many modules are involved in serving the user, and most of them are out +of your control. Therefore, you need to have real-user monitoring in +place. + +Metrics give very abstract details on service performance. To get a +better understanding of the system and for faster recovery during +incidents, you might want to implement the other two pillars of +observability: logs and tracing. Logs and trace data can help you +understand what led to service failure or degradation. + +Following are some resources to learn more about monitoring and +observability: + +- [Google SRE book: Monitoring Distributed + Systems](https://sre.google/sre-book/monitoring-distributed-systems/) + +- [Mastering Distributed Tracing by Yuri + Shkuro](https://learning.oreilly.com/library/view/mastering-distributed-tracing/9781788628464/) + +- Engineering blogs on + [LinkedIn](https://engineering.linkedin.com/blog/topic/monitoring), + [Grafana](https://grafana.com/blog/), + [Elastic.co](https://www.elastic.co/blog/), + [OpenTelemetry](https://medium.com/opentelemetry) + +## References + +- [Google SRE book: Monitoring Distributed + Systems](https://sre.google/sre-book/monitoring-distributed-systems/) + +- [Mastering Distributed Tracing, by Yuri + Shkuro](https://learning.oreilly.com/library/view/mastering-distributed-tracing/9781788628464/) + +- [Monitoring and + Observability](https://copyconstruct.medium.com/monitoring-and-observability-8417d1952e1c) + +- [Three PIllars with Zero + Answers](https://medium.com/lightstephq/three-pillars-with-zero-answers-2a98b36358b8) diff --git a/courses/metrics_and_monitoring/images/image1.jpg b/courses/metrics_and_monitoring/images/image1.jpg new file mode 100644 index 0000000..776248f Binary files /dev/null and b/courses/metrics_and_monitoring/images/image1.jpg differ diff --git a/courses/metrics_and_monitoring/images/image10.png b/courses/metrics_and_monitoring/images/image10.png new file mode 100644 index 0000000..2bae97a Binary files /dev/null and b/courses/metrics_and_monitoring/images/image10.png differ diff --git a/courses/metrics_and_monitoring/images/image11.png b/courses/metrics_and_monitoring/images/image11.png new file mode 100644 index 0000000..41bf3d3 Binary files /dev/null and b/courses/metrics_and_monitoring/images/image11.png differ diff --git a/courses/metrics_and_monitoring/images/image12.png b/courses/metrics_and_monitoring/images/image12.png new file mode 100644 index 0000000..1588af3 Binary files /dev/null and b/courses/metrics_and_monitoring/images/image12.png differ diff --git a/courses/metrics_and_monitoring/images/image2.png b/courses/metrics_and_monitoring/images/image2.png new file mode 100644 index 0000000..c6cee36 Binary files /dev/null and b/courses/metrics_and_monitoring/images/image2.png differ diff --git a/courses/metrics_and_monitoring/images/image3.jpg b/courses/metrics_and_monitoring/images/image3.jpg new file mode 100644 index 0000000..26c68e9 Binary files /dev/null and b/courses/metrics_and_monitoring/images/image3.jpg differ diff --git a/courses/metrics_and_monitoring/images/image4.jpg b/courses/metrics_and_monitoring/images/image4.jpg new file mode 100644 index 0000000..c3266d4 Binary files /dev/null and b/courses/metrics_and_monitoring/images/image4.jpg differ diff --git a/courses/metrics_and_monitoring/images/image5.jpg b/courses/metrics_and_monitoring/images/image5.jpg new file mode 100644 index 0000000..95abeaf Binary files /dev/null and b/courses/metrics_and_monitoring/images/image5.jpg differ diff --git a/courses/metrics_and_monitoring/images/image6.png b/courses/metrics_and_monitoring/images/image6.png new file mode 100644 index 0000000..70115ae Binary files /dev/null and b/courses/metrics_and_monitoring/images/image6.png differ diff --git a/courses/metrics_and_monitoring/images/image7.png b/courses/metrics_and_monitoring/images/image7.png new file mode 100644 index 0000000..55adbe6 Binary files /dev/null and b/courses/metrics_and_monitoring/images/image7.png differ diff --git a/courses/metrics_and_monitoring/images/image8.png b/courses/metrics_and_monitoring/images/image8.png new file mode 100644 index 0000000..67fab10 Binary files /dev/null and b/courses/metrics_and_monitoring/images/image8.png differ diff --git a/courses/metrics_and_monitoring/images/image9.png b/courses/metrics_and_monitoring/images/image9.png new file mode 100644 index 0000000..9a71bf0 Binary files /dev/null and b/courses/metrics_and_monitoring/images/image9.png differ diff --git a/courses/metrics_and_monitoring/introduction.md b/courses/metrics_and_monitoring/introduction.md new file mode 100644 index 0000000..9f01831 --- /dev/null +++ b/courses/metrics_and_monitoring/introduction.md @@ -0,0 +1,280 @@ +## + +# Prerequisites + +- [Linux Basics](https://linkedin.github.io/school-of-sre/linux_basics/intro/) + +- [Python and the Web](https://linkedin.github.io/school-of-sre/python_web/intro/) + +- [Systems Design](https://linkedin.github.io/school-of-sre/systems_design/intro/) + +- [Linux Networking Fundamentals](https://linkedin.github.io/school-of-sre/linux_networking/intro/) + + +## What to expect from this course + +Monitoring is an integral part of any system. As an SRE, you need to +have a basic understanding of monitoring a service infrastructure. By +the end of this course, you will gain a better understanding of the +following topics: + +- What is monitoring? + + - What needs to be measured + + - How the metrics gathered can be used to improve business decisions and overall reliability + + - Proactive monitoring with alerts + + - Log processing and its importance + +- What is observability? + + - Distributed tracing + + - Logs + + - Metrics + +## What is not covered in this course + +- Guide to setting up a monitoring infrastructure + +- Deep dive into different monitoring technologies and benchmarking or comparison of any tools + + +## Course content + +- [Introduction](#introduction) + + - [Four golden signals of monitoring](#four-golden-signals-of-monitoring) + + - [Why is monitoring important?](#why-is-monitoring-important) + +- [Command-line tools](command-line_tools.md) + +- [Third-party monitoring](third-party_monitoring.md) + +- [Proactive monitoring using alerts](alerts.md) + +- [Best practices for monitoring](best_practices.md) + +- [Observability](observability.md) + + - [Logs](observability.md#logs) + - [Tracing](observability.md#tracing) + +[Conclusion](conclusion.md) + + +## + +# Introduction + +Monitoring is a process of collecting real-time performance metrics from +a system, analyzing the data to derive meaningful information, and +displaying the data to the users. In simple terms, you measure various +metrics regularly to understand the state of the system, including but +not limited to, user requests, latency, and error rate. *What gets +measured, gets fixed*---if you can measure something, you can reason +about it, understand it, discuss it, and act upon it with confidence. + + +## Four golden signals of monitoring + +When setting up monitoring for a system, you need to decide what to +measure. The four golden signals of monitoring provide a good +understanding of service performance and lay a foundation for monitoring +a system. These four golden signals are + +- Traffic + +- Latency + +- Error + +- Saturation + +These metrics help you to understand the system performance and +bottlenecks, and to create a better end-user experience. As discussed in +the [Google SRE +book](https://sre.google/sre-book/monitoring-distributed-systems/), +if you can measure only four metrics of your service, focus on these +four. Let's look at each of the four golden signals. + +- **Traffic** -- *Traffic* gives a better understanding of the service + demand. Often referred to as *service QPS* (queries per second), + traffic is a measure of requests served by the service. This + signal helps you to decide when a service needs to be scaled up to + handle increasing customer demand and scaled down to be + cost-effective. + +- **Latency** -- *Latency* is the measure of time taken by the service + to process the incoming request and send the response. Measuring + service latency helps in the early detection of slow degradation + of the service. Distinguishing between the latency of successful + requests and the latency of failed requests is important. For + example, an [HTTP 5XX + error](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status#server_error_responses) + triggered due to loss of connection to a database or other + critical backend might be served very quickly. However, because an + HTTP 500 error indicates a failed request, factoring 500s into + overall latency might result in misleading calculations. + +- **Error (rate)** -- *Error* is the measure of failed client + requests. These failures can be easily identified based on the + response codes ([HTTP 5XX + error](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status#server_error_responses)). + There might be cases where the response is considered erroneous + due to wrong result data or due to policy violations. For example, + you might get an [HTTP + 200](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/200) + response, but the body has incomplete data, or response time is + breaching the agreed-upon + [SLA](https://en.wikipedia.org/wiki/Service-level_agreement)s. + Therefore, you need to have other mechanisms (code logic or + [instrumentation](https://en.wikipedia.org/wiki/Instrumentation_(computer_programming))) + in place to capture errors in addition to the response codes. + +- **Saturation** -- *Saturation* is a measure of the resource + utilization by a service. This signal tells you the state of + service resources and how full they are. These resources include + memory, compute, network I/O, and so on. Service performance + slowly degrades even before resource utilization is at 100 + percent. Therefore, having a utilization target is important. An + increase in latency is a good indicator of saturation; measuring + the [99th + percentile](https://medium.com/@ankur_anand/an-in-depth-introduction-to-99-percentile-for-programmers-22e83a00caf) + of latency can help in the early detection of saturation. + +Depending on the type of service, you can measure these signals in +different ways. For example, you might measure queries per second served +for a web server. In contrast, for a database server, transactions +performed and database sessions created give you an idea about the +traffic handled by the database server. With the help of additional code +logic (monitoring libraries and instrumentation), you can measure these +signals periodically and store them for future analysis. Although these +metrics give you an idea about the performance at the service end, you +need to also ensure that the same user experience is delivered at the +client end. Therefore, you might need to monitor the service from +outside the service infrastructure, which is discussed under third-party +monitoring. + +## Why is monitoring important? + +Monitoring plays a key role in the success of a service. As discussed +earlier, monitoring provides performance insights for understanding +service health. With access to historical data collected over time, you +can build intelligent applications to address specific needs. Some of +the key use cases follow: + +- **Reduction in time to resolve issues** -- With a good monitoring + infrastructure in place, you can identify issues quickly and + resolve them, which reduces the impact caused by the issues. + +- **Business decisions** -- Data collected over a period of time can + help you make business decisions such as determining the product + release cycle, which features to invest in, and geographical areas + to focus on. Decisions based on long-term data can improve the + overall product experience. + +- **Resource planning** -- By analyzing historical data, you can + forecast service compute-resource demands, and you can properly + allocate resources. This allows financially effective decisions, + with no compromise in end-user experience. + +Before we dive deeper into monitoring, let's understand some basic +terminologies. + +- **Metric** -- A metric is a quantitative measure of a particular + system attribute---for example, memory or CPU + +- **Node or host** -- A physical server, virtual machine, or container + where an application is running + +- **QPS** -- *Queries Per Second*, a measure of traffic served by the + service per second + +- **Latency** -- The time interval between user action and the + response from the server---for example, time spent after sending a + query to a database before the first response bit is received + +- **Error** **rate** -- Number of errors observed over a particular + time period (usually a second) + +- **Graph** -- In monitoring, a graph is a representation of one or + more values of metrics collected over time + +- **Dashboard** -- A dashboard is a collection of graphs that provide + an overview of system health + +- **Incident** -- An incident is an event that disrupts the normal + operations of a system + +- **MTTD** -- *Mean Time To Detect* is the time interval between the + beginning of a service failure and the detection of such failure + +- **MTTR** -- Mean Time To Resolve is the time spent to fix a service + failure and bring the service back to its normal state + +Before we discuss monitoring an application, let us look at the +monitoring infrastructure. Following is an illustration of a basic +monitoring system. + +![Illustration of a monitoring infrastructure](images/image1.jpg)

Figure 1: Illustration of a monitoring infrastructure

+ +Figure 1 shows a monitoring infrastructure mechanism for aggregating +metrics on the system, and collecting and storing the data for display. +In addition, a monitoring infrastructure includes alert subsystems for +notifying concerned parties during any abnormal behavior. Let's look at +each of these infrastructure components: + +- **Host metrics agent --** A *host metrics agent* is a process + running on the host that collects performance statistics for host + subsystems such as memory, CPU, and network. These metrics are + regularly relayed to a metrics collector for storage and + visualization. Some examples are + [collectd](https://collectd.org/), + [telegraf](https://www.influxdata.com/time-series-platform/telegraf/), + and [metricbeat](https://www.elastic.co/beats/metricbeat). + +- **Metric aggregator --** A *metric aggregator* is a process running + on the host. Applications running on the host collect service + metrics using + [instrumentation](https://en.wikipedia.org/wiki/Instrumentation_(computer_programming)). + Collected metrics are sent either to the aggregator process or + directly to the metrics collector over API, if available. Received + metrics are aggregated periodically and relayed to the metrics + collector in batches. An example is + [StatsD](https://github.com/statsd/statsd). + +- **Metrics collector --** A *metrics collector* process collects all + the metrics from the metric aggregators running on multiple hosts. + The collector takes care of decoding and stores this data on the + database. Metric collection and storage might be taken care of by + one single service such as + [InfluxDB](https://www.influxdata.com/), which we discuss + next. An example is [carbon + daemons](https://graphite.readthedocs.io/en/latest/carbon-daemons.html). + +- **Storage --** A time-series database stores all of these metrics. + Examples are [OpenTSDB](http://opentsdb.net/), + [Whisper](https://graphite.readthedocs.io/en/stable/whisper.html), + and [InfluxDB](https://www.influxdata.com/). + +- **Metrics server --** A *metrics server* can be as basic as a web + server that graphically renders metric data. In addition, the + metrics server provides aggregation functionalities and APIs for + fetching metric data programmatically. Some examples are + [Grafana](https://github.com/grafana/grafana) and + [Graphite-Web](https://github.com/graphite-project/graphite-web). + +- **Alert manager --** The *alert manager* regularly polls metric data + available and, if there are any anomalies detected, notifies you. + Each alert has a set of rules for identifying such anomalies. + Today many metrics servers such as + [Grafana](https://github.com/grafana/grafana) support alert + management. We discuss alerting [in detail + later](#proactive-monitoring-using-alerts). Examples are + [Grafana](https://github.com/grafana/grafana) and + [Icinga](https://icinga.com/). diff --git a/courses/metrics_and_monitoring/observability.md b/courses/metrics_and_monitoring/observability.md new file mode 100644 index 0000000..cdec5a4 --- /dev/null +++ b/courses/metrics_and_monitoring/observability.md @@ -0,0 +1,151 @@ +## + +# Observability + +Engineers often use observability when referring to building reliable +systems. *Observability* is a term derived from control theory, It is a +measure of how well internal states of a system can be inferred from +knowledge of its external outputs. Service infrastructures used on a +daily basis are becoming more and more complex; proactive monitoring +alone is not sufficient to quickly resolve issues causing application +failures. With monitoring, you can keep known past failures from +recurring, but with a complex service architecture, many unknown factors +can cause potential problems. To address such cases, you can make the +service observable. An observable system provides highly granular +insights into the implicit failure modes. In addition, an observable +system furnishes ample context about its inner workings, which unlocks +the ability to uncover deeper systemic issues. + +Monitoring enables failure detection; observability helps in gaining a +better understanding of the system. Among engineers, there is a common +misconception that monitoring and observability are two different +things. Actually, observability is the superset to monitoring; that is, +monitoring improves service observability. The goal of observability is +not only to detect problems, but also to understand where the issue is +and what is causing it. In addition to metrics, observability has two +more pillars: logs and traces, as shown in Figure 9. Although these +three components do not make a system 100 percent observable, these are +the most important and powerful components that give a better +understanding of the system. Each of these pillars has its flaws, which +are described in [Three Pillars with Zero +Answers](https://medium.com/lightstephq/three-pillars-with-zero-answers-2a98b36358b8). + +![Three pillars of observability](images/image7.png)

Figure 9: +Three pillars of observability

+ +Because we have covered metrics already, let's look at the other two +pillars (logs and traces). + +#### Logs + +Logs (often referred to as *events*) are a record of activities +performed by a service during its run time, with a corresponding +timestamp. Metrics give abstract information about degradations in a +system, and logs give a detailed view of what is causing these +degradations. Logs created by the applications and infrastructure +components help in effectively understanding the behavior of the system +by providing details on application errors, exceptions, and event +timelines. Logs help you to go back in time to understand the events +that led to a failure. Therefore, examining logs is essential to +troubleshooting system failures. + +Log processing involves the aggregation of different logs from +individual applications and their subsequent shipment to central +storage. Moving logs to central storage helps to preserve the logs, in +case the application instances are inaccessible, or the application +crashes due to a failure. After the logs are available in a central +place, you can analyze the logs to derive sensible information from +them. For audit and compliance purposes, you archive these logs on the +central storage for a certain period of time. Log analyzers fetch useful +information from log lines, such as request user information, request +URL (feature), and response headers (such as content length) and +response time. This information is grouped based on these attributes and +made available to you through a visualization tool for quick +understanding. + +You might be wondering how this log information helps. This information +gives a holistic view of activities performed on all the involved +entities. For example, let's say someone is performing a DoS (denial of +service) attack on a web application. With the help of log processing, +you can quickly look at top client IPs derived from access logs and +identify where the attack is coming from. + +Similarly, if a feature in an application is causing a high error rate +when accessed with a particular request parameter value, the results of +log analysis can help you to quickly identify the misbehaving parameter +value and take further action. + +![Log processing and analysis using ELK stack](images/image4.jpg) +

Figure 10: Log processing and analysis using ELK stack

+ +Figure 10 shows a log processing platform using ELK (Elasticsearch, +Logstash, Kibana), which provides centralized log processing. Beats is a +collection of lightweight data shippers that can ship logs, audit data, +network data, and so on over the network. In this use case specifically, +we are using filebeat as a log shipper. Filebeat watches service log +files and ships the log data to Logstash. Logstash parses these logs and +transforms the data, preparing it to store on Elasticsearch. Transformed +log data is stored on Elasticsearch and indexed for fast retrieval. +Kibana searches and displays log data stored on Elasticsearch. Kibana +also provides a set of visualizations for graphically displaying +summaries derived from log data. + +Storing logs is expensive. And extensive logging of every event on the +server is costly and takes up more storage space. With an increasing +number of services, this cost can increase proportionally to the number +of services. + +#### Tracing + +So far, we covered the importance of metrics and logging. Metrics give +an abstract overview of the system, and logging gives a record of events +that occurred. Imagine a complex distributed system with multiple +microservices, where a user request is processed by multiple +microservices in the system. Metrics and logging give you some +information about how these requests are being handled by the system, +but they fail to provide detailed information across all the +microservices and how they affect a particular client request. If a slow +downstream microservice is leading to increased response times, you need +to have detailed visibility across all involved microservices to +identify such microservice. The answer to this need is a request tracing +mechanism. + +A trace is a series of spans, where each span is a record of events +performed by different microservices to serve the client's request. In +simple terms, a trace is a log of client-request serving derived from +various microservices across different physical machines. Each span +includes span metadata such as trace ID and span ID, and context, which +includes information about transactions performed. + +![Trace and spans for a URL shortener request](images/image3.jpg) +

Figure 11: Trace and spans for a URL shortener request

+ +Figure 11 is a graphical representation of a trace captured on the [URL +shortener](https://linkedin.github.io/school-of-sre/python_web/url-shorten-app/) +example we covered earlier while learning Python. + +Similar to monitoring, the tracing infrastructure comprises a few +modules for collecting traces, storing them, and accessing them. Each +microservice runs a tracing library that collects traces in the +background, creates in-memory batches, and submits the tracing backend. +The tracing backend normalizes received trace data and stores it on +persistent storage. Tracing data comes from multiple different +microservices; therefore, trace storage is often organized to store data +incrementally and is indexed by trace identifier. This organization +helps in the reconstruction of trace data and in visualization. Figure +12 illustrates the anatomy of the distributed system. + +![Anatomy of distributed tracing](images/image5.jpg) +

Figure 12: Anatomy of distributed tracing

+ +Today a set of tools and frameworks are available for building +distributed tracing solutions. Following are some of the popular tools: + +- [OpenTelemetry](https://opentelemetry.io/): Observability + framework for cloud-native software + +- [Jaeger](https://www.jaegertracing.io/): Open-source + distributed tracing solution + +- [Zipkin](https://zipkin.io/): Open-source distributed tracing + solution diff --git a/courses/metrics_and_monitoring/third-party_monitoring.md b/courses/metrics_and_monitoring/third-party_monitoring.md new file mode 100644 index 0000000..e968caf --- /dev/null +++ b/courses/metrics_and_monitoring/third-party_monitoring.md @@ -0,0 +1,37 @@ +## + +# Third-party monitoring + +Today most cloud providers offer a variety of monitoring solutions. In +addition, a number of companies such as +[Datadog](https://www.datadoghq.com/) offer +monitoring-as-a-service. In this section, we are not covering +monitoring-as-a-service in depth. + +In recent years, more and more people have access to the internet. Many +services are offered online to cater to the increasing user base. As a +result, web pages are becoming larger, with increased client-side +scripts. Users want these services to be fast and error-free. From the +service point of view, when the response body is composed, an HTTP 200 +OK response is sent, and everything looks okay. But there might be +errors during transmission or on the client side. As previously +mentioned, monitoring services from within the service infrastructure +give good visibility into service health, but this is not enough. You +need to monitor user experience, specifically the availability of +services for clients. A number of third-party services such asf +[Catchpoint](https://www.catchpoint.com/), +[Pingdom](https://www.pingdom.com/), and so on are available for +achieving this goal. + +Third-party monitoring services can generate synthetic traffic +simulating user requests from various parts of the world, to ensure the +service is globally accessible. Other third-party monitoring solutions +for real user monitoring (RUM) provide performance statistics such as +service uptime and response time, from different geographical locations. +This allows you to monitor the user experience from these locations, +which might have different internet backbones, different operating +systems, and different browsers and browser versions. [Catchpoint +Global Monitoring +Network](https://pages.catchpoint.com/overview-video) is a +comprehensive 3-minute video that explains the importance of monitoring +the client experience. \ No newline at end of file diff --git a/mkdocs.yml b/mkdocs.yml index 07e9671..14710ed 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -57,6 +57,14 @@ nav: - Availability: systems_design/availability.md - Fault Tolerance: systems_design/fault-tolerance.md - Conclusion: systems_design/conclusion.md +- Metrics and Monitoring: + - Introduction: metrics_and_monitoring/introduction.md + - Command-line Tools: metrics_and_monitoring/command-line_tools.md + - Third-party Monitoring: metrics_and_monitoring/third-party_monitoring.md + - Proactive Monitoring with Alerts: metrics_and_monitoring/alerts.md + - Best Practices for Monitoring: metrics_and_monitoring/best_practices.md + - Observability: metrics_and_monitoring/observability.md + - Conclusion: metrics_and_monitoring/conclusion.md - Security: - Introduction: security/intro.md - Fundamentals of Security: security/fundamentals.md