Adding course system troubleshooting and performance (#114)

Co-authored-by: Himanshu Chandwani <hchandwa@hchandwa-ld2.linkedin.biz>
3 years ago · 95b5e64cfb
parent 4b1c22ec44
commit 95b5e64cfb
17 changed files with 275 additions and 8 deletions
--- a/courses/level102/system_troubleshooting_and_performance/conclusion.md
+++ b/courses/level102/system_troubleshooting_and_performance/conclusion.md
@ -0,0 +1,12 @@
+Complex systems have many factors which can go wrong. It can be a bad design & architecture, poorly managed code, poor policies around different caches, bad DB queries or architecture, improper use of resources, or bad OS version, poorly monitored system, datacenter issues, network faults, and many more, Any of these can go wrong.
+
+As an SRE, Knowing important tools/commands, best practices, profiling, benchmarking and scaling can help you with faster troubleshooting and performance improvement of the overall system.
+
+## Further readings
+
+Here are some links from the LinkedIn Engineering Blog, as written by LinkedIn engineers, about firefighting they did, ensuring site up 24x7x365.
+
+- [Taming memory fragmentation in Venice with Jemalloc](https://engineering.linkedin.com/blog/2021/taming-memory-fragmentation-in-venice-with-jemalloc)
+- [Intro: Every Day Is Monday in Operations](https://www.linkedin.com/pulse/introduction-every-day-monday-operations-benjamin-purgason)
+- [Fixing Linux filesystem performance regressions](https://engineering.linkedin.com/blog/2020/fixing-linux-filesystem-performance-regressions)
+- [The impact of slow NFS on data systems](https://engineering.linkedin.com/blog/2020/the-impact-of-slow-nfs-on-data-systems)
--- a/courses/level102/system_troubleshooting_and_performance/images/FlaskCode.png
+++ b/courses/level102/system_troubleshooting_and_performance/images/FlaskCode.png
--- a/courses/level102/system_troubleshooting_and_performance/images/FlaskStart.png
+++ b/courses/level102/system_troubleshooting_and_performance/images/FlaskStart.png
--- a/courses/level102/system_troubleshooting_and_performance/images/MemUsage01.png
+++ b/courses/level102/system_troubleshooting_and_performance/images/MemUsage01.png
--- a/courses/level102/system_troubleshooting_and_performance/images/MemUsage02.png
+++ b/courses/level102/system_troubleshooting_and_performance/images/MemUsage02.png
--- a/courses/level102/system_troubleshooting_and_performance/images/MemUsage03.png
+++ b/courses/level102/system_troubleshooting_and_performance/images/MemUsage03.png
--- a/courses/level102/system_troubleshooting_and_performance/images/MemUsageChart.png
+++ b/courses/level102/system_troubleshooting_and_performance/images/MemUsageChart.png
--- a/courses/level102/system_troubleshooting_and_performance/images/Tracemalloc01.png
+++ b/courses/level102/system_troubleshooting_and_performance/images/Tracemalloc01.png
--- a/courses/level102/system_troubleshooting_and_performance/images/Tracemalloc02.png
+++ b/courses/level102/system_troubleshooting_and_performance/images/Tracemalloc02.png
--- a/courses/level102/system_troubleshooting_and_performance/images/Tracemalloc03.png
+++ b/courses/level102/system_troubleshooting_and_performance/images/Tracemalloc03.png
--- a/courses/level102/system_troubleshooting_and_performance/images/TroubleshootingFlow.jpg
+++ b/courses/level102/system_troubleshooting_and_performance/images/TroubleshootingFlow.jpg
--- a/courses/level102/system_troubleshooting_and_performance/important-tools.md
+++ b/courses/level102/system_troubleshooting_and_performance/important-tools.md
@ -0,0 +1,29 @@
+### Important linux commands
+
+Having knowledge of following commands will help find issues faster. Elaborating each command in detail is out of scope, please look for man pages or online for more information and examples around the same.
+
+- For logs parsing -: grep, sed, awk, cut, tail, head
+- For network checks -: nc, netstat, traceroute/6, mtr, ping/6, route, tcpdump, ss, ip
+- For DNS -: dig, host, nslookup
+- For tracing system call -: strace
+- For parallel executions over ssh -: gnu parallel, xargs + ssh.
+- For http/s checks -: curl, wget
+- For list of open files -: lsof
+- For modifying attributes of the system kernel -: [sysctl](https://man7.org/linux/man-pages/man8/sysctl.8.html)
+
+In case of distributed systems, some good third party tools can help to execute commands/instructions on many hosts at once, like:
+
+- **SSH based tools**
+    - [ClusterSSH](https://github.com/duncs/clusterssh): Cluster ssh can help you run a command in parallel on many hosts at once.
+    - [Ansible](https://github.com/ansible/ansible): It allows you to write ansible playbooks which you can run on hundreds/thousands of hosts at the same time.
+- **Agent Based tools**
+    - [Saltstack](https://github.com/saltstack/salt): Is a configuration, state and remote execution framework, provides a wide variety of flexibility to users to execute modules on large numbers of hosts at once.
+    - [Puppet](https://github.com/puppetlabs/puppet): Is an automated administrative engine for your Linux, Unix, and Windows systems, performs administrative tasks.
+
+### Log analysis tools
+
+These can help in writing SQL type queries for parsing, analysing logs and provide an easy UI interface to create dashboards which can render various types of charts based on defined queries.
+
+- [ELK](https://www.elastic.co/what-is/elk-stack): Elasticsearch, Logstash and Kibana, provide package of tools and services to allow, parse logs, index logs and analyse logs easily and quickly. Once logs/data is parsed/filtered through logstash and indexed in elasticsearch, one can create dynamic dashboards in Kibana in a matter of minutes. Such provides easy analysis and correlation on application errors/exceptions/warnings.
+- [Azure kusto](https://docs.microsoft.com/en-us/azure/data-explorer): Azure kusto is a cloud based service similar to Elasticsearch and Kibana, it allows easy indexing of heavy logs, provides SQL type interface for writing queries, and an interface to create dynamic dashboards.
+
--- a/courses/level102/system_troubleshooting_and_performance/introduction.md
+++ b/courses/level102/system_troubleshooting_and_performance/introduction.md
@ -0,0 +1,66 @@
+# System troubleshooting and performance improvements
+
+## Prerequisites
+
+* [Linux Basics](https://linkedin.github.io/school-of-sre/level101/linux_basics/intro/)
+* [System design](https://linkedin.github.io/school-of-sre/level101/systems_design/intro/)
+* [Basic Networking](https://linkedin.github.io/school-of-sre/level101/linux_networking/intro/)
+* [Metrics and Monitoring](https://linkedin.github.io/school-of-sre/level101/metrics_and_monitoring/introduction/)
+
+## What to expect from this course
+
+This brief course tries to provide a general introduction on how to troubleshoot system issues, like analysing api failures,
+resource utilization, network issues, hardware and OS issues. Course also briefs on profiling and benchmarking to measure overall system performance.
+
+## What is not covered under this course
+
+This course does not cover following -:
+
+* System Design and Architecture.
+* Programming practices.
+* Metrics and Monitoring.
+* OS basics.
+
+## Course Contents
+- [Introduction](https://linkedin.github.io/school-of-sre/level102/system_troubleshooting_and_performance/introduction)
+- [Troubleshooting](https://linkedin.github.io/school-of-sre/level102/system_troubleshooting_and_performance/troubleshooting)
+    - [Troubleshooting Flowchart](https://linkedin.github.io/school-of-sre/level102/system_troubleshooting_and_performance/troubleshooting/#troubleshooting-flowchart)
+    - [General Practices](https://linkedin.github.io/school-of-sre/level102/system_troubleshooting_and_performance/troubleshooting/#general-practices)
+    - [General Host issues](https://linkedin.github.io/school-of-sre/level102/system_troubleshooting_and_performance/troubleshooting/#general-host-issues)
+- [Important tools to know](https://linkedin.github.io/school-of-sre/level102/system_troubleshooting_and_performance/important-tools)
+    - [Important linux commands](https://linkedin.github.io/school-of-sre/level102/system_troubleshooting_and_performance/important-tools/#important-linux-commands)
+    - [Log analysis tools](https://linkedin.github.io/school-of-sre/level102/system_troubleshooting_and_performance/important-tools/#log-analysis-tools)
+- [Performance improvements](https://linkedin.github.io/school-of-sre/level102/system_troubleshooting_and_performance/performance-improvements)
+    - [Performance analysis commands](https://linkedin.github.io/school-of-sre/level102/system_troubleshooting_and_performance/performance-improvements/#performance-analysis-commands)
+    - [Profiling tools](https://linkedin.github.io/school-of-sre/level102/system_troubleshooting_and_performance/performance-improvements/#profiling-tools)
+    - [Benchmarking](https://linkedin.github.io/school-of-sre/level102/system_troubleshooting_and_performance/performance-improvements/#benchmarking)
+    - [Scaling](https://linkedin.github.io/school-of-sre/level102/system_troubleshooting_and_performance/performance-improvements/#scaling)
+- [Troubleshooting Example](https://linkedin.github.io/school-of-sre/level102/system_troubleshooting_and_performance/troubleshooting-example)
+- [Conclusion](https://linkedin.github.io/school-of-sre/level102/system_troubleshooting_and_performance/conclusion)
+    - [Further readings](https://linkedin.github.io/school-of-sre/level102/system_troubleshooting_and_performance/conclusion/#further-readings)
+
+## Introduction
+Troubleshooting is an important part of operations & development. It can’t be learned by just reading one article or completing a course online, 
+Its a continuous learning process, one learns it during :-
+
+* Daily operations and development.
+* Finding & Fixing application bugs.
+* Finding & Fixing system & network issues.
+* Performance analysis and improvements.
+* And more.
+
+From an SRE’s perspective, It is expected that they are aware of certain topics upfront to be able to troubleshoot problems around single or distributed systems.
+
+* Know your resources well, understand host specifications, liks CPU, Memory, Network, Disk etc.
+* Understand system design and architecture.
+* Ensure important metrics are being collected/rendered properly.
+
+There was a famous quote by HP founders - **“What gets measured gets fixed”**
+
+If system components and performance metrics are captured thoroughly then there is a high chance of success in troubleshooting an issue, at its earliest.
+
+### Scope
+There is no common approach to troubleshoot different types of applications or services, the failure can occur at any layer of it. We will keep the scope of this work to a web api service type only.
+
+**Note -:** Linux ecosystem is wide, there are hundreds of tools and utilities which can help with system troubleshooting, each comes with its own set of benefits and functionalities. We will cover some of the known tools, either already available with Linux or are available in the open source world. Detailed explanation of mentioned tools in this doc is out of scope, please explore the internet or man pages for more examples and documentation around the same.
+
--- a/courses/level102/system_troubleshooting_and_performance/performance-improvements.md
+++ b/courses/level102/system_troubleshooting_and_performance/performance-improvements.md
@ -0,0 +1,58 @@
+Performance tools are an important part of development/operations lifecycle, Its highly important for understanding application behavior. SRE generally uses such tools to evaluate how well service will perform and make/suggest improvements accordingly.
+
+### Performance analysis commands
+
+Most of these commands are a must to know for doing performance analysis of a system or service.
+
+- top -: shows real-time view of running system, processes, threads etc.
+- htop -: Similar to top command, but a bit more interactive then it.
+- iotop -: An interactive disk I/O monitoring tool.
+- vmstat -: Virtual memory statistics explorer.
+- iostat -: Monitoring tool for input/output statistics for devices and partitions.
+- free -: Tell info about physical memory and swap memory. 
+- sar -: System activity report, reports diff metrics such as cpu, disk, mem, network, etc.
+- mpstat -: Display info about CPU utilization and performance.
+- lsof -: Provides info about the list of open files, opened by which processes.
+- perf -: Performance analysing tool.
+
+### Profiling tools
+
+Profiling is an important part of performance analysis of the service. There are various profiler tools available, which can help figure most frequent code-paths, debugging, memory profiling, etc. These can generate the heatmap to understand the code performance when under load.
+
+- [FlameGraph](https://github.com/brendangregg/FlameGraph): Flame graphs are a visualization of profiled software, allowing the most frequent code-paths to be identified quickly and accurately.
+- [Valgrind](https://valgrind.org/info/about.html): It is a programming tool for memory debugging, memory leak detection, and profiling.
+- [Gprof](https://sourceware.org/binutils/docs/gprof): GNU profiler tool uses a hybrid of instrumentation and sampling. Instrumentation is used to collect function call information, and sampling is used to gather runtime profiling information.
+
+To know how LinkedIn performs On-Demand Profiling on its services, Read LinkedIn blog [ODP: An Infrastructure for On-Demand Service Profiling](https://engineering.linkedin.com/blog/2017/01/odp--an-infrastructure-for-on-demand-service-profiling)
+
+### Benchmarking
+
+It is a process of measuring the best performance of the service. Like how much QPS service can handle, its latency when load is increasing, host resource utilization, loadavg etc etc. The regression testing (i.e load testing) is a must before deploying the service to production.
+
+**Some of known tools -:**
+
+- [Apache Benchmark Tool, ab](https://httpd.apache.org/docs/2.4/programs/ab.html):, It simulate a high load on webapp and gather data for analysis
+- [Httperf](https://github.com/httperf/httperf): It sends requests to the web server at a specified rate and gathers stats. Increase till one finds the saturation point.
+- [Apache JMeter](https://github.com/apache/jmeter): It is a popular open-source tool to measure web application performance. JMeter is a java based application and not only a web server, but you can use it against PHP, Java, REST, etc.
+- [Wrk](https://github.com/wg/wrk): It is another modern performance measurement tool to put a load on your web server and give you latency, request per second, transfer per second, etc. details.
+- [Locust](https://github.com/locustio/locust): Easy to use, scriptable and scalable performance testing tool.
+
+**Limitation -:**
+
+Above tools help in synthetic load or stress testing, but such does not measure actual end user experience, It can’t see how end user resources will affect application performance, it is due to lack of memory, CPU, or poor connectivity to the internet.
+
+To know how LinkedIn performs load testing across its fleet. Read : [Eliminating toil with fully automated load testing](https://engineering.linkedin.com/blog/2019/eliminating-toil-with-fully-automated-load-testing)
+
+And to know how LinkedIn makes use of Real Time Monitoring (RUM) data to overcome the limitations of load testing, and help improve overall experience for end users. Read : [Monitor and Improve Web Performance Using RUM Data Visualization](https://engineering.linkedin.com/performance/monitor-and-improve-web-performance-using-rum-data-visualization)
+
+### Scaling
+
+System designed optimally can perform up to a certain limit only, based on availability of resources. Continuous optimization is always needed to ensure optimum use of resources at its peak. With increasing QPS, Systems need to scale up. We can either scale vertically or horizontally. Vertical scalability has its limits as one can increase cpu, memory, disk, GPU and other specifications to certain limit only, whereas horizontal scalability can grow easily and infinitely given limitations imposed by application design and environment attributes.
+
+Scaling a web application will require some or all of the following -:
+
+- Ease the server load by adding more hosts.
+- Distributing the traffic across servers by using Load Balancers.
+- Scale up DB by sharding the data and increasing read replicas.
+
+Here’s a good read how LinkedIn scaled its application stack [A Brief History of Scaling LinkedIn](https://engineering.linkedin.com/architecture/brief-history-scaling-linkedin)
--- a/courses/level102/system_troubleshooting_and_performance/troubleshooting-example.md
+++ b/courses/level102/system_troubleshooting_and_performance/troubleshooting-example.md
@ -0,0 +1,49 @@
+In this section we will see an example of an issue and try to troubleshoot it, and at the end a few famous troubleshooting stories are shared, which were shared by LinkedIn engineers earlier.
+
+### Example - Memory leak :
+Often memory leak issues go unnoticed until the service becomes unresponsive after running for some time (days, week or even month) until service is restarted or bug is fixed, In such cases, service memory usage will reflect in increasing order in the metric graph, something like this graph.
+
+![](images/MemUsageChart.png)
+
+Memory leak is mismanagement of memory allocations by application, where unneeded memory is not released, over the period of time objects continue to pile up in memory resulting in service crash. Generally such non-released objects get sorted by [garbage collector](https://en.wikipedia.org/wiki/Garbage_collection_(computer_science)) automatically, but sometimes due to a bug it fails. Debugging helps in figuring where much of the application storage memory is being applied. Then, you start tracking and filter everything based on usage. In case, you find objects that aren’t in use, but are referenced, you can get rid of them by deleting them to avoid memory leaks. In the case of python applications, it comes with inbuilt features like [tracemalloc](https://docs.python.org/3/library/tracemalloc.html). This module can help pinpoint where an object was allocated first. Almost every language comes with a set of tools/libraries (inbuilt or external) which helps find memory issues. Similarly for Java there is a famous memory leak detection tool called [Java VisualVM](http://visualvm.java.net/intro.html).
+
+Let’s see how a dummy flask based web app with a memory leak bug, with every request its memory usage keeps increasing, and how we can use tracemalloc to capture the leak.
+
+Assumption -: A python virtual environment is created, and flask is installed in it.
+
+**A bare minimum flask code with bug, read comments for more info**
+![](images/FlaskCode.png)
+
+**Starting flask app**
+![](images/FlaskStart.png)
+
+**On start, Its memory usage is around 26576 kb, i.e approx 26MB**
+![](images/MemUsage01.png)
+
+**Now with every subsequent GET request, We can notice that process memory usage continues to increase slowly.**
+![](images/MemUsage02.png)
+
+**Now lets try 10000 requests, to see if memory usage increases heavily.**
+To hit a high number of requests, we use an Apache benchmarking tool called [“ab”](https://httpd.apache.org/docs/2.4/programs/ab.html). After 10000 hits, we can notice memory usage of flask app is jumped almost 15 times, i.e from initial **26576 KB to 419316 KB, i.e from roughly 26 MB to 419 MB**, That’s a huge jump for such a small webapp.
+![](images/MemUsage03.png)
+
+**Lets try the python [tracemalloc](https://docs.python.org/3/library/tracemalloc.html) module to try to understand the application memory allocations.** Tracemalloc takes memory snapshots at a particular point, performing various statistics on the same. 
+
+Adding a bare minimum code to our app.py file, no change in fetchuserdata.py file, it will allow us to capture tracemalloc snapshots whenever we will hit /capture uri.
+![](images/Tracemalloc01.png)
+
+**After restart of app.py (flask run)**, we will
+- First hit http://127.0.0.1:5000/capture
+- Then hit http://127.0.0.1:5000/ 10000 times, for memory leak/s to take place.
+- Finally hit http://127.0.0.1:5000/capture again to take a snapshot to know which line has the most allocation.
+![](images/Tracemalloc02.png)
+
+In the final snapshot, we noticed the exact module and lineno where most allocation happened. I.e fetchuserdata.py, line no 6, after 10000 hits, it is holding 419 MB of memory.
+![](images/Tracemalloc03.png)
+
+**Summary**
+
+Above example shows how a bug can lead to memory leak, and how we can use [tracemalloc](https://docs.python.org/3/library/tracemalloc.html) to understand where it is. In real world applications are way more complex than the above dummy example, you must understand that using tracemalloc might degrade application performance somebit, due to tracemalloc own overheads. Be mindful about its use in production environments.
+
+If you are interested in digging deeper into Python Object Memory Allocation Internals and debugging memory leak, have a look at an Interesting talk by [Sanket Patel](https://www.linkedin.com/in/sanketplus/) in PyCon India 2019, [Debug Memory Leak In Python Flask | Python Object Memory Allocation Internals](https://www.youtube.com/watch?v=s9kAghWpzoE)
+
--- a/courses/level102/system_troubleshooting_and_performance/troubleshooting.md
+++ b/courses/level102/system_troubleshooting_and_performance/troubleshooting.md
@ -0,0 +1,47 @@
+Troubleshooting system failures can be tricky or tedious at times. In this practice we need to examine the end-to-end flow of a service, all its downstreams, analysing logs, memory leak, CPU usage, disk IO, network failures, hosts issues, etc. Knowing certain practices and tools can help figure & mitigate failures faster. Here’s the high level troubleshooting flowchart -:
+
+### Troubleshooting Flowchart
+![](images/TroubleshootingFlow.jpg)
+
+### General Practices
+Different systems require different approaches for finding issues. Scope of this is limited and given a problem, there can be many more points which can be looked into. Following points will look at some high level practices towards finding webapp failures and finding fixes for the same.
+
+**Reproduce problem**
+
+* Try the broken request to reproduce the issue, Like try Hit http/s request which fails.
+* Check the end to end flow of request and look for return codes, mostly [3xx, 4xx or 5xx](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes). 3xx are mostly about redirections, 4xx are about unauthorized, bad request, forbidden, etc, And 5xx is mostly about server side issues. Based on the return code you can look for the next step.
+* Client side issues are mainly about missing or buggy static contents, like javascript issues, bad image, broken json from an async call etc, such can result in incorrect page rendering on browsers.
+
+**Gather Information**
+
+* Look for errors/exceptions in application logs, Like "Can’t Allocate Memory" or OutOfMemoryError, Or Something like "disk I/O error", Or a DNS resolution error.
+* Check application and host metrics, Look for anomalies in service and hosts graphs. Since when CPU usage has increased, since when memory usage increased, since when disk space is reduced Or Disk I/O is increased, when load average start shooting up etc. Please read the School of SRE link for more detail around [metrics and monitoring](https://linkedin.github.io/school-of-sre/level101/metrics_and_monitoring/introduction).
+* Look for recent code or config changes which possibly are breaking the system.
+
+**Understand the problem**
+
+* Try correlating gathered data with recent actions, like an exception showing up in logs after config/code deployment.
+* Is it due to the [QPS](https://en.wikipedia.org/wiki/Queries_per_second) increase? Is it bad SQL queries? Do recent code changes demand better or more hardware?
+
+**Find a solution and apply a fix**
+
+* Based on the above findings, look for a quick fix if possible, For example like rolling back changes if errors/exceptions correlate.
+* Try patching or [hotfixing](https://en.wikipedia.org/wiki/Hotfix) the code, probably in staging setup if you want to fix forward.
+* Try to scale up the system, if high QPS is the reason for system failure, then try adding resources (compute, storage, memory, etc) as necessary.
+* Optimize SQL queries if needed.
+
+**Verify complete request flow**
+
+* Hit requests again and ensure returns are successful (return code 2xx).
+* Check Logs ensure no more exceptions/errors, as found earlier.
+* Ensure metrics are back to normal.
+
+### General Host issues
+
+To Know if host health is fine or not, look for any hardware failures or its performance issues, one can try following -:
+
+* Dmesg -: Shows recent errors / failures thrown by kernel. This help with knowing  hardware failures if any
+* ls commands -: lspci, lsblk, lscpu, lsscsi, These commands list out pci, disk, cpu information.
+* /var/log/messages -: Shows system app/service related errors/warnings, also shows kernel issues.
+* Smartd -: check disk health.
+
--- a/mkdocs.yml
+++ b/mkdocs.yml
@ -78,14 +78,20 @@ nav:
        - Writing Secure code: level101/security/writing_secure_code.md
        - Conclusion: level101/security/conclusion.md
 - Level 102:
-  - Linux Advanced:
-    - Containerization And Orchestration:
-      - Introduction: level102/containerization_and_orchestration/intro.md
-      - Introduction To Containers: level102/containerization_and_orchestration/intro_to_containers.md
-      - Containerization With Docker: level102/containerization_and_orchestration/containerization_with_docker.md
-      - Orchestration With Kubernetes: level102/containerization_and_orchestration/orchestration_with_kubernetes.md
-      - Conclusion: level102/containerization_and_orchestration/conclusion.md
- 
+    - Linux Advanced:
+        - Containerization And Orchestration:
+            - Introduction: level102/containerization_and_orchestration/intro.md
+            - Introduction To Containers: level102/containerization_and_orchestration/intro_to_containers.md
+            - Containerization With Docker: level102/containerization_and_orchestration/containerization_with_docker.md
+            - Orchestration With Kubernetes: level102/containerization_and_orchestration/orchestration_with_kubernetes.md
+            - Conclusion: level102/containerization_and_orchestration/conclusion.md
+    - System Troubleshooting and Performance Improvements:
+        - Introduction: level102/system_troubleshooting_and_performance/introduction.md
+        - Troubleshooting: level102/system_troubleshooting_and_performance/troubleshooting.md
+        - Important Tools: level102/system_troubleshooting_and_performance/important-tools.md
+        - Performance Improvements: level102/system_troubleshooting_and_performance/performance-improvements.md
+        - Troubleshooting Example: level102/system_troubleshooting_and_performance/troubleshooting-example.md
+        - Conclusion: level102/system_troubleshooting_and_performance/conclusion.md
 - Contribute: CONTRIBUTING.md
 - Code of Conduct: CODE_OF_CONDUCT.md
 - SRE Community: sre_community.md