Add System Design content (#116)

* Add System Design content * Fix Phase 1 course URL
3 years ago · f3161a0b63
parent bf86ecfeee
commit f3161a0b63
11 changed files with 508 additions and 1 deletions
--- a/courses/level102/system_design/conclusion.md
+++ b/courses/level102/system_design/conclusion.md
@ -0,0 +1 @@
+We have looked at designing a sytem from the scratch, scaling it up from a single server to multiple datacenters and hundreds of thousands of users. However, you might have (rightly!) guessed that there is a lot more to system design than what we have covered so far. This course should give you a sweeping glance at the things that are fundamental to any system design process. Specific solutions implemented, frameworks and orchestration systems used evolve rapidly. However, the guiding principles remain the same. We hope you this course helped in getting you started along the right direction and that you have fun designing systems and solving interesting problems.
--- a/courses/level102/system_design/images/initial_application_sketch.jpeg
+++ b/courses/level102/system_design/images/initial_application_sketch.jpeg
--- a/courses/level102/system_design/images/initial_architecture.jpeg
+++ b/courses/level102/system_design/images/initial_architecture.jpeg
--- a/courses/level102/system_design/images/microservices.jpg
+++ b/courses/level102/system_design/images/microservices.jpg
--- a/courses/level102/system_design/intro.md
+++ b/courses/level102/system_design/intro.md
@ -0,0 +1,67 @@
+# System Design Phase 2
+
+## Prerequisites
+
+- [School of SRE - System Design - Phase I](https://linkedin.github.io/school-of-sre/level101/systems_design/intro/)
+
+## What to expect from this course 
+
+ The aim is to empower the reader to understand the building blocks of a well-designed system, evaluate existing systems, understand the trade-offs, come up with their own design, and to explore the various tools available to implement such a system. In phase one of this module, we talked about the fundamentals of system design including concepts like scalability, availability and reliability. We continue to build on those fundamentals in this phase.
+
+<div class="callout callout-info">
+ Throughout the course, there are callout sections that appear like
+ this, and talk about things that are closely related to the system
+ design process, but don’t form a part of the system itself. They also have information about some common issues that crop up in system design. Watch out for them.
+</div>
+
+## What is not covered under this course 
+
+While this course covers many aspects of system design, it does not
+cover the most fundamental concepts. For such topics, it is advised to
+go through the prerequisites.
+
+In general, this module will not go into actually implementing the architecture - we will not talk about choosing a hosting/cloud provider or an orchestration setup or a CI/CD
+system. Instead, we try to focus on the fundamental considerations that need to go into system design.
+
+## Course Contents
+
+- [Introduction](https://linkedin.github.io/school-of-sre/level102/system_design/intro/)
+- [Large system Design](https://linkedin.github.io/school-of-sre/level102//large-system-design/)
+- [Scaling](https://linkedin.github.io/school-of-sre/level102/scaling/)
+- [Scaling beyond the datacentre](https://linkedin.github.io/school-of-sre/level102/scaling-beyond-the-datacenter/)
+- [Design patterns for resiliency](https://linkedin.github.io/school-of-sre/level102/resiliency/)
+- [Conclusion](https://linkedin.github.io/school-of-sre/level102/conclusion/)
+
+## Introduction
+
+We talked about building a basic photo sharing application in the previous phase of this course. Our basic requirements for the application were that
+
+1.  It should work for a reasonably large number of users
+2.  Avoid service failures/cluster crash in case of any issues
+
+In other words, we wanted to build a system that was available, scalable and fault tolerant. We will continue designing that application, and cover additional concepts in the course of doing so.
+
+The photo sharing application is a web application that will handle everything from user sign up, log in, uploads, feed generation, user interaction and interaction with uploaded content. Also a database to store this information. In the simplest design, both the web app and the database can run on the same server. Recall this initial design from Phase 1.
+
+![First architecture diagram](images/initial_architecture.jpeg)
+
+
+Building on that, we will talk about performance elements in system design - setting the right performance measurement metrics and using them to drive our design decisions, improving performance using caching, Content Delivery Networks (CDNs), etc. We will also talk about how to design for resilience by looking at some system design patterns -
+graceful degradation, time-outs and circuit breakers.
+
+<div class="callout callout-info">
+<h4>Cost</h4>
+System design considerations like availability, scalability cannot exist in isolation. When operating outside the lab, we have other considerations / the existing considerations take on a different hue. One such consideration is cost. Real world systems almost always have budget constraints. System design, implementation and continued operation needs to have predictable costs per unit output. The output is usually the business problem you are trying to solve. Striking a balance between the two is very important.
+</div>
+
+<div class="callout callout-primary">
+<h4> Understanding the capabilities of your system </h4>
+A well designed system requires understanding the building blocks intimately in terms of their capabilities. Not all components are created equal, and understanding what a single component can do is very important - for e.g., in the photo upload application it is important to know what a single database instance is capable of, in terms of read or write transactions per second and what would be a reasonable expectation be. This helps in building systems that are appropriately weighted - and will eliminate obvious sources of bottlenecks.
+<br>
+<br>
+On a lower level, even understanding the capabilities of the underlying hardware (or a VM instance if you are on cloud) is important. For eg., all disks don’t perform the same, and all disks don’t perform the same per dollar. If we are planning to have an API that is expected to return a response in 100ms under normal circumstances, then it is important to know how much of it will be spent in which parts of the system. The following link will help in getting a sense of each component’s performance, all the way from the CPU cache to the network link to our end user.
+<br>
+<br>
+<a href="https://colin-scott.github.io/personal_website/research/interactive_latency.html">Numbers every programmer should know</a>
+</div>
+
--- a/courses/level102/system_design/large-system-design.md
+++ b/courses/level102/system_design/large-system-design.md
@ -0,0 +1,164 @@
+
+Designing a system usually starts out to be abstract - we have large functional blocks that need to work together and are abstracted away into frontend, backend and database layers. However, when it is time to implement the system, especially as an SRE we have no other choice but to think in specific terms. Servers have a fixed amount of memory, storage capacity and processing power. So we need to think about the realistic expectations from our system, assess the requirements, translate them into specific requirements from each component of the system like network, storage and compute. This is typically how almost
+all large scale systems are built. The folks over at Google have formalized this approach to designing systems as ‘Non abstract large system design’ (NALSD). According to the Google site reliability workbook,
+> “Practically, NALSD combines elements of capacity planning, component isolation, and graceful system degradation that are crucial to highly available production systems.” 
+
+We will be using an approach similar to this to build our system.
+
+## Application requirements
+
+Let us define our application requirements in more concrete terms i.e.,
+specific functions:
+
+Our photo sharing application must let the user
+
+-   Sign up to become a member, and login to the application
+
+-   Upload photographs, and optionally add a description and tag location and/or people
+
+-   Follow other users on the platform
+
+-   See a feed comprising of photos from other users that they follow
+
+-   View their own profile page, and manage who they follow
+
+Let us define expectations for the application’s performance for a
+better user experience. We also need to define the health of the system.
+SLIs and SLOs help us do just that.
+
+## SLIs and SLOs
+
+The Google SRE book defines service level indicator(SLI) as “a carefully
+defined quantitative measure of some aspect of the level of service that
+is provided.” For our application, we can define multiple SLIs. One
+indicator can be the response time for loading the feed for our photo
+sharing application. Picking the right set of SLIs is very important
+since they essentially help us define the health of the system as a
+whole using concrete data. SLIs for an application are defined by the
+owners of the service, in consultation with the SREs.
+
+Service level objective (SLO) is defined as “a target value or range of
+values for a service level that is measured by an SLI”. SLO is a way for
+us to anchor ourselves to an optimal user experience by defining SLI
+boundaries. If our application takes a long time to load the feed, users
+might not open it very often. As a result, an example of SLO can be that
+at least 99% of the users should see their feed loaded within 1 second.
+
+Now that we have defined SLIs and SLOs, let us define the application’s
+scalability, reliability and performance characteristics in terms of
+specific SLI and SLO levels.
+
+## Defining application requirements in terms of SLIs and SLOs
+
+The following can be some of the expectations for our application:
+
+-   Once the user successfully uploads the image, it should be accessible to the user and their followers 100% of the time, barring user elected deletion.
+
+-   At least 50000 unique visitors should be able to visit the site at any given time and view their feed.
+
+-   99% of the users should be able to view their feeds in less than 1 second.
+
+-   Upon uploading a new image, it should show up in the feed of the user’s followers within 15 minutes.
+
+-   Users should be able to upload potentially thousands of images. (as long as they are not abusing the service)
+
+Since our ultimate aim is to learn system design, we will arbitrarily limit the functionalities of the system. This will help us keep sight of our aim, and keep us focussed.
+
+Having defined the functionalities and expectations for our system, let us quickly sketch an initial design.
+
+![Initial Application Sketch](images/initial_application_sketch.jpeg)
+
+As of now, all the functionalities are residing on a single server,
+which has endpoints for all of these functions. We will attempt to build
+a system that satisfies our SLOs, is able to serve 50k concurrent users,
+and about a million total users. In the course of this attempt, we will
+discuss a string of concepts, some of which we have already seen in
+Phase 1 of this course.
+
+<div class="callout callout-danger">
+<h4>Caution</h4>
+Note that the numbers we have picked in the following sections are completely arbitrary. They have been chosen to demonstrate thinking about system design in a non-abstract manner. They have not been benchmarked, and bear no real world resemblance. Do not use them in any real world systems that you may be designing. You should come up with your own numbers, using the guiding principles we have relied upon here.
+</div>
+
+## Estimating resource requirements
+
+**Single Computer**
+
+If we wished to run the application on a single server, we would need to
+perform all the above functionalities from the diagram on this server
+itself. Let us perform some calculations to figure out what kind of
+resources we will need.
+
+Before anything else, we need to store the data about users, their
+uploads, follower information and any other metadata. We will choose a
+relational DB to store this information, like MySQL. Do note that we can also choose to use a NOSQL solution here. That would require a similar approach to calculate the requirements. Let us represent the users like so:
+```
+UserID(int)
+UserName(varchar)
+DisplayName(varchar)
+YearOfBirth(year)
+Email(varchar)
+```
+Photos can be represented like this:
+
+```
+PhotoID(int)
+PhotoHash(varchar)
+Uploadtime(datetime)
+Location(varchar)
+OptionalFlag(varchar)
+```
+
+Followers can be represented like this:
+
+```
+Follower(int)
+Followee(int)
+```
+
+Let us quickly estimate the storage needed for a hundred million total
+users. A single user would need 4B + 32B + 32B + 4B + 32B = 104 bytes. A
+hundred million users would need 10.4 GB storage. A single photo would
+need about 4B + 20B + 4B + 32B + 4B = 64 bytes of storage to store the metadata related to the photo. Assuming a
+million photos uploaded in one day, we would need about 64 MB of storage
+per day, just for the metadata. For the photo storage itself, we will need about 300GB per day,
+assuming 300KB average photo size.
+
+A single visitor opening our application simply hits our /get_feed
+endpoint upon logging in to the application. Let us quickly calculate
+the resources needed to serve this request. Assuming the initial feed
+loads 5 images (of 300 KB size on an average) and then does lazy loading
+to infinitely scroll, we will need to send about 1.5 megabytes of images
+to the user for his initial call. With a 1000 Mbps\* network link to the
+server, we can send only about (1000/8)/1.5 or about 83 users all
+loading the feed at the same time, before we saturate the network link.
+If we needed to serve 50k concurrent users every second, we would need
+1.5\*50000\*8 = 600000 Mbps network throughput needed for every 5 images
+sent, assuming we send out all 5 images in a single second. If we are
+reading all of it from disk, we would likely hit disk throughput limits
+far before approaching anywhere near this amount of traffic.
+
+So in order to meet our application requirements, we would need a server
+that has ~310GB storage for the database and the images of one day, and
+about 600 Gbps link to serve 50k users concurrently, along with CPU
+required to perform all this. Clearly not the task for a single server.
+
+And do note that we have severely limited the information we are storing
+in the database. We would likely need an order of magnitude more
+information to be stored.
+
+While we clearly do not have any real world server that has the
+resources we calculated above, this exercise provides us some valuable
+data points about what the resource cost is. Armed with this
+information, let us work on scaling our system through system design to
+get us as close as possible to our goals for the application.
+
+\* Modern servers even have multi-gigabit links, but it is highly
+unlikely that such a huge server will be serving our application alone.
+Modern cloud providers have VMs that also boast several gigabits of
+bandwidth, but they usually end up being throttled after certain limits.
+
+
+## References:
+1. [SQL vs NoSQL databases](https://www.mongodb.com/nosql-explained/nosql-vs-sql) 
+2. [Introducing Non-Abstract Large System Design](https://sre.google/workbook/non-abstract-design/)
--- a/courses/level102/system_design/resiliency.md
+++ b/courses/level102/system_design/resiliency.md
@ -0,0 +1,58 @@
+A resilient system is one that can keep functioning in the face of
+adversity. With our application, there can be numerous failures that act
+as adversities. There can be network level failures that take out entire
+data centres, there might be issues at the rack level or at the server
+level, or there might be something wrong with the cloud provider. We may also run out of capacity, or there might be a wrong code push that
+breaks the system. We will talk about a couple of such issues, and
+understand how we might design a system to work around such things. In
+some cases, a workaround might not be possible. However it is still
+valuable to know potential vulnerabilities to the system stability.
+
+Resilient architectures leverage system design patterns such as
+graceful degradation, quotas, timeouts and circuit breakers. Let us look
+at some of them in this section.
+
+## Quotas
+
+A system may have a component or an endpoint that is consumed by
+multiple components and endpoints. It is important to have something in
+place that will prevent one consumer or client from overwhelming such a
+system. Quotas are one way to do this - we simply assign a specific
+quota for each component - by way of specifying requests per unit time.
+Anyone who breaches the quota is either warned or dropped, depending on
+the implementation. This way, one of our own systems misbehaving cannot
+result in denial of service to others. Quotas also help us prevent cascading failures.
+
+## Graceful Degradation
+
+When a system with multiple dependencies encounters failure in one of
+the dependencies, gracefully degrading to minimum viable functionality
+would be a lot better than grinding the entire system to a halt. For
+example, let us assume there is an endpoint (an URL for a service or a specific function) in our application whose responsibility is to parse the location information in an user uploaded
+image from the image's metadata and provide suggestions for location
+tagging to the user. Rather than failing the entire upload, it is much
+better to skip over this functionality and still give the user an option
+to manually tag a location. Gracefully degrading is always better
+compared to total failures.
+
+## Timeouts
+
+We sometimes call other services or resources like databases or API endpoints in our application. When calling such a resource from our application, it is important to always have a reasonable timeout. It doesn’t necessarily even have to be that the resource will fail for all requests. It just might be that a specific request falls in the high tail latency category. A reasonable time out is helpful to keep the user experience consistent - it is better to fail rather than to have frustratingly long delays, in some cases.
+
+## Exponential back-offs
+
+When a service endpoint fails, retries are one way to see if it was a momentary failure. However, if the retry is also going to fail, there is no point in endlessly retrying. At large enough scale, the retries can compete with the new requests (which might very well be served as expected) and saturate the system. To avoid this, we can look at exponential back-off for retries. This essentially decreases the rate at which the clients retry, upon encountering consecutive failures on retries.
+
+## Circuit breakers
+
+While exponential back off is one way to deal with retry storms, circuit breakers can be another. Circuit breakers can help failures from percolating the entire system. Else, an unmitigated failure that flows through the system may result in false alerts, worsening the mean time to detection(MTTD) and mean time to resolution(MTTR). For example, in case one of the in-memory cache nodes fails resulting in requests reaching the database post the initial timeouts for cache, it might end up overloading the database. If the initial connection between cache node failure and DB node failure is not made, then it might result in increased MTTD of the actual cause and consequently the MTTR.
+
+## Self healing systems
+
+A traditionally load-balanced application with multiple instances might fail when more than a threshold of instances stop responding to requests - either because they are down, or suddenly there is a huge influx of requests, resulting in degraded performance. A self-healing system adds more instances in this scenario to replace the failed instances.
+Auto-scaling like this can also help when there is a sudden spike in query. If our application runs on a public cloud, it might simply be a matter of spinning up more [virtual machines](https://azure.microsoft.com/en-in/overview/what-is-a-virtual-machine/). If we are running on-premise out of our data center, then we will want to think about capacity planning much more carefully. Regardless of how we handle adding additional
+capacity - simply addition may not be enough. We should also think about additional potential failure modes that might be encountered. For example, the load balancing layer itself might need scaling up, to handle the influx of new backends.
+
+## Continuous Deployment and Integration
+
+A well designed system also needs to take into account the need for a proper staging setup that can mimic the production environment as closely as possible. There should also be a way for us to replay production traffic in the staging environment to test changes to production thoroughly.
--- a/courses/level102/system_design/scaling-beyond-the-datacenter.md
+++ b/courses/level102/system_design/scaling-beyond-the-datacenter.md
@ -0,0 +1,33 @@
+## Caching static assets
+
+Extending the existing caching solution a bit, we arrive at Content Delivery Networks(CDNs). CDNs are the caching layer that is closest to the user. A significant chunk of resources served in a webpage, may not be changing on an hourly or even a daily basis. In those cases, we would want to cache these at the CDN level, reducing our load. CDNs not only help reduce the load on our servers by removing the burden of serving static / bandwidth intensive resources, they also let us be present closer to our users, by way of points of presence(POPs). CDNs also let us do geo-load balancing, in case we have multiple data centres around
+the world, and would want to serve from the closest data center (DC) possible.
+
+**Taking it a step further**
+
+With the addition of caching and distributing our application into simpler services, we have solved the problem of scaling to 50000 users. However, our users may be geographically distributed locations and may not be at the same distance from our data centre or our cloud region. Consistency in user experience is important, else we are excluding users who are far away from our location, potentially eliminating a significant chunk of potential users. However, it is not impractical to have data centers all over the world, or even in more than a couple of locations in the world. This is where CDNs and POPs come into picture.
+
+## Points of Presence
+
+CDN POPs are geographically distributed data centers aimed at being close to users. POPs reduce the round trip time by delivering content from a location that is nearest to the user. POPs typically may not have all the content, but have caching servers that cache the static assets, and fetch the rest of the content from the [origin server](https://www.cloudflare.com/en-in/learning/cdn/glossary/origin-server/) where the application actually resides. Their main function is to reduce round trip time by bringing the content closer to the website’s visitor. POPs can also route traffic to one of the multiple origin DCs possible. This way, POPs can be leveraged to add resiliency as well as load-balancing.
+
+
+Now, with our image sharing application becoming more popular by the day, let us assume that we have hit 100,000 concurrent users. And we have built another data center, predicting this increase in traffic. Now we need to be able to route the service to both of these data centers in a reliable manner, while also retaining the ability to fall back to a single data center in case there is an issue with one of the two DCs. This is where sticky routing comes into play.
+
+## Sticky Routing
+
+When an user sends a request, there are cases in which we might want to serve a specific user’s requests from a DC if we have multiple DCs, or a specific server inside a DC. We may also wish to serve all requests from a specific POP by a single data center. Sticky routing helps us do exactly that. It might be simply pinning all users to a specific DC or pinning specific users to specific servers. This is typically done from the POP, so that as soon as the user enters reaches our servers, we can route them to the nearest DC possible.
+
+## Geo DNS
+
+When a user opens the application, the user can be directed to one of the multiple
+globally distributed POPs. This can be done using [GeoDNS](https://jameshfisher.com/2017/02/08/how-does-geodns-work/), which simply put, gives out a different IP address(which are distributed geographically), depending on the location of the user making the DNS request. GeoDNS is the first step in distributing users to different locations - it is not 100% accurate, and typically makes use of IP address allotment information for guessing the location of the user. However, it works well enough for \>90% of the users. After this, we can have a sticky routing service that assigns each user to a specific DC, which we can use to assign a DC to this user, and set a cookie. When the user next visits, the cookie can be read at the POP to decide which data center the user’s traffic must be directed to.
+
+Having multiple DCs and leveraging sticky routing has not only scaling benefits, but also adds to the resiliency of the service, albeit at the cost of additional complexity.
+
+Let us consider another use case in which an user uploads a new profile picture for themselves. If we have multiple data centres or POPs which are not synced in real time - not all of them might have the newer picture. In such a case, it would make sense to tie that user to a specific DC/region until the update has propagated to all regions. Sticky routing would enable us to do this.
+
+
+## References
+1. [CDNs](https://www.cloudflare.com/en-in/learning/cdn/what-is-a-cdn/)
+2. LinkedIn's TrafficShift [blog](https://engineering.linkedin.com/blog/2017/05/trafficshift--load-testing-at-scale) talks about sticky routing
--- a/courses/level102/system_design/scaling.md
+++ b/courses/level102/system_design/scaling.md
@ -0,0 +1,142 @@
+
+In the Phase 1 of this course, we had seen AKF [scale cube](https://akfpartners.com/growth-blog/scale-cube) and how it can help in segmenting services, defining microservices and scaling the overall application. We will use a similar strategy to scale our application - while using the estimates from the previous section, so that we can have a data driven design rather than arbitrarily choosing scaling patterns.
+
+## Splitting the application
+
+Considering the huge volume of traffic that might be generated by our application, and the related resource requirements in terms of memory and CPU, let us split the application into smaller chunks. One of the simplest ways to do this would be to simply divide the application along the endpoints, and spin them up as separate instances. In reality, this decision would probably be a little more complicated, and you might end up having multiple endpoints running from the same instance.
+
+The images can be stored in an [object store](https://en.wikipedia.org/wiki/Object_storage) that can be scaled independently, rather than locating it on the servers where the application or the database resides. This would reduce the resource requirements for the servers.
+
+## Stateful vs Stateless services
+<div class="callout callout-primary">
+A stateless process or service doesn’t rely on stored data of it’s past invocations. A stateful service on the other hand stores its state in a datastore, and typically uses the state on every call or transaction. In some cases, there are options for us to design services in such a way that certain components can be made stateless and this helps in multiple ways. Applications can be containerized easily if they are stateless. Containerized applications are also easier to scale. Stateful services require you to scale the datastore
+with the state as well. However, containerizing databases or scaling databases is out of the scope of this module.
+</div>
+
+<!--You are encouraged to refer to the containerisation module [here]()-->
+
+The resulting design after such distribution of workloads might look
+something like this.
+
+![Microservices diagram](images/microservices.jpg)
+
+You might notice that the diagram also has multiple databases. We will see more about this in the following sharding section.
+
+Now that we have split the application into smaller services, we need to
+look at scaling up the capacity of each of these endpoints. The popular
+Pareto principle states that “80% of consequences come from 20% of the
+causes”. Modifying it slightly, we can say that 80% of the traffic will
+be for 20% of images. The no. of images uploaded vs the no. of images
+seen by the user is going to be similarly skewed. An user is much more
+likely to view images on a daily basis than they are to upload new ones.
+
+In our simple design, generating the feed page with initial 5 images
+will be a matter of choosing 5 recently uploaded images from fellow
+users whom this user follows. While we can dynamically fetch the images
+from the database and generate the page on the fly once the user logs
+on, we might soon overwhelm the database in case a large number of users
+choose to login at the same time and load their feeds. There are two
+things we can do here, one is caching, and the other one is ahead of
+time generation of user feeds.
+
+An user with a million followers can potentially lead to hundreds of
+thousands of calls to the DB, simply to fetch the latest photoID that
+the user has uploaded. This can quickly overwhelm any DB, and can
+potentially bring down the DB itself.
+
+## Sharding
+
+One way to solve the problem of DB limitation is scaling up the DB write
+and reads. Sharding is one way to scale the DB writes, where the DB
+would be split into parts that reside in different instances of the DB
+running on separate machines. DB reads can be scaled up similarly by
+using read replicas as we had seen in Phase 1 of this module.
+
+Compared to the number of images the popular user uploads, the number of
+views generated would be massive. In that case, we should cache the
+photoIDs of the user’s uploads, to be returned without having to perform
+a potentially expensive call to the DB.
+
+Let us consider another endpoint in our application named
+`/get_user_details`. It simply returns the page an user would see upon
+clicking another user’s name. This endpoint will return a list of posts
+that the user has created. Normally, a call to that endpoint will
+involve the application talking to the DB, fetching a list of all the
+posts by the user and returning the result. If someone’s profile is
+viewed thousands of times that means there are thousands of calls to the
+DB - which may result in issues like hot keys and hot partitions. As
+with all other systems, an increase in load may result in worsening
+response times, resulting in inconsistent and potentially bad user
+experience. A simple solution here would be a cache layer - one that
+would return the user’s profile with posts without having to call the DB everytime.
+
+## Caching
+
+A cache is used for the temporary storage of data that is likely to be
+accessed again, often repetitively. When the data requested is found in
+the cache, it is termed as a \`cache hit’. A ‘cache miss’ is the obvious
+complement. A well positioned cache can greatly reduce the query
+response time as well as improve the scalability of a system. Caches can
+be placed at multiple levels between the user and the application. In
+Phase 1, we saw how we could use caches / CDNs to service static
+resources of the application, resulting in quicker response times as
+well as lesser burden on the application servers. Let us look at more
+situations where caching can play a role.
+
+### In-memory caching:
+
+In memory caching is when the information to be cached is kept in the
+main memory of the server, allowing it to be retrieved much faster than
+a DB residing on a disk. We cache arbitrary text (which can be HTML
+fragments or may be JSON objects) and fetch it back really fast. An in
+memory cache is the quickest way to add a layer of fast cache that can
+optionally be persisted to disk as well.
+
+<div class="callout callout-danger">
+While caching can aid significantly in scaling up and improving
+performance, there are situations where cache is suddenly not in place.
+It might be that the cache was accidentally wiped, leading to all the
+queries falling through to the DB layer, often multiple calls for the
+same piece of information. It is important to be aware of this potential
+‘thundering herd’ problem and design your system accordingly.
+</div>
+
+**Caching proxies:**
+
+There are cases where you may want to cache entire webpages / responses
+of other upstream resources that you need to respond to requests. There
+are also cases where you want to let your upstream tell you what to
+cache and how long to cache it for. In such cases, it might be a good
+idea to have a caching solution that understands Cache related HTTP
+headers. One example for our usecase can be when users search for a
+specific term in our application - if there is a frequent enough search
+for a user or a term, it might be more efficient to cache the responses
+for some duration rather than performing the search anew everytime.  
+  
+Let’s recap one of the goals - Atleast 50000 unique visitors should be
+able to visit the site at any given time and view their feed. With the
+implementation of caching, we have removed one potential bottleneck -
+the DB. We also decomposed the monolith into smaller chunks that provide
+individual services. Another step closer to our goal is to simply
+horizontally scale the services needed for feed viewing and putting them
+behind a load balancer. Please recall the scaling concepts discussed in
+Phase 1 of this module.
+
+## Cache managment
+<div class="callout callout-info">
+While caching sounds like a simple, easy solution for a hard problem, an
+even harder problem is to manage the cache efficiently. Like most things
+in your system, the cache layer is not infinite. Effective cache
+management means removing things from the cache at the right time, to
+ensure the cache hit rate remains high. There are many strategies to
+invalidate cache after a certain time period or below certain usage
+thresholds. It is important to keep an eye on cache-hit rate and fine
+tune your caching strategy accordingly.
+</div>
+
+## References
+1. There are many object storage solutions available.  [Minio](https://github.com/minio/minio) is one self hosted solution. There are also vendor-specific solutions for the cloud like [Azure Blob storage](https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blobs-introduction) and [Amazon S3](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html).
+2. Microservices architecture style - [Azure architecture guide](https://docs.microsoft.com/en-us/azure/architecture/guide/architecture-styles/microservices)
+3. There are many in-memory caching solutions. Some of the most popular ones include [redis](https://redis.io) and [memcached](https://memcached.org). Cloud vendors also have their managed cache solutions. 
+4. Some of the most popular proxies include [squid](https://www.squid-cache.org) and [Apache Traffic Server](https://trafficserver.apache.org)
+5. Thundering herd problem - how instagram [tackled it](https://instagram-engineering.com/thundering-herds-promises-82191c8af57d).
--- a/courses/stylesheets/custom.css
+++ b/courses/stylesheets/custom.css
@ -1 +1,36 @@
-div.md-content img { border: 4px solid #ddd; padding: 12px; }
+div.md-content img { border: 4px solid #ddd; padding: 12px; }
+.callout {
+  padding: 20px;
+  margin: 20px 0;
+  border: 1px solid #eee;
+  border-left-width: 5px;
+  border-radius: 3px;
+  h4 {
+    margin-top: 0;
+    margin-bottom: 1px;
+  }
+  p:last-child {
+    margin-bottom: 0;
+  }
+  code {
+    border-radius: 3px;
+  }
+}
+.callout-info {
+  border-left-color: #428bca;
+  h4 {
+       color: #428bca;
+  }
+}
+.callout-primary {
+  border-left-color: #5bc0de;
+  h4 {
+       color: #5bc0de;
+  }
+}
+.callout-danger {
+ border-left-color: #d9534f;
+  h4 {
+       color: #d9534f;
+  }
+}
--- a/mkdocs.yml
+++ b/mkdocs.yml
@ -92,6 +92,13 @@ nav:
      - RTT: level102/networking/rtt.md
      - Infrastructure Services: level102/networking/infrastructure-features.md
      - Conclusion: level102/networking/conclusion.md
+  - System Design:
+      - Introduction: level102/system_design/intro.md
+      - Large System Design: level102/system_design/large-system-design.md
+      - Scaling: level102/system_design/scaling.md
+      - Scaling Beyond the Data Center: level102/system_design/scaling-beyond-the-datacenter.md
+      - Resiliency: level102/system_design/resiliency.md
+      - Conclusion: level102/system_design/conclusion.md
  - System Troubleshooting and Performance Improvements:
        - Introduction: level102/system_troubleshooting_and_performance/introduction.md
        - Troubleshooting: level102/system_troubleshooting_and_performance/troubleshooting.md
				`@ -0,0 +1 @@`
				We have looked at designing a sytem from the scratch, scaling it up from a single server to multiple datacenters and hundreds of thousands of users. However, you might have (rightly!) guessed that there is a lot more to system design than what we have covered so far. This course should give you a sweeping glance at the things that are fundamental to any system design process. Specific solutions implemented, frameworks and orchestration systems used evolve rapidly. However, the guiding principles remain the same. We hope you this course helped in getting you started along the right direction and that you have fun designing systems and solving interesting problems.