diff --git a/courses/level102/networking/conclusion.md b/courses/level102/networking/conclusion.md new file mode 100644 index 0000000..66827b9 --- /dev/null +++ b/courses/level102/networking/conclusion.md @@ -0,0 +1,10 @@ + +This course would have given some background on deploying services in datacentre and various parameters to consider and available solutions. It has to be noted that, each of the solution discussed here have various pros and cons, so specific to the scenario/requirement, the right fit among these are to be identified and used. As we didnt go the depth of various technologies/solution in this course, it might have made the reader curious to know about some of the topics. Here are some of the reference or online training content, for further learning. + +[linked engineering blog](https://engineering.linkedin.com/blog/topic/datacenter) : has information about how Linkedin datacentres are setup and some of the key problems are solved. + +[IPSpace blog](https://blog.ipspace.net/tag/data-center.html) : Has lot of articles about datacentre networking. + +[Networking Basics](https://www.edx.org/course/introduction-to-networking) course in edx. + +Happy learning !! diff --git a/courses/level102/networking/infrastructure-features.md b/courses/level102/networking/infrastructure-features.md new file mode 100644 index 0000000..e4d3ebe --- /dev/null +++ b/courses/level102/networking/infrastructure-features.md @@ -0,0 +1,180 @@ +> *Some of the aspects to consider are, whether the underlying data +centre infrastructure supports ToR resiliency, i.e. features like link +bundling (bonds), BGP, support for anycast service, load balancer, +firewall, Quality of Service.* + +As seen in previous sections, to deploy applications at scale, it will +need certain capabilities to be supported from the infrastructure. This +section will cover different options available, and their suitability. + +### ToR connectivity + +This being one of the most frequent points of failure (considering the scale of deployment), there are different options available to connect the servers to the ToR. We are going to see them in detail below, + +#### Single ToR + +This is the simplest of all the options. Where a NIC of the server is +connected to one ToR. The advantage of this approach is, there is a +minimal number of switch ports used, allowing the DC fabric to support +the rapid growth of server infrastructure (Note: Not only the ToR ports +are used efficiently, but the upper switching layer in DC fabric as well, +the port usage will be efficient). On the downside, the servers can be +unreachable if there is an issue with the ToR, link or NIC. This will +impact the stateful apps more, as the existing connections get +abruptly disconnected. + +![Graphical user interface, application Description automatically +generated with medium +confidence](./media/Single ToR.png) + +Fig 4: Single ToR design + +#### Dual ToR + +In this option, each server is connected to two ToR, of the same +cabinet. This can be set up in active/passive mode, thereby providing +resiliency during ToR/link/NIC failures. The resiliency can be achieved +either in layer 2 or in layer 3. + +##### Layer 2 + +In this case, both the links are bundled together as a [bond](https://en.wikipedia.org/wiki/Link_aggregation) on the +server side (with one NIC taking the active role and the other being +passive). On the switch side, these two links are made part of +[multi-chassis lag](https://en.wikipedia.org/wiki/Multi-chassis_link_aggregation_group) (similar to bonding, but spread across switches). The +prerequisite here is, both the ToR should be part of the same layer 2 +domain. The IP addresses are configured on the bond interface on the +server and SVI on the switch side. + +![Diagram Description automatically +generated](./media/Dual ToR.png) + +Note: In this, the ToR 2 role is only to provide resiliency. + +Fig 5: Dual ToR layer 2 setup + +##### Layer 3 + +In this case, both the links are configured as separate layer 3 +interfaces. The resiliency is achieved by setting up a routing protocol +(like BGP). Wherein one link is given higher preference over the other. +In this case, the two ToR's can be set up independently, in layer 3 +mode. The servers would need a virtual address, to which the services +have to be bound. + +![Diagram Description automatically +generated](./media/Dual ToR BGP.png) + +Note: In this, the ToR 2 role is only to provide resiliency. + +Fig 6: Dual ToR layer 3 setup + +Though the resiliency is better with dual ToR, the drawback is, the +number of ports being used. As the access port in the ToR doubles up, +the number of ports required in the Spine layer also doubles up, and +this keeps cascading to higher layers. + +Type | Single ToR | Dual ToR (layer 2) | Dual ToR (layer 3) +------------------| ----------------| ----------------- |----------------- +Resiliency1 | No2 | Yes | Yes +Port usage | 1:1 | 1:2 | 1:2 +Cabling | Less | More | More +Cost of DC fabric | Low | High | High +ToR features required | Low | High | Medium + +1 Resiliency in terms of ToR/Link/NIC + +2 As an alternative, resiliency can be addressed at the application layer. + + +Along with the above-mentioned ones, an application might need more +capabilities out of the infrastructure to deploy at scale. Some of them +are, + +### Anycast + +As seen in the previous section, of deploying at scale, anycast is one +of the means to have services distributed across cabinets and still have +traffic flowing to each one of the servers. To achieve this, two things +are required + +1. Routing protocol between ToR and server (to announce the anycast +address) + +2. Support for ECMP (Equal Cost Multi-Path) load balancing in the +infrastructure, to distribute the flows across the cabinets. + +### Load balancing + +Similar to Anycast, another means to achieve load balancing across +servers (host a particular app), is using load balancers. These could be +implemented in different ways + +1. Hardware load balancers: A LB device is placed inline of the traffic +flow, and looks at the layer 3 and layer 4 information in an incoming +packet. Then determine the set of real hosts, to which the connections +are to be redirected. As covered in the [Scale](http://athiagar-ld2:8000/linux_networking/Phase_2/scale/#load-balancer) topic, these load balancers can be set up in two ways, + + - Single-arm mode: In this mode, the load balancer handles only the +incoming requests to the VIP. The response from the server goes directly +to the clients. There are two ways to implement this, + + * L2 DSR: Where the load balancer and the real servers remain in the +same VLAN. Upon getting an incoming request, the load balancer +identifies the real server to redirect the request and then modifies the +destination mac address of that Ethernet frame. Upon processing this +packet, the real server responds directly to the client. + + * [L3 DSR](https://github.com/yahoo/l3dsr): In this case, the load balancer and real servers need not be +in the same VLAN (does away with layer 2 complexities like running STP, +managing wider broadcast domain, etc). Upon incoming request, the load +balancer redirects to the real server, by modifying the destination IP +address of the packet. Along with this, the DSCP value of the packet is +set to a predefined value (mapped for that VIP). Upon receipt of this +packet, the real server uses the DSCP value to determine the loopback +address (VIP address). The response again goes directly to the client. + + - Two arm mode: In this case, the load balancer is in line for incoming +and outgoing traffic. + +2. DNS based load balancer: Here the DNS servers keep a check of the +health of the real servers and resolve the domain in such a way that the +client can connect to different servers in that cluster. This part was +explained in detail in the deployment at [scale](http://athiagar-ld2:8000/linux_networking/Phase_2/scale/#dns-based-load-balancing) section. + +3. IPVS based load balancing: This is another means, where an IPVS +server presents itself as the service endpoint to the clients. Upon +incoming request, the IPVS directs the request to the real servers. The +IPVS can be set up to do health for the real servers. + +### NAT + +Network Address Translation (NAT) will be required for hosts that need +to connect to destinations on the Internet, but don't want to expose +their configured NIC address. In this case, the address (of the internal +server) is translated to a public address by a firewall. Few examples of +this are proxy servers, mail servers, etc. + +### QoS + +Quality of Service is a means to provide, differentiate treatment to few packets +over others. These could provide priority in forwarding queues, or +bandwidth reservations. In the data centre scenario, depending upon the +bandwidth subscription ratio, the need for QoS varies, + +1. 1:1 bandwidth subscription ratio: In this case, the server to ToR +connectivity (all servers in that cabinet) bandwidth should be +equivalent to the ToR to Spine switch connectivity. Similarly for the +upper layers as well. In this design, congestion on a link is not going +to happen, as enough bandwidth will always be available. In this case, +the only difference QoS can bring, it provides priority treatment for +certain packets in the forwarding queue. Note: Packet buffering happens, +when the packet moves between ports of different speeds, like 100Gbps, +10Gbps. + +2. Oversubscribed network: In this case, not all layers maintain a +bandwidth subscription ratio, for example, the ToR uplink may be of +lower bandwidth, compared to ToR to Server bandwidth (This is sometimes +referred to as oversubscription ratio). In this case, there is a +possibility of congestion. Here QoS might be required, to give priority +as well as bandwidth reservation, for certain types of traffic flows. diff --git a/courses/level102/networking/introduction.md b/courses/level102/networking/introduction.md new file mode 100644 index 0000000..6651f2a --- /dev/null +++ b/courses/level102/networking/introduction.md @@ -0,0 +1,123 @@ + +# Prerequisites + +It is recommended to have basic knowledge of network security, TCP and +datacenter setup and the common terminologies used in them. Also, the +readers are expected to go through the School of Sre contents - + +- [Linux Networking](http://linkedin.github.io/school-of-sre/level101/linux_networking/intro/) + +- [system design](http://linkedin.github.io/school-of-sre/level101/systems_design/intro/) + +- [security](http://linkedin.github.io/school-of-sre/level101/security/intro/) + +# What to expect from this course + +This part will cover how a datacenter infrastructure is segregated for different +application needs as well as the consideration of deciding where to +place an application. These will be broadly based on, Security, Scale, +RTT (latency), Infrastructure features. + +Each of these topics will be covered in detail, + +Security - Will cover threat vectors faced by services facing +external/internal clients. Potential mitigation options to consider +while deploying them. This will touch upon perimeter security, [DDoS](https://en.wikipedia.org/wiki/Denial-of-service_attack) +protection, Network demarcation and ring-fencing the server clusters. + +Scale - Deploying large scale applications, require a better +understanding of infrastructure capabilities, in terms of resource +availability, failure domains, scaling options like using anycast, layer +4/7 load balancer, DNS based load balancing. + +RTT (latency) - Latency plays a key role in determining the overall +performance of the distributed service/application, where calls are made +between hosts to serve the users. + +Infrastructure features - Some of the aspects to consider are, whether +the underlying data centre infrastructure supports ToR resiliency, i.e., +features like link bundling (bonds), BGP (Border Gateway Protocol), support for anycast service, +load balancer, firewall, Quality of Service. + +# What is not covered under this course + +Though these parameters play a role in designing an application, we will +not go into the details of the design. Each of these topics are vast, hence the objective is to introduce the terms and relevance of the parameters in them, and not to provide extensive details about each one of them. + +# Course Contents +1. [Security](http://linkedin.github.io/school-of-sre/level102/networking/security/) +2. [Scale](https://linkedin.github.io/school-of-sre/level102/networking/scale/) +3. [RTT](http://linkedin.github.io/school-of-sre/level102/networking/rtt/) +4. [Infrastructure features](http://linkedin.github.io/school-of-sre/level102/networking/Infrastructure-features/) +5. [Conclusion](http://linkedin.github.io/school-of-sre/level102/networking/Conclusion/) + + +# Terminology + +Before discussing each of the topics, it is important to get familiar with few commonly used terms + +Cloud + +This refers to hosted solutions from different providers like Azure, AWS, GCP. Wherein enterprises can host their applications for either public or private usage. + +On-prem + +This term refers to physical Data Center(DC) infrastructure, built and managed by enterprises themselves. This can be used for private access as well as public (like users connecting over the Internet). + +Leaf switch (ToR) + +This refers to the switch, where the servers connect to, in a DC. They are called by many names, like access switch, Top of the Rack switch, Leaf switch. + +The term leaf switch comes from the [Spine-leaf architecture](https://searchdatacenter.techtarget.com/definition/Leaf-spine), where the access switches are called leaf switches. Spine-leaf architecture is commonly used in large/hyper-scale data centres, which brings very high scalability options for the DC switching layer and is also more efficient in building and implementing these switches. Sometimes these are referred to as Clos architecture. + +Spine switch + +Spine switches are the aggregation point of several leaf switches, they provide the inter-leaf communication and also connect to the upper layer of DC infrastructure. + +DC fabric + +As the data centre grows, multiple Clos networks need to be interconnected, to support the scale, and fabric switches help to interconnect them. + +Cabinet + +This refers to the rack, where the servers and ToR are installed. One cabinet refers to the entire rack. + +BGP + +It is the Border Gateway Protocol, used to exchange routing information between routers and switches. This is one of the common protocols used in the Internet and as well Data Centers as well. Other protocols are also used in place of BGP, like OSPF. + +VPN + +A Virtual Private Network is a tunnel solution, where two private networks (like offices, datacentres, etc) can be interconnected over a public network (internet). These VPN tunnels encrypt the traffic before sending over the Internet, as a security measure. + +NIC + +Network Interface Card refers to the module in Servers, which consists of the Ethernet port and the interconnection to the system bus. It is used to connect to the switches (commonly ToR switches). + +Flow + +Flows refer to a traffic exchange between two nodes (could be servers, switches, routers, etc), which has common parameters like source/destination IP address, source/destination port number, IP Protocol number. This helps in traffic a particular traffic exchange session, between two nodes (like a file copy session, or an HTTP connection, etc). + +ECMP + +Equal Cost Multi-Path means, a switch/router can distribute the traffic to a destination, among multiple exit interfaces. The flow information is used to build a hash value and based on that, exit interfaces are selected. Once a flow is mapped to a particular exit interface, all the packets of that flow exit via the same interface only. This helps in preventing out of order delivery of packets. + +RTT + +This is a measure of the time it takes for a packet from the source to reach the destination and return to the source. This is most commonly used in measuring network performance and also troubleshooting. + +TCP throughput + +This is the measure of the data transfer rate achieved between two nodes. This is impacted by many parameters like RTT, packet size, window size, etc. + +Unicast + +This refers to the traffic flow between a single source to a single destination (i.e.) like ssh sessions, where there is one to one communication. + +Anycast + +This refers to one-to-one traffic flow as above, but endpoints could be multiple (i.e.) a single source can send traffic to any one of the destination hosts in that group. This is achieved by having the same IP address configured in multiple servers and every new traffic flow is mapped to one of the servers. + +Multicast + +This refers to one-to-many traffic flow (i.e.) a single source can send traffic to multiple destinations. To make it feasible, the network routers replicate the traffic to different hosts (which register as members of that particular multicast group). diff --git a/courses/level102/networking/media/Anycast.png b/courses/level102/networking/media/Anycast.png new file mode 100644 index 0000000..d66a7d3 Binary files /dev/null and b/courses/level102/networking/media/Anycast.png differ diff --git a/courses/level102/networking/media/Dual ToR BGP.png b/courses/level102/networking/media/Dual ToR BGP.png new file mode 100644 index 0000000..e7cfc52 Binary files /dev/null and b/courses/level102/networking/media/Dual ToR BGP.png differ diff --git a/courses/level102/networking/media/Dual ToR.png b/courses/level102/networking/media/Dual ToR.png new file mode 100644 index 0000000..55c3793 Binary files /dev/null and b/courses/level102/networking/media/Dual ToR.png differ diff --git a/courses/level102/networking/media/LB 1-Arm.png b/courses/level102/networking/media/LB 1-Arm.png new file mode 100644 index 0000000..607924d Binary files /dev/null and b/courses/level102/networking/media/LB 1-Arm.png differ diff --git a/courses/level102/networking/media/LB 2-Arm.png b/courses/level102/networking/media/LB 2-Arm.png new file mode 100644 index 0000000..4d50b97 Binary files /dev/null and b/courses/level102/networking/media/LB 2-Arm.png differ diff --git a/courses/level102/networking/media/RTT.png b/courses/level102/networking/media/RTT.png new file mode 100644 index 0000000..36ce2d7 Binary files /dev/null and b/courses/level102/networking/media/RTT.png differ diff --git a/courses/level102/networking/media/Single ToR.png b/courses/level102/networking/media/Single ToR.png new file mode 100644 index 0000000..4d902d4 Binary files /dev/null and b/courses/level102/networking/media/Single ToR.png differ diff --git a/courses/level102/networking/rtt.md b/courses/level102/networking/rtt.md new file mode 100644 index 0000000..ce80e4c --- /dev/null +++ b/courses/level102/networking/rtt.md @@ -0,0 +1,33 @@ +> *Latency plays a key role in determining the overall performance of the +distributed service/application, where calls are made between hosts to +serve the users.* + +RTT is a measure of time, it takes for a packet to reach B from A, and +return to A. It is measured in milliseconds. This measure plays a role +in determining the performance of the services. Its impact is seen in +calls made between different servers/services, to serve the user, as +well as the TCP throughput that can be achieved. + +It is fairly common that service makes multiple calls to servers within +its cluster or to different services like authentication, logging, +database, etc, to respond to each user/client request. These servers can +be spread across different cabinets, at times even between different +data centres in the same region. Such cases are quite possible in cloud +solutions, where the deployment spreads across different sites within a +region. As the RTT increases, the response time for each of the calls +gets longer and thereby has a cascading effect on the end response being +sent to the user. + +### Relation of RTT and throughput + +RTT is inversely proportional to the TCP throughput. As RTT increases, +it reduces the TCP throughput, just like packet loss. Below is a formula +to estimate the TCP throughput, based on TCP mss, RTT and packet loss. + +![Diagram, schematic Description automatically +generated](./media/RTT.png) + +As within a data centre, these calculations are also, important for +communication over the internet, where a client can connect to the DC +hosted services, over different telco networks and the RTT is not very +stable, due to the unpredictability of the Internet routing policies. diff --git a/courses/level102/networking/scale.md b/courses/level102/networking/scale.md new file mode 100644 index 0000000..d1c4e2c --- /dev/null +++ b/courses/level102/networking/scale.md @@ -0,0 +1,149 @@ +> *Deploying large scale applications, require a better understanding of +infrastructure capabilities, in terms of resource availability, failure +domains, scaling options like using anycast, layer 4/7 load balancer, +DNS based load balancing.* + +Building large scale applications is a complex activity, which should +cover many aspects in design, development and as well as +operationalisation. This section will talk about the considerations to +look for while deploying them. + +### Failure domains + +In any infrastructure, failures due to hardware or software issues are +common. Though these may be a pain from a service availability +perspective, these failures do happen and a pragmatic goal would be to, +try to keep these failures to the minimum. Hence while deploying any +service, failures/non-availability of some of the nodes to be factored +in. + +#### Server failures + +A server could fail, due to power or NIC or software bug. And at times +it may not be a complete failure but could be an error in the NIC, which +causes some packet loss. This is a very common scenario and will impact +the stateful services more. While designing such services, it is +important to accommodate some level of tolerance to such failures. + +#### ToR failures + +This is one of the common scenarios, where the leaf switch connecting +the servers goes down, along with it taking down the entire cabinet. +There could be more than one server of the same service that can go down +in this case. It requires planning to decide how much server loss can be +handled without overloading other servers. Based on this, the service +can be distributed across many cabinets. These calculations may vary, +depending upon the resiliency in the ToR design, which will be covered +in [ToR connectivity](http://athiagar-ld2:8000/linux_networking/Phase_2/infrastructure-features/#dual-tor) section. + +#### Site failures + +Here site failure is a generic term, which could mean, a particular +service is down in a site, maybe due to new version rollout, or failures +of devices like firewall, load balancer, if the service depends on them, +or loss of connectivity to remote sites (which might have limited +options for resiliency) or issues with critical services like DNS, etc. +Though these events may not be common, they can have a significant +impact. + +In summary, handling these failure scenarios has to be thought about +while designing the application itself. That will provide the tolerance +required within the application to recover from unexpected failures. +This will help not only for failures, even for planned maintenance work, +as it will be easier to take part of the infrastructure, out of service. + +### Resource availability + +The other aspect to consider while deploying applications at scale is +the availability of the required infrastructure and the features the +service is dependent upon. For example, for the resiliency of a cabinet, +if one decides to distribute the service to 5 cabinets, but the service +needs a load balancer (to distribute incoming connections to different +servers), it may become challenging if load balancers are not supported +in all cabinets. Or there could be a case that there are not enough +cabinets available (that meet the minimum required specification for +service to be set up). The best approach in these cases is to identify +the requirements and gaps and then work with the Infrastructure team to +best solve them. + +#### Scaling options + +While distributing the application to different cabinets, the incoming +traffic to these services has to be distributed across these servers. To +achieve this, the following may be considered + +##### Anycast + +This is one of the quickest ways to roll out traffic distribution across +multiple cabinets. In this, each server, part of the cluster (where the +service is set up), advertises a loopback address (/32 IPv4 or /128 IPv6 +address), to the DC switch fabric (most commonly BGP is used for this +purpose). The service has to be set up to be listening to this loopback +address. When the clients try to connect to the service, get resolved to +this virtual address and forward their queries. The DC switch fabric +distributes each flow into different available next hops (eventually to +all the servers in that service cluster). + +Note: The DC switch computes a hash, based on the IP packet header, this +could include any combination of source and destination addresses, +source and destination port, mac address and IP protocol number. Based +on this hash value, a particular next-hop is picked up. Since all the +packets in a traffic flow, carry the same values for these headers, all +the packets in that flow will be mapped to the same path. + +![Diagram Description automatically generated with medium +confidence](./media/Anycast.png) + +*Fig 1: Anycast setup* + +To achieve a proportionate distribution of flows across these servers, +it is important to maintain uniformity in each of the cabinets and pods. +But remember, the distribution happens only based on flows, and if there +are any elephant (large) flows, some servers might receive a higher +volume of traffic. + +If there are any server or ToR failures, the advertisement of loopback +address to the switches will stop, and thereby the new packets will be +forwarded to the remaining available servers. + +##### Load balancer + +Another common approach is to use a load balancer. A Virtual IP is set +up in the load balancers, to which the client connects while trying to +access the service. The load balancer, in turn, redirects these +connections to, one of the actual servers, where the service is running. +In order to, verify the server is in the serviceable state, the load +balancer does periodic health checks, and if it fails, the LB stops +redirecting the connection to these servers. + +The load balancer can be deployed in single-arm mode, where the traffic +to the VIP is redirected by the LB, and the return traffic from the +server to the client is sent directly. The other option is the two-arm +mode, where the return traffic is also passed through the LB. + +![Graphical user interface, application Description automatically +generated](./media/LB 2-Arm.png) + +Fig 2: Single-arm mode + +![Graphical user interface, application Description automatically +generated](./media/LB 1-Arm.png) + +Fig 3: Two-arm mode + +One of the cons of this approach is, at a higher scale, the load +balancer can become the bottleneck, to support higher traffic volumes or +concurrent connections per second. + +##### DNS based load balancing + +This is similar to the above approach, with the only difference is +instead of an appliance, the load balancing is done at the DNS. The +clients get different IP's to connect when they query for the DNS +records of the service. The DNS server has to do a health check, to know which +servers are in a good state. + +This approach alleviates the bottleneck of the load balancer solution. +But require shorter TTL for the DNS records, so that problematic servers +can be taken out of rotation quickly, which means, there will be far +more DNS queries. diff --git a/courses/level102/networking/security.md b/courses/level102/networking/security.md new file mode 100644 index 0000000..f221fbb --- /dev/null +++ b/courses/level102/networking/security.md @@ -0,0 +1,175 @@ +> *This section will cover threat vectors faced by services facing +external/internal clients. Potential mitigation options to consider +while deploying them. This will touch upon perimeter security, DDoS +protection, Network demarcation and operational practices.* + +### Security Threat + +Security is one of the major considerations in any infrastructure. There +are various security threats, which could amount to data theft, loss of +service, fraudulent activity, etc. An attacker can use techniques like +phishing, spamming, malware, Dos/DDoS, exploiting vulnerabilities, +man-in-the-middle attack, and many more. In this section, we will cover +some of these threats and possible mitigation. As there are numerous +means to attack and secure the infrastructure, we will only focus on +some of the most common ones. + +**Phishing** is mostly done via email (and other mass communication +methods), where an attacker provides links to fake websites/URLs. Upon +accessing that, victim's sensitive information like login +credentials or personal data is collected and can be misused. + +**Spamming** is also similar to phishing, but the attacker doesn't collect +data from users but tries to spam a particular website and probably +overwhelm them (to cause slowness) and well use that opportunity to, +compromise the security of the attacked website. + +**Malware** is like a trojan horse, where an attacker manages to install a +piece of code on the secured systems in the infrastructure. Using this, +the hacker can collect sensitive data and as well infect the critical +services of the target company. + +**Exploiting vulnerabilities** is another method an attacker can gain access +to the systems. These could be bugs or misconfiguration in web servers, +internet-facing routers/switches/firewalls, etc. + +**DoS/DDoS** is one of the common attacks seen on internet-based +services/solutions, especially those businesses based on eyeball +traffic. Here the attacker tries to overwhelm the resources of the +victim by generating spurious traffic to the external-facing services. +By this, primarily the services turn slow or non-responsive, during this +time, the attacker could try to hack into the network, if some of the +security mechanism fails to filter through the attack traffic due to +overload. + +### Securing the infrastructure + +The first and foremost aspect for any infrastructure administration is +to identify the various security threats that could affect the business +running over this infrastructure. Once different threats are known, the +security defence mechanism has to be designed and implemented. Some of +the common means to securing the infrastructure are + +#### Perimeter security + +This is the first line of defence in any infrastructure, where +unwanted/unexpected traffic flows into the infrastructure are +filtered/blocked. These could be filters in the edge routers, that allow +expected services (like port 443 traffic for web service running on +HTTPS), or this filter can be set up to block unwanted traffic, like +blocking UDP ports, if the services are not dependent on UDP. + +Similar to the application traffic entering the network, there could be +other traffic like BGP messages for Internet peers, VPN tunnels traffic, +as well other services like email/DNS, etc. There are means to protect +every one of these, like using authentication mechanisms (password or +key-based) for peers of BGP, VPN, and whitelisting these specific peers +to make inbound connections (in perimeter filters). Along with these, +the amount of messages/traffic can be rate-limited to known scale or +expected load, so the resources are not overwhelmed. + +#### DDoS mitigation + +Protecting against a DDoS attack is another important aspect. The attack +traffic will look similar to the genuine users/client request, but with +the intention to flood the externally exposed app, which could be a web +server, DNS, etc. Therefore it is essential to differentiate between the +attack traffic and genuine traffic, for this, there are different +methods to do at the application level, one such example using Captcha +on a web service, to catch traffic originating from bots. + +For these methods to be useful, the nodes should be capable of handling +both the attack traffic and genuine traffic. It may be possible in +cloud-based infrastructure to dynamically add more virtual +machines/resources, to handle the sudden spike in volume of traffic, but +on-prem, the option to add additional resources might be challenging. + +To handle a large volume of attack traffic, there are solutions +available, which can inspect the packets/traffic flows and identify +anomalies (i.e.) traffic patterns that don't resemble a genuine +connection, like client initiating TCP connection, but fail to complete +the handshake, or set of sources, which have abnormally huge traffic +flow. Once this unwanted traffic is identified, these are dropped at the +edge of the network itself, thereby protecting the resources of app +nodes. This topic alone can be discussed more in detail, but that will +be beyond the scope of this section. + +#### Network Demarcation + +Network demarcation is another common strategy deployed in different +networks when applications are grouped based on their security needs and +vulnerability to an attack. Some common demarcations are, the +external/internet facing nodes are grouped into a separate zone, whereas +those nodes having sensitive data are segregated into a separate zone. +And any communication between these zones is restricted with the help of +security tools to limit exposure to unwanted hosts/ports. These +inter-zone communication filters are sometimes called ring-fencing. The +number of zones to be created, varies for different deployments, for +example, there could be a host which should be able to communicate to +the external world as well as internal servers, like proxy, email, in +this case, these can be grouped under one zone, say De-Militarized Zones +(DMZ). The main advantage of creating zones is that, even if there is a +compromised host, that doesn't act as a back door entry for the rest of +the infrastructure. + +#### Node protection + +Be it server, router, switches, load balancers, firewall, etc, each of +these devices come with certain capabilities to secure themselves, like +support for filters (e.g. Access-list, Iptables) to control what traffic +to process and what to drop, anti-virus software can be used in servers +to check on the software installed in them. + +#### Operational practices + +There are numerous security threats for infrastructure, and there are +different solutions to defend them. The key part to the defence, is not +only identifying the right solution and the tools for it but also making +sure there are robust operational procedures in place, to respond +promptly, decisively and with clarity, for any security incident. + +##### Standard Operating Procedures (SOP) + +SOP need to be well defined and act as a reference for on-call to follow +during a security incident. This SoP should cover things like, + +- When a security incident happens, how it will be alerted, to whom it +will be alerted. + +- Identify the scale and severity of the security incident. + +- Who are the points of escalation and the threshold/time to intimate +them, there could be other concerned teams or to the management or +even to the security operations in-charge. + +- Which solutions to use (and the procedure to follow in them) to +mitigate the security incident. + +- Also the data about the security incident has to be collated for +further analysis. + +Many organisations have a dedicated team focused on security, and they +drive most of the activities, during an attack and even before, to come +up with best practices, guidelines and compliance audits. It is the +responsibility of respective technical teams, to ensure the +infrastructure meets these recommendations and gaps are fixed. + +##### Periodic review + +Along with defining SoP's, the entire security of the infrastructure has +to be reviewed periodically. This review should include, + +- Identifying any new/improved security threat that could potentially +target the infrastructure. + +- The SoP's have to be reviewed periodically, depending upon new +security threats or changes in the procedure (to implement the +solutions) + +- Ensuring software upgrades/patches are done in a timely manner. + +- Audit the infrastructure for any non-compliance of the security +standards. + +- Review of recent security incidents and find means to improvise the +defence mechanisms. diff --git a/mkdocs.yml b/mkdocs.yml index 14885d2..1eb283c 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -78,14 +78,21 @@ nav: - Writing Secure code: level101/security/writing_secure_code.md - Conclusion: level101/security/conclusion.md - Level 102: - - Linux Advanced: - - Containerization And Orchestration: - - Introduction: level102/containerization_and_orchestration/intro.md - - Introduction To Containers: level102/containerization_and_orchestration/intro_to_containers.md - - Containerization With Docker: level102/containerization_and_orchestration/containerization_with_docker.md - - Orchestration With Kubernetes: level102/containerization_and_orchestration/orchestration_with_kubernetes.md - - Conclusion: level102/containerization_and_orchestration/conclusion.md - - System Troubleshooting and Performance Improvements: + - Linux Advanced: + - Containerization And Orchestration: + - Introduction: level102/containerization_and_orchestration/intro.md + - Introduction To Containers: level102/containerization_and_orchestration/intro_to_containers.md + - Containerization With Docker: level102/containerization_and_orchestration/containerization_with_docker.md + - Orchestration With Kubernetes: level102/containerization_and_orchestration/orchestration_with_kubernetes.md + - Conclusion: level102/containerization_and_orchestration/conclusion.md + - Networking: + - Introduction: level102/networking/introduction.md + - Security: level102/networking/security.md - Scale: level102/networking/scale.md + - Scale: level102/networking/scale.md + - RTT: level102/networking/rtt.md + - Infrastructure Services: level102/networking/infrastructure-features.md + - Conclusion: level102/networking/conclusion.md + - System Troubleshooting and Performance Improvements: - Introduction: level102/system_troubleshooting_and_performance/introduction.md - Troubleshooting: level102/system_troubleshooting_and_performance/troubleshooting.md - Important Tools: level102/system_troubleshooting_and_performance/important-tools.md