Adding networking course for level102 (#113)

* Adding networking course for level102

* Adding URL and minor fix

Co-authored-by: Arun Thiagarajan <athiagar@athiagar-ld2.linkedin.biz>
Co-authored-by: kalyan <ksomasundaram@linkedin.com>
pull/115/head^2
Arun Thiagarajan 3 years ago committed by GitHub
parent 95b5e64cfb
commit b302d7bc06
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -0,0 +1,10 @@
This course would have given some background on deploying services in datacentre and various parameters to consider and available solutions. It has to be noted that, each of the solution discussed here have various pros and cons, so specific to the scenario/requirement, the right fit among these are to be identified and used. As we didnt go the depth of various technologies/solution in this course, it might have made the reader curious to know about some of the topics. Here are some of the reference or online training content, for further learning.
[linked engineering blog](https://engineering.linkedin.com/blog/topic/datacenter) : has information about how Linkedin datacentres are setup and some of the key problems are solved.
[IPSpace blog](https://blog.ipspace.net/tag/data-center.html) : Has lot of articles about datacentre networking.
[Networking Basics](https://www.edx.org/course/introduction-to-networking) course in edx.
Happy learning !!

@ -0,0 +1,180 @@
> *Some of the aspects to consider are, whether the underlying data
centre infrastructure supports ToR resiliency, i.e. features like link
bundling (bonds), BGP, support for anycast service, load balancer,
firewall, Quality of Service.*
As seen in previous sections, to deploy applications at scale, it will
need certain capabilities to be supported from the infrastructure. This
section will cover different options available, and their suitability.
### ToR connectivity
This being one of the most frequent points of failure (considering the scale of deployment), there are different options available to connect the servers to the ToR. We are going to see them in detail below,
#### Single ToR
This is the simplest of all the options. Where a NIC of the server is
connected to one ToR. The advantage of this approach is, there is a
minimal number of switch ports used, allowing the DC fabric to support
the rapid growth of server infrastructure (Note: Not only the ToR ports
are used efficiently, but the upper switching layer in DC fabric as well,
the port usage will be efficient). On the downside, the servers can be
unreachable if there is an issue with the ToR, link or NIC. This will
impact the stateful apps more, as the existing connections get
abruptly disconnected.
![Graphical user interface, application Description automatically
generated with medium
confidence](./media/Single ToR.png)
Fig 4: Single ToR design
#### Dual ToR
In this option, each server is connected to two ToR, of the same
cabinet. This can be set up in active/passive mode, thereby providing
resiliency during ToR/link/NIC failures. The resiliency can be achieved
either in layer 2 or in layer 3.
##### Layer 2
In this case, both the links are bundled together as a [bond](https://en.wikipedia.org/wiki/Link_aggregation) on the
server side (with one NIC taking the active role and the other being
passive). On the switch side, these two links are made part of
[multi-chassis lag](https://en.wikipedia.org/wiki/Multi-chassis_link_aggregation_group) (similar to bonding, but spread across switches). The
prerequisite here is, both the ToR should be part of the same layer 2
domain. The IP addresses are configured on the bond interface on the
server and SVI on the switch side.
![Diagram Description automatically
generated](./media/Dual ToR.png)
Note: In this, the ToR 2 role is only to provide resiliency.
Fig 5: Dual ToR layer 2 setup
##### Layer 3
In this case, both the links are configured as separate layer 3
interfaces. The resiliency is achieved by setting up a routing protocol
(like BGP). Wherein one link is given higher preference over the other.
In this case, the two ToR's can be set up independently, in layer 3
mode. The servers would need a virtual address, to which the services
have to be bound.
![Diagram Description automatically
generated](./media/Dual ToR BGP.png)
Note: In this, the ToR 2 role is only to provide resiliency.
Fig 6: Dual ToR layer 3 setup
Though the resiliency is better with dual ToR, the drawback is, the
number of ports being used. As the access port in the ToR doubles up,
the number of ports required in the Spine layer also doubles up, and
this keeps cascading to higher layers.
Type | Single ToR | Dual ToR (layer 2) | Dual ToR (layer 3)
------------------| ----------------| ----------------- |-----------------
Resiliency<sup>1</sup> | No<sup>2</sup> | Yes | Yes
Port usage | 1:1 | 1:2 | 1:2
Cabling | Less | More | More
Cost of DC fabric | Low | High | High
ToR features required | Low | High | Medium
<sup>1</sup> Resiliency in terms of ToR/Link/NIC
<sup>2</sup> As an alternative, resiliency can be addressed at the application layer.
Along with the above-mentioned ones, an application might need more
capabilities out of the infrastructure to deploy at scale. Some of them
are,
### Anycast
As seen in the previous section, of deploying at scale, anycast is one
of the means to have services distributed across cabinets and still have
traffic flowing to each one of the servers. To achieve this, two things
are required
1. Routing protocol between ToR and server (to announce the anycast
address)
2. Support for ECMP (Equal Cost Multi-Path) load balancing in the
infrastructure, to distribute the flows across the cabinets.
### Load balancing
Similar to Anycast, another means to achieve load balancing across
servers (host a particular app), is using load balancers. These could be
implemented in different ways
1. Hardware load balancers: A LB device is placed inline of the traffic
flow, and looks at the layer 3 and layer 4 information in an incoming
packet. Then determine the set of real hosts, to which the connections
are to be redirected. As covered in the [Scale](http://athiagar-ld2:8000/linux_networking/Phase_2/scale/#load-balancer) topic, these load balancers can be set up in two ways,
- Single-arm mode: In this mode, the load balancer handles only the
incoming requests to the VIP. The response from the server goes directly
to the clients. There are two ways to implement this,
* L2 DSR: Where the load balancer and the real servers remain in the
same VLAN. Upon getting an incoming request, the load balancer
identifies the real server to redirect the request and then modifies the
destination mac address of that Ethernet frame. Upon processing this
packet, the real server responds directly to the client.
* [L3 DSR](https://github.com/yahoo/l3dsr): In this case, the load balancer and real servers need not be
in the same VLAN (does away with layer 2 complexities like running STP,
managing wider broadcast domain, etc). Upon incoming request, the load
balancer redirects to the real server, by modifying the destination IP
address of the packet. Along with this, the DSCP value of the packet is
set to a predefined value (mapped for that VIP). Upon receipt of this
packet, the real server uses the DSCP value to determine the loopback
address (VIP address). The response again goes directly to the client.
- Two arm mode: In this case, the load balancer is in line for incoming
and outgoing traffic.
2. DNS based load balancer: Here the DNS servers keep a check of the
health of the real servers and resolve the domain in such a way that the
client can connect to different servers in that cluster. This part was
explained in detail in the deployment at [scale](http://athiagar-ld2:8000/linux_networking/Phase_2/scale/#dns-based-load-balancing) section.
3. IPVS based load balancing: This is another means, where an IPVS
server presents itself as the service endpoint to the clients. Upon
incoming request, the IPVS directs the request to the real servers. The
IPVS can be set up to do health for the real servers.
### NAT
Network Address Translation (NAT) will be required for hosts that need
to connect to destinations on the Internet, but don't want to expose
their configured NIC address. In this case, the address (of the internal
server) is translated to a public address by a firewall. Few examples of
this are proxy servers, mail servers, etc.
### QoS
Quality of Service is a means to provide, differentiate treatment to few packets
over others. These could provide priority in forwarding queues, or
bandwidth reservations. In the data centre scenario, depending upon the
bandwidth subscription ratio, the need for QoS varies,
1. 1:1 bandwidth subscription ratio: In this case, the server to ToR
connectivity (all servers in that cabinet) bandwidth should be
equivalent to the ToR to Spine switch connectivity. Similarly for the
upper layers as well. In this design, congestion on a link is not going
to happen, as enough bandwidth will always be available. In this case,
the only difference QoS can bring, it provides priority treatment for
certain packets in the forwarding queue. Note: Packet buffering happens,
when the packet moves between ports of different speeds, like 100Gbps,
10Gbps.
2. Oversubscribed network: In this case, not all layers maintain a
bandwidth subscription ratio, for example, the ToR uplink may be of
lower bandwidth, compared to ToR to Server bandwidth (This is sometimes
referred to as oversubscription ratio). In this case, there is a
possibility of congestion. Here QoS might be required, to give priority
as well as bandwidth reservation, for certain types of traffic flows.

@ -0,0 +1,123 @@
# Prerequisites
It is recommended to have basic knowledge of network security, TCP and
datacenter setup and the common terminologies used in them. Also, the
readers are expected to go through the School of Sre contents -
- [Linux Networking](http://linkedin.github.io/school-of-sre/level101/linux_networking/intro/)
- [system design](http://linkedin.github.io/school-of-sre/level101/systems_design/intro/)
- [security](http://linkedin.github.io/school-of-sre/level101/security/intro/)
# What to expect from this course
This part will cover how a datacenter infrastructure is segregated for different
application needs as well as the consideration of deciding where to
place an application. These will be broadly based on, Security, Scale,
RTT (latency), Infrastructure features.
Each of these topics will be covered in detail,
Security - Will cover threat vectors faced by services facing
external/internal clients. Potential mitigation options to consider
while deploying them. This will touch upon perimeter security, [DDoS](https://en.wikipedia.org/wiki/Denial-of-service_attack)
protection, Network demarcation and ring-fencing the server clusters.
Scale - Deploying large scale applications, require a better
understanding of infrastructure capabilities, in terms of resource
availability, failure domains, scaling options like using anycast, layer
4/7 load balancer, DNS based load balancing.
RTT (latency) - Latency plays a key role in determining the overall
performance of the distributed service/application, where calls are made
between hosts to serve the users.
Infrastructure features - Some of the aspects to consider are, whether
the underlying data centre infrastructure supports ToR resiliency, i.e.,
features like link bundling (bonds), BGP (Border Gateway Protocol), support for anycast service,
load balancer, firewall, Quality of Service.
# What is not covered under this course
Though these parameters play a role in designing an application, we will
not go into the details of the design. Each of these topics are vast, hence the objective is to introduce the terms and relevance of the parameters in them, and not to provide extensive details about each one of them.
# Course Contents
1. [Security](http://linkedin.github.io/school-of-sre/level102/networking/security/)
2. [Scale](https://linkedin.github.io/school-of-sre/level102/networking/scale/)
3. [RTT](http://linkedin.github.io/school-of-sre/level102/networking/rtt/)
4. [Infrastructure features](http://linkedin.github.io/school-of-sre/level102/networking/Infrastructure-features/)
5. [Conclusion](http://linkedin.github.io/school-of-sre/level102/networking/Conclusion/)
# Terminology
Before discussing each of the topics, it is important to get familiar with few commonly used terms
Cloud
This refers to hosted solutions from different providers like Azure, AWS, GCP. Wherein enterprises can host their applications for either public or private usage.
On-prem
This term refers to physical Data Center(DC) infrastructure, built and managed by enterprises themselves. This can be used for private access as well as public (like users connecting over the Internet).
Leaf switch (ToR)
This refers to the switch, where the servers connect to, in a DC. They are called by many names, like access switch, Top of the Rack switch, Leaf switch.
The term leaf switch comes from the [Spine-leaf architecture](https://searchdatacenter.techtarget.com/definition/Leaf-spine), where the access switches are called leaf switches. Spine-leaf architecture is commonly used in large/hyper-scale data centres, which brings very high scalability options for the DC switching layer and is also more efficient in building and implementing these switches. Sometimes these are referred to as Clos architecture.
Spine switch
Spine switches are the aggregation point of several leaf switches, they provide the inter-leaf communication and also connect to the upper layer of DC infrastructure.
DC fabric
As the data centre grows, multiple Clos networks need to be interconnected, to support the scale, and fabric switches help to interconnect them.
Cabinet
This refers to the rack, where the servers and ToR are installed. One cabinet refers to the entire rack.
BGP
It is the Border Gateway Protocol, used to exchange routing information between routers and switches. This is one of the common protocols used in the Internet and as well Data Centers as well. Other protocols are also used in place of BGP, like OSPF.
VPN
A Virtual Private Network is a tunnel solution, where two private networks (like offices, datacentres, etc) can be interconnected over a public network (internet). These VPN tunnels encrypt the traffic before sending over the Internet, as a security measure.
NIC
Network Interface Card refers to the module in Servers, which consists of the Ethernet port and the interconnection to the system bus. It is used to connect to the switches (commonly ToR switches).
Flow
Flows refer to a traffic exchange between two nodes (could be servers, switches, routers, etc), which has common parameters like source/destination IP address, source/destination port number, IP Protocol number. This helps in traffic a particular traffic exchange session, between two nodes (like a file copy session, or an HTTP connection, etc).
ECMP
Equal Cost Multi-Path means, a switch/router can distribute the traffic to a destination, among multiple exit interfaces. The flow information is used to build a hash value and based on that, exit interfaces are selected. Once a flow is mapped to a particular exit interface, all the packets of that flow exit via the same interface only. This helps in preventing out of order delivery of packets.
RTT
This is a measure of the time it takes for a packet from the source to reach the destination and return to the source. This is most commonly used in measuring network performance and also troubleshooting.
TCP throughput
This is the measure of the data transfer rate achieved between two nodes. This is impacted by many parameters like RTT, packet size, window size, etc.
Unicast
This refers to the traffic flow between a single source to a single destination (i.e.) like ssh sessions, where there is one to one communication.
Anycast
This refers to one-to-one traffic flow as above, but endpoints could be multiple (i.e.) a single source can send traffic to any one of the destination hosts in that group. This is achieved by having the same IP address configured in multiple servers and every new traffic flow is mapped to one of the servers.
Multicast
This refers to one-to-many traffic flow (i.e.) a single source can send traffic to multiple destinations. To make it feasible, the network routers replicate the traffic to different hosts (which register as members of that particular multicast group).

Binary file not shown.

After

Width:  |  Height:  |  Size: 80 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 26 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 31 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 71 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 71 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 9.7 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 11 KiB

@ -0,0 +1,33 @@
> *Latency plays a key role in determining the overall performance of the
distributed service/application, where calls are made between hosts to
serve the users.*
RTT is a measure of time, it takes for a packet to reach B from A, and
return to A. It is measured in milliseconds. This measure plays a role
in determining the performance of the services. Its impact is seen in
calls made between different servers/services, to serve the user, as
well as the TCP throughput that can be achieved.
It is fairly common that service makes multiple calls to servers within
its cluster or to different services like authentication, logging,
database, etc, to respond to each user/client request. These servers can
be spread across different cabinets, at times even between different
data centres in the same region. Such cases are quite possible in cloud
solutions, where the deployment spreads across different sites within a
region. As the RTT increases, the response time for each of the calls
gets longer and thereby has a cascading effect on the end response being
sent to the user.
### Relation of RTT and throughput
RTT is inversely proportional to the TCP throughput. As RTT increases,
it reduces the TCP throughput, just like packet loss. Below is a formula
to estimate the TCP throughput, based on TCP mss, RTT and packet loss.
![Diagram, schematic Description automatically
generated](./media/RTT.png)
As within a data centre, these calculations are also, important for
communication over the internet, where a client can connect to the DC
hosted services, over different telco networks and the RTT is not very
stable, due to the unpredictability of the Internet routing policies.

@ -0,0 +1,149 @@
> *Deploying large scale applications, require a better understanding of
infrastructure capabilities, in terms of resource availability, failure
domains, scaling options like using anycast, layer 4/7 load balancer,
DNS based load balancing.*
Building large scale applications is a complex activity, which should
cover many aspects in design, development and as well as
operationalisation. This section will talk about the considerations to
look for while deploying them.
### Failure domains
In any infrastructure, failures due to hardware or software issues are
common. Though these may be a pain from a service availability
perspective, these failures do happen and a pragmatic goal would be to,
try to keep these failures to the minimum. Hence while deploying any
service, failures/non-availability of some of the nodes to be factored
in.
#### Server failures
A server could fail, due to power or NIC or software bug. And at times
it may not be a complete failure but could be an error in the NIC, which
causes some packet loss. This is a very common scenario and will impact
the stateful services more. While designing such services, it is
important to accommodate some level of tolerance to such failures.
#### ToR failures
This is one of the common scenarios, where the leaf switch connecting
the servers goes down, along with it taking down the entire cabinet.
There could be more than one server of the same service that can go down
in this case. It requires planning to decide how much server loss can be
handled without overloading other servers. Based on this, the service
can be distributed across many cabinets. These calculations may vary,
depending upon the resiliency in the ToR design, which will be covered
in [ToR connectivity](http://athiagar-ld2:8000/linux_networking/Phase_2/infrastructure-features/#dual-tor) section.
#### Site failures
Here site failure is a generic term, which could mean, a particular
service is down in a site, maybe due to new version rollout, or failures
of devices like firewall, load balancer, if the service depends on them,
or loss of connectivity to remote sites (which might have limited
options for resiliency) or issues with critical services like DNS, etc.
Though these events may not be common, they can have a significant
impact.
In summary, handling these failure scenarios has to be thought about
while designing the application itself. That will provide the tolerance
required within the application to recover from unexpected failures.
This will help not only for failures, even for planned maintenance work,
as it will be easier to take part of the infrastructure, out of service.
### Resource availability
The other aspect to consider while deploying applications at scale is
the availability of the required infrastructure and the features the
service is dependent upon. For example, for the resiliency of a cabinet,
if one decides to distribute the service to 5 cabinets, but the service
needs a load balancer (to distribute incoming connections to different
servers), it may become challenging if load balancers are not supported
in all cabinets. Or there could be a case that there are not enough
cabinets available (that meet the minimum required specification for
service to be set up). The best approach in these cases is to identify
the requirements and gaps and then work with the Infrastructure team to
best solve them.
#### Scaling options
While distributing the application to different cabinets, the incoming
traffic to these services has to be distributed across these servers. To
achieve this, the following may be considered
##### Anycast
This is one of the quickest ways to roll out traffic distribution across
multiple cabinets. In this, each server, part of the cluster (where the
service is set up), advertises a loopback address (/32 IPv4 or /128 IPv6
address), to the DC switch fabric (most commonly BGP is used for this
purpose). The service has to be set up to be listening to this loopback
address. When the clients try to connect to the service, get resolved to
this virtual address and forward their queries. The DC switch fabric
distributes each flow into different available next hops (eventually to
all the servers in that service cluster).
Note: The DC switch computes a hash, based on the IP packet header, this
could include any combination of source and destination addresses,
source and destination port, mac address and IP protocol number. Based
on this hash value, a particular next-hop is picked up. Since all the
packets in a traffic flow, carry the same values for these headers, all
the packets in that flow will be mapped to the same path.
![Diagram Description automatically generated with medium
confidence](./media/Anycast.png)
*Fig 1: Anycast setup*
To achieve a proportionate distribution of flows across these servers,
it is important to maintain uniformity in each of the cabinets and pods.
But remember, the distribution happens only based on flows, and if there
are any elephant (large) flows, some servers might receive a higher
volume of traffic.
If there are any server or ToR failures, the advertisement of loopback
address to the switches will stop, and thereby the new packets will be
forwarded to the remaining available servers.
##### Load balancer
Another common approach is to use a load balancer. A Virtual IP is set
up in the load balancers, to which the client connects while trying to
access the service. The load balancer, in turn, redirects these
connections to, one of the actual servers, where the service is running.
In order to, verify the server is in the serviceable state, the load
balancer does periodic health checks, and if it fails, the LB stops
redirecting the connection to these servers.
The load balancer can be deployed in single-arm mode, where the traffic
to the VIP is redirected by the LB, and the return traffic from the
server to the client is sent directly. The other option is the two-arm
mode, where the return traffic is also passed through the LB.
![Graphical user interface, application Description automatically
generated](./media/LB 2-Arm.png)
Fig 2: Single-arm mode
![Graphical user interface, application Description automatically
generated](./media/LB 1-Arm.png)
Fig 3: Two-arm mode
One of the cons of this approach is, at a higher scale, the load
balancer can become the bottleneck, to support higher traffic volumes or
concurrent connections per second.
##### DNS based load balancing
This is similar to the above approach, with the only difference is
instead of an appliance, the load balancing is done at the DNS. The
clients get different IP's to connect when they query for the DNS
records of the service. The DNS server has to do a health check, to know which
servers are in a good state.
This approach alleviates the bottleneck of the load balancer solution.
But require shorter TTL for the DNS records, so that problematic servers
can be taken out of rotation quickly, which means, there will be far
more DNS queries.

@ -0,0 +1,175 @@
> *This section will cover threat vectors faced by services facing
external/internal clients. Potential mitigation options to consider
while deploying them. This will touch upon perimeter security, DDoS
protection, Network demarcation and operational practices.*
### Security Threat
Security is one of the major considerations in any infrastructure. There
are various security threats, which could amount to data theft, loss of
service, fraudulent activity, etc. An attacker can use techniques like
phishing, spamming, malware, Dos/DDoS, exploiting vulnerabilities,
man-in-the-middle attack, and many more. In this section, we will cover
some of these threats and possible mitigation. As there are numerous
means to attack and secure the infrastructure, we will only focus on
some of the most common ones.
**Phishing** is mostly done via email (and other mass communication
methods), where an attacker provides links to fake websites/URLs. Upon
accessing that, victim's sensitive information like login
credentials or personal data is collected and can be misused.
**Spamming** is also similar to phishing, but the attacker doesn't collect
data from users but tries to spam a particular website and probably
overwhelm them (to cause slowness) and well use that opportunity to,
compromise the security of the attacked website.
**Malware** is like a trojan horse, where an attacker manages to install a
piece of code on the secured systems in the infrastructure. Using this,
the hacker can collect sensitive data and as well infect the critical
services of the target company.
**Exploiting vulnerabilities** is another method an attacker can gain access
to the systems. These could be bugs or misconfiguration in web servers,
internet-facing routers/switches/firewalls, etc.
**DoS/DDoS** is one of the common attacks seen on internet-based
services/solutions, especially those businesses based on eyeball
traffic. Here the attacker tries to overwhelm the resources of the
victim by generating spurious traffic to the external-facing services.
By this, primarily the services turn slow or non-responsive, during this
time, the attacker could try to hack into the network, if some of the
security mechanism fails to filter through the attack traffic due to
overload.
### Securing the infrastructure
The first and foremost aspect for any infrastructure administration is
to identify the various security threats that could affect the business
running over this infrastructure. Once different threats are known, the
security defence mechanism has to be designed and implemented. Some of
the common means to securing the infrastructure are
#### Perimeter security
This is the first line of defence in any infrastructure, where
unwanted/unexpected traffic flows into the infrastructure are
filtered/blocked. These could be filters in the edge routers, that allow
expected services (like port 443 traffic for web service running on
HTTPS), or this filter can be set up to block unwanted traffic, like
blocking UDP ports, if the services are not dependent on UDP.
Similar to the application traffic entering the network, there could be
other traffic like BGP messages for Internet peers, VPN tunnels traffic,
as well other services like email/DNS, etc. There are means to protect
every one of these, like using authentication mechanisms (password or
key-based) for peers of BGP, VPN, and whitelisting these specific peers
to make inbound connections (in perimeter filters). Along with these,
the amount of messages/traffic can be rate-limited to known scale or
expected load, so the resources are not overwhelmed.
#### DDoS mitigation
Protecting against a DDoS attack is another important aspect. The attack
traffic will look similar to the genuine users/client request, but with
the intention to flood the externally exposed app, which could be a web
server, DNS, etc. Therefore it is essential to differentiate between the
attack traffic and genuine traffic, for this, there are different
methods to do at the application level, one such example using Captcha
on a web service, to catch traffic originating from bots.
For these methods to be useful, the nodes should be capable of handling
both the attack traffic and genuine traffic. It may be possible in
cloud-based infrastructure to dynamically add more virtual
machines/resources, to handle the sudden spike in volume of traffic, but
on-prem, the option to add additional resources might be challenging.
To handle a large volume of attack traffic, there are solutions
available, which can inspect the packets/traffic flows and identify
anomalies (i.e.) traffic patterns that don't resemble a genuine
connection, like client initiating TCP connection, but fail to complete
the handshake, or set of sources, which have abnormally huge traffic
flow. Once this unwanted traffic is identified, these are dropped at the
edge of the network itself, thereby protecting the resources of app
nodes. This topic alone can be discussed more in detail, but that will
be beyond the scope of this section.
#### Network Demarcation
Network demarcation is another common strategy deployed in different
networks when applications are grouped based on their security needs and
vulnerability to an attack. Some common demarcations are, the
external/internet facing nodes are grouped into a separate zone, whereas
those nodes having sensitive data are segregated into a separate zone.
And any communication between these zones is restricted with the help of
security tools to limit exposure to unwanted hosts/ports. These
inter-zone communication filters are sometimes called ring-fencing. The
number of zones to be created, varies for different deployments, for
example, there could be a host which should be able to communicate to
the external world as well as internal servers, like proxy, email, in
this case, these can be grouped under one zone, say De-Militarized Zones
(DMZ). The main advantage of creating zones is that, even if there is a
compromised host, that doesn't act as a back door entry for the rest of
the infrastructure.
#### Node protection
Be it server, router, switches, load balancers, firewall, etc, each of
these devices come with certain capabilities to secure themselves, like
support for filters (e.g. Access-list, Iptables) to control what traffic
to process and what to drop, anti-virus software can be used in servers
to check on the software installed in them.
#### Operational practices
There are numerous security threats for infrastructure, and there are
different solutions to defend them. The key part to the defence, is not
only identifying the right solution and the tools for it but also making
sure there are robust operational procedures in place, to respond
promptly, decisively and with clarity, for any security incident.
##### Standard Operating Procedures (SOP)
SOP need to be well defined and act as a reference for on-call to follow
during a security incident. This SoP should cover things like,
- When a security incident happens, how it will be alerted, to whom it
will be alerted.
- Identify the scale and severity of the security incident.
- Who are the points of escalation and the threshold/time to intimate
them, there could be other concerned teams or to the management or
even to the security operations in-charge.
- Which solutions to use (and the procedure to follow in them) to
mitigate the security incident.
- Also the data about the security incident has to be collated for
further analysis.
Many organisations have a dedicated team focused on security, and they
drive most of the activities, during an attack and even before, to come
up with best practices, guidelines and compliance audits. It is the
responsibility of respective technical teams, to ensure the
infrastructure meets these recommendations and gaps are fixed.
##### Periodic review
Along with defining SoP's, the entire security of the infrastructure has
to be reviewed periodically. This review should include,
- Identifying any new/improved security threat that could potentially
target the infrastructure.
- The SoP's have to be reviewed periodically, depending upon new
security threats or changes in the procedure (to implement the
solutions)
- Ensuring software upgrades/patches are done in a timely manner.
- Audit the infrastructure for any non-compliance of the security
standards.
- Review of recent security incidents and find means to improvise the
defence mechanisms.

@ -78,14 +78,21 @@ nav:
- Writing Secure code: level101/security/writing_secure_code.md
- Conclusion: level101/security/conclusion.md
- Level 102:
- Linux Advanced:
- Containerization And Orchestration:
- Introduction: level102/containerization_and_orchestration/intro.md
- Introduction To Containers: level102/containerization_and_orchestration/intro_to_containers.md
- Containerization With Docker: level102/containerization_and_orchestration/containerization_with_docker.md
- Orchestration With Kubernetes: level102/containerization_and_orchestration/orchestration_with_kubernetes.md
- Conclusion: level102/containerization_and_orchestration/conclusion.md
- System Troubleshooting and Performance Improvements:
- Linux Advanced:
- Containerization And Orchestration:
- Introduction: level102/containerization_and_orchestration/intro.md
- Introduction To Containers: level102/containerization_and_orchestration/intro_to_containers.md
- Containerization With Docker: level102/containerization_and_orchestration/containerization_with_docker.md
- Orchestration With Kubernetes: level102/containerization_and_orchestration/orchestration_with_kubernetes.md
- Conclusion: level102/containerization_and_orchestration/conclusion.md
- Networking:
- Introduction: level102/networking/introduction.md
- Security: level102/networking/security.md - Scale: level102/networking/scale.md
- Scale: level102/networking/scale.md
- RTT: level102/networking/rtt.md
- Infrastructure Services: level102/networking/infrastructure-features.md
- Conclusion: level102/networking/conclusion.md
- System Troubleshooting and Performance Improvements:
- Introduction: level102/system_troubleshooting_and_performance/introduction.md
- Troubleshooting: level102/system_troubleshooting_and_performance/troubleshooting.md
- Important Tools: level102/system_troubleshooting_and_performance/important-tools.md

Loading…
Cancel
Save