Add Relational Databases course (#21)

Looks good to me
pull/33/head
Sumesh Premraj 3 years ago committed by GitHub
parent f3575235c6
commit d42a09c9aa
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

3
.gitignore vendored

@ -0,0 +1,3 @@
.DS_Store
.venv
site/

@ -0,0 +1,98 @@
* Relational DBs are used for data storage. Even a file can be used to store data, but relational DBs are designed with specific goals:
* Efficiency
* Ease of access and management
* Organized
* Handle relations between data (represented as tables)
* Transaction: a unit of work that can comprise multiple statements, executed together
* ACID properties
Set of properties that guarantee data integrity of DB transactions
* Atomicity: Each transaction is atomic (succeeds or fails completely)
* Consistency: Transactions only result in valid state (which includes rules, constraints, triggers etc.)
* Isolation: Each transaction is executed independently of others safely within a concurrent system
* Durability: Completed transactions will not be lost due to any later failures
Lets take some examples to illustrate the above properties.
* Account A has a balance of ₹200 & B has ₹400. Account A is transferring ₹100 to Account B. This transaction has a deduction from sender and an addition into the recipients balance. If the first operation passes successfully while the second fails, As balance would be ₹100 while B would be having ₹400 instead of ₹500. **Atomicity** in a DB ensures this partially failed transaction is rolled back.
* If the second operation above fails, it leaves the DB inconsistent (sum of balance of accounts before and after the operation is not the same). **Consistency** ensures that this does not happen.
* There are three operations, one to calculate interest for As account, another to add that to As account, then transfer ₹100 from B to A. Without **isolation** guarantees, concurrent execution of these 3 operations may lead to a different outcome every time.
* What happens if the system crashes before the transactions are written to disk? **Durability** ensures that the changes are applied correctly during recovery.
* Relational data
* Tables represent relations
* Columns (fields) represent attributes
* Rows are individual records
* Schema describes the structure of DB
* SQL
A query language to interact with and manage data.
[CRUD operations](https://stackify.com/what-are-crud-operations/) - create, read, update, delete queries
Management operations - create DBs/tables/indexes etc, backup, import/export, users, access controls
Exercise: Classify the below queries into the four types - DDL (definition), DML(manipulation), DCL(control) and TCL(transactions) and explain in detail.
insert, create, drop, delete, update, commit, rollback, truncate, alter, grant, revoke
You can practise these in the [lab section](../lab.md).
* Constraints
Rules for data that can be stored. Query fails if you violate any of these defined on a table.
Primary key: one or more columns that contain UNIQUE values, and cannot contain NULL values. A table can have only ONE primary key. An index on it is created by default.
Foreign key: links two tables together. Its value(s) match a primary key in a different table \
Not null: Does not allow null values \
Unique: Value of column must be unique across all rows \
Default: Provides a default value for a column if none is specified during insert
Check: Allows only particular values (like Balance >= 0)
* [Indexes](https://datageek.blog/en/2018/06/05/rdbms-basics-indexes-and-clustered-indexes/)
Most indexes use B+ tree structure.
Why use them: Speeds up queries (in large tables that fetch only a few rows, min/max queries, by eliminating rows from consideration etc)
Types of indexes: unique, primary key, fulltext, secondary
Write-heavy loads, mostly full table scans or accessing large number of rows etc. do not benefit from indexes
* [Joins](https://www.sqlservertutorial.net/sql-server-basics/sql-server-joins/)
Allows you to fetch related data from multiple tables, linking them together with some common field. Powerful but also resource-intensive and makes scaling databases difficult. This is the cause of many slow performing queries when run at scale, and the solution is almost always to find ways to reduce the joins.
* [Access control](https://dev.mysql.com/doc/refman/8.0/en/access-control.html)
DBs have privileged accounts for admin tasks, and regular accounts for clients. There are finegrained controls on what actions(DDL, DML etc. discussed earlier )are allowed for these accounts.
DB first verifies the user credentials (authentication), and then examines whether this user is permitted to perform the request (authorization) by looking up these information in some internal tables.
Other controls include activity auditing that allows examining the history of actions done by a user, and resource limits which define the number of queries, connections etc. allowed.
### Popular databases
Commercial, closed source - Oracle, Microsoft SQL Server, IBM DB2
Open source with optional paid support - MySQL, MariaDB, PostgreSQL
Individuals and small companies have always preferred open source DBs because of the huge cost associated with commercial software.
In recent times, even large organizations have moved away from commercial software to open source alternatives because of the flexibility and cost savings associated with it.
Lack of support is no longer a concern because of the paid support available from the developer and third parties.
MySQL is the most widely used open source DB, and it is widely supported by hosting providers, making it easy for anyone to use. It is part of the popular Linux-Apache-MySQL-PHP ([LAMP](https://en.wikipedia.org/wiki/LAMP_(software_bundle))) stack that became popular in the 2000s. We have many more choices for a programming language, but the rest of that stack is still widely used.

@ -0,0 +1,13 @@
# Conclusion
We have covered basic concepts of SQL databases. We have also covered some of the tasks that an SRE may be responsible for - there is so much more to learn and do. We hope this course gives you a good start and inspires you to explore further.
### Further reading
* More practice with online resources like [this one](https://www.w3resource.com/sql-exercises/index.php)
* [Normalization](https://beginnersbook.com/2015/05/normalization-in-dbms/)
* [Routines](https://dev.mysql.com/doc/refman/8.0/en/stored-routines.html), [triggers](https://dev.mysql.com/doc/refman/8.0/en/trigger-syntax.html)
* [Views](https://www.essentialsql.com/what-is-a-relational-database-view/)
* [Transaction isolation levels](https://dev.mysql.com/doc/refman/8.0/en/innodb-transaction-isolation-levels.html)
* [Sharding](https://www.digitalocean.com/community/tutorials/understanding-database-sharding)
* [Setting up HA](https://severalnines.com/database-blog/introduction-database-high-availability-mysql-mariadb), [monitoring](https://blog.serverdensity.com/how-to-monitor-mysql/), [backups](https://dev.mysql.com/doc/refman/8.0/en/backup-methods.html)

Binary file not shown.

After

Width:  |  Height:  |  Size: 36 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 24 KiB

@ -0,0 +1,26 @@
### Why should you use this?
General purpose, row level locking, ACID support, transactions, crash recovery and multi-version concurrency control etc.
### Architecture
![alt_text](images/innodb_architecture.png "InnoDB components")
### Key components:
* Memory:
* Buffer pool: LRU cache of frequently used data(table and index) to be processed directly from memory, which speeds up processing. Important for tuning performance.
* Change buffer: Caches changes to secondary index pages when those pages are not in the buffer pool and merges it when they are fetched. Merging may take a long time and impact live queries. It also takes up part of the buffer pool. Avoids the extra I/O to read secondary indexes in.
* Adaptive hash index: Supplements InnoDBs B-Tree indexes with fast hash lookup tables like a cache. Slight performance penalty for misses, also adds maintenance overhead of updating it. Hash collisions cause AHI rebuilding for large DBs.
* Log buffer: Holds log data before flush to disk.
Size of each above memory is configurable, and impacts performance a lot. Requires careful analysis of workload, available resources, benchmarking and tuning for optimal performance.
* Disk:
* Tables: Stores data within rows and columns.
* Indexes: Helps find rows with specific column values quickly, avoids full table scans.
* Redo Logs: all transactions are written to them, and after a crash, the recovery process corrects data written by incomplete transactions and replays any pending ones.
* Undo Logs: Records associated with a single transaction that contains information about how to undo the latest change by a transaction.

@ -0,0 +1,21 @@
# Relational Databases
### What to expect from this training
You will have an understanding of what relational databases are, their advantages, and some MySQL specific concepts.
### What is not covered under this course
* In depth implementation details
* Advanced topics like normalization, sharding
* Specific tools for administration
### Introduction
The main purpose of database systems is to manage data. This includes storage, adding new data, deleting unused data, updating existing data, retrieving data within a reasonable response time, other maintenance tasks to keep the system running etc.
### Prerequisites
* Complete [Linux course](/linux_basics/intro/)
* Install Docker (for lab section)
### Pre-reads
[RDBMS Concepts](https://beginnersbook.com/2015/04/rdbms-concepts/)

@ -0,0 +1,207 @@
**Prerequisites**
Install Docker
**Setup**
Create a working directory named sos or something similar, and cd into it.
Enter the following into a file named my.cnf under a directory named custom.
```
sos $ cat custom/my.cnf
[mysqld]
# These settings apply to MySQL server
# You can set port, socket path, buffer size etc.
# Below, we are configuring slow query settings
slow_query_log=1
slow_query_log_file=/var/log/mysqlslow.log
long_query_time=0.1
```
Start a container and enable slow query log with the following:
```
sos $ docker run --name db -v custom:/etc/mysql/conf.d -e MYSQL_ROOT_PASSWORD=realsecret -d mysql:8
sos $ docker cp custom/mysqld.cnf $(docker ps -qf "name=db"):/etc/mysql/conf.d/custom.cnf
sos $ docker restart $(docker ps -qf "name=db")
```
Import a sample database
```
sos $ git clone git@github.com:datacharmer/test_db.git
sos $ docker cp test_db $(docker ps -qf "name=db"):/home/test_db/
sos $ docker exec -it $(docker ps -qf "name=db") bash
root@3ab5b18b0c7d:/# cd /home/test_db/
root@3ab5b18b0c7d:/# mysql -uroot -prealsecret mysql < employees.sql
root@3ab5b18b0c7d:/etc# touch /var/log/mysqlslow.log
root@3ab5b18b0c7d:/etc# chown mysql:mysql /var/log/mysqlslow.log
```
_Workshop 1: Run some sample queries_
Run the following
```
$ mysql -uroot -prealsecret mysql
mysql>
# inspect DBs and tables
# the last 4 are MySQL internal DBs
mysql> show databases;
+--------------------+
| Database |
+--------------------+
| employees |
| information_schema |
| mysql |
| performance_schema |
| sys |
+--------------------+
> use employees;
mysql> show tables;
+----------------------+
| Tables_in_employees |
+----------------------+
| current_dept_emp |
| departments |
| dept_emp |
| dept_emp_latest_date |
| dept_manager |
| employees |
| salaries |
| titles |
+----------------------+
# read a few rows
mysql> select * from employees limit 5;
# filter data by conditions
mysql> select count(*) from employees where gender = 'M' limit 5;
# find count of particular data
mysql> select count(*) from employees where first_name = 'Sachin';
```
_Workshop 2: Use explain and explain analyze to profile a query, identify and add indexes required for improving performance_
```
# View all indexes on table
#(\G is to output horizontally, replace it with a ; to get table output)
mysql> show index from employees from employees\G
*************************** 1. row ***************************
Table: employees
Non_unique: 0
Key_name: PRIMARY
Seq_in_index: 1
Column_name: emp_no
Collation: A
Cardinality: 299113
Sub_part: NULL
Packed: NULL
Null:
Index_type: BTREE
Comment:
Index_comment:
Visible: YES
Expression: NULL
# This query uses an index, idenitfied by 'key' field
# By prefixing explain keyword to the command,
# we get query plan (including key used)
mysql> explain select * from employees where emp_no < 10005\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: employees
partitions: NULL
type: range
possible_keys: PRIMARY
key: PRIMARY
key_len: 4
ref: NULL
rows: 4
filtered: 100.00
Extra: Using where
# Compare that to the next query which does not utilize any index
mysql> explain select first_name, last_name from employees where first_name = 'Sachin'\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: employees
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 299113
filtered: 10.00
Extra: Using where
# Let's see how much time this query takes
mysql> explain analyze select first_name, last_name from employees where first_name = 'Sachin'\G
*************************** 1. row ***************************
EXPLAIN: -> Filter: (employees.first_name = 'Sachin') (cost=30143.55 rows=29911) (actual time=28.284..3952.428 rows=232 loops=1)
-> Table scan on employees (cost=30143.55 rows=299113) (actual time=0.095..1996.092 rows=300024 loops=1)
# Cost(estimated by query planner) is 30143.55
# actual time=28.284ms for first row, 3952.428 for all rows
# Now lets try adding an index and running the query again
mysql> create index idx_firstname on employees(first_name);
Query OK, 0 rows affected (1.25 sec)
Records: 0 Duplicates: 0 Warnings: 0
mysql> explain analyze select first_name, last_name from employees where first_name = 'Sachin';
+--------------------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN |
+--------------------------------------------------------------------------------------------------------------------------------------------+
| -> Index lookup on employees using idx_firstname (first_name='Sachin') (cost=81.20 rows=232) (actual time=0.551..2.934 rows=232 loops=1)
|
+--------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.01 sec)
# Actual time=0.551ms for first row
# 2.934ms for all rows. A huge improvement!
# Also notice that the query involves only an index lookup,
# and no table scan (reading all rows of table)
# ..which vastly reduces load on the DB.
```
_Workshop 3: Identify slow queries on a MySQL server_
```
# Run the command below in two terminal tabs to open two shells into the container.
docker exec -it $(docker ps -qf "name=db") bash
# Open a mysql prompt in one of them and execute this command
# We have configured to log queries that take longer than 1s,
# so this sleep(3) will be logged
mysql -uroot -prealsecret mysql
mysql> sleep(3);
# Now, in the other terminal, tail the slow log to find details about the query
root@62c92c89234d:/etc# tail -f /var/log/mysqlslow.log
/usr/sbin/mysqld, Version: 8.0.21 (MySQL Community Server - GPL). started with:
Tcp port: 3306 Unix socket: /var/run/mysqld/mysqld.sock
Time Id Command Argument
# Time: 2020-11-26T14:53:44.822348Z
# User@Host: root[root] @ localhost [] Id: 9
# Query_time: 5.404938 Lock_time: 0.000000 Rows_sent: 1 Rows_examined: 1
use employees;
# Time: 2020-11-26T14:53:58.015736Z
# User@Host: root[root] @ localhost [] Id: 9
# Query_time: 10.000225 Lock_time: 0.000000 Rows_sent: 1 Rows_examined: 1
SET timestamp=1606402428;
select sleep(3);
```
These were simulated examples with minimal complexity. In real life, the queries would be much more complex and the explain/analyze and slow query logs would have more details.

@ -0,0 +1,38 @@
### MySQL architecture
![alt_text](images/mysql_architecture.png "MySQL architecture diagram")
MySQL architecture enables you to select the right storage engine for your needs, and abstracts away all implementation details from the end users (application engineers and [DBA](https://en.wikipedia.org/wiki/Database_administrator)) who only need to know a consistent stable API.
Application layer:
* Connection handling - each client gets its own connection which is cached for the duration of access)
* Authentication - server checks (username,password,host) info of client and allows/rejects connection
* Security: server determines whether the client has privileges to execute each query (check with _show privileges_ command)
Server layer:
* Services and utilities - backup/restore, replication, cluster etc
* SQL interface - clients run queries for data access and manipulation
* SQL parser - creates a parse tree from the query (lexical/syntactic/semantic analysis and code generation)
* Optimizer - optimizes queries using various algorithms and data available to it(table level stats), modifies queries, order of scanning, indexes to use etc. (check with explain command)
* Caches and buffers - cache stores query results, buffer pool(InnoDB) stores table and index data in [LRU](https://en.wikipedia.org/wiki/Cache_replacement_policies#Least_recently_used_(LRU)) fashion
Storage engine options:
* InnoDB: most widely used, transaction support, ACID compliant, supports row-level locking, crash recovery and multi-version concurrency control. Default since MySQL 5.5+.
* MyISAM: fast, does not support transactions, provides table-level locking, great for read-heavy workloads, mostly in web and data warehousing. Default upto MySQL 5.1.
* Archive: optimised for high speed inserts, compresses data as it is inserted, does not support transactions, ideal for storing and retrieving large amounts of seldom referenced historical, archived data
* Memory: tables in memory. Fastest engine, supports table-level locking, does not support transactions, ideal for creating temporary tables or quick lookups, data is lost after a shutdown
* CSV: stores data in CSV files, great for integrating into other applications that use this format
* … etc.
It is possible to migrate from one storage engine to another. But this migration locks tables for all operations and is not online, as it changes the physical layout of the data. It takes a long time and is generally not recommended. Hence, choosing the right storage engine at the beginning is important.
General guideline is to use InnoDB unless you have a specific need for one of the other storage engines.
Running `mysql> SHOW ENGINES; `shows you the supported engines on your MySQL server.

@ -0,0 +1,64 @@
* Explain and explain+analyze
EXPLAIN &lt;query> analyzes query plans from the optimizer, including how tables are joined, which tables/rows are scanned etc.
Explain analyze shows the above and additional info like execution cost, number of rows returned, time taken etc.
This knowledge is useful to tweak queries and add indexes.
Watch this performance tuning [tutorial video](https://www.youtube.com/watch?v=pjRTLPeUOug).
Checkout the [lab section](../lab.md) for a hands-on about indexes.
* [Slow query logs](https://dev.mysql.com/doc/refman/5.7/en/slow-query-log.html)
Used to identify slow queries (configurable threshold), enabled in config or dynamically with a query
Checkout the [lab section](../lab.md) about identifying slow queries.
* User management
This includes creation and changes to users, like managing privileges, changing password etc.
* Backup and restore strategies, pros and cons
Logical backup using mysqldump - slower but can be done online
Physical backup (copy data directory or use xtrabackup) - quick backup/recovery. Copying data directory requires locking or shut down. xtrabackup is an improvement because it supports backups without shutting down (hot backup).
Others - PITR, snapshots etc.
* Crash recovery process using redo logs
After a crash, when you restart server it reads redo logs and replays modifications to recover
* Monitoring MySQL
Key MySQL metrics: reads, writes, query runtime, errors, slow queries, connections, running threads, InnoDB metrics
Key OS metrics: CPU, load, memory, disk I/O, network
* Replication
Copies data from one instance to one or more instances. Helps in horizontal scaling, data protection, analytics and performance. Binlog dump thread on primary, replication I/O and SQL threads on secondary. Strategies include the standard async, semi async or group replication.
* High Availability
Ability to cope with failure at software, hardware and network level. Essential for anyone who needs 99.9%+ uptime. Can be implemented with replication or clustering solutions from MySQL, Percona, Oracle etc. Requires expertise to setup and maintain. Failover can be manual, scripted or using tools like Orchestrator.
* [Data directory](https://dev.mysql.com/doc/refman/8.0/en/data-directory.html)
Data is stored in a particular directory, with nested directories for the data contained in each database. There are also MySQL log files, InnoDB log files, server process ID file and some other configs. The data directory is configurable.
* [MySQL configuration](https://dev.mysql.com/doc/refman/5.7/en/server-configuration.html)
This can be done by passing [parameters during startup](https://dev.mysql.com/doc/refman/5.7/en/server-options.html), or in a [file](https://dev.mysql.com/doc/refman/8.0/en/option-files.html). There are a few [standard paths](https://dev.mysql.com/doc/refman/8.0/en/option-files.html#option-file-order) where MySQL looks for config files, `/etc/my.cnf` is one of the commonly used paths. These options are organized under headers (mysqld for server and mysql for client), you can explore them more in the lab that follows.
* [Logs](https://dev.mysql.com/doc/refman/5.7/en/server-logs.html)
MySQL has logs for various purposes - general query log, errors, binary logs (for replication), slow query log. Only error log is enabled by default (to reduce I/O and storage requirement), the others can be enabled when required - by specifying config parameters at startup or running commands at runtime. [Log destination](https://dev.mysql.com/doc/refman/5.7/en/log-destinations.html) can also be tweaked with config parameters.

@ -16,7 +16,7 @@ In this course we are focusing on building strong foundational skills. The cours
- [Linux Networking](https://linkedin.github.io/school-of-sre/linux_networking/intro/)
- [Python and Web](https://linkedin.github.io/school-of-sre/python_web/intro/)
- Data
- Relational databases (MySQL)
- [Relational databases(MySQL)](https://linkedin.github.io/school-of-sre/databases_sql/intro/)
- [NoSQL concepts](https://linkedin.github.io/school-of-sre/databases_nosql/intro/)
- [Big Data](https://linkedin.github.io/school-of-sre/big_data/intro/)
- [Systems Design](https://linkedin.github.io/school-of-sre/systems_design/intro/)
@ -24,4 +24,4 @@ In this course we are focusing on building strong foundational skills. The cours
We believe continuous learning will help in acquiring deeper knowledge and competencies in order to expand your skill sets, every module has added reference which could be a guide for further learning. Our hope is that by going through these modules we should be able to build the essential skills required for a Site Reliability Engineer.
At Linkedin, we are using this curriculum for onboarding our non-traditional hires and new college grads to the SRE role. We had multiple rounds of successful onboarding experience with the new members and helped them to be productive in a very short period of time. This motivated us to opensource these contents for helping other organizations onboarding new engineers to the role and individuals to get into the role. We realize that the initial content we created is just a starting point and we hope that the community can help in the journey of refining and extending the contents.
At Linkedin, we are using this curriculum for onboarding our non-traditional hires and new college grads to the SRE role. We had multiple rounds of successful onboarding experience with the new members and helped them to be productive in a very short period of time. This motivated us to opensource these contents for helping other organizations onboarding new engineers to the role and individuals to get into the role. We realize that the initial content we created is just a starting point and we hope that the community can help in the journey of refining and extending the contents.

@ -35,6 +35,14 @@ nav:
- The URL Shortening App: python_web/url-shorten-app.md
- Conclusion: python_web/sre-conclusion.md
- Data:
- Relational Databases:
- Introduction: databases_sql/intro.md
- Key Concepts: databases_sql/concepts.md
- MySQL: databases_sql/mysql.md
- InnoDB: databases_sql/innodb.md
- Operational Concepts: databases_sql/operations.md
- Lab: databases_sql/lab.md
- Further Reading: databases_sql/reading.md
- NoSQL:
- Introduction: databases_nosql/intro.md
- Key Concepts: databases_nosql/key_concepts.md
@ -54,6 +62,6 @@ nav:
- Fundamentals of Security: security/fundamentals.md
- Network Security: security/network_security.md
- Threat, Attacks & Defences: security/threats_attacks_defences.md
- Writing Secure code: security/writing_secure_code.md
- Writing Secure code: security/writing_secure_code.md
- Conclusion: security/conclusion.md
- Contribute: CONTRIBUTING.md

Loading…
Cancel
Save