Sos mysql phase2 (#110)

* SoS MySQL phase 2

* SoS MySQL phase 2
pull/111/head
Apoorv Purohit 3 years ago committed by GitHub
parent 1ee9aeb861
commit 2eddc9ba0a
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -0,0 +1,219 @@
### Backup and Recovery
Backups are a very crucial part of any database setup. They are generally a copy of the data that can be used to reconstruct the data in case of any major or minor crisis with the database. In general terms backups can be of two types:-
- **Physical Backup** - the data directory as it is on the disk
- **Logical Backup** - the table structure and records in it
Both the above kinds of backups are supported by MySQL with different tools. It is the job of the SRE to identify which should be used when.
#### Mysqldump
This utility is available with MySQL installation. It helps in getting the logical backup of the database. It outputs a set of SQL statements to reconstruct the data. It is not recommended to use mysqldump for large tables as it might take a lot of time and the file size will be huge. However, for small tables it is the best and the quickest option.
`mysqldump [options] > dump_output.sql`
There are certain options that can be used with mysqldump to get an appropriate dump of the database.
To dump all the databases
`mysqldump -u<user> -p<pwd> --all-databases > all_dbs.sql`
To dump specific databases
`mysqldump -u<user> -p<pwd> --databases db1 db2 db3 > dbs.sql`
To dump a single database
`mysqldump -u<user> -p<pwd> --databases db1 > db1.sql`
OR
`mysqldump -u<user> -p<pwd> db1 > db1.sql`
The difference between the above two commands is that the latter one does not contain the **CREATE DATABASE** command in the backup output.
To dump specific tables in a database
`mysqldump -u<user> -p<pwd> db1 table1 table2 > db1_tables.sql`
To dump only table structures and no data
`mysqldump -u<user> -p<pwd> --no-data db1 > db1_structure.sql`
To dump only table data and no CREATE statements
`mysqldump -u<user> -p<pwd> --no-create-info db1 > db1_data.sql`
To dump only specific records from a table
`mysqldump -u<user> -p<pwd> --no-create-info db1 table1 --where=”salary>80000” > db1_table1_80000.sql`
Mysqldump can also provide output in CSV, other delimited text or XML format to support use-cases if any. The backup from mysqldump utility is offline i.e. when the backup finishes it will not have the changes to the database which were made when the backup was going on. For example, if the backup started at 3 PM and finished at 4 PM, it will not have the changes made to the database between 3 and 4 PM.
**Restoring** from mysqldump can be done in the following two ways:-
From shell
`mysql -u<user> -p<pwd> < all_dbs.sql`
OR
From shell if the database is already created
`mysql -u<user> -p<pwd> db1 < db1.sql`
From within MySQL shell
`mysql> source all_dbs.sql`
#### Percona Xtrabackup
This utility is installed separately from the MySQL server and is open source, provided by Percona. It helps in getting the full or partial physical backup of the database. It provides online backup of the database i.e. it will have the changes made to the database when the backup was going on as explained at the end of the previous section.
- **Full Backup** - the complete backup of the database.
- **Partial Backup** - Incremental
- **Cumulative** - After one full backup, the next backups will have changes post the full backup. For example, we took a full backup on Sunday, from Monday onwards every backup will have changes after Sunday; so, Tuesdays backup will have Mondays changes as well, Wednesdays backup will have changes of Monday and Tuesday as well and so on.
- **Differential** - After one full backup, the next backups will have changes post the previous incremental backup. For example, we took a full backup on Sunday, Monday will have changes done after Sunday, Tuesday will have changes done after Monday, and so on.
![partial backups - differential and cummulative](images/partial_backup.png "Differential and Cumulative Backups")
Percona xtrabackup allows us to get both full and incremental backups as we desire. However, incremental backups take less space than a full backup (if taken per day) but the restore time of incremental backups is more than that of full backups.
**Creating a full backup**
`xtrabackup --defaults-file=<location to my.cnf> --user=<mysql user> --password=<mysql password> --backup --target-dir=<location of target directory>`
Example
`xtrabackup --defaults-file=/etc/my.cnf --user=some_user --password=XXXX --backup --target-dir=/mnt/data/backup/`
Some other options
- `--stream` - can be used to stream the backup files to standard output in a specified format. xbstream is the only option for now.
- `--tmp-dir` - set this to a tmp directory to be used for temporary files while taking backups.
- `--parallel` - set this to the number of threads that can be used to parallely copy data files to target directory.
- `--compress` - by default - quicklz is used. Set this to have the backup in compressed format. Each file is a .qp compressed file and can be extracted by qpress file archiver.
- `--decompress` - decompresses all the files which were compressed with the .qp extension. It will not delete the .qp files after decompression. To do that, use `--remove-original` along with this. Please note that the decompress option should be run separately from the xtrabackup command that used the compress option.
**Preparing a backup**
Once the backup is done with the --backup option, we need to prepare it in order to restore it. This is done to make the datafiles consistent with point-in-time. There might have been some transactions going on while the backup was being executed and those have changed the data files. When we prepare a backup, all those transactions are applied to the data files.
`xtrabackup --prepare --target-dir=<where backup is taken>`
Example
`xtrabackup --prepare --target-dir=/mnt/data/backup/`
It is not recommended to halt a process which is preparing the backup as that might cause data file corruption and backup cannot be used further. The backup will have to be taken again.
**Restoring a Full Backup**
To restore the backup which is created and prepared from above commands, just copy everything from the backup target-dir to the data-dir of MySQL server, change the ownership of all files to mysql user (the linux user used by MySQL server) and start mysql.
Or the below command can be used as well,
`xtrabackup --defaults-file=/etc/my.cnf --copy-back --target-dir=/mnt/data/backups/`
**Note** - the backup has to be prepared in order to restore it.
**Creating Incremental backups**
Percona Xtrabackup helps create incremental backups, i.e only the changes can be backed up since the last backup. Every InnoDB page contains a log sequence number or LSN that is also mentioned as one of the last lines of backup and prepare commands.
```
xtrabackup: Transaction log of lsn <LSN> to <LSN> was copied.
```
OR
```
InnoDB: Shutdown completed; log sequence number <LSN>
<timestamp> completed OK!
```
This indicates that the backup has been taken till the log sequence number mentioned. This is a key information in understanding incremental backups and working towards automating one. Incremental backups do not compare data files for changes, instead, they go through the InnoDB pages and compare their LSN to the last backups LSN. So, without one full backup, the incremental backups are useless.
The xtrabackup command creates a xtrabackup_checkpoint file which has the information about the LSN of the backup. Below are the key contents of the file:-
```
backup_type = full-backuped | incremental
from_lsn = 0 (full backup) | to_lsn of last backup <LSN>
to_lsn = <LSN>
last_lsn = <LSN>
```
There is a difference between **to\_lsn** and **last\_lsn**. When the **last\_lsn** is more than **to\_lsn** that means there are transactions that ran while we took the backup and are yet to be applied. That is what --prepare is used for.
To take incremental backups, first, we require one full backup.
`xtrabackup --defaults-file=/etc/my.cnf --user=some_user --password=XXXX --backup --target-dir=/mnt/data/backup/full/`
Lets assume the contents of the xtrabackup_checkpoint file to be as follows.
```
backup_type = full-backuped
from_lsn = 0
to_lsn = 1000
last_lsn = 1000
```
Now that we have one full backup, we can have an incremental backup that takes the changes. We will go with differential incremental backups.
`xtrabackup --defaults-file=/etc/my.cnf --user=some_user --password=XXXX --backup --target-dir=/mnt/data/backup/incr1/ --incremental-basedir=/mnt/data/backup/full/`
There are delta files created in the incr1 directory like, **ibdata1.delta**, **db1/tbl1.ibd.delta** with the changes from the full directory. The xtrabackup_checkpoint file will thus have the following contents.
```
backup_type = incremental
from_lsn = 1000
to_lsn = 1500
last_lsn = 1500
```
Hence, the **from\_lsn** here is equal to the **to\_lsn** of the last backup or the basedir provided for the incremental backups. For the next incremental backup we can use this incremental backup as the basedir.
`xtrabackup --defaults-file=/etc/my.cnf --user=some_user --password=XXXX --backup --target-dir=/mnt/data/backup/incr2/ --incremental-basedir=/mnt/data/backup/incr1/`
The xtrabackup_checkpoint file will thus have the following contents.
```
backup_type = incremental
from_lsn = 1500
to_lsn = 2000
last_lsn = 2200
```
**Preparing Incremental backups**
Preparing incremental backups is not the same as preparing a full backup. When prepare runs, two operations are performed - *committed transactions are applied on the data files* and *uncommitted transactions are rolled back*. While preparing incremental backups, we have to skip rollback of uncommitted transactions as it is likely that they might get committed in the next incremental backup. If we rollback uncommitted transactions the further incremental backups cannot be applied.
We use **--apply-log-only** option along with **--prepare** to avoid the rollback phase.
From the last section, we had the following directories with complete backup
```
/mnt/data/backup/full
/mnt/data/backup/incr1
/mnt/data/backup/incr2
```
First, we prepare the full backup, but only with the --apply-log-only option.
`xtrabackup --prepare --apply-log-only --target-dir=/mnt/data/backup/full`
The output of the command will contain the following at the end.
```
InnoDB: Shutdown complete; log sequence number 1000
<timestamp> Completed OK!
```
Note the LSN mentioned at the end is the same as the to\_lsn from the xtrabackup_checkpoint created for full backup.
Next, we apply the changes from the first incremental backup to the full backup.
`xtrabackup --prepare --apply-log-only --target-dir=/mnt/data/backup/full --incremental-dir=/mnt/data/backup/incr1`
This applies the delta files in the incremental directory to the full backup directory. It rolls the data files in the full backup directory forward to the time of incremental backup and applies the redo logs as usual.
Lastly, we apply the last incremental backup same as the previous one with just a small change.
`xtrabackup --prepare --target-dir=/mnt/data/backup/full --incremental-dir=/mnt/data/backup/incr1`
We do not have to use the **--apply-log-only** option with it. It applies the *incr2 delta files* to the full backup data files taking them forward, applies redo logs on them and finally rollbacks the uncommitted transactions to produce the final result. The data now present in the full backup directory can now be used to restore.
**Note** - To create cumulative incremental backups, the incremental-basedir should always be the full backup directory for every incremental backup. While preparing, we can start with the full backup with the --apply-log-only option and use just the last incremental backup for the final --prepare as that has all the changes since the full backup.
**Restoring Incremental backups**
Once all the above steps are completed, restoring is the same as done for a full backup.
#### Further Reading
- [MySQL Point-In-Time-Recovery](https://dev.mysql.com/doc/refman/8.0/en/point-in-time-recovery.html)
- [Another MySQL backup tool - mysqlpump](https://dev.mysql.com/doc/refman/8.0/en/mysqlpump.html)
- [Another MySQL backup tool - mydumper](https://github.com/maxbube/mydumper/tree/master/docs)
- [A comparison between mysqldump, mysqlpump and mydumper](https://mydbops.wordpress.com/2019/03/26/mysqldump%E2%80%8B-vs-mysqlpump-vs-mydumper/)
- [Backup Best Practices](https://www.percona.com/blog/2020/05/27/best-practices-for-mysql-backups/)

Binary file not shown.

After

Width:  |  Height:  |  Size: 74 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 128 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 38 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 706 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 207 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 24 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 32 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 568 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 315 KiB

@ -24,6 +24,10 @@ The main purpose of database systems is to manage data. This includes storage, a
- [Key Concepts](https://linkedin.github.io/school-of-sre/databases_sql/concepts/)
- [MySQL Architecture](https://linkedin.github.io/school-of-sre/databases_sql/mysql/#mysql-architecture)
- [InnoDB](https://linkedin.github.io/school-of-sre/databases_sql/innodb/)
- [Operational Concepts](https://linkedin.github.io/school-of-sre/databases_sql/operations/)
- [Backup and Recovery](https://linkedin.github.io/school-of-sre/databases_sql/backup_recovery/)
- [MySQL Replication](https://linkedin.github.io/school-of-sre/databases_sql/replication/)
- Operational Concepts
- [SELECT Query](https://linkedin.github.io/school-of-sre/databases_sql/select_query/)
- [Query Performance](https://linkedin.github.io/school-of-sre/databases_sql/query_performance/)
- [Lab](https://linkedin.github.io/school-of-sre/databases_sql/lab/)
- [Further Reading](https://linkedin.github.io/school-of-sre/databases_sql/conclusion/#further-reading)

@ -0,0 +1,175 @@
### Query Performance Improvement
Query Performance is a very crucial aspect of relational databases. If not tuned correctly, the select queries can become slow and painful for the application, and for the MySQL server as well. The important task is to identify the slow queries and try to improve their performance by either rewriting them or creating proper indexes on the tables involved in it.
#### The Slow Query Log
The slow query log contains SQL statements that take a longer time to execute then set in the config parameter long_query_time. These queries are the candidates for optimization. There are some good utilities to summarize the slow query logs like, mysqldumpslow (provided by MySQL itself), pt-query-digest (provided by Percona), etc. Following are the config parameters that are used to enable and effectively catch slow queries
| Variable | Explanation | Example value |
| --- | --- | --- |
| slow_query_log | Enables or disables slow query logs | ON |
| slow_query_log_file | The location of the slow query log | /var/lib/mysql/mysql-slow.log |
| long_query_time | Threshold time. The query that takes longer than this time is logged in slow query log | 5 |
| log_queries_not_using_indexes | When enabled with the slow query log, the queries which do not make use of any index are also logged in the slow query log even though they take less time than long_query_time. | ON |
So, for this section, we will be enabling **slow_query_log**, **long_query_time** will be kept to **0.3 (300 ms)**, and **log_queries_not_using** index will be enabled as well.
Below are the queries that we will execute on the employees database.
1. select * from employees where last_name = 'Koblick';
2. select * from salaries where salary >= 100000;
3. select * from titles where title = 'Manager';
4. select * from employees where year(hire_date) = 1995;
5. select year(e.hire_date), max(s.salary) from employees e join salaries s on e.emp_no=s.emp_no group by year(e.hire_date);
Now, queries **1**, **3** and **4** executed under 300 ms but if we check the slow query logs, we will find these queries logged as they are not using any of the index. Queries **2** and **5** are taking longer than 300ms and also not using any index.
Use the following command to get the summary of the slow query log
`mysqldumpslow /var/lib/mysql/mysql-slow.log`
![slow query log analysis](images/mysqldumpslow_out.png "slow query log analysis")
There are some more queries in the snapshot that were along with the queries mentioned. Mysqldumpslow replaces actual values that were used by N (in case of numbers) and S (in case of strings). That can be overridden by `-a` option, however that will increase the output lines if different values are used in similar queries.
#### The EXPLAIN Plan
The **EXPLAIN** command is used with any query that we want to analyze. It describes the query execution plan, how MySQL sees and executes the query. EXPLAIN works with Select, Insert, Update and Delete statements. It tells about different aspects of the query like, how tables are joined, indexes used or not, etc. The important thing here is to understand the basic Explain plan output of a query to determine its performance.
Let's take the following query as an example,
```
mysql> explain select * from salaries where salary = 100000;
+----+-------------+----------+------------+------+---------------+------+---------+------+---------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------+------------+------+---------------+------+---------+------+---------+----------+-------------+
| 1 | SIMPLE | salaries | NULL | ALL | NULL | NULL | NULL | NULL | 2838426 | 10.00 | Using where |
+----+-------------+----------+------------+------+---------------+------+---------+------+---------+----------+-------------+
1 row in set, 1 warning (0.00 sec)
```
The key aspects to understand in the above output are:-
- **Partitions** - the number of partitions considered while executing the query. It is only valid if the table is partitioned.
- **Possible_keys** - the list of indexes that were considered during creation of the execution plan.
- **Key** - the index that will be used while executing the query.
- **Rows** - the number of rows examined during the execution.
- **Filtered** - the percentage of rows that were filtered out of the rows examined. The maximum and most optimized result will have 100 in this field.
- **Extra** - this tells some extra information on how MySQL evaluates, whether the query is using only where clause to match target rows, any index or temporary table, etc.
So, for the above query, we can determine that there are no partitions, there are no candidate indexes to be used and so no index is used at all, over 2M rows are examined and only 10% of them are included in the result, and lastly, only a where clause is used to match the target rows.
#### Creating an Index
Indexes are used to speed up selecting relevant rows for a given column value. Without an index, MySQL starts with the first row and goes through the entire table to find matching rows. If the table has too many rows, the operation becomes costly. With indexes, MySQL determines the position to start looking for the data without reading the full table.
A primary key is also an index which is also the fastest and is stored along with the table data. Secondary indexes are stored outside of the table data and are used to further enhance the performance of SQL statements. Indexes are mostly stored as B-Trees, with some exceptions like spatial indexes use R-Trees and memory tables use hash indexes.
There are 2 ways to create indexes:-
- While creating a table - if we know beforehand the columns that will drive the most number of where clauses in select queries, then we can put an index over them while creating a table.
- Altering a Table - To improve the performance of a troubling query, we create an index on a table which already has data in it using ALTER or CREATE INDEX command. This operation does not block the table but might take some time to complete depending on the size of the table.
Lets look at the query that we discussed in the previous section. Its clear that scanning over 2M records is not a good idea when only 10% of those records are actually in the resultset.
Hence, we create an index on the salary column of the salaries table.
`create index idx_salary on salaries(salary)`
OR
`alter table salaries add index idx_salary(salary)`
And the same explain plan now looks like this
```
mysql> explain select * from salaries where salary = 100000;
+----+-------------+----------+------------+------+---------------+------------+---------+-------+------+----------+-------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------+------------+------+---------------+------------+---------+-------+------+----------+-------+
| 1 | SIMPLE | salaries | NULL | ref | idx_salary | idx_salary | 4 | const | 13 | 100.00 | NULL |
+----+-------------+----------+------------+------+---------------+------------+---------+-------+------+----------+-------+
1 row in set, 1 warning (0.00 sec)
```
Now the index used is idx_salary, the one we recently created. The index actually helped examine only 13 records and all of them are in the resultset. Also, the query execution time is also reduced from over 700ms to almost negligible.
Lets look at another example. Here we are searching for a specific combination of first\_name and last\_name. But, we might also search based on last_name only.
```
mysql> explain select * from employees where last_name = 'Dredge' and first_name = 'Yinghua';
+----+-------------+-----------+------------+------+---------------+------+---------+------+--------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------+------------+------+---------------+------+---------+------+--------+----------+-------------+
| 1 | SIMPLE | employees | NULL | ALL | NULL | NULL | NULL | NULL | 299468 | 1.00 | Using where |
+----+-------------+-----------+------------+------+---------------+------+---------+------+--------+----------+-------------+
1 row in set, 1 warning (0.00 sec)
```
Now only 1% record out of almost 300K is the resultset. Although the query time is particularly quick as we have only 300K records, this will be a pain if the number of records are over millions. In this case, we create an index on last\_name and first\_name, not separately, but a composite index including both the columns.
`create index idx_last_first on employees(last_name, first_name)`
```
mysql> explain select * from employees where last_name = 'Dredge' and first_name = 'Yinghua';
+----+-------------+-----------+------------+------+----------------+----------------+---------+-------------+------+----------+-------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------+------------+------+----------------+----------------+---------+-------------+------+----------+-------+
| 1 | SIMPLE | employees | NULL | ref | idx_last_first | idx_last_first | 124 | const,const | 1 | 100.00 | NULL |
+----+-------------+-----------+------------+------+----------------+----------------+---------+-------------+------+----------+-------+
1 row in set, 1 warning (0.00 sec)
```
We chose to put last\_name before first\_name while creating the index as the optimizer starts from the leftmost prefix of the index while evaluating the query. For example, if we have a 3-column index like idx(c1, c2, c3), then the search capability of the index follows - (c1), (c1, c2) or (c1, c2, c3) i.e. if your where clause has only first_name this index wont work.
```
mysql> explain select * from employees where first_name = 'Yinghua';
+----+-------------+-----------+------------+------+---------------+------+---------+------+--------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------+------------+------+---------------+------+---------+------+--------+----------+-------------+
| 1 | SIMPLE | employees | NULL | ALL | NULL | NULL | NULL | NULL | 299468 | 10.00 | Using where |
+----+-------------+-----------+------------+------+---------------+------+---------+------+--------+----------+-------------+
1 row in set, 1 warning (0.00 sec)
```
But, if you have only the last_name in the where clause, it will work as expected.
```
mysql> explain select * from employees where last_name = 'Dredge';
+----+-------------+-----------+------------+------+----------------+----------------+---------+-------+------+----------+-------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------+------------+------+----------------+----------------+---------+-------+------+----------+-------+
| 1 | SIMPLE | employees | NULL | ref | idx_last_first | idx_last_first | 66 | const | 200 | 100.00 | NULL |
+----+-------------+-----------+------------+------+----------------+----------------+---------+-------+------+----------+-------+
1 row in set, 1 warning (0.00 sec)
```
For another example, use the following queries:-
```
create table employees_2 like employees;
create table salaries_2 like salaries;
alter table salaries_2 drop primary key;
```
We made copies of employees and salaries tables without the Primary Key of salaries table to understand an example of Select with Join.
When you have queries like the below, it becomes tricky to identify the pain point of the query.
```
mysql> select e.first_name, e.last_name, s.salary, e.hire_date from employees_2 e join salaries_2 s on e.emp_no=s.emp_no where e.last_name='Dredge';
1860 rows in set (4.44 sec)
```
This query is taking about 4.5 seconds to complete with 1860 rows in the resultset. Lets look at the Explain plan. There will be 2 records in the Explain plan as 2 tables are used in the query.
```
mysql> explain select e.first_name, e.last_name, s.salary, e.hire_date from employees_2 e join salaries_2 s on e.emp_no=s.emp_no where e.last_name='Dredge';
+----+-------------+-------+------------+--------+------------------------+---------+---------+--------------------+---------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+--------+------------------------+---------+---------+--------------------+---------+----------+-------------+
| 1 | SIMPLE | s | NULL | ALL | NULL | NULL | NULL | NULL | 2837194 | 100.00 | NULL |
| 1 | SIMPLE | e | NULL | eq_ref | PRIMARY,idx_last_first | PRIMARY | 4 | employees.s.emp_no | 1 | 5.00 | Using where |
+----+-------------+-------+------------+--------+------------------------+---------+---------+--------------------+---------+----------+-------------+
2 rows in set, 1 warning (0.00 sec)
```
These are in order of evaluation i.e. salaries\_2 will be evaluated first and then employees\_2 will be joined to it. As it looks like, it scans almost all the rows of salaries\_2 table and tries to match the employees\_2 rows as per the join condition. Though where clause is used in fetching the final resultset, but the index corresponding to the where clause is not used for the employees\_2 table.
If the join is done on two indexes which have the same data-types, it will always be faster. So, lets create an index on the *emp_no* column of salaries_2 table and analyze the query again.
`create index idx_empno on salaries_2(emp_no);`
```
mysql> explain select e.first_name, e.last_name, s.salary, e.hire_date from employees_2 e join salaries_2 s on e.emp_no=s.emp_no where e.last_name='Dredge';
+----+-------------+-------+------------+------+------------------------+----------------+---------+--------------------+------+----------+-------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+------------------------+----------------+---------+--------------------+------+----------+-------+
| 1 | SIMPLE | e | NULL | ref | PRIMARY,idx_last_first | idx_last_first | 66 | const | 200 | 100.00 | NULL |
| 1 | SIMPLE | s | NULL | ref | idx_empno | idx_empno | 4 | employees.e.emp_no | 9 | 100.00 | NULL |
+----+-------------+-------+------------+------+------------------------+----------------+---------+--------------------+------+----------+-------+
2 rows in set, 1 warning (0.00 sec)
```
Now, not only did the index help the optimizer to examine only a few rows in both tables, it reversed the order of the tables in evaluation. The employees\_2 table is evaluated first and rows are selected as per the index respective to the where clause. Then the records are joined to salaries\_2 table as per the index used due to the join condition. The execution time of the query came down **from 4.5s to 0.02s**.
```
mysql> select e.first_name, e.last_name, s.salary, e.hire_date from employees_2 e join salaries_2 s on e.emp_no=s.emp_no where e.last_name='Dredge'\G
1860 rows in set (0.02 sec)
```

@ -0,0 +1,300 @@
### MySQL Replication
Replication enables data from one MySQL host (termed as Primary) to be copied to another MySQL host (termed as Replica). MySQL Replication is asynchronous in nature by default, but it can be changed to semi-synchronous with some configurations.
Some common applications of MySQL replication are:-
- **Read-scaling** - as multiple hosts can replicate the data from a single primary host, we can set up as many replicas as we need and scale reads through them, i.e. application writes will go to a single primary host and the reads can balance between all the replicas that are there. Such a setup can improve the write performance as well, as the primary is dedicated to only updates and not reads.
- **Backups using replicas** - the backup process can sometimes be a little heavy. But if we have replicas configured, then we can use one of them to get the backup without affecting the primary data at all.
- **Disaster Recovery** - a replica in some other geographical region paves a proper path to configure disaster recovery.
MySQL supports different types of synchronizations as well:-
- **Asynchronous** - this is the default synchronization method. It is one-way, i.e. one host serves as primary and one or more hosts as replica. We will discuss this method throughout the replication topic.
![replication topologies](images/replication_topologies.png "Different Replication Scenarios")
- **Semi-Synchronous** - in this type of synchronization, a commit performed on the primary host is blocked until at least one replica acknowledges it. Post the acknowledgement from any one replica, the control is returned to the session that performed the transaction. This ensures strong consistency but the replication is slower than asynchronous.
- **Delayed** - we can deliberately lag the replica in a typical MySQL replication by the number of seconds desired by the use case. This type of replication safeguards from severe human errors of dropping or corrupting the data on the primary, for example, in the above diagram for Delayed Replication, if a DROP DATABASE is executed by mistake on the primary, we still have 30 minutes to recover the data from R2 as that command has not been replicated on R2 yet.
**Pre-Requisites**
Before we dive into setting up replication, we should know about the binary logs. Binary logs play a very important role in MySQL replication. Binary logs, or commonly known as *binlogs* contain events about the changes done to the database, like table structure changes, data changes via DML operations, etc. They are not used to log SELECT statements. For replication, the primary sends the information to the replicas using its binlogs about the changes done to the database, and the replicas make the same data changes.
With respect to MySQL replication, the binary log format can be of two types that decides the main type of replication:-
- Statement-Based Replication or SBR
- Row-Based Replication or RBR
**Statement Based Binlog Format**
Originally, the replication in MySQL was based on SQL statements getting replicated and executed on the replica from the primary. This is called statement based logging. The binlog contains the exact SQL statement run by the session.
![SBR update example](images/sbr_example_update_1.png "SBR update example")
So If we run the above statements to insert 3 records and the update 3 in a single update statement, they will be logged exactly the same as when we executed them.
![SBR binlog](images/sbr_binlog_view_1.png "SBR binlog")
**Row Based Binlog Format**
The Row based is the default one in the latest MySQL releases. This is a lot different from the Statement format as here, row events are logged instead of statements. By that we mean, in the above example one update statement affected 3 records, but binlog had only one update statement; if it is a row based format, binlog will have an event for each record updated.
![RBR update example](images/rbr_example_update_1.png "RBR update example")
![RBR binlog](images/rbr_binlog_view_1.png "RBR binlog")
**Statement Based v/s Row Based binlogs**
Lets have a look at the operational differences between statement-based and row-based binlogs.
| Statement Based | Row Based |
|---|---|
| Logs SQL statements as executed | Logs row events based on SQL statements executed |
| Takes lesser disk space | Takes more disk space |
| Restoring using binlogs is faster | Restoring using binlogs is slower |
| When used for replication, if any statement has a predefined function that has its own value, like sysdate(), uuid() etc, the output could be different on the replica which makes it inconsistent. | Whatever is executed becomes a row event with values, so there will be no problem if such functions are used in SQL statements. |
| Only statements are logged so no other row events are generated. | A lot of events are generated when a table is copied into another using INSERT INTO SELECT. |
**Note** - There is another type of binlog format called **Mixed**. With mixed logging, statement based is used by default but it switches to row based in certain cases. If MySQL cannot guarantee that statement based logging is safe for the statements executed, it issues a warning and switches to row based for those statements.
We will be using binary log format as Row for the entire replication topic.
**Replication in Motion**
![replication in motion](images/replication_function.png "Replication in motion")
The above figure indicates how a typical MySQL replication works.
1. Replica_IO_Thread is responsible to fetch the binlog events from the primary binary logs to the replica
2. On the Replica host, relay logs are created which are exact copies of the binary logs. If the binary logs on primary are in row format, the relay logs will be the same.
3. Replica_SQL_Thread applies the relay logs on the replica MySQL server.
4. If log-bin is enabled on the replica, then the replica will have its own binary logs as well. If log-slave-updates is enabled, then it will have the updates from the primary logged in the binlogs as well.
#### Setting up Replication
In this section, we will set up a simple asynchronous replication. The binlogs will be in row based format. The replication will be set up on two fresh hosts with no prior data present. There are two different ways in which we can set up replication.
- **Binlog based** - Each replica keeps a record of the binlog coordinates on the primary - current binlog and position in the binlog till where it has read and processed. So, at a time different replicas might be reading different parts of the same binlog.
- **GTID based** - Every transaction gets an identifier called global transaction identifier or GTID. There is no need to keep the record of binlog coordinates, as long as the replica has all the GTIDs executed on the primary, it is consistent with the primary. A typical GTID is the server_uuid:# positive integer.
We will set up a GTID based replication in the following section but will also discuss binlog based replication setup as well.
**Primary Host Configurations**
The following config parameters should be present in the primary my.cnf file for setting up GTID based replication.
```
server-id - a unique ID for the mysql server
log-bin - the binlog location
binlog-format - ROW | STATEMENT (we will use ROW)
gtid-mode - ON
enforce-gtid-consistency - ON (allows execution of only those statements which can be logged using GTIDs)
```
**Replica Host Configurations**
The following config parameters should be present in the replica my.cnf file for setting up replication.
```
server-id - different than the primary host
log-bin - (optional, if you want replica to log its own changes as well)
binlog-format - depends on the above
gtid-mode - ON
enforce-gtid-consistency - ON
log-slave-updates - ON (if binlog is enabled, then we can enable this. This enables the replica to log the changes coming from the primary along with its own changes. Helps in setting up chain replication)
```
**Replication User**
Every replica connects to the primary using a mysql user for replicating. So there must be a mysql user account for the same on the primary host. Any user can be used for this purpose provided it has REPLICATION SLAVE privilege. If the sole purpose is replication then we can have a user with only the required privilege.
On the primary host
```
mysql> create user repl_user@<replica_IP> identified by 'xxxxx';
mysql> grant replication slave on *.* to repl_user@'<replica_IP>';
```
**Obtaining Starting position from Primary**
Run the following command on the primary host
```
mysql> show master status\G
*************************** 1. row ***************************
File: mysql-bin.000001
Position: 73
Binlog_Do_DB:
Binlog_Ignore_DB:
Executed_Gtid_Set: e17d0920-d00e-11eb-a3e6-000d3aa00f87:1-3
1 row in set (0.00 sec)
```
If we are working with binary log based replication, the top two output lines are the most important ones. That tells the current binlog on the primary host and till what position it has written. For fresh hosts we know that no data is written so we can directly set up replication using the very first binlog file and position 4. If we are setting up a replication from a backup, then that changes the way we obtain the starting position. For GTIDs, the executed_gtid_set is the value where primary is right now. Again, for a fresh setup, we dont have to specify anything about the starting point and it will start from the transaction id 1, but when we set up from a backup, the backup will contain the GTID positions till where backup has been taken.
**Setting up Replica**
The replication setup must know about the primary host, the user and password to connect, the binlog coordinates (for binlog based replication) or the GTID auto-position parameter.
The following command is used for setting up
```
change master to
master_host = '<primary host IP>',
master_port = <primary host port - default=3306>,
master_user = 'repl_user',
master_password = 'xxxxx',
master_auto_position = 1;
```
**Note** - the *Change Master To* command has been replaced with *Change Replication Source To* from Mysql 8.0.23 onwards, also all the *master* and *slave* keywords are replaced with *source* and *replica*.
If it is binlog based replication, then instead of master_auto_position, we need to specify the binlog coordinates.
```
master_log_file = 'mysql-bin.000001',
master_log_pos = 4
```
**Starting Replication and Check Status**
Now that everything is configured, we just need to start the replication on the replica via the following command
`start slave;`
OR from MySQL 8.0.23 onwards,
`start replica;`
Whether or not the replication is running successfully, we can determine by running the following command
`show slave status\G`
OR from MySQL 8.0.23 onwards,
`show replica status\G`
```
mysql> show replica status\G
*************************** 1. row ***************************
Replica_IO_State: Waiting for master to send event
Source_Host: <primary IP>
Source_User: repl_user
Source_Port: <primary port>
Connect_Retry: 60
Source_Log_File: mysql-bin.000001
Read_Source_Log_Pos: 852
Relay_Log_File: mysql-relay-bin.000002
Relay_Log_Pos: 1067
Relay_Source_Log_File: mysql-bin.000001
Replica_IO_Running: Yes
Replica_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Source_Log_Pos: 852
Relay_Log_Space: 1283
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Source_SSL_Allowed: No
Source_SSL_CA_File:
Source_SSL_CA_Path:
Source_SSL_Cert:
Source_SSL_Cipher:
Source_SSL_Key:
Seconds_Behind_Source: 0
Source_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Source_Server_Id: 1
Source_UUID: e17d0920-d00e-11eb-a3e6-000d3aa00f87
Source_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Replica_SQL_Running_State: Slave has read all relay log; waiting for more updates
Source_Retry_Count: 86400
Source_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
Source_SSL_Crl:
Source_SSL_Crlpath:
Retrieved_Gtid_Set: e17d0920-d00e-11eb-a3e6-000d3aa00f87:1-3
Executed_Gtid_Set: e17d0920-d00e-11eb-a3e6-000d3aa00f87:1-3
Auto_Position: 1
Replicate_Rewrite_DB:
Channel_Name:
Source_TLS_Version:
Source_public_key_path:
Get_Source_public_key: 0
Network_Namespace:
1 row in set (0.00 sec)
```
Some of the parameters are explained below:-
- **Relay_Source_Log_File** - the primarys file where replica is currently reading from
- **Execute_Source_Log_Pos** - for the above file on which position is the replica reading currently from. These two parameters are of utmost importance when binlog based replication is used.
- **Replica_IO_Running** - IO thread of replica is running or not
- **Replica_SQL_Running** - SQL thread of replica is running or not
- **Seconds_Behind_Source** - the difference of seconds when a statement was executed on Primary and then on Replica. This indicates how much replication lag is there.
- **Source_UUID** - the uuid of the primary host
- **Retrieved_Gtid_Set** - the GTIDs fetched from the primary host by the replica to be executed.
- **Executed_Gtid_Set** - the GTIDs executed on the replica. This set remains the same for the entire cluster if the replicas are in sync.
- **Auto_Position** - it directs the replica to fetch the next GTID automatically
**Create a Replica for the already setup cluster**
The steps discussed in the previous section talks about the setting up replication on two fresh hosts. When we have to set up a replica for a host which is already serving applications, then the backup of the primary is used, either fresh backup taken for the replica (should only be done if the traffic it is serving is less) or use a recently taken backup.
If the size of the databases on the MySQL primary server is small, less than 100G recommended, then mysqldump can be used to take backup along with the following options.
`mysqldump -uroot -p -hhost_ip -P3306 --all-databases --single-transaction --master-data=1 > primary_host.bkp`
- `--single-transaction` - this option starts a transaction before taking the backup which ensures it is consistent. As transactions are isolated from each other, so no other writes affect the backup.
- `--master-data` - this option is required if binlog based replication is desired to be set up. It includes the binary log file and log file position in the backup file.
When GTID mode is enabled and **mysqldump** is executed, it includes the GTID executed to be used to start the replica after the backup position. The contents of the mysqldump output file will have the following
![GTID info in mysqldump](images/mysqldump_gtid_text.png "GTID info in mysqldump")
It is recommended to comment these before restoring otherwise they could throw errors. Also, using master-data=2 will automatically comment the master\_log\_file line.
Similarly, when taking backup of the host using **xtrabackup**, the file *xtrabckup_info* file contains the information about binlog file and file position, as well as the GTID executed set.
```
server_version = 8.0.25
start_time = 2021-06-22 03:45:17
end_time = 2021-06-22 03:45:20
lock_time = 0
binlog_pos = filename 'mysql-bin.000007', position '196', GTID of the last change 'e17d0920-d00e-11eb-a3e6-000d3aa00f87:1-5'
innodb_from_lsn = 0
innodb_to_lsn = 18153149
partial = N
incremental = N
format = file
compressed = N
encrypted = N
```
Now, after setting MySQL server on the desired host, restore the backup taken from any one of the above methods. If the intended way is binlog based replication, then use the binlog file and position info in the following command
```
change Replication Source to
source_host = primary_ip,
source_port = 3306,
source_user = repl_user,
source_password = xxxxx,
source_log_file = mysql-bin.000007,
source_log_pos = 196;
```
If the replication needs to be set via GITDs, then run the below command to tell the replica about the GTIDs already executed. On the Replica host, run th following commands
```
reset master;
set global gtid_purged = e17d0920-d00e-11eb-a3e6-000d3aa00f87:1-5
change replication source to
source_host = primary_ip,
source_port = 3306,
source_user = repl_user,
source_password = xxxxx,
source_auto_position = 1
```
The reset master command resets the position of the binary log to initial. It can be skipped if the host is a freshly installed MySQL, but we restored a backup so it is necessary. The gtid_purged global variable lets the replica know the GTIDs that have already been executed, so that the replication can start after that. Then in the change source command, we set the auto-position to 1 which automatically gets the next GTID to proceed.
#### Further Reading
- [More applications of Replication](https://dev.mysql.com/doc/refman/8.0/en/replication-solutions.html)
- [Automtaed Failovers using MySQL Orchestrator](https://github.com/openark/orchestrator/tree/master/docs)

@ -0,0 +1,423 @@
### SELECT Query
The most commonly used command while working with MySQL is SELECT. It is used to fetch the result set from one or more tables.
The general form of a typical select query looks like:-
```
SELECT expr
FROM table1
[WHERE condition]
[GROUP BY column_list HAVING condition]
[ORDER BY column_list ASC|DESC]
[LIMIT #]
```
The above general form contains some commonly used clauses of a SELECT query:-
- **expr** - comma-separated column list or * (for all columns)
- **WHERE** - a condition is provided, if true, directs the query to select only those records.
- **GROUP BY** - groups the entire result set based on the column list provided. An aggregate function is recommended to be present in the select expression of the query. **HAVING** supports grouping by putting a condition on the selected or any other aggregate function.
- **ORDER BY** - sorts the result set based on the column list in ascending or descending order.
- **LIMIT** - commonly used to limit the number of records.
Lets have a look at some examples for a better understanding of the above. The dataset used for the examples below is available [here](https://dev.mysql.com/doc/employee/en/employees-installation.html) and is free to use.
**Select all records**
```
mysql> select * from employees limit 5;
+--------+------------+------------+-----------+--------+------------+
| emp_no | birth_date | first_name | last_name | gender | hire_date |
+--------+------------+------------+-----------+--------+------------+
| 10001 | 1953-09-02 | Georgi | Facello | M | 1986-06-26 |
| 10002 | 1964-06-02 | Bezalel | Simmel | F | 1985-11-21 |
| 10003 | 1959-12-03 | Parto | Bamford | M | 1986-08-28 |
| 10004 | 1954-05-01 | Chirstian | Koblick | M | 1986-12-01 |
| 10005 | 1955-01-21 | Kyoichi | Maliniak | M | 1989-09-12 |
+--------+------------+------------+-----------+--------+------------+
5 rows in set (0.00 sec)
```
**Select specific fields for all records**
```
mysql> select first_name, last_name, gender from employees limit 5;
+------------+-----------+--------+
| first_name | last_name | gender |
+------------+-----------+--------+
| Georgi | Facello | M |
| Bezalel | Simmel | F |
| Parto | Bamford | M |
| Chirstian | Koblick | M |
| Kyoichi | Maliniak | M |
+------------+-----------+--------+
5 rows in set (0.00 sec)
```
**Select all records Where hire_date >= January 1, 1990**
```
mysql> select * from employees where hire_date >= '1990-01-01' limit 5;
+--------+------------+------------+-------------+--------+------------+
| emp_no | birth_date | first_name | last_name | gender | hire_date |
+--------+------------+------------+-------------+--------+------------+
| 10008 | 1958-02-19 | Saniya | Kalloufi | M | 1994-09-15 |
| 10011 | 1953-11-07 | Mary | Sluis | F | 1990-01-22 |
| 10012 | 1960-10-04 | Patricio | Bridgland | M | 1992-12-18 |
| 10016 | 1961-05-02 | Kazuhito | Cappelletti | M | 1995-01-27 |
| 10017 | 1958-07-06 | Cristinel | Bouloucos | F | 1993-08-03 |
+--------+------------+------------+-------------+--------+------------+
5 rows in set (0.01 sec)
```
**Select first_name and last_name from all records Where birth_date >= 1960 AND gender = F**
```
mysql> select first_name, last_name from employees where year(birth_date) >= 1960 and gender='F' limit 5;
+------------+-----------+
| first_name | last_name |
+------------+-----------+
| Bezalel | Simmel |
| Duangkaew | Piveteau |
| Divier | Reistad |
| Jeong | Reistad |
| Mingsen | Casley |
+------------+-----------+
5 rows in set (0.00 sec)
```
**Display the total number of records**
```
mysql> select count(*) from employees;
+----------+
| count(*) |
+----------+
| 300024 |
+----------+
1 row in set (0.05 sec)
```
**Display gender-wise count of all records**
```
mysql> select gender, count(*) from employees group by gender;
+--------+----------+
| gender | count(*) |
+--------+----------+
| M | 179973 |
| F | 120051 |
+--------+----------+
2 rows in set (0.14 sec)
```
**Display the year of hire_date and number of employees hired that year, also only those years where more than 20k employees were hired**
```
mysql> select year(hire_date), count(*) from employees group by year(hire_date) having count(*) > 20000;
+-----------------+----------+
| year(hire_date) | count(*) |
+-----------------+----------+
| 1985 | 35316 |
| 1986 | 36150 |
| 1987 | 33501 |
| 1988 | 31436 |
| 1989 | 28394 |
| 1990 | 25610 |
| 1991 | 22568 |
| 1992 | 20402 |
+-----------------+----------+
8 rows in set (0.14 sec)
```
**Display all records ordered by their hire_date in descending order. If hire_date is the same then in order of their birth_date ascending order**
```
mysql> select * from employees order by hire_date desc, birth_date asc limit 5;
+--------+------------+------------+-----------+--------+------------+
| emp_no | birth_date | first_name | last_name | gender | hire_date |
+--------+------------+------------+-----------+--------+------------+
| 463807 | 1964-06-12 | Bikash | Covnot | M | 2000-01-28 |
| 428377 | 1957-05-09 | Yucai | Gerlach | M | 2000-01-23 |
| 499553 | 1954-05-06 | Hideyuki | Delgrande | F | 2000-01-22 |
| 222965 | 1959-08-07 | Volkmar | Perko | F | 2000-01-13 |
| 47291 | 1960-09-09 | Ulf | Flexer | M | 2000-01-12 |
+--------+------------+------------+-----------+--------+------------+
5 rows in set (0.12 sec)
```
### SELECT - JOINS
JOIN statement is used to produce a combined result set from two or more tables based on certain conditions. It can be also used with Update and Delete statements but we will be focussing on the select query.
Following is a basic general form for joins
```
SELECT table1.col1, table2.col1, ... (any combination)
FROM
table1 <join_type> table2
ON (or USING depends on join_type) table1.column_for_joining = table2.column_for_joining
WHERE …
```
Any number of columns can be selected, but it is recommended to select only those which are relevant to increase the readability of the resultset. All other clauses like where, group by are not mandatory.
Lets discuss the types of JOINs supported by MySQL Syntax.
**Inner Join**
This joins table A with table B on a condition. Only the records where the condition is True are selected in the resultset.
Display some details of employees along with their salary
```
mysql> select e.emp_no,e.first_name,e.last_name,s.salary from employees e join salaries s on e.emp_no=s.emp_no limit 5;
+--------+------------+-----------+--------+
| emp_no | first_name | last_name | salary |
+--------+------------+-----------+--------+
| 10001 | Georgi | Facello | 60117 |
| 10001 | Georgi | Facello | 62102 |
| 10001 | Georgi | Facello | 66074 |
| 10001 | Georgi | Facello | 66596 |
| 10001 | Georgi | Facello | 66961 |
+--------+------------+-----------+--------+
5 rows in set (0.00 sec)
```
Similar result can be achieved by
```
mysql> select e.emp_no,e.first_name,e.last_name,s.salary from employees e join salaries s using (emp_no) limit 5;
+--------+------------+-----------+--------+
| emp_no | first_name | last_name | salary |
+--------+------------+-----------+--------+
| 10001 | Georgi | Facello | 60117 |
| 10001 | Georgi | Facello | 62102 |
| 10001 | Georgi | Facello | 66074 |
| 10001 | Georgi | Facello | 66596 |
| 10001 | Georgi | Facello | 66961 |
+--------+------------+-----------+--------+
5 rows in set (0.00 sec)
```
And also by
```
mysql> select e.emp_no,e.first_name,e.last_name,s.salary from employees e natural join salaries s limit 5;
+--------+------------+-----------+--------+
| emp_no | first_name | last_name | salary |
+--------+------------+-----------+--------+
| 10001 | Georgi | Facello | 60117 |
| 10001 | Georgi | Facello | 62102 |
| 10001 | Georgi | Facello | 66074 |
| 10001 | Georgi | Facello | 66596 |
| 10001 | Georgi | Facello | 66961 |
+--------+------------+-----------+--------+
5 rows in set (0.00 sec)
```
**Outer Join**
Majorly of two types:-
- **LEFT** - joining complete table A with table B on a condition. All the records from table A are selected, but from table B, only those records are selected where the condition is True.
- **RIGHT** - Exact opposite of the left join.
Let us assume the below tables for understanding left join better.
```
mysql> select * from dummy1;
+----------+------------+
| same_col | diff_col_1 |
+----------+------------+
| 1 | A |
| 2 | B |
| 3 | C |
+----------+------------+
mysql> select * from dummy2;
+----------+------------+
| same_col | diff_col_2 |
+----------+------------+
| 1 | X |
| 3 | Y |
+----------+------------+
```
A simple select join will look like the one below.
```
mysql> select * from dummy1 d1 left join dummy2 d2 on d1.same_col=d2.same_col;
+----------+------------+----------+------------+
| same_col | diff_col_1 | same_col | diff_col_2 |
+----------+------------+----------+------------+
| 1 | A | 1 | X |
| 3 | C | 3 | Y |
| 2 | B | NULL | NULL |
+----------+------------+----------+------------+
3 rows in set (0.00 sec)
```
Which can also be written as
```
mysql> select * from dummy1 d1 left join dummy2 d2 using(same_col);
+----------+------------+------------+
| same_col | diff_col_1 | diff_col_2 |
+----------+------------+------------+
| 1 | A | X |
| 3 | C | Y |
| 2 | B | NULL |
+----------+------------+------------+
3 rows in set (0.00 sec)
```
And also as
```
mysql> select * from dummy1 d1 natural left join dummy2 d2;
+----------+------------+------------+
| same_col | diff_col_1 | diff_col_2 |
+----------+------------+------------+
| 1 | A | X |
| 3 | C | Y |
| 2 | B | NULL |
+----------+------------+------------+
3 rows in set (0.00 sec)
```
**Cross Join**
This does a cross product of table A and table B without any condition. It doesnt have a lot of applications in the real world.
A Simple Cross Join looks like this
```
mysql> select * from dummy1 cross join dummy2;
+----------+------------+----------+------------+
| same_col | diff_col_1 | same_col | diff_col_2 |
+----------+------------+----------+------------+
| 1 | A | 3 | Y |
| 1 | A | 1 | X |
| 2 | B | 3 | Y |
| 2 | B | 1 | X |
| 3 | C | 3 | Y |
| 3 | C | 1 | X |
+----------+------------+----------+------------+
6 rows in set (0.01 sec)
```
One use case that can come in handy is when you have to fill in some missing entries. For example, all the entries from dummy1 must be inserted into a similar table dummy3, with each record must have 3 entries with statuses 1, 5 and 7.
```
mysql> desc dummy3;
+----------+----------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------+----------+------+-----+---------+-------+
| same_col | int | YES | | NULL | |
| value | char(15) | YES | | NULL | |
| status | smallint | YES | | NULL | |
+----------+----------+------+-----+---------+-------+
3 rows in set (0.02 sec)
```
Either you create an insert query script with as many entries as in dummy1 or use cross join to produce the required resultset.
```
mysql> select * from dummy1
cross join
(select 1 union select 5 union select 7) T2
order by same_col;
+----------+------------+---+
| same_col | diff_col_1 | 1 |
+----------+------------+---+
| 1 | A | 1 |
| 1 | A | 5 |
| 1 | A | 7 |
| 2 | B | 1 |
| 2 | B | 5 |
| 2 | B | 7 |
| 3 | C | 1 |
| 3 | C | 5 |
| 3 | C | 7 |
+----------+------------+---+
9 rows in set (0.00 sec)
```
The **T2** section in the above query is called a *sub-query*. We will discuss the same in the next section.
**Natural Join**
This implicitly selects the common column from table A and table B and performs an inner join.
```
mysql> select e.emp_no,e.first_name,e.last_name,s.salary from employees e natural join salaries s limit 5;
+--------+------------+-----------+--------+
| emp_no | first_name | last_name | salary |
+--------+------------+-----------+--------+
| 10001 | Georgi | Facello | 60117 |
| 10001 | Georgi | Facello | 62102 |
| 10001 | Georgi | Facello | 66074 |
| 10001 | Georgi | Facello | 66596 |
| 10001 | Georgi | Facello | 66961 |
+--------+------------+-----------+--------+
5 rows in set (0.00 sec)
```
Notice how natural join and using takes care that the common column is displayed only once if you are not explicitly selecting columns for the query.
Some More Examples
Display emp_no, salary, title and dept of the employees where salary > 80000
```
mysql> select e.emp_no, s.salary, t.title, d.dept_no
from
employees e
join salaries s using (emp_no)
join titles t using (emp_no)
join dept_emp d using (emp_no)
where s.salary > 80000
limit 5;
+--------+--------+--------------+---------+
| emp_no | salary | title | dept_no |
+--------+--------+--------------+---------+
| 10017 | 82163 | Senior Staff | d001 |
| 10017 | 86157 | Senior Staff | d001 |
| 10017 | 89619 | Senior Staff | d001 |
| 10017 | 91985 | Senior Staff | d001 |
| 10017 | 96122 | Senior Staff | d001 |
+--------+--------+--------------+---------+
5 rows in set (0.00 sec)
```
Display title-wise count of employees in each department order by dept_no
```
mysql> select d.dept_no, t.title, count(*)
from titles t
left join dept_emp d using (emp_no)
group by d.dept_no, t.title
order by d.dept_no
limit 10;
+---------+--------------------+----------+
| dept_no | title | count(*) |
+---------+--------------------+----------+
| d001 | Manager | 2 |
| d001 | Senior Staff | 13940 |
| d001 | Staff | 16196 |
| d002 | Manager | 2 |
| d002 | Senior Staff | 12139 |
| d002 | Staff | 13929 |
| d003 | Manager | 2 |
| d003 | Senior Staff | 12274 |
| d003 | Staff | 14342 |
| d004 | Assistant Engineer | 6445 |
+---------+--------------------+----------+
10 rows in set (1.32 sec)
```
#### SELECT - Subquery
A subquery is generally a smaller resultset that can be used to power a select query in many ways. It can be used in a where condition, can be used in place of join mostly where a join could be an overkill.
These subqueries are also termed as derived tables. They must have a table alias in the select query.
Lets look at some examples of subqueries.
Here we got the department name from the departments table by a subquery which used dept_no from dept_emp table.
```
mysql> select e.emp_no,
(select dept_name from departments where dept_no=d.dept_no) dept_name from employees e
join dept_emp d using (emp_no)
limit 5;
+--------+-----------------+
| emp_no | dept_name |
+--------+-----------------+
| 10001 | Development |
| 10002 | Sales |
| 10003 | Production |
| 10004 | Production |
| 10005 | Human Resources |
+--------+-----------------+
5 rows in set (0.01 sec)
```
Here, we used the avg query above (which got the avg salary) as a subquery to list the employees whose latest salary is more than the average.
```
mysql> select avg(salary) from salaries;
+-------------+
| avg(salary) |
+-------------+
| 63810.7448 |
+-------------+
1 row in set (0.80 sec)
mysql> select e.emp_no, max(s.salary)
from employees e
natural join salaries s
group by e.emp_no
having max(s.salary) > (select avg(salary) from salaries)
limit 10;
+--------+---------------+
| emp_no | max(s.salary) |
+--------+---------------+
| 10001 | 88958 |
| 10002 | 72527 |
| 10004 | 74057 |
| 10005 | 94692 |
| 10007 | 88070 |
| 10009 | 94443 |
| 10010 | 80324 |
| 10013 | 68901 |
| 10016 | 77935 |
| 10017 | 99651 |
+--------+---------------+
10 rows in set (0.56 sec)
```

@ -40,7 +40,11 @@ nav:
- Key Concepts: databases_sql/concepts.md
- MySQL: databases_sql/mysql.md
- InnoDB: databases_sql/innodb.md
- Operational Concepts: databases_sql/operations.md
- Backup and Recovery: databases_sql/backup_recovery.md
- MySQL Replication: databases_sql/replication.md
- Operational Concepts:
- Select Query: databases_sql/select_query.md
- Query Performance: databases_sql/query_performance.md
- Lab: databases_sql/lab.md
- Conclusion: databases_sql/conclusion.md
- NoSQL:

Loading…
Cancel
Save