Tech blog February 9, 2016
When comparing storage options, 4KiB random read/write metrics are typically used?
because modern filesystem and operating systems trea t disk I/O in 4KiB blocks. However, with real world tasks, 4KiB may not be the ideal data chunk size for measurement.
In this article, we will show a "raw" access pattern using OLTP using MySQL to see what size of blocks are used in one typical application.MySQL and Linux storage stack
First of all, let’s take a look at an overview of how MySQL works with the Linux storage stack.
We are using Ubuntu 14.04 LTS (Kernel 3.19.0) and here is the great diagram showing how user application talks with the block device.
The Linux storage stack diagrams for Linux Kernel 3.17:
The long trip of I/O operation starts from the system call, for example read(2) or write(2), but since the default storage engine of MySQL 5.7, InnoDB requires more performa nce efficiency,
io_submit(2) is used to make operation asynchronous.
The system call will then trigger the kernel and the VFS invokes appropriate implementation for specific filesystem.
As you can see, page cache is placed aside from VFS and then store the contents of the file I/O transparently in DRAM.
Generally speaking, page cache improves I/O performance, ?although some applications will implement their own cache mechanism in user space and then page cache may become ov erhead.
InnoDB also has table cache called “buffer pool” so sometimes bypassing the page cache is the recommended configuration. 
You can set up MySQL by adding flags in your configuration file. Additionally, ?"buffer pool" size must be increased to utilize DRAM efficiently.
[innodb_flush_method = O_DIRECT]
[innodb_buffer_pool_size = 6442450944]Every BIO (Block I/O) to the block device goes through block layer which and the I/O scheduler will optimize the ordering of each I/O operation. Linux kernel 3.19 provides three I/O schedulers:
[$ echo “noop” > /sys/block/sda/queue/scheduler]
The I/O scheduler then dispatches requests to the driver stack. In Linux, SATA devices are treated as SCSI devices and the SAT (SCSI ATA Translation) layer is responsible to translate SCSI command into ATA ’s one, finally ATA command is issued to our SATA SSD by AHCI driver.
blktrace - generate traces of the i/o traffic on block devices
We’d need to observe BIO commands issued by the I/O scheduler to the driver to create raw I/O statistics and for that task the blktrace tool is very useful.
blktrace combines kernel functionality which is able to log BIO commands in the Linux storage stack and a CLI tools to interact with the kernel BIO data for a specific devic e.
[$ sudo blktrace -d /dev/sda1 -o ./]
blktrace stores a series of requests in a binary format which blkparse can transform into a human readable format.
[$ blkparse -a issue -i sda.blktrace.0 -o blktrace.log]
blktrace records every type of request in each stage but by adding extra arguments “-a issue”, it will restricts that output too only look of I/O requests between the scheduler and driver.Results
We chose to use LinkBench for our benchmark workload, it is benchmark software developed by Facebook that emulates their social graph analysis.  Running LinkBench and usi ng blktrace to analyze the BIO commands issued by the I/O Schedular we are able to create the following graph:
You can see LinkBench and InnoDB mostly use 16KiB blocks with the default settings with a few 4KiB write requess but all reads are using 16KiB blocks.
What does this mean ?
Instead of using simple synthentic benchmarks that continue to use the 4KiB for random access benchmarks, real world benchmarks such as LinkBench will help you choose the right storage device for your need s.
Article by: Takuro Iizuka