josephtarango.com

Content intended as a guide for engineers working on performance related issues. It covers the way in which performance targets are established, how to interpret and understand performance data, identifying bottlenecks, and strategies for tackling specific types of issues. It is not a comprehensive recipe for resolving any performance issue that could be encountered. Rather, its intention is to arm the reader with the tools necessary to thoroughly investigate performance and formulate reasonable hypotheses.

*Performance enhancements are a continuous and infinite optimization process so in generality performance is complete with we meet the deadline and/or customers are content with their empirical usage.

Introduction

NVMe is known as non-volatile memory express protocol designed to further optimize storage typologies for throughput and latency for non-volatile persistent memory technologies. The features can be found at https://nvmexpress.org/specifications/ with each specification defining the precise operation expectations. A typical architecture implementation uses PCIe (Peripheral Component Interconnect Express) to map operations through shared memory and acceleration commands paired with operating system specific device drivers optimized for general protocol adherence or optimized for specific devices. In general a user application will send a command to the operating system kernel to service the command then translate the format to the file-system expectation, the file-system command will then be submitted to the device driver command queue, then process it in the physical device. The key items to note are the transaction times between the software, operating system, driver, hardware activation path, and device when reviewing performance.

Before reviewing performance, it is significant to note the precise usage case or workload. A workload is typically defined as trace of known command sequences in the form of a benchmark standard or a synthetic set of sequences used to compare theoretical versus empirical expectations. The challenge between a standard benchmark versus synthetic workload is the application processing time that can manifest in anomalous events or additional overhead from the software implementation , compiler, and/or libraries. Two well known workload applications are FIO (Flexible I/O) and IOMeter with both supporting Linux and Windows operating systems. The baseline compiled images provided at these main sites are functional; however, for precise performance analysis the source code should be compiled from source with optimizations for a given native micro architecture with operating system level optimizations for latency and throughput. When using a specific device it is key to note the optimized variables in the NVMe identification command by using software such as NVMe-CLI (summary sheet).

For additional images or content details, please feel free to ready my patents.

Workload and Data Collection

When discussing performance, it is necessary to understand the workload that is being performed. Let's take a look at the key components that make up a workload:

Block Size - The size of the IO (Input-Output) being sent to the drive, typically represented in KB (Kilo-Byte)
Percent Random - Percentage of IO that are random in nature (the rest being sequential).
Percent Read - Percentage of IO that are reads (the rest being writes)
Queue Depth - The number of uncompleted commands that are allowed to flow through the drive at any given time
Workers - The number of threads or processing cores sending IO to the drive simultaneously. More workers enables the host machine to more easily saturate the drive with the entire queue depth worth of commands.
Span Size - The amount of the total LBA (Logical Block Address) range that is being accessed by the workload
Alignment - Refers to if the blocks are aligned to the native IU (Indirection Unit) boundaries (I.E. 4 KB) or not (unaligned)

When discussing a particular workload, it is customer to formulate it in the following format:

Block size, alignment, percent random percent read / percent write, queue depth, workers (threads), span size, . An example of this would be:
- 4 KB, aligned, random 70/30, QD128, W1 full span

Often times the span indicator is dropped, and in this case the workload is assumed to be full span. The same thing applies to the alignment with the default being aligned. Also, the number of workers is generally dropped when there is a single worker. Using these defaults would transform this workload to look like so: 4kB random 70/30 QD128.

There are many other important aspects of a workload such as if there are trims (De-allocation range command), administration commands, or asynchronous events such as Telemetry during the workload. The states and change of workload being constant or exhibits any bursts add possible anomalies.

Pre-Conditioning

Almost as important as the workload itself is the preconditioning of the drive before the workload is run. If the drive was prepared with random writes versus sequential writes, this can have a massive impact on both the write and read performance of workloads that are run after. The standard prep procedure is to sequentially fill the drive with sequential writes before every workload. For sequential read/write workloads, no further prep is necessary. For random write workloads, a random write workload must be run until steady state is reach (more on this later). For random read workloads, the matching random write workload is expected to be run beforehand. Obviously when looking at real world use cases, we do not get to control the preconditioning of the drive. In these cases, it is very important to understand what was performed on the drive prior to the workload that is being analyzed to get an idea for how the blocks might be laid out on physical media.

Consistency in Results

It is imperative to use the same platform (host machine), device (SSD), operating system, driver, PCIe connectivity, protocol training speed (I.E. PCIe 3.0 x4 versus PCIe 1.0 x1), data preconditioning, and workload in order to attain meaningful A vs. B comparisons. It is also important to understand and bound the run-to-run variation. In other words, we must have confidence in our A vs. A comparisons before we draw strong conclusions about A vs. B comparisons.

Identifying & Classifying Issues

Most internal performance sightings come from strictly looking at the FIO/IOMeter results overview page and tagging shortfalls compared to targets. Diving into the actual IOPS (Inputs-Outputs Per Second) plot allows for a deeper classification of an observed challenge. The criteria by which we judge ourselves is generally defined in three buckets: throughput, uniformity, and latency/QoS. These three are all essentially the same thing - rate of IO - measured on very different timescales.

Throughput Issues

Throughput is a measure of the total bandwidth while the workload is in steady state. It is the rate of IO on the longest timescale. The time it takes to reach steady state for a given workload is thus statistics should be used to comprehend stationary effects in the FIO/IOMeter schedule files. These will depend on the device; thus it is important to note.

Uniformity Issues

Uniformity looks at bandwidth in one second intervals. Uniformity is generally represented as a percentage where the slowest one second interval is compared against the average throughput during the sampling window.

Latency Issues

Latency is looking at the time to completion for individual commands. This is the finest granularity performance measurement. Latency and Quality of Service (QoS) are often used interchangeably, but there is a slight difference. Latency typically refers to the round trip time of a particular command whereas QoS is generally referring to unique latency buckets that individual times fall into. When discussing QoS, it is important to communicate in terms of number of nines. For example, if I say my four-nines is at 500 us, this means 99.99% of commands take less than or equal to 500 us. In other words, only 1 out of 10,000 commands takes over 500 us. When looking at latency, it is incredibly important to run long enough to trust the QoS buckets that are being generated.

Operation Systems and filesystems may add some noise on the measured latency data. Like most system timing on multi-core systems has a granularity (I.E. 15.5 ms) and is not alterable, then when system has several processes is running, then on FIO/IOmeter latency result, we can observe some max latency glitches which are multiple (I.E. 15.5 ms), they are from system, not real drive latency. One of the best known methods is to use internal latency tracking (I.E. NVMe 1.3 optional telemetry) can help to provide more accurate latency data reflect the real performance of the drive. In general, use logarithmic scaling, and range from 1 us to 2 s , 1216 total buckets, ~8% max error per buckets based on the precision and accuracy of probing technology.

Latency Stats

There is infrastructure to track latency buckets (QoS) internally on the drive that can be pulled out via telemetry commands. This data is expected to be correlated with whole system, host data, but should be shorter because it is only measuring from the time it is pulled off the NVMe queue/mailbox until the time it is completed to the IO submission queue. Vendor infrastructure can also be used to assert, throw exceptions, trap, and/or state snapshot on high latency events.

Diagnosis

The link between the host machine and the drive has a maximum throughput capability, and performance can never exceed this link speed. To identify the maximum achievable link speed, identify the interface of the product and look up the maximum theoretical bandwidth. Note that the practical limitation is less than the theoretical because this link must be used for transmitting command meta data in addition to user data. For reference, here are some theoretical link speeds:

SATA III = 600 MB / s
SAS 6G = 2,400 MB / s
PCIe
- Gen3 x4 = 4,000 MB / s
- Gen4 x4 = 8,ooo MB / s

CPU

It is possible that the device ASIC (Application Specific Integrated Circuit) device CPU (Central Processing Unit) is so overloaded with instructions that it is the primary bottleneck. If the processor was able to cycle through operation faster, it would result in greater IOPS. This can include the time to read and write certain registers. The first step in identifying if a workload is CPU limited is collecting a device trace and seeing if the CPU is busy at all times. If the processor is constantly doing things besides idling, there is a good chance the workload is CPU limited. This can then be further proved out by hacking out non-critical pieces of code in the fast IO paths (for example, command queue handling), to see if a correlating performance improvement is seen when removing instructions.

Buffer

The SRAM (syncronous random access memory) or transfer buffer is used as a staging buffer for moving data between the host, device, and physical media. It can be hard to determine if this is the real bottleneck because bottlenecks elsewhere in the system often manifest as traffic jams back up into the buffer. The amount of buffer in use is a good first check to determine if most of the buffer is in use at all times. If most of the buffer is not being used at all times, it is a good indication that we are not buffer limited. Even if all of the buffer is in use, more work must be done to ensure that other bottlenecks are not giving the illusion of a buffer bottleneck.

To further shed light on if buffer is the real bottleneck, the easiest thing is to short stroke (artificially constrain by reducing the allocation size) the buffers. This fakes out the host or firmware into thinking it has less available buffer and can be accomplished by hacking away at the cache management initialization function. If buffer is short stroked by 10% and a corresponding 10% performance reduction is witnessed, it is pretty good evidence that a buffer bottleneck exists.

Channel

The channel is the transfer medium between DRAM, SRAM, and physical media. Each ASIC has a finite number of channels that are capable of running at a certain speed, and the bandwidth of a workload cannot exceed the capability of the channel. The max theoretical channel bandwidth can be found my multiplying the number of channels times the channel speed and the width (generally one byte). For example:

18 ASIC channels * 200 million transfers per second * 1 byte = 3600 million bytes per second = 3433 MB/s

To see if a workload is approaching the theoretical limit, counters can be added to the host read and data transfer done callbacks to measure how many sectors are being IO'd across the channel. It is important to keep in the mind that the practical channel limitations is much lower than the theoretical due to meta data and error correction information that must be sent in addition to the user data. The exact sector meta information, code word size, etc. must be understood to more tightly bound the maximum practical channel speed.

Media Package Stacking, Die, Internal Switching Logic

When selecting a device, the media packaging will impact performance for example a single die package may have more dedicated pins and for an 8-die stacked package an additionally switching logic will be added to select the specific chip on the device. Additionally, the media technology can stack memory cells into planes that additionally add internal switching logic delays for increased density. When evaluating a device, it is possible on the device every single die is busy most of the time during a workload. This is quite often the case on smaller stock-keeping unit (SKU) and is one of the main reasons why performance scales up with capacity to a point. To get an idea of what the maximum achievable back end capability is, a paper exercise can be performed to see how much data could be churned through if the die were bust all the time. For example, in a pure write workload:

((Number die - Parity protection die) * write plane width * block size) / (tIO + tProg)

tIO is the transfer time from ASIC to physical

tProg is the time to program the physical unit granularity

If I'm looking at a single core, 6 die product with 1 parity die and 16 KB page size quad plane NAND with tIO of 100us and tProg of 1.5 ms, the equation looks like so: ((6 - 1) * 4 * 16kB) / (100us + 1500us) = 195 MB/s

The paper exercise assumes the firmware is perfect at keeping the back end loaded, which is not the case, so it should be considered an upper bound. If the empirical results are approaching the paper exercise, it is worthwhile to try and measure the back-end utilization in firmware by periodically crawling the internal media command queue to observe if every die is active with work most of the time. An alternate approach would be to use counters associated with callbacks to track busy die.

Power

The ASIC has a power governor that will prevent commands from being dispatched to a device externally or internally when the maximum power threshold has been reached. There are counters in the power governor that can be queried to determine if we are throttling commands due to saturating the power budget from NVMe telemetry.

Device and Firmware Bottlenecks

Work Items

There are a variety of work items that are used to capture meta data around transfers between different device memories. Most of these work items are virtual resources in the sense that we can make more of them with firmware. The real limitations behind making lots of them are DRAM (Dynamic Rate Random Access Memory) constraints for storing the work item array and the number of bits for tracking work item identifiers that we want to allocate and pass around. To identify work item bottlenecks, counters, internal logs, asserts/exceptions, and/or asynchronous state snapshots can be used to identify cases where one of these soft resources is desired and unavailable.

These items include:

Media Work Items - These are used for moving data between SRAM and physical non-volatile memory and vice versa

DMA Work Items - These are used for moving data between SRAM and DRAM and vice versa

Containers - Containers are used for all sorts of things, but one container is used for each IO command, so running out of them is a performance problem

DMA descriptors - These are used for moving data between SRAM and host DRAM and vice versa

CMD descriptors - This one isn't really like the others in the sense that it is actually tied to a hardware constraint, but profiling it is similar to the others, so it was placed in this section. CMDs detail host commands.

CPU Consumption

Rare events can be very compute intensive and will trigger a certain task or thread to hang onto the CPU much longer than normal. These depend on the device kernel policies can result in performance reduction events. Examples of this might include non-volatile physical media completion handling for error cases or data mismatch.

Working on Throughput Issues

Once the primary bottleneck has been identified, we must figure out how to optimize around this bottleneck. This often means trading off optimal use of some other resources in order to speed up the primary bottleneck. However, different workloads often have unique bottlenecks, so what is good for one workload may be bad for another. This is where dynamic machine learning or artificial intelligence policies can save the day.

With dynamic policies, we often leverage some statistic to identify when we are operating in certain conditions and then perform specific actions under that condition. As an example, I might want to restrict garbage collection to consuming a small amount of resources when performing BDRs (Background Data Relocation/Refresh) in a pure read workload, but I want garbage collection to get as many resources as needed to maintain Write Amp relocation in a random write workload. To accomplish this, we could view the garbage collection source media set and host write buffer in use to detect which mode I am in and put a capacity on garbage collection resources based on the observed mode.

Working on Uniformity Issues

A uniformity issue implies a sustained bandwidth drop. This often means whatever is causing the drop is happening on a long enough time scale to be detected observed history. Some of the same principles when working on throughput issues can be applied to working on uniformity issues. Typically, observed history is the first level tool to have an idea of what is going on at the time of the dip.

Was there a shift in buffer allocation?
Does the reduction event coincide with crossing some free space thresholds
Were there a unaligned zero data writes?
Trim tokens?
Device standby/shutdown command?

Most of these initial hypotheses can be ruled in or out via observed history analysis.

If observed history is not providing any clues as to what is going on, a few options can be pursued:

Add new statistics into observed history in order to track them across time and rule them in/out as contributors to the problem.

Add new Observed History logging points to determine if certain code paths are being executed and/or to achieve finer granularity with regards to time.

Trap on uniformity dip so that the entire state of the system can be analyzed for anomalies. This can be done by sending telemetry administration commands when bandwidth falls below some level and break-pointing/asserting in that command path.

It can also be done by having the drive internally measure bandwidth, setting a device or system configuration to indicate when the drive has reached steady state for the workload, and injecting an assert event when the device or system configuration is set then bandwidth falls below some threshold.

Special Considerations

Pre-fetch

Pre-fetch is a low QD sequential read bandwidth improvement. It aims to proactively fetch reads from the non-volatile media to take advantage of parallelism when the next host read request can be anticipated. Assumption is that we can take advantage of available transfer buffer and/or idle die/channels opportunistically. However, this can create issues if pre-fetch is not getting enough hits (correctly anticipated host reads) to justify the resource consumption. In mixed or pure write workloads, a case can exist due to the filesystem reading the partition table and the drive tying up a large number of resources assuming more sequential reads are coming.

Read Modify Write

If a write workload is not aligned to the drives indirection granularity, writes will incur a large read-modify-write (RMW) penalty. In order to save DRAM for the indirection table, sectors are tracked in indirection unit chunks. This means that these sectors MUST travel as a group. Because non-volatile media cannot overwrite in place, if the host only writes a subset of the indirection unit chunk or media block size, the remaining sectors must be read from the old location into the transfer buffer and then written to non-volatile media in a new location. This results in inefficient use of the transfer buffer (have to wait for read to complete, and it could be stuck behind other non-volatile media activity), increased back-end consumption, and more CPU cycles. Depending on the technologies used in a device, these can be constructed in firmware or hardware.

Multi-worker Sequential

When running with multiple sequential read/write worker threads at the host level, synchronization is not guaranteed, and the workload will likely not appear sequential to the drive. Thus, the assumptions that generally apply to a sequential workload - self-invalidating for writes or pre-fetch capable for reads - become invalid. This can create problems for small block size, high QD workloads because the host cannot saturate the IO queue without running multiple workers, but multiple workers fundamentally changes the workload. An example of intelligent optimization is "Performance Configurable Nonvolatile Memory Set" in " https://uspto.report/patent/app/20200218649 .

Admin/Control Commands During Data Collection

It has been observed on multiple programs that issuing admin/control commands can cause tail latencies. It is possible that handling these commands delays the issuing of IO commands for architectures where the core has a single command dispatcher. Examples of control commands that might be collected during performance run are:

Collect temperature data
Collect SMART data
Collect Observed Telemetry History

Hardware and Operating System Optimizations

To optimize the NVMe SSD or SSDs the use case is significant, thus before committing to a specific setup a sweep of possible configurations is necessary for each SSD. Depending on the system empirical performance characteristics can change do to operating system, software, firmware, and hardware; which is why it is important to perform controlled experiments or tests with the dependent and Independent variables. If in the event one cannot recreate the exact experiment, a statistical sample of the most simlar configurations may be necessary. These configurations can be determined Product Requirement Document (PRD), Product Specification, and NVM Express standard. The key data structure is the Identify Controller, which has key information necessary for performance tuning.

To gather the information on Linux install the NVMe CLI tool or on Windows the IMAS tool, for the content below we will only show Ubuntu x64 20.04 LTS assuming you already created a name space and formatted the device. More detailed information can be found at the NVMe wiki. Typically, Intel NVMe SSDs are designed for a 4096 Bytes optimized minimal IO Size and multiples of 4096 Bytes are encouraged to activate accelerated hardware paths. The critical piece of information is the id-ns "best" relative performance LBA format since the SSD is designed and performance tuned with these parameters in mind. When attempting to understand a performance issue, the Telemetry host/controller initiated log and smart log are recommend. Within Intel SSDs, the smart log content is within the Telemetry host/controller initiated log and when performing anomaly analysis with Intel engineers this content contains meta information necessary to evaluate hardware and firmware state spaces.

T1000@skynetAI:~$ sudo apt-get install -y nvme-cli

T1000@skynetAI:~$ sudo nvme list

Node SN Model Namespace Usage Format FW Rev

---------------- -------------------- ---------------------------------------- --------- --------------------------

/dev/nvme0n1 PHKE00000000750BGN INTEL SSDPE21K750GA 1 750.16 GB / 750.16 GB 512 B + 0 B E2010325

T1000@skynetAI:~$ sudo nvme id-ctrl /dev/nvme0n1 -H -o normal

T1000@skynetAI:~$ sudo nvme id-ns /dev/nvme0n1 -H

T1000@skynetAI:~$ sudo nvme smart-log /dev/nvme0n1 -o normal

id-ctrl

lpa : 0x2

[3:3] : 0 Telemetry host/controller initiated log page Supported

[2:2] : 0 Extended data for Get Log Page Supported

[1:1] : 0x1 Command Effects Log Page Supported

[0:0] : 0 SMART/Health Log Page per NS Supported

sqes : 0x66

[7:4] : 0x6 Max SQ Entry Size (64)

[3:0] : 0x6 Min SQ Entry Size (64)

cqes : 0x44

[7:4] : 0x4 Max CQ Entry Size (16)

[3:0] : 0x4 Min CQ Entry Size (16)

id-ns

LBA Format 0 : Metadata Size: 0 bytes - Data Size: 512 bytes - Relative Performance: 0x2 Good

LBA Format 1 : Metadata Size: 8 bytes - Data Size: 512 bytes - Relative Performance: 0x2 Good

LBA Format 2 : Metadata Size: 16 bytes - Data Size: 512 bytes - Relative Performance: 0x2 Good

LBA Format 3 : Metadata Size: 0 bytes - Data Size: 4096 bytes - Relative Performance: 0 Best

LBA Format 4 : Metadata Size: 8 bytes - Data Size: 4096 bytes - Relative Performance: 0 Best

LBA Format 5 : Metadata Size: 64 bytes - Data Size: 4096 bytes - Relative Performance: 0 Best

LBA Format 6 : Metadata Size: 128 bytes - Data Size: 4096 bytes - Relative Performance: 0 Best

smart-log

Smart Log for NVME device:nvme0n1 namespace-id:ffffffff

critical_warning : 0

temperature : 46 C

available_spare : 100%

available_spare_threshold : 0%

percentage_used : 0%

data_units_read : 64,478,107

data_units_written : 50,610,214

host_read_commands : 563,305,671

host_write_commands : 617,967,115

controller_busy_time : 477

power_cycles : 305

power_on_hours : 10,323

unsafe_shutdowns : 180

media_errors : 0

num_err_log_entries : 0

Warning Temperature Time : 0

Critical Composite Temperature Time : 0

Thermal Management T1 Trans Count : 0

Thermal Management T2 Trans Count : 0

Thermal Management T1 Total Time : 0

For those that are technically inclined, focusing on optimizations based on understanding the Media page write and read granularity will be the best approach using the technical reference information. However, being a computer scientist, I believe in empirically verifying desired configurations through automation techniques using Machine Learning it is recommend to explore the search space through hyperparameter tuning. A good place to start is the Microsofts DeepSpeed using the benchmark library. Additionally, as noted above the controlled experiment requires the exact setup and if this cannot be achieved then using K-Means for the hyper parameter sets will aid in the nearest best configuration.

Single and RAID Configurations

Due to the reduction in the host protocol from SATA to NVMe, hardware operating system optimizations were necessary. One product is Intel® Virtual RAID on CPU (Intel® VROC) with hybrid RAID solution, specifically designed for NVMe SSDs connected directly to the CPU. Intel® VROC is made possible by the new CPU feature Intel® Volume Management Device, Intel® VMD, a new hardware architecture on Intel® Xeon® Scalable Processors. Intel® VMD enhances the 48 preexisting PCIe* lanes for dependable NVMe connections. Intel® VROC capitalizes on Intel® VMD for a simpler RAID solution that requires no additional hardware. It provides compelling RAID performance that unleashes the full potential of NVMe drives.

Key Benefits:
Fewer hardware queues
Bootable RAID NVM
Host Swap or Insert/Surprise removal
Closes select RAID write hole
Device Management

Once the tuning testing is complete, configure the SSDs based on a common or best fit configuration. When attempting to tune a RAID array, we want to keep this information in mind for the stripe size. Please refer to the RAID article for background and VROC info for Intel Hardware configuration. When selecting a stripe size typical options are 16, 32, 64 and 128 KB; however, these will vary based on hardware and software support. Depending on the usage case either latency or throughput, the stripe size will determine the minimum operating system hardware fetch size. It is important to note the hardware fetch size may not be the same as the operating systems's file system fetch size; thus, when optimizing and formatting devices aligning these two sizes are critical. The fetch sizes changes the read and write performance of the overall system.

Additionally, when tuning a setup for high write throughput and/or low latency for a write the RAID spare percentage can increase the overall SSD or RAID array performance characteristics. The performance characteristics are improved by providing the SSD with additional over provisioning of the internal write area. In a typical client SSDs, the over-provisioning can vary be from 0% to 10%; while in data center SSDs can vary from 0% to 200% from the host machine visible storage area. For example, an 100 GB SSD with 15% over-provisioning actually has 115 GB of actual storage media on the device. The reasoning for over provisioning can vary from performance to endurance requirements for a given host storage space. The minimalist reason for over provisioning is when a media block fails the firmware on the SSD will remap the failing block to a good block transparently from the host.

The challenge with hardware or software RAID is anytime a new test is to be performed with new stripe or spare; the entire RAID name space must be erased quickly through trim (data may still be present at host's block location) or secure erase (erases each block with random data to ensure data privacy). If security is enabled then each media block is cryptographically protected with host keys and deletion of host keys will render the data inaccessible; however, cryptographic vulnerabilities or cryptographic cracking software may render the data still accessible so for security it is always recommend to secure erase the device. With software RAID or hardware RAID APIs (Application Programming Interfaces) some of the tasks may be automated and with Intel these are accessible through the VROC Intel Rapid Storage Technology applications or drivers.

Ubuntu

sudo apt-get install mdadm systray-mdstat

BIOS Configuration

Disable the following:

Hyper threading
Intel Turbo Mode
PCIe ASPM (Active State Power Management)
EIST (Enhanced Intel Speed Step Techonology)
C-States
P-States

Hyper-Threading

Enter the System BIOS by either pressing F2 or DEL

Select Advanced Menu

Select Processor Configuration

Set Intel ® Hyper-Threading Tech to Disabled

Power Scheme

Select Power & Performance

Set CPU Power and Performance Policy to Performance

Intel Turbo Mode

Select CPU P State Control

Set Intel Turbo Boost Technology to Disabled

EIST (Enhanced Intel Speed Step Technology)

Set Enhanced Intel SpeedStep Tech to Disabled

C-States

Select CPU C State Control
Set CPU C-State to Disabled
Set C1E Autopromote to Disabled
Set Processor C3 to Disabled
Set Processor C6 to Disabled

Max Payload Size

Select PCI Configuration
Set Memory Mapped I/O Size to 1024G

Ubuntu Performance Configuration

Ubuntu install Commands

$ sudo add-apt-repository ppa:woodrow-shen/ppa

$ sudo apt-get update

$ sudo apt install -y ipmctl libipmctl-dev ledmon ndctl zfs-initramfs zfsutils-linux zfs-initramfs libzfslinux-dev zfs-auto-snapshot libvirt-daemon-driver-storage-zfs python3-pyzfs pyzfs-doc golang-go-zfs-dev libgtk-3-dev golang git nfs-common golang-github-gotk3-gotk3-dev btrfs-progs e2fsprogs f2fs-tools dosfstools hfsutils hfsprogs jfsutils mdadm util-linux cryptsetup dmsetup lvm2 util-linux nilfs-tools nilfs-tools ntfs-3g ntfs-3g reiser4progs reiserfsprogs reiserfsprogs udftools xfsprogs xfsdump gpart gedit samba rsync grsync rar unrar p7zip-full p7zip-rar openconnect libncurses5 libtinfo5 libz1 openvpn vpnc-scripts net-tools network-manager-openvpn network-manager-l2tp-gnome postfix libsasl2-modules ca-certificates mailutils ubuntu-mate-desktop mate-desktop-environment-extras mate-tweak gnome-tweaks wine

$ wget https://www.thawte.com/roots/thawte_Premium_Server_CA.pem

$ sudo mv thawte_Premium_Server_CA.pem /etc/ssl/certs/Thawte_Premium_Server_CA.pem

$ sudo cat /etc/ssl/certs/Thawte_Premium_Server_CA.pem | sudo tee -a /etc/postfix/cacert.pem

$ sudo apt-get upgrade -y

OS Performance Optimizations

# Do not use ext3 for the OS file system, use ext4!

$ sudo preload cpupower-gui indicator-cpufreq sysv-rc-conf numad

# Modify the cpu power to be for performance

$ sudo sysctl vm.swappiness=1

$ swapoff -a

# Wait 20 minutes...

$ swapon -a

$ sudo gedit /etc/sysctl.conf

# Modify or add the value

vm.swappiness=10

vm.vfs_cache_pressure=50

vm.dirty_background_ratio = 5

vm.dirty_background_bytes = 0

vm.dirty_ratio = 10

vm.dirty_bytes = 0

vm.dirty_writeback_centisecs = 500

vm.dirty_expire_centisecs = 12000

$ sudo gedit /etc/fstab

# Add the following:

# Move /tmp to RAM

tmpfs /tmp tmpfs defaults,noexec,nosuid 0 0

$ sudo ufw logging off

$ sudo chmod +x /etc/rc.local

$ sudo gedit /etc/rc.local

# Add this above "exit 0", we want these to reflect the performance of the media read-write Input-Output size.

# Please refer to [24] and domain expert for internal mechanics of the storage device. The 'sda' is the disk location in Linux.

echo 0 > /sys/block/sda/queue/add_random

echo 0 > /sys/block/sda/queue/rotational

echo 2 > /sys/block/sda/queue/rq_affinity

echo 256 > /sys/block/sda/queue/nr_requests

echo 256 > /sys/block/sda/queue/read_ahead_kb

exit 0

$ sudo /etc/pam.d/common-session and /etc/pam.d/common-session-noninteractive

# Add

session required pam_limits.so

# The total size should be a reflection of total system memory with room for the operating system and background tasks.

# These are in KB for the file so ensure you convert properly.

# Note: the user is @perf refers to the performance user group and user_ID is for a specific user. The '-' means soft=hard limit.

# Ensure the system is calculated and configured for total users by ensuring swap space is ready for all users operating at peak usage.

# In our case we have, 24 units x 1 DIMM x 64 GB per DRAM = 1,536 GB DRAM and 6 units x 4 DIMMs x 512 GB per Optane DIMM = 12,288 GB.

# Total runtime memory is 13,824 GB ~ 13.8 TB.

# We will limit the average user to the floor power of two for the DRAM size so:

# T = 1,536 GB ~= (1,536 * 1,048,576 KB) ~= 1610612736 KB

# log(1610612736)/log(2) ~= 30.5

# floor(30.5) ~= 30, (2^30)-1 = 1073741823, we subtract 1 for unsigned Linux kernel representation.

# Note: the Linux kernel variable size is the limit, so for developers check the value in the source code.

# nofile is based on cat /proc/sys/fs/nr_open

user_ID soft nofile 32768

user_ID hard nofile 32768

user_ID soft fsize 1073741823

user_ID hard fsize 1073741823

user_ID soft data 1073741823

user_ID hard data 1073741823

user_ID soft stack 1073741823

user_ID hard stack 1073741823

user_ID soft rss 1073741823

user_ID hard rss 1073741823

user_ID soft core 1073741823

user_ID hard core 1073741823

@perf soft nofile 1048576

@perf hard nofile 1048576

@perf soft fsize unlimited

@perf hard fsize unlimited

@perf soft data unlimited

@perf hard data unlimited

@perf soft stack unlimited

@perf hard stack unlimited

@perf soft rss unlimited

@perf hard rss unlimited

@perf soft core unlimited

@perf hard core unlimited

@perf soft cpu unlimited

@perf hard cpu unlimited

@perf hard as unlimited

@perf soft as unlimited

@perf soft locks unlimited

@perf hard locks unlimited

@perf soft memlock unlimited

@perf hard memlock unlimited

@perf soft memlock unlimited

@perf hard memlock unlimited

$ sudo gedit /etc/security/limits.conf

$ sudo gedit /etc/sysctl.conf

Add following:

fs.file-max = 2097152

Run:

sysctl -p

# Disable IRQ Balancing

$ sudo gedit /etc/default/irqbalance

# ENABLED=0

# Set Linux Startup Parameter in Grub

$ sudo gedit /etc/default/grub

# idle=poll

Building and Loading FIO

$ cd <absolute_path_of_repo_path>/utils/fio

$ make

$ cd <absolute_path_of_repo_path>/lib/spdk

$ gedit <absolute_path_of_repo_path>/lib/spdk/CONFIG

# CONFIG_FIO_PLUGIN?=y

# FIO_SOURCE_DIR?= <absolute_path_of_repo_path>/utils/fio/

# CONFIG_SOFTSIM?=n

./configure --with-fio=<repo_path>/utils/fio

cd <ent_sdd_repo_path>/lib/spdk

$ make -j

$ cd /utils/fio

$ gedit spdk_config.fio

# file content: spdk_config.fio with 4k IO at a queue depth of 64 for 60 seconds

[global]

ioengine=spdk

thread=1

group_reporting=1

direct=1

verify=0

time_based=1

ramp_time=0

runtime=60

iodepth=64

rw=rw

bs=4k

filename= sim=y ns=1

[test]

numjobs=1

$ sudo LD_PRELOAD=/<absolute_path_of_repo_path>/lib/spdk/examples/nvme/fio_plugin/fio_plugin LD_LIBRARY_PATH=/<absolute_path_of_repo_path>/lib/spdk/build/lib ./fio /<absolute_path_of_repo_path>/projects/fio_scripts/spdk_config.fio

WORK in Progress

FIO Visualizer

https://github.com/intel/fiovisualizer

Example Script using Microsofts DeepSpeed (Not tested, use with caution)

#!/bin/bash

# Example, filename tune.sh

# T1000@skynetAI:~$ cd csrc/aio/py_test

# T1000@skynetAI:~$ dd if=/dev/urandom of=input.file count=400 bs=1M

# T1000@skynetAI:~$ mkdir read-logs

# T1000@skynetAI:~$ mkdir write-test-data

# T1000@skynetAI:~$ mkdir write-logs

# T1000@skynetAI:~$ ./tune.sh input.file read-logs 400 write-test-data write-logs

# T1000@skynetAI:~$ python parse_aio_stats.py --logdir read-logs/aio_perf_sweep --metric read_speed | sort -k9 -n | tail -1

# The read report best result: ('read', 'block', 'overlap', 1, 1, 32, 262144) = 3.168102406435208

# T1000@skynetAI:~$ python parse_aio_stats.py --logdir write-logs/aio_perf_sweep --metric write_speed | sort -k10 -n | tail -1

# The write report best result: ('write', '400MB', 'block', 'overlap', 8, 1, 32, 262144) = 2.5923189261116324

function prep_folder()

{

folder=$1

if [[ -d ${folder} ]]; then

rm -f ${folder}/*

else

mkdir -p ${folder}

}

function validate_enviroment()

{

validate_cmd="python ./validate_async_io.py"

eval ${validate_cmd}

res=$?

if [[ $res != 0 ]]; then

echo "Failing because environment is not properly configured"

echo "Possible fix: sudo apt-get install libaio-dev"

exit 1

}

function getLog()

{

# Example

# getLog read

# echo $LOG

# NEW_LOG_VAR=$(getLog)

# echo $NEW_LOG_VAR

if [[ $1 == "read" ]]; then

local tLOG="${LOG_DIR}/read_${sub}_${ov}_t${t}_p${p}_d${d}_bs${bs}.txt"

else

local tLOG="${LOG_DIR}/write_${sub}_${ov}_t${t}_p${p}_d${d}_bs${bs}.txt"

eval $LOG="'$tLOG'"

}

function getCmd()

{

# Example

# getCmd read

# echo $CMD

# NEW_CMD_VAR=$(getCmd)

# echo $NEW_CMD_VAR

if [[ $1 == "read" ]]; then

local tcmd="${RUN_SCRIPT} ${READ_OPT} ${OPTS} ${SCHED_OPTS} &> ${LOG}"

else

local tcmd="${RUN_SCRIPT} ${WRITE_OPT} ${OPTS} ${SCHED_OPTS} &> ${LOG}"

eval $cmd="'$tcmd'"

}

function performEval()

{

RUN_TYPE = $1

for sub in single block; do

if [[ $sub == "single" ]]; then

sub_opt="--single_submit"

else

sub_opt=""

for ov in overlap sequential; do

if [[ $ov == "overlap" ]]; then

ov_opt="--overlap_events"

else

ov_opt=""

for t in 1 2 4 8; do

for p in 1 ; do

for d in 1 2 4 8 16 32 64; do

for bs in 512 4K 16K 32K 64K 128K 256K 512K 1M; do

SCHED_OPTS="${sub_opt} ${ov_opt} --handle --threads ${t}"

OPTS="--io_parallel ${p} --queue_depth ${d} --block_size ${bs}"

LOG=getLog $RUN_TYPE

CMD_PARAM=getCmd $RUN_TYPE

echo ${DISABLE_CACHE}

echo ${cmd}

echo ${SYNC}

eval ${DISABLE_CACHE}

eval ${cmd}

eval ${SYNC}

sleep 2

done

}

validate_enviroment

INPUT_FILE=$1

if [[ ! -f ${INPUT_FILE} ]]; then

echo "Input file not found: ${INPUT_FILE}"

exit 1

if [[ $# -ne 5 ]]; then

echo "Usage: $0 <input file> <output log dir> <write size in MB> <write dir> <output log dir>"

exit 1

# Read

RLOG_DIR=$2/aio_perf_sweep

RUN_SCRIPT=./test_ds_aio.py

READ_OPT="--read_file ${INPUT_FILE}"

prep_folder ${READ_DIR}

prep_folder ${RLOG_DIR}

# Write

SIZE="$3M"

WRITE_DIR=$4

OUTPUT_FILE=${WRITE_DIR}/ds_aio_write_${SIZE}B.pt

WRITE_OPT="--write_file ${OUTPUT_FILE} --write_size ${SIZE}"

WLOG_DIR=$3/aio_perf_sweep

prep_folder ${WRITE_DIR}

prep_folder ${WLOG_DIR}

DISABLE_CACHE="sync; sudo bash -c 'echo 1 > /proc/sys/vm/drop_caches' "

SYNC="sync"

$R_RESULT = performEval read

$W_RESULT = performEval write

References:

https://www.intel.com/content/www/us/en/software/virtual-raid-on-cpu-vroc.html
https://github.com/microsoft/DeepSpeed/
http://bluestop.org/fio/
Microsoft Performance Tuning for Server Hardware
Microsoft Performance Tuning for Subsystems
Microsoft Performance Tuning for Workloads
Red Hat Enterprise Linux 7 Performance Tuning Guide
Red Hat Enterprise Linux 6 Performance Tuning Guide

Page updated

Google Sites

Report abuse