Storage Myths: Put Oracle Redo on SSD
Storage for DBAs: “My database is slow”… “Well then why not put your redo logs on SSDs?” Gaaaah. I still hear people having this discussion and it drives me mad. “Nobody got fired for putting Oracle redo on …<flash vendor>”. Yeah right, but does that mean it was worth the investment?
I’m bored of this line of illogical “reasoning”, so here are three reasons why you shouldn’t put your redo logs on SSD.
1. Solve The Right Problem
If a database is slow, find out why. Investigate, troubleshoot, resolve. Don’t throw hardware at it without understanding what the problem is. Redo is written by the Oracle Log Writer process – and the wait event log file parallel write covers the writing of redo records from the log buffer into the online redo log files. If you are seeing high average wait times for log file parallel write (or occasional high wait times in the Wait Event Histogram) maybe it’s time to investigate the speed of redo I/Os. Otherwise … leave it alone, or you are fixing the wrong issue.
Also, let’s not confuse the wait event log file sync with log file parallel write. Log file sync is experienced by foreground processes waiting on the log writer to complete a flush of the log buffer to storage. It’s tempting to assume high log file sync times are therefore a consequence of slow log writes, but as Kevin Closson points out in this must-read article, most log file sync waits are actually processing issues where the log writer is not getting enough CPU time.
2. SSD Write Performance Sucks
Huh? You thought I was pro-SSD right? Ok so I’m being a bit crafty, because the terms SSD and Flash are not really synonymous. SSD stands for Solid State Disk (or Device depending on who you ask), which generally means a set of flash chips crafted into the shape of a hard disk drive and plugged into a HDD-shaped hole somewhere via the use of a Flash Memory Controller. This interface takes page-based flash memory and makes it look like block-based storage – and each SSD in an array has its own controller.
There is a fundamental difference between an all-flash array and a set of SSDs masquerading as disks: an all-flash array can manage the flash holistically while the SSD-populated array cannot. This matters because flash is awkward to work with – for example, flash pages must be erased before they are written to – a process which is both slow and cumbersome, since other pages are locked (even from reads) during an erase.
The all-flash array is able to avoid the consequences of these restrictions by managing the flash globally, so that erases do not block reads and writes. In contrast, SSDs shoved into a disk array cannot communicate with each other to indicate when they are busy performing this garbage collection process, resulting in unpredictable performance and horrible spikes in latency as I/Os queue up behind the erase process.
3. Disk Is Good Enough
You didn’t expect me to say that, did you? Don’t get me wrong, disk is terrible at random I/O. Really, truly awful. But here’s the thing: the Oracle log writer performs large, sequential writes. And disk is ok with sequential I/O, particularly if you are using faster spindles like the 15k RPM drives.
Flushing the log buffer to storage involves writing some multiple of the redo log block size (512 byte default but configurable to 1024 or 4096 bytes from Oracle version 11.2). If your system is busy enough that you believe you have redo performance issues, it seems likely that those writes will be larger as more redo is created per log flush. The larger the write, the more efficient it will be on disk as the impact of the initial seek time is averaged out.
But hey, don’t take my word for it. Trust the evidence – and it turns out there is a wealth of data out there for anyone to analyse… right here: http://www.tpc.org/
The thing about TPC-C benchmarks is that they generate redo logs like you wouldn’t believe. So if anyone needs the ultimate redo performance it’s a system like this one, which set a world record back in September 2012 (which Oracle crowed about in it’s usual classy way by using it to bash IBM). The great thing about TPC results is that they come with a complete full disclosure report so you can see just how the vendors did it. And in the full disclosure report for this submission, where was the redo located? On a RAID set consisting of 600GB 15K RPM disk drives (see page 21). If disk is fast enough for a world record, it’s fast enough for you.
Incidentally, the datafiles in that benchmark were located on 2x Violin Memory 6616 arrays – which also tells you something important: if you are migrating from disk to flash, the first thing you need to move is the primary data, not the redo.
The Counter-Argument: Flash is Not SSD
Now I don’t want to wrap this article up giving you the impression that you shouldn’t move your redo logs to flash memory, so I’ll leave you with some counter arguments to the above. When I build a database, I always put the redo logs on flash (not on SSD mind, but on a flash memory array). Here’s why:
1. Violin Isn’t Limited By Writes
I know, I know … that sounds like a sales pitch. I usually try to talk about flash in general, which is why I originally wrote “All-flash arrays aren’t limited by writes”, but the truth is I don’t know other all-flash arrays to the extent that I know Violin… so forgive me for sticking with what I know.
I’ll explain Violin’s methods for guaranteeing sustained ultra-low write latency some other day. for now, let’s just see the evidence:
Load Profile Per Second Per Transaction ~~~~~~~~~~~~ ~~~~~~~~~~~~ --------------- --------------- DB Time(s): 197.6 2.8 DB CPU(s): 18.8 0.3 Redo size: 1,477,126,876.3 20,568,059.6 Logical reads: 896,951.0 12,489.5 Block changes: 672,039.3 9,357.7 Physical reads: 15,529.0 216.2 Physical writes: 166,099.8 2,312.8
That’s over 1.4GB/sec of sustained redo generation from a 5 minute snapshot (see this post for details) using just a single Violin Memory 6616 array connected over 8Gb fibre channel. The AWR snapshot was 5 minutes long but the workload had been running for an extended period prior to the capture. Don’t leave here with the illusion that redo on flash memory isn’t blindingly fast.
2. Your First Design Goal Should Be Simplicity
There is a quote often attributed to Albert Einstein which says, “Everything should be made as simple as possible, but no simpler“. This applies perfectly to system design – and is one reason why I always recommend an all-flash database design over a flash and disk hybrid. Yes it’s possible to put some datafiles here and others over there, redo logs on disk and primary data on flash, etc. But the simplest design is to put everything on high performance, low latency flash. Is it the cheapest solution? Maybe not always on list price, but it probably will be based on TCO.
Look, if you want to put your redo logs on flash, I’m not going to argue. I’m not saying that it’s a bad thing.
What is a bad thing though is the practice of taking a disk-based database and sticking some SSDs in to home the redo logs. That’s just silly. The first part of the database you should move to flash is the primary data. If it makes sense to relocate the whole database (which it almost always does, because that disk array doesn’t belong in your data centre anymore – it belongs in a museum) then go for it. Just don’t compromise on having only the redo logs on flash or SSD, because then you have essentially built yourself an anti-TPC-C benchmarking system! And what’s the opposite of a system that goes really fast…?