|
- <?xml version="1.0" encoding="UTF-8" standalone="no"?>
- <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>29.1. Reliability</title><link rel="stylesheet" type="text/css" href="stylesheet.css" /><link rev="made" href="pgsql-docs@lists.postgresql.org" /><meta name="generator" content="DocBook XSL Stylesheets V1.79.1" /><link rel="prev" href="wal.html" title="Chapter 29. Reliability and the Write-Ahead Log" /><link rel="next" href="wal-intro.html" title="29.2. Write-Ahead Logging (WAL)" /></head><body><div xmlns="http://www.w3.org/TR/xhtml1/transitional" class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="5" align="center">29.1. Reliability</th></tr><tr><td width="10%" align="left"><a accesskey="p" href="wal.html" title="Chapter 29. Reliability and the Write-Ahead Log">Prev</a> </td><td width="10%" align="left"><a accesskey="u" href="wal.html" title="Chapter 29. Reliability and the Write-Ahead Log">Up</a></td><th width="60%" align="center">Chapter 29. Reliability and the Write-Ahead Log</th><td width="10%" align="right"><a accesskey="h" href="index.html" title="PostgreSQL 12.4 Documentation">Home</a></td><td width="10%" align="right"> <a accesskey="n" href="wal-intro.html" title="29.2. Write-Ahead Logging (WAL)">Next</a></td></tr></table><hr></hr></div><div class="sect1" id="WAL-RELIABILITY"><div class="titlepage"><div><div><h2 class="title" style="clear: both">29.1. Reliability</h2></div></div></div><p>
- Reliability is an important property of any serious database
- system, and <span class="productname">PostgreSQL</span> does everything possible to
- guarantee reliable operation. One aspect of reliable operation is
- that all data recorded by a committed transaction should be stored
- in a nonvolatile area that is safe from power loss, operating
- system failure, and hardware failure (except failure of the
- nonvolatile area itself, of course). Successfully writing the data
- to the computer's permanent storage (disk drive or equivalent)
- ordinarily meets this requirement. In fact, even if a computer is
- fatally damaged, if the disk drives survive they can be moved to
- another computer with similar hardware and all committed
- transactions will remain intact.
- </p><p>
- While forcing data to the disk platters periodically might seem like
- a simple operation, it is not. Because disk drives are dramatically
- slower than main memory and CPUs, several layers of caching exist
- between the computer's main memory and the disk platters.
- First, there is the operating system's buffer cache, which caches
- frequently requested disk blocks and combines disk writes. Fortunately,
- all operating systems give applications a way to force writes from
- the buffer cache to disk, and <span class="productname">PostgreSQL</span> uses those
- features. (See the <a class="xref" href="runtime-config-wal.html#GUC-WAL-SYNC-METHOD">wal_sync_method</a> parameter
- to adjust how this is done.)
- </p><p>
- Next, there might be a cache in the disk drive controller; this is
- particularly common on <acronym class="acronym">RAID</acronym> controller cards. Some of
- these caches are <em class="firstterm">write-through</em>, meaning writes are sent
- to the drive as soon as they arrive. Others are
- <em class="firstterm">write-back</em>, meaning data is sent to the drive at
- some later time. Such caches can be a reliability hazard because the
- memory in the disk controller cache is volatile, and will lose its
- contents in a power failure. Better controller cards have
- <em class="firstterm">battery-backup units</em> (<acronym class="acronym">BBU</acronym>s), meaning
- the card has a battery that
- maintains power to the cache in case of system power loss. After power
- is restored the data will be written to the disk drives.
- </p><p>
- And finally, most disk drives have caches. Some are write-through
- while some are write-back, and the same concerns about data loss
- exist for write-back drive caches as for disk controller
- caches. Consumer-grade IDE and SATA drives are particularly likely
- to have write-back caches that will not survive a power failure. Many
- solid-state drives (SSD) also have volatile write-back caches.
- </p><p>
- These caches can typically be disabled; however, the method for doing
- this varies by operating system and drive type:
- </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
- On <span class="productname">Linux</span>, IDE and SATA drives can be queried using
- <code class="command">hdparm -I</code>; write caching is enabled if there is
- a <code class="literal">*</code> next to <code class="literal">Write cache</code>. <code class="command">hdparm -W 0</code>
- can be used to turn off write caching. SCSI drives can be queried
- using <a class="ulink" href="http://sg.danny.cz/sg/sdparm.html" target="_top"><span class="application">sdparm</span></a>.
- Use <code class="command">sdparm --get=WCE</code> to check
- whether the write cache is enabled and <code class="command">sdparm --clear=WCE</code>
- to disable it.
- </p></li><li class="listitem"><p>
- On <span class="productname">FreeBSD</span>, IDE drives can be queried using
- <code class="command">atacontrol</code> and write caching turned off using
- <code class="literal">hw.ata.wc=0</code> in <code class="filename">/boot/loader.conf</code>;
- SCSI drives can be queried using <code class="command">camcontrol identify</code>,
- and the write cache both queried and changed using
- <code class="command">sdparm</code> when available.
- </p></li><li class="listitem"><p>
- On <span class="productname">Solaris</span>, the disk write cache is controlled by
- <code class="command">format -e</code>.
- (The Solaris <acronym class="acronym">ZFS</acronym> file system is safe with disk write-cache
- enabled because it issues its own disk cache flush commands.)
- </p></li><li class="listitem"><p>
- On <span class="productname">Windows</span>, if <code class="varname">wal_sync_method</code> is
- <code class="literal">open_datasync</code> (the default), write caching can be disabled
- by unchecking <code class="literal">My Computer\Open\<em class="replaceable"><code>disk drive</code></em>\Properties\Hardware\Properties\Policies\Enable write caching on the disk</code>.
- Alternatively, set <code class="varname">wal_sync_method</code> to
- <code class="literal">fsync</code> or <code class="literal">fsync_writethrough</code>, which prevent
- write caching.
- </p></li><li class="listitem"><p>
- On <span class="productname">macOS</span>, write caching can be prevented by
- setting <code class="varname">wal_sync_method</code> to <code class="literal">fsync_writethrough</code>.
- </p></li></ul></div><p>
- Recent SATA drives (those following <acronym class="acronym">ATAPI-6</acronym> or later)
- offer a drive cache flush command (<code class="command">FLUSH CACHE EXT</code>),
- while SCSI drives have long supported a similar command
- <code class="command">SYNCHRONIZE CACHE</code>. These commands are not directly
- accessible to <span class="productname">PostgreSQL</span>, but some file systems
- (e.g., <acronym class="acronym">ZFS</acronym>, <acronym class="acronym">ext4</acronym>) can use them to flush
- data to the platters on write-back-enabled drives. Unfortunately, such
- file systems behave suboptimally when combined with battery-backup unit
- (<acronym class="acronym">BBU</acronym>) disk controllers. In such setups, the synchronize
- command forces all data from the controller cache to the disks,
- eliminating much of the benefit of the BBU. You can run the
- <a class="xref" href="pgtestfsync.html" title="pg_test_fsync"><span class="refentrytitle"><span class="application">pg_test_fsync</span></span></a> program to see
- if you are affected. If you are affected, the performance benefits
- of the BBU can be regained by turning off write barriers in
- the file system or reconfiguring the disk controller, if that is
- an option. If write barriers are turned off, make sure the battery
- remains functional; a faulty battery can potentially lead to data loss.
- Hopefully file system and disk controller designers will eventually
- address this suboptimal behavior.
- </p><p>
- When the operating system sends a write request to the storage hardware,
- there is little it can do to make sure the data has arrived at a truly
- non-volatile storage area. Rather, it is the
- administrator's responsibility to make certain that all storage components
- ensure integrity for both data and file-system metadata.
- Avoid disk controllers that have non-battery-backed write caches.
- At the drive level, disable write-back caching if the
- drive cannot guarantee the data will be written before shutdown.
- If you use SSDs, be aware that many of these do not honor cache flush
- commands by default.
- You can test for reliable I/O subsystem behavior using <a class="ulink" href="https://brad.livejournal.com/2116715.html" target="_top"><code class="filename">diskchecker.pl</code></a>.
- </p><p>
- Another risk of data loss is posed by the disk platter write
- operations themselves. Disk platters are divided into sectors,
- commonly 512 bytes each. Every physical read or write operation
- processes a whole sector.
- When a write request arrives at the drive, it might be for some multiple
- of 512 bytes (<span class="productname">PostgreSQL</span> typically writes 8192 bytes, or
- 16 sectors, at a time), and the process of writing could fail due
- to power loss at any time, meaning some of the 512-byte sectors were
- written while others were not. To guard against such failures,
- <span class="productname">PostgreSQL</span> periodically writes full page images to
- permanent WAL storage <span class="emphasis"><em>before</em></span> modifying the actual page on
- disk. By doing this, during crash recovery <span class="productname">PostgreSQL</span> can
- restore partially-written pages from WAL. If you have file-system software
- that prevents partial page writes (e.g., ZFS), you can turn off
- this page imaging by turning off the <a class="xref" href="runtime-config-wal.html#GUC-FULL-PAGE-WRITES">full_page_writes</a> parameter. Battery-Backed Unit
- (BBU) disk controllers do not prevent partial page writes unless
- they guarantee that data is written to the BBU as full (8kB) pages.
- </p><p>
- <span class="productname">PostgreSQL</span> also protects against some kinds of data corruption
- on storage devices that may occur because of hardware errors or media failure over time,
- such as reading/writing garbage data.
- </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
- Each individual record in a WAL file is protected by a CRC-32 (32-bit) check
- that allows us to tell if record contents are correct. The CRC value
- is set when we write each WAL record and checked during crash recovery,
- archive recovery and replication.
- </p></li><li class="listitem"><p>
- Data pages are not currently checksummed by default, though full page images
- recorded in WAL records will be protected; see <a class="link" href="app-initdb.html#APP-INITDB-DATA-CHECKSUMS"><span class="application">initdb</span></a>
- for details about enabling data page checksums.
- </p></li><li class="listitem"><p>
- Internal data structures such as <code class="filename">pg_xact</code>, <code class="filename">pg_subtrans</code>, <code class="filename">pg_multixact</code>,
- <code class="filename">pg_serial</code>, <code class="filename">pg_notify</code>, <code class="filename">pg_stat</code>, <code class="filename">pg_snapshots</code> are not directly
- checksummed, nor are pages protected by full page writes. However, where
- such data structures are persistent, WAL records are written that allow
- recent changes to be accurately rebuilt at crash recovery and those
- WAL records are protected as discussed above.
- </p></li><li class="listitem"><p>
- Individual state files in <code class="filename">pg_twophase</code> are protected by CRC-32.
- </p></li><li class="listitem"><p>
- Temporary data files used in larger SQL queries for sorts,
- materializations and intermediate results are not currently checksummed,
- nor will WAL records be written for changes to those files.
- </p></li></ul></div><p>
- </p><p>
- <span class="productname">PostgreSQL</span> does not protect against correctable memory errors
- and it is assumed you will operate using RAM that uses industry standard
- Error Correcting Codes (ECC) or better protection.
- </p></div><div class="navfooter"><hr /><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="wal.html">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="wal.html">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="wal-intro.html">Next</a></td></tr><tr><td width="40%" align="left" valign="top">Chapter 29. Reliability and the Write-Ahead Log </td><td width="20%" align="center"><a accesskey="h" href="index.html">Home</a></td><td width="40%" align="right" valign="top"> 29.2. Write-Ahead Logging (<acronym class="acronym">WAL</acronym>)</td></tr></table></div></body></html>
|