This, along with RAID-1, is probably sufficient to catch the majority of errors. But realize that these are just probabilities - if the failure can happen on the first drive, it can also happen on the second. A merkle tree is commonly used to also protect against these scenarios.
Notice that using something like RAID-5 can result in data corruption migrating throughout the stripe when using certain write algorithms
The paranoid would also follow the write with a read command, setting the SCSI FUA (forced unit access) bit, requiring the disk to read from the physical media, and confirming the data is really written to that rotating rust. Trying to do similar in SATA or with NVMe drives might be more complicated, or maybe impossible. That’s the method to ensure your data is actually written to viable media and can be subsequently read.
I’ve seen disks do off track writes, dropped writes due to write channel failures, and dropped writes due to the media having been literally scrubbed off the platter previously. You need LBA seeded CRC to catch these failures along with a number of other checks. I get excited when people write about this in the industry. They’re extremely interesting failure modes that I’ve been lucky enough to have been exposed to, at volume, for a large fraction of my career.
I thought an fsync on the containing directories of each of the logs was needed to ensure the that newly created files were durably present in the directories.
Right, you do need to fsync when creating new files to ensure the directory entry is durable. However, WAL files are typically created once and then appended to for their lifetime, so the directory fsync is only needed at file creation time, not during normal operations.
https://en.wikipedia.org/wiki/Data_Integrity_Field
This, along with RAID-1, is probably sufficient to catch the majority of errors. But realize that these are just probabilities - if the failure can happen on the first drive, it can also happen on the second. A merkle tree is commonly used to also protect against these scenarios.
Notice that using something like RAID-5 can result in data corruption migrating throughout the stripe when using certain write algorithms
The paranoid would also follow the write with a read command, setting the SCSI FUA (forced unit access) bit, requiring the disk to read from the physical media, and confirming the data is really written to that rotating rust. Trying to do similar in SATA or with NVMe drives might be more complicated, or maybe impossible. That’s the method to ensure your data is actually written to viable media and can be subsequently read.
I’ve seen disks do off track writes, dropped writes due to write channel failures, and dropped writes due to the media having been literally scrubbed off the platter previously. You need LBA seeded CRC to catch these failures along with a number of other checks. I get excited when people write about this in the industry. They’re extremely interesting failure modes that I’ve been lucky enough to have been exposed to, at volume, for a large fraction of my career.
I thought an fsync on the containing directories of each of the logs was needed to ensure the that newly created files were durably present in the directories.
Right, you do need to fsync when creating new files to ensure the directory entry is durable. However, WAL files are typically created once and then appended to for their lifetime, so the directory fsync is only needed at file creation time, not during normal operations.
> Conclusion
> A production-grade WAL isn't just code, it's a contract.
I hate that I'm now suspicious of this formulation.
In what sense? The phrasing is just a generalization, production-grade anything needs consideration of the needs and goals of the project.
“<x> isn’t just <y>, it’s <z>” is an AI smell.