Anomalous IPv6 behavior on Linux

Linux has some strange default IPv6 behavior. Here are a few things I noticed…

You can bind to a port on an IPv4 address while all of your tools will report that the port is on IPv6. For example, if your host is 198.51.100.1, you can bind to ::FFFF:198.51.100.1 and all state-checking tools like netstat, ss or lsof will report the listener as on an IPv6-port. You can do this in netcat with nc -6 -l ::FFFF:192.51.100.1.

Linux appears not to have a concept of default IPv6 rules with default ACCEPT everything. Instead, running ip6tables-save will simply return a blank output without an error code. In a small deployment you can catch this, but in a mass deployment scenario it becomes quite problematic.

By default, binding to all IPv6 addresses will also bind to all IPv4-addresses. This can be changed by setting the sysctl net.ipv6.bindv6only to 1. This seems like a very poor default.

Capturing Input/Output of Another Process in C

In my travels in C programming, I periodically need to run another process and redirect its standard output back to the first process. While it is straight forward to perform, it is not always obvious. This article will explain the process of how this is done in three sections.

In my travels in C programming, I periodically need to run another process and redirect its standard output back to the first process. While it is straight forward to perform, it is not always obvious. This article will explain the process of how this is done in three sections.

  • High Level Overview
  • Explanation of each line
  • Code Sample

High Level Overview

  • Create a three pipe(2)s for standard input, output and error
  • fork(2) the process
  • The child process runs dup2(2) to over the pipes to redirect the new processes’s standard
  • input, output and error to the pipe.
  • The parent process reads from the pipe(2) descriptors as needed.

Explanation

A pipe(2) is a Unix system call API that creates two file descriptors. Data written to one end of the pipe can be read by the other. It provides simple FIFO functionality without the need to maintain an associated data structure. The process should initially create three pipe(2) file descriptor pairs for standard input, output and error. For our purposes, it will be used to bridge communication between the parent and second process.

Next, our program will run a standard Unix fork(2), which creates a copy of the running processes, the stack and machine code, except with a different process ID. The return value for the parent is the process ID (pid) of the child, while the child returns 0.

dup2(2)‘s documentation says it “duplicates” a file descriptor, but I found this to be a misleading misnomer. In layman’s terms, dup2(2) cause any reads or writes to the newfd to be redirected (pointed) to the oldfd descriptor while the original newfd is closed. For our uses, the child process will use dup2(2) to redirect its standard input, output and error to the pipe(2) descriptors.

At this point, the child process will run execl(2), which will replace the current process with a new process. This is different than spawning a new process, such as through system(3), thought the effect would be the same. Now, because of the dup(2) calls, any reads or writes to standard input, output or error will be redirected to the respective pipe(2)‘s.

On the other end, the parent process will use the other end of the pipe(2) to read or write to the child process, thus accomplishing our objective.

Example Code

#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define SAMPLE_STRING	"Bismillah"

int main() {
	int fdstdin[2];
	int fdstdout[2];
	int fdstderr[2];
	int pid;

	pipe(fdstdin);
	pipe(fdstdout);
	pipe(fdstderr);

	pid = fork();

	if (pid == 0) { /* Child process */
		int ret;
		/*
		 * Have the 2nd argument (oldd) point to
		 * the first argument, (newd)
		 */
		dup2(fdstdin[0], STDIN_FILENO);
		dup2(fdstdout[1], STDOUT_FILENO);
		dup2(fdstderr[1], STDERR_FILENO);

		close(fdstdin[1]);
		close(fdstdout[0]);
		close(fdstderr[0]);

		/* This simulates a simple writing to stderr */
		system("printf Hi > /dev/stderr");

		/* Simulates writing from stdin to a test file */
		system("cat > `mktemp`");

		/* Typical method is to run execl(2). Here just using printf */
		ret = execl("/usr/bin/printf", "printf", "Hello World!", NULL); 

		if (ret == -1) {
			/*
			 * execl(2) returns -1 if an error occurs. Any
			 * debugging messages to the console would be
			 * interpreted as output of the process. Therefore,
			 * we will simply exit.
			 * The parent process's read attempts will return -1
			 */
			exit(128);
		}
	}
	else { /* Parent process */
		char buf[1000];

		/* Close the other end of the pipe */
		close(fdstdin[0]);
		close(fdstdout[1]);
		close(fdstderr[1]);

		/* Read from the stderr */
		read(fdstderr[0], buf, 1000);
		printf("Stderr message from child, simulated by a "
		    "system(): %s\n", buf);

		/* Sending data to the stdin of the child process */
		printf("Sending string '%s' to stdin, written to mktemp file.\n"
		    SAMPLE_STRING);

		write(fdstdin[1], SAMPLE_STRING, strlen(SAMPLE_STRING));
		/* Closing the stdin pipe */
		close(fdstdin[1]);

		/* Read from the stdout */
		read(fdstdout[0], buf, 1000);
		printf("Stdout message from child, run with an execl(): %s\n",
		    buf);
	}
}

Compiling and running this code should give you the following output.

$ ./redirect
Stderr message from child, simulated by a system(): Hi
Sending string 'Bismillah' to stdin, written to mktemp file.
Stdout message from child, run with an execl(): Hello World!

I hope this helps someone going forward! Thoughts?

This work is heavily based off of Cameron Zwarich’s excellent 1998 article Pipes in Unix from C-Scene, issue #4. I have it in hard-copy from 2001 and periodically refer back to it.

Avoiding Redundancy with Function Pointers

I am currently writing OpenGit, a BSD-licensed re-implementation of Linus Torvald’s Git (lower-cased going forward). This frequently involves reviewing git’s source code to understand how it works under the hood. One of the things I consistently encounter is git performing similar and sizable tasks in multiple different ways and places, resulting in redundancy and a higher maintenance cost.

In this brief entry, I will discuss a classic problem and how I solve it: When minor variants of a routine result in multiple implementations.

Example Pseudo-Code Problem

Git makes heavy use of zlib deflation, a library used to decompress arbitrary data. In the process, git will perform different subroutines, such as calculating an object’s cyclic redundancy check (CRC) or SHA1 value, both on deflated and inflated data, consuming the data in different ways. Rather than having a single deflation function, git re-implemented the deflation code in numerous different ways.

To help understand the problem, lets use this sample pseudo-code, which are decompression routines. The primary difference between the two functions is that one executes additional_routine_one() on the decompressed data, while the other executes additional_routine_two() on the uncompressed data.

void
decompress_routine_one(int fd, char *data, uint8_t *bin)
{
   int size;
   char compressed_data[1000];
   char *decompressed_data;

   do {
      size = read(fd, compressed_data, 1000);

      ... decompress the data ...

      additional_routine_one(decompressed_data, bin);
   } while(size <= 0);
}

void
decompress_routine_two(int fd, char *var)
{
   int size;
   char compressed_data[1000];
   char *decompressed_data;

   do {
      size = read(fd, compressed_data, 1000);
      additional_routine_two(compressed_data, var);

      ... decompress the data ...

   } while(size <= 0);
}

decompress_routine_one(fd, data, bin);
decompress_routine_two(fd, data, var);

In this example, both will decompress the data in the do-while loop, but perform different tasks and require different arguments for the routine.

  • The first executes additional_routine_one(), which uses two arguments: The decompressed_data and uses the bin variable.
  • The second executes additional_routine_two(), which utilizes the raw compressed data, and the variables var and size.

In other words, not only does the additional task change, the location of the task changes. This could be further complicated by the number of arguments that the additional routine utilizes.

Possible Solution

The best approach I have concluded with is to implement handler function pointers at the appropriate points in the primary routine. The application should verify if the handler is not necessary by checking if the pointer is set to NULL. This is, of course, not my own creation, but based on reviewing Linux and BSD (oh, and Windows) kernel implementations. Consider the following alternative.

typedef void datahandler(char *, int, void *);

void
additional_routine_one(char *data, int size, void *arg)
{
   char *x = arg;
   ... do something ...
}

void
additional_routine_two(char *data, int size, void *arg)
{
   int *x = arg;
   ... do something ...
}

void
decompress_routine(int fd, datahandler decompressed_handler, void *darg,
    datahandler compressed_handler, void *iarg)
{
   int size;
   char buf[1000];

   do {
      size = read(fd, buf, 1000);
      if (decompressed_handler)
         decompressed_handler(darg);

      ... decompress the data ...

      if (compressed_handler)
         compressed_handler(iarg);
   } while(size <= 0);
}

decompress_routine(fd, additional_routine_one, bin, NULL, NULL);
decompress_routine(fd, NULL, NULL, additional_routine_two, var);

In this example, the decompress_routine performs the same complex decompression algorithm, but rather than having two separate functions, at the appropriate points in each function they verify if a function pointer was passed, If so, it passes the respective argument.

Additionally, if the program must run both additional_routine_one and additional_routine_two the program can run:

decompress_routine(fd, additional_routine_one, bin, additional_routine_two, var);

Finally, if in cases where I require more than one argument, I typically pass a pointer to a structure with the data I need.

Potential Tradeoffs

  • Performance: There is a trivial cost to verifying if the function pointer is set to NULL or not. This may be irrelevant for general-purpose applications, but could be something to consider for extremely high-performing systems, where memory or disk utilization is not a concern.
  • Code clarity: Having a series of NULLs is “ugly” code. However, this can be trivially resolved by using macros to hide away the NULL references.

Thoughts? Comments? Threats?

Generating a “Vanity” PGP Key ID Signature

Here’s a quick bash script I used to generated a “vanity” PGP key with the last two bytes (four characters) set to FFFF.

#!/usr/bin/env bash

while :
do
gpg --debug-quick-random -q --batch --gen-key << EOF
Key-Type: RSA
Key-Length: 2048
Name-Email: user@domain
Name-Real: Real Name
Passphrase: yourverylongpassphrasegoeshere
EOF

if gpg -q --list-keys | head -4 | tail -c 5 | grep FFFF
then
        echo Break
        exit 1
else
        gpg2 --batch -q --yes --delete-secret-and-public-key `gpg -q --list-keys
| head -4 | tail -n 1`
fi

done

I also added no-secmem-warning to ~/.gnupg/options to suppress the insecure memory warnings. When I set it to a 1024-bit key, it only took about 3 hours, while 2048-bit took 20 hours across.

It goes without saying, my use of insecure randomness is a terrible idea for those facing a serious threat model. Also, you’re basically picking a number at random out of 65,535 hoping for the right combination – but I’m just having fun with it.

Passing by Reference: C’s Garbage Collection

The C programming language has no built-in garbage-collection mechanism – and it very likely never will. This can (and does) lead to memory leaks by even the best programmers. It is also an imputes for the Rust language. However, depending on your use-case, it is still possible to structure your code to use the stack as a sort of zero-cost “garbage collector”.

Lets jump directly into the code!

This is how many applications instantiate and utilize a structure or arbitrary object.

struct resource *instance;
instance = malloc(sizeof(struct resource));
get_resource(instance); ... free(instance);

While this is a perfectly fine snippet of code, it requires the program to explicitly free(3) instance when it is no longer needed or risk a memory leak. There is also a slight performance loss from the malloc(3) and free(3).

Therefore, lately I have been using another method.

struct resource instance;
get_resource(&instance);

Rather than allocating memory, this uses the stack. When the variable is “destroyed” immediately after falling out of scope without the need for a free(3).

The downside, of course, is losing the ability to pass the pointer elsewhere after the initial allocating function closes. But, this can be overcome by creating the variable in the parent function to all those that need it.

Thoughts?

SHA1 on FreeBSD Snippet

I needed some code that produces SHA1 digests for a project I am working on. I hunted through the FreeBSD’s sha1(1) code and produced this minimal snippet. Hopefully this helps someone else in the future.

Compile and run as follows:

$ cc shatest.c -o shatest -lmd
$ ./shatest
10d0b55e0ce96e1ad711adaac266c9200cbc27e4
$ printf "bismillah" | sha1
10d0b55e0ce96e1ad711adaac266c9200cbc27e4

Thanks to FreeBSD for maintaining such clean code!

Including optimized-out kernel symbols in dtrace on FreeBSD

Warning: This is a hack that involves modifying the build scripts. tldr; modify /usr/src/sys/conf/kern.pre.mk to change all references of -O2 to -O0.

Have you ever had dtrace(1) on FreeBSD fail to list a probe that should exist in the kernel? This is because Clang will optimize-out some functions. The result is ctfconvert(1) will not generate debugging symbols that dtrace(1) uses to identify probes. I have a quick solution to getting those probes visible to dtrace(1).

In my case, I was trying to instrument on ieee80211_ioctl_get80211, whose sister function ieee80211_ioctl_set80211 has a dtrace(1) probe in the generic FreeBSD 11 and 12 kernels. Both functions are located in /usr/src/sys/net80211/ieee80211_ioctl.c.

My first attempt was to add to /etc/make.conf as follows and recompile the kernel.

CFLAGS+=-O0 and -fno-inline-functions

This failed to produce the dtrace(1) probe. Several other attempts failed and I was getting inconsistent compilation results (Is it me or is ieee80211_ioctl.c compiled with different flags if NO_CLEAN=1 is set?). When I manually compiled the object file by copying the compilation line for the object file and adding -O0 -fno-inline-functions, nm(1) on both the object file and kernel demonstrated that the symbol was present. I installed the kernel, rebooted and it was listed as a dtrace probe. Great!

But as I continued to debug my WiFi driver (oh yeah, I’m very slowly extending rtwn(4)), I found myself rebuilding the kernel several times and frequently rebooting. Why not do this across the entire kernel?

After hacking around, my solution was to modify the build scripts. My solution was to edit /usr/src/sys/conf/kern.pre.mk and modify all optimization level 2 to optimization level 0. The following is my diff(1) on FreeBSD 12.0-CURRENT.

diff --git a/sys/conf/kern.pre.mk b/sys/conf/kern.pre.mk
index c1bbf0d30bf..9a99f1065aa 100644
--- a/sys/conf/kern.pre.mk
+++ b/sys/conf/kern.pre.mk
@@ -57,14 +57,14 @@ CTFFLAGS+=  -g
.if ${MACHINE_CPUARCH} == "powerpc"
_MINUS_O=      -O      # gcc miscompiles some code at -O2
.else
-_MINUS_O=      -O2
+_MINUS_O=      -O0
.endif
.endif
.if ${MACHINE_CPUARCH} == "amd64"
.if ${COMPILER_TYPE} == "clang"
-COPTFLAGS?=-O2 -pipe
+COPTFLAGS?=-O0 -pipe
.else
-COPTFLAGS?=-O2 -frename-registers -pipe
+COPTFLAGS?=-O0 -frename-registers -pipe
.endif
.else
COPTFLAGS?=${_MINUS_O} -pipe

My dtrace -l | wc -l went from 71432 probes to 91420 probes.

A few thoughts:

  • This seems like a hack rather than a long-term solution. Either the problem is with the hard-coded optimization flags, or the inability to overwrite them in all places in make.conf.
  • Removing optimizations is only something I would do in a non-production kernel, so its as if I have to choose between optimizations for a production kernel or having dtrace probes. But dtrace explicitly markets itself as not impactful on production.
  • Using the dtrace pony as your featured image on WordPress does not render properly and must be rotated and modified. Blame Bryan Cantrill.

If you have a better solution, please let me know and I will update the article, but this works for me!

Linux maintains bugs: The real reason ifconfig on Linux is deprecated

In my third installment of FreeBSD vs Linux, I will discuss underlying reasons for why Linux moved away from ifconfig(8) to ip(8).

In the past, when people said, “Linux is a kernel, not an operating system”, I knew that was true but I always thought it was a rather pedantic criticism. Of course no one runs just the Linux kernel, you run a distribution of Linux. But after reviewing userland code, I understand the significant drawbacks to developing “just a kernel” in isolation from the rest of the system.

Lets say a userland program wants to request an object from the kernel. The kernel structure might be something like this:

struct foo {
     size_t size;
     char name[20];
     int val;
};

On POSIX systems, a typical way to communicate with the kernel is to open a file descriptor to the appropriate system and send an ioctl(1) with a pointer to where the kernel should store the responding data. FreeBSD might perform this task as follows:

struct foo x;
ioctl(fd, CMD_REQUEST_FOO, &x);

Linux should do the same and to be fair it typically does. This manifests as software source that requires the Linux kernels headers. But because userland tools are maintained independent of the kernel, and sometimes are even explicitly written to be cross-platform, they typically maintain their own copy of data structures and macros independent of the Linux source tree.

So far so good. This might even produce the exact same binary output. But what happens if the kernel structure or behavior changes? This could be due to a bug fix, an added feature or an optimization – either way, the structure may change.

On FreeBSD this is not a problem. They update the kernel and userland tools in tandem. In fact, because both the kernel and userland application are in the same source tree they can even share the same header files. For 3rd party userland applications, FreeBSD provides highly stable libraries that do all the kernel-interactions, such as lib80211(3) – its worth noting that OpenBSD and NetBSD do not have these libraries because the kernel interface itself is highly stable anyways. FreeBSD even provides a COMPAT layer in the rare cases that an older binary fails to run on modern versions of FreeBSD.

Conversely on Linux, because the kernel and the rest of the operating system are not developed in tandem, this means updating or fixing a kernel struct would almost guarantee to break a downstream application. The only to prevent this would be to conduct regular massively coordinated updates to system utilities when the kernel changes, and properly version applications for specific kernel releases. Quite a herculean endeavor. This also explains why systemtap, one of Linux’s many answers to dtrace(1), does not work on Ubuntu.

Also, Linux can never have an equivalent of a lib80211(3) because there is no single standard library set. Even for the standard C library set, Linux has Glibc, uClibC, Dietlibc, Bionic and Musl. Rather than guessing the underlying C library implementation or falling into “dependency hell“, applications default to the most low-level implementation or their requested functionality. Some tools, such as ifconfig(8), resort to just reading from the /proc filesystem.

Linux’s solution to this problem was to create a policy of never breaking userland applications. This means userland interfaces to the Linux kernel never change under any circumstances, even if they malfunction and have known bugs. That is worth reiterating. Linux maintains known bugs – and actively refuses to fix them. In fact, if you attempt to fix them, Linus will curse at you, as manifest by this email.

And this leads back to the topic. Have you ever wondered why nearly every distribution deprecated ifconfig(8), a standard networking tool dating back to classic Unix? When Linux first implemented multiple IPv4 addresses on the same physical interface, it did so by cloning the interface in software and assigning each clone a unique IPv4 address. For example, eth0 could be cloned with eth0:1, eth0:2, etc. From a programmatic perspective, eth0 still only had one IPv4 address. As time passed and developers updated the kernel, it allowed users to assign multiple IPv4 addresses directly to the same interface., bypassing the need for cloning.

But Linux’s API has not changed. It still only returns a single legacy IPv4 address per interface. An interface could have multiple IPv4 addresses but ifconfig(8) will still only report a single address. In other words, as it currently stands ifconfig(8) lies to you. I do not fully understand they did not just update ifconfig(8) – random IRC rumors say there was a failed attempt due to ifconfig(8)’s convoluted code-base. But for whatever reason, this led to the completely new tool ip(8).

By contrast, FreeBSD just updates their ifconfig(8) in tandem with any kernel updates and there were no problems. Simple.

This also explains why Linux has multiple tools for seemingly highly correlated network tasks. Rather than working together to create a consolidate tool, Linux has iw(8), iwconfig(8) and brctl(8), etc, whereas FreeBSD just has different drivers for its ifconfig(8) implementation. For the record, I think ip(8)’s syntax is cleaner than ifconfig(8)’s syntax, as the latter is a victim of IPv4 legacy syntax. If both tools worked just fine, it might be worth having ifconfig(8) for legacy scripts during a transitionary period, but making ip(8) the future. That would be perfectly fine, but it would be ideal if both tools just worked, rather than needing to abandon the tool because it is broken.

Written with love a laptop running OpenBSD 6.3.

Thoughts?

My Backup Solution Leveraging OpenZFS, rsync, WOL and crontab

The other day I managed to destroy my hard drive’s partition table as I attempted to fix a grub(2) issue. To make matters worse, my backup of important files was old, and while attempting to make a boot-disk to re-install Linux, I selected the wrong disk and over-wrote the backup! Clearly I needed a more robust backup solution.

My requirements for the solution were as follows:

  • Runs during idle time, in my case 5:00 am and 5:00 pm.
  • Does not require me to leave my computer on 24/7. I try to reduce my power consumption.
  • Does not require non-standard (read: non-open source) software.

My setup is as follows:

  • Linux Mint desktop. Hate on me if you want, I love the Cinnamon desktop.
  • FreeBSD server, which serves as the ZFS backup sink (and a lot more).
  • A small Linux (soon-to-be NetBSD) Raspberry Pi 1 that functions as my IPv6 router, VPN endpoint and jump host when I need to tunnel into my home network from the outside.

Lets begin!

At 4:58 AM, the Raspberry Pi’s crontab(1) runs a script to ping my desktop. If the ping is successful, it notes that the computer is online by writing 0 to /var/log/pc-status. Otherwise, it notes that the computer is off-line, writes 1 to /var/log/pc-status and sends a Wake-On-LAN (WOL) frame to my computer to power on the machine. The script is as follows:

#!/bin/sh
ping6 -w 0.5 -c 1 -q PC_IP6_ADDRESS 2> /dev/null > /dev/null
if [ $? -ne 0 ]; then
     # It was off, so write 1
     sudo wakeonlan PC_MAC_ADDRESS 2> /dev/null > /dev/null
     sudo sh -c 'echo 1 > /var/log/pc-status'
else
     # Its already on, so write 0
     sudo sh -c 'echo 0 > /var/log/pc-status'
fi

I put this in the crontab(1) as follows: 58 4 * * * /home/farhan/bin/wake-script.sh.

Next, at 5:00 AM, the Linux desktop performs an rsync(1) to the FreeBSD backup machine to synchronize all files. Since this is rsync(1) and not scp, it does not waste time on files that are already up to date. Upon completion, the Linux desktop queries the Raspberry Pi’s power-status record to determine if it was just on or off. If the value is on, Linux will do nothing. If the value is off, it will suspend the machine. This is done as follows in another script:

rsync -6 -q -a --exclude "VirtualBox VMs" --include ".*" /home/farhan/* farhan@FREEBSD_SERVER:/usr/local/home/farhan/pc_home/
ssh farhan@RASPBERRY_PI cat /var/log/pc-status | grep 1 -q
if [ $? -ne 0 ]; then
     sudo pm-suspend 2> /dev/null > /dev/null # The redirects are unnecessary.
fi

My storage device is 4 TB and mostly unused, so I do not even bother with the --delete option. Yes, my ~/Downloads directory will likely grow quite large in the next few weeks, but that is not a problem. However, I excluded the Virtual Machines’ directory because even just powering on a VM results in changing the VDI disk image. And similarly to the wake-script, this is in crontab(1) as follows: 0 5 * * * /home/farhan/bin/sleep-script.sh.

Finally, on the FreeBSD side I run daily OpenZFS snapshots and daily prunes of snapshots that are older than 14 days. My crontab(1) is as follows:

@daily /sbin/zfs snapshot -r Data@daily-`date "+\%Y-\%m-\%d"`
@daily /sbin/zfs list -t snapshot -o name | /usr/bin/grep tank/home/farhan/pcbackup@daily- | /usr/bin/sort -r | /usr/bin/tail -n +14 | /usr/bin/xargs -n 1 /sbin/zfs destroy -r

And that’s it! A low-powered, open-source backup solution that relies 100% on Unix tools.

Notes: The only way to make this solution more elegant would be if the Linux desktop ran OpenZFS and used zfs(8) send command. But given that Linux Mint does not support ZFS out of the box, I am concerned what might happen if the OpenZFS module fails to load and I am stuck with a non-functional machine. Also, notice the explicit use of IPv6, not legacy IP.

fsync(2) on FreeBSD vs Linux

Even with our modern technology, hard-disk operations tend to be the slowest and most painful part of any modern system. As such, modern operations implement buffering mechanism. In this schema, when an application calls write(2), rather than immediately performing physical disk operations, the operating stores data in a kernel buffer. When the buffer exceeds a certain amount or the when an application falls the fsync(2) system call, the kernel begins writing to the disk.

This scheme is significantly faster, perhaps most demonstrably by the massive performance differential between the GNU vs BSD yes(1), as initially noted here. Note: FreeBSD’s yes(2) has now reached parity with GNU.

So far so good. But what happens when a disk write operation fails? This could be due to a hardware or network failure, but ultimately it is not the fault of the operating system. However, the operating must properly handle the failure.

On Linux, when an application’s fsync(2) call fails, the kernel returns a disk error. However, it then clears the buffer and properly sets the buffer as “dirty” (EIO flag). When the application issues another fsync(2) and the disk succeeds, the kernel clears the error bit, and reports a successful write to the application. As such the previously failed data never hit the disk and, if discarded by the application, the data was lost.

On FreeBSD, when an application’s fsync(2) call fails, the kernel also returns an error. Similar to Linux, it also reports the error to the application. But unlike with Linux, it maintains the “dirty” bit, thus not re-writing over the kernel buffer, until the page buffer is cleared, even if the successive fsync(2) is successful. This way, the page data is not lost.

This is another example of the superiority of FreeBSD over Linux. FreeBSD can better survive a disk failure, while Linux’s implementation is fundamentally broken. In the past I have experienced Linux’s ext4 fail into read-only mode to prevent disk corruption. While that might be a fall-back mechanism, it is not a long-term solution. Instead, userland applications have to keep track of whether the kernel was successful or not. Depending on your perspective, this is a stack violation.

Additionally, any long-term solution to change the behavior of the operating system would mean all user-land applications would potentially break. Linus Torvalds has notoriously stated:

Breaking user programs simply isn't acceptable

In fact, he’s repeated this policy in more colorful language here. So you’re stuck with bad behavior.

Now consider if you want to build an operating system that will run for potentially a hundred years and produce zero errors or catch errors and properly perform exception handling. Go with FreeBSD.

Its worth noting that Illumos (Solaris) properly implements fsync(2), whereas OpenBSD and NetBSD also failed on this issue and I fully anticipate them to fix the problem.