RFC: Accurate time synchronization dependencies via time-sync.target

thoughtpolice · December 3, 2018, 7:26pm

Hello *,

In the process of working on nixos/cockroachdb: create new service by thoughtpolice · Pull Request #51306 · NixOS/nixpkgs · GitHub, adding a new CockroachDB module and set of tests, I found myself needing a way to synchronize clocks in NixOS tests, but also, to block services until all clocks are initially synchronized.

This led me to wonder: what are the semantics of time-sync.target, a systemd.special(7) target that is supposed to be triggered by NTP software?

The systemd.special(7) manpage says this about time-sync.target:

Services responsible for synchronizing the system clock from a remote source (such as NTP client implementations) should pull in this target and order themselves before it. All services where correct time is essential should be ordered after this unit, but not pull it in…

This reading would seem to imply that time-sync.target should be triggered once the system clock has been synchronized correctly against a remote source, for the first time.

However, to date, all NixOS NTP daemons trigger time-sync.target once they have started, not once the initial time is synchronized. Initial time synchronization can take noticeable amounts of time after the NTP daemon is started, as it is highly possible that adjustments need to be made soon afterwords.

TL;DR – I propose to change this so that time-sync.target is triggered only once initial synchronization has occurred after boot. My current work implementing this is available at nixos: make time-sync.target block until initial adjustment with all NTP daemons by thoughtpolice · Pull Request #51338 · NixOS/nixpkgs · GitHub

Feel free to read on for some background, or skip past that if you’re already familiar with the basics.

NTP Background

NTP daemons do a form of measurement and adjustment for clock skew in order to keep accurate time. NTP daemons monitor the given system time and the remote time over a sampling period, in order to determine future and current estimations of “real time”, within a narrow window of error. They continuously do this measurement at all times for accurate, long-term timekeeping.

A particularly important part of this is keeping time accurate during the boot sequence of a system, when the initial measurements against a source, and the very first clock synchronization, must be performed.

Most NTP software has two modes of operation to adjust time: skipping and slewing. Slewing is the process of minor adjustments to time, by lightly tweaking the time. For example, if the time is too far ahead, the daemon will adjust system time to “spread” out small time windows (a few ms), so it can fall back to the real time, gradually. If it’s too far behind, it will adjust the system time to speed up little-by-little. Most of the time this slewing works fine for normal jitter that occurs while the system is alive.

However, slewing only works well for a narrow window of anomalies, and is only meant to adjust time in very small steps that happen due to ordinary, small-scale jitter – the order of microseconds. Large discrepancies in time would result in extremely long slew times – an adjustment of a few seconds may take extremely long amounts of time to correct (minutes, hours).

Boot time is a bit different and falls in this category of “large time discrepancies”. Generally the boot sequence will copy the onboard RTC clock to the system clock at startup, but adjustments may still need to be made afterwords simply due to spooky ghosts and other malicious entities. If, after the initial set of measurements after boot, the necessary time adjustment is large enough (e.g. a few seconds), slewing is going to take too long.

For this reason, most NTP software has configurable “skip” functionality – a skip being an immediate jump forward or backwards in time, meant to fix large discrepancies. The intention of this design is that most systems will start NTP at boot and rapidly do measurements, followed by a possible, immediate skip, so initial (possible large) discrepancies are corrected fast – and over time, they will adjust further, much smaller discrepancies via slews.

Clustering database example

However, skips are dangerous, because arbitrary jumps forward or backward for large periods will generally cause many systems to go haywire. Furthermore, accurate timekeeping over the long haul is also essential, or systems will also go haywire. For certain systems, it is absolutely essential for the clock to be adjusted properly before the service can be reliably deployed.

CockroachDB is a prime example of this. As a clustering database, CRDB heavily relies on globally consistent time provided by something like an NTP daemon. With CRDB, nodes that are de-synchronized beyond a threshold (default ~500ms) will be kicked out of a cluster or unallowed to join. Furthermore, local skips on a cluster will cause a node to kill itself, as a result. Therefore it is absolutely essential that time synchronization has occurred before the database, or any of its sibling nodes, have been started.

An easy way to make this evident is my CockroachDB clustering tests, as written in nixos/cockroachdb: create new service by thoughtpolice · Pull Request #51306 · NixOS/nixpkgs · GitHub. The default NixOS testing infrastructure deploys virtual machines through KVM/QEMU that have no NTP daemon deployed. However, due to system jitter, they will very often start with highly desynchronized clocks – I’ve noticed anywhere from .5 to 3 seconds of adjustment at bootup. I would say that unless CockroachDB is forced to wait for synchronization, these VM tests fail 60% of the time or more, non-deterministically.

In order to fix this, I use a KVM driver called ptp_kvm in order to provide a reference clock to the virtual machine, which is then synchronized with chrony. As every virtual machine will be synchronized to the host clock, this allows the tests to reliably pass.

But how do I get cockroachdb.service to wait for synchronization? The current way is to do this: inject a call to chronyc waitsync inside the ExecPreStart of cockroachdb.service:

      # Enable CockroachDB. In order to ensure that Chrony has performed its
      # first synchronization at boot-time (which may take ~10 seconds) before
      # starting CockroachDB, we block the ExecStartPre directive using the
      # 'waitsync' command. This ensures Cockroach doesn't have its system time
      # leap forward out of nowhere during startup/execution.
      #
      # Note that the default threshold for NTP-based skew in CockroachDB is
      # ~500ms by default, so making sure it's started *after* accurate time
      # synchronization is extremely important.
      services.cockroachdb.enable = true;
      # ...

      # Hold startup until Chrony has performed its first measurement (which
      # will probably result in a full timeskip, thanks to makestep)
      systemd.services.cockroachdb.preStart = ''
        ${pkgs.chrony}/bin/chronyc waitsync
      '';

However, this is extremely non-modular: there is no way for the cockroachdb.nix module to know which NTP daemon the user has chosen, each with distinct wait/synchronization functionality. This means the user must inject such a blocking call into the path of their specific daemons, knowing that accurate synchronization won’t happen otherwise.

`time-sync.target` for reliable synchronization dependencies: PR #51338

This brings is back to time-sync.target in our current setup. Currently, time-sync.target is only triggered on the start of any NTP daemon, before any initial measurements or adjustments occur. nixos: make time-sync.target block until initial adjustment with all NTP daemons by thoughtpolice · Pull Request #51338 · NixOS/nixpkgs · GitHub changes this to trigger after initial synchronization.

The easiest way to show how to pull this off is to just look at how the new-and-improved Chrony service does it, in my patch:

    systemd.services.chronyd =
      { description   = "Chrony NTP daemon";
        documentation = [ "man:chronyd(8)" "man:chrony.conf(5)" "https://chrony.tuxfamily.org" ];

        wantedBy  = [ "multi-user.target" ];
        wants     = [ "ntp-adjusted-chrony.service" ];
        after     = [ "network.target" ];
        conflicts = [ "openntpd.service" "ntpd.service" "systemd-timesyncd.service" ];

        preStart = ''
          mkdir -m 0755 -p ${stateDir}
          touch ${keyFile}
          chmod 0640 ${keyFile}
          chown chrony:chrony ${stateDir} ${keyFile}
        '';

        unitConfig.ConditionCapability = "CAP_SYS_TIME";
        serviceConfig =
          { Type = "forking";
            ExecStart = "${pkgs.chrony}/bin/chronyd ${chronyFlags}";

            ProtectHome = "yes";
            ProtectSystem = "full";
            PrivateTmp = "yes";
          };
      };

    # Blocker for time-sync.target
    systemd.services.ntp-adjusted-chrony =
      { description = "initial NTP adjustment and measurement (chrony)";

        requires = [ "chronyd.service" "time-sync.target" ];
        before   = [ "time-sync.target" ];
        after    = [ "chronyd.service" ];

        serviceConfig =
          { ExecStart = "${pkgs.chrony}/bin/chronyc waitsync";
            Type = "oneshot";
            RemainAfterExit = "yes";

            ProtectHome = "yes";
            ProtectSystem = "full";
            PrivateTmp = "yes";
          };
      };

The basic idea is:

an NTP daemon service now implies the start of a new, one-shot ntp-adjust service that waits until initial measurements occur, using NTP-daemon-specific functionality
time-sync.target is no longer triggered by the NTP daemon itself, but instead by its corresponding adjustment service

With this change, services like cockroachdb.service can now just set Requires=time-sync.target and After=time-sync.target in their serviceConfig – Cockroach will not start until network time is properly synchronized, removing any service dependency hacks.

Current progress

Currently, only Chrony and NTPD have been tested using this patch. One particular problem is that we do not really have any tests for any of the NTP daemons – this is a little obvious, because NixOS tests remove networking functionality from their virtual machines (although it can be opted into, IIRC).

The remaining issues are

Testing systemd-timesyncd, which uses a new service systemd-time-wait-sync.service, part of systemd v239 (see timesyncd allows time-sync.target to be reached before time is synchronized. · Issue #5097 · systemd/systemd · GitHub and https://github.com/systemd/systemd/pull/8494 for more information). We somewhat lucked out on this, because it’s otherwise a relatively recent addition fixing a long-standing bug.
Figuring out what to do about openntpd, which does not offer a way to wait for synchronization, but does offer a way to force immediate synchronization.
Implementing proper failure scenarios: in particular, all of the new oneshot adjustment services will have infinite timeout/retry. It’s probably better to wait for some relatively long-but-not-forever-amount-of-time (say, 300 second timeout for initial synchronization.)
Documenting this in the manual.

I don’t think #2 is particularly troublesome to work around, but the inability to wait, and only force immediate results, seems a bit “rude” to do on behalf of users (in contrast, ntpwait and chronyc waitsync both properly wait until measurements complete, based on how the user specifies their configuration – they do not force anything to occur faster than it would have otherwise.) This is an annoying asymmetry, but hopefully not a huge issue; honestly I have no idea how many OpenNTPD users there are for NixOS – I personally use Chrony on all my machines now.

Does this impact you? Speak up

Currently, time-sync.target is only used by two time-critical daemons in Nixpkgs:

Ceph
CockroachDB

However, there are quite possibly many people out there using this target themselves, or indirectly via Ceph. I would be interested in hearing what their use-case for time-sync.target is, and if they expect this to improve things (I think it’s a strict improvement, but I’m willing to be surprised).

I would also appreciate any input from openntpd users if they have suggestions for solving the time-waiting issue. Perhaps I missed something obvious.

I’d like to implement/merge this all within the next few weeks, so I appreciate your thoughts.

Mic92 · December 3, 2018, 8:28pm

Have you thought about bringing this up on the systemd issue tracker? A target like that could be also standarized. Systemd also has an optional time sync service, that we already use by default, so this would be relevant to them and to us. At least we would get their feedback on the topic.

This should already works out-of-the-box with systemd-timesyncd: https://github.com/systemd/systemd/blob/master/src/timesync/timesyncd-manager.c#L619

thoughtpolice · December 3, 2018, 11:04pm

Correct, except for one thing: as noted, systemd-time-wait-sync.service is a new service that ships with systemd v239, and it does exactly what you expect – it simply injects itself into the dependencies of time-sync.target and waits until systemd-timesyncd has done its first synchronization before allowing things to continue, making it effectively equivalent to ntpwait or chrony waitsync.

The only thing we were missing is enabling this upstream service in our default config (see commit nixos/boot: distribute systemd-time-wait-sync.service)