NAR unpacking vulnerability -- post mortem

Last week there was a period of 14 hours where the mechanism of a serious security vulnerability in Nix 2.24 was publicly known but no patch available. The following reconstructs what led up to that, and lists the conclusions Nix maintainers (@edolstra @tomberek @Ericson2314 @roberth @fricklerhandwerk) draw from the incident.

Summary of events

Nix maintainers were first informed of the vulnerability on Friday 2024-08-30 on Matrix by @puckipedia, with an advance warning on GitHub by @jade. @tomberek got in touch with @puckipedia the same day, started working on the issue over the weekend, and produced an analysis and a patch. @edolstra started adding tests on Wednesday 2024-09-04.

Due to an omission, @puckipedia did not get access the GitHub advisory until Saturday 2024-09-07. In the Matrix exchange that revealed the problem, @fricklerhandwerk also communicated that the fix release was planned for the coming Monday.

On Monday 2024-09-09 the patch was ready, but the team decided to wait for feedback from @puckipedia since she had alluded to another possible attack without expanding on details.

In the European night to Tuesday 2024-09-10, @puckipedia publicly disclosed the vulnerability. Since no one except @edolstra could cut a release and @edolstra couldn’t be reached, @roberth and @cole-h redirected the installer URL to the previous version as a temporary measure. By European noon the patch was released together with an announcement.

Complete timeline

This timeline was sent for review of factual correctness to @puck, @jade, and @ktemkin.

Friday 2024-08-30

  • 03:15 CEST @jade posts a comment on the Nix 2.18 → 2.24 pull request with a warning not to merge yet.

  • 10:31 CEST @roberth notifies the team.

  • 14:08 CEST @puckipedia posts a message on the Nix development Matrix room describing the vulnerability and setting a deadline to disclose it.

    Okay, so. As mentioned on GitHub yesterday, I have found a vulnerability in Nix 2.24 that allows any untrusted user or substituter (even without trusted signing key) to potentially escalate to root on macOS and some very quirky Linux setups. Due to the severity of this bug, and my past experiences trying to disclose vulnerabilities to Nix, I am considering disclosing this vulnerability publicly in one week (seven days). Any Nix team member may message me for the details of the vulnerability.

  • 19:42 CEST @Mic92 from the NixOS security team checks with @tomberek if the issue is being addressed and offers help if needed.

  • 21:15 CEST @tomberek picks up communication with @puckipedia and leaves a note in the public Matrix room.

    @puckipedia shares the exploit with @tomberek. @tomberek suspects the reason to be recent the file system accessor refactoring by @Ericson2314. They discuss the issue and identify three exploit mechanisms. @tomberek promises to work on it over the weekend.

  • 23:21 CEST @tomberek shares his first assessment with the team.

Sunday 2024-09-01

Monday 2024-09-02

  • 14:00 CEST Regular Nix maintainer meeting begins. @tomberek is not available this time.

    @roberth and @edolstra work on other things due to lack of context.
    @fricklerhandwerk writes down a checklist for releasing security patches to ensure a smoother process this time around.

  • 20:56 CEST @tomberek creates the GitHub security advisory, tags @puckipedia as reporter. He misses that “reporter” (unlike “collabator”) doesn’t grant permissions to read, however. Tom messages Puck with the link.

  • 21:13 CEST @puckipedia lets Tom know that she cannot access the GitHub link. She also brings up a possible fourth exploit variant, wondering if the fix addresses it.

  • 21:27 CEST @tomberek acknolwedges the possible fourth attack, but does not react to Puck saying she cannot access the advisory.

Wednesday 2024-09-04

Saturday 2024-09-07

Monday 2024-09-09

Tuesday 2024-09-10

  • 0:40 CEST @roberth notifies the team that the vulnerability has been exposed.

    @roberth and @fricklerhandwerk meet in a call.
    Both try to reach out to @edolstra, since he’s the only one with credentials to make a release.
    As an emergency measure, they open pull requests to change offical installer URLs to a previous version.

  • 1:41 CEST

    @fricklerhandwerk reaches @grahamc, who gets @cole-h into the call – he has access to infrastructure.
    The three decide to downgrade the installer, since the earliest possible release would land in the European morning even if @edolstra appeared immediately.

    @fricklerhandwerk tries to reach @tomberek to get details on the status of the fix.

  • 2:00 CEST @cole-h has redirected the installer URL to 2.23.3.

    @Ericson2314 joins the call.

  • 2:18 CEST @roberth opens the pull request to remove Nix 2.24 from Nixpkgs.

    The group tries to reconstruct the sequence of events.

  • 3:07 CEST @cole-h skips ofborg CI and merges the pull request.

  • 10:00 CEST @fricklerhandwerk starts preparing a public statement.

  • 12:22 CEST @raboof posts an announcement on Discourse. @fricklerhandwerk follows up with details.

  • 13:00 CEST @edolstra backports the fix to 2.23 and kicks off the build.

  • 14:40 CEST @roberth and @fricklerhandwerk discuss with @tomberek if the patch is a sufficient mitigation and decide to move forward rather than recommending to downgrade.

    The point release lands, @edolstra provides updates to the Discourse announcement, recommending to reinstall Nix standalone to get 2.24.6 and downgrade to 2.23.3 if using Nixpkgs.

Wednesday 2024-09-11

  • @Mic92 brings Nix 2.24 back to Nixpkgs with the fix applied. @fricklerhandwerk updates the Discourse announcement to recommend upgrading to 2.24.6 inside Nixpkgs.

Assessment

There seems to have been no attempt to reach out to Nix maintainers or the NixOS security team before the public posts on GitHub and Matrix.

Since @tomberek and @edolstra were actively working on the issue, there was no sense of urgency among other team members. The conclusion of the Wednesday meeting that the patch was almost done contributed to that. Before that meeting, a comprehensive discussion of risks among team members did not happen.

We did not manage to aggregate all information consistently in one communication channel, which otherwise may have prevented misunderstandings

We probably should have communicated more clearly with @puckipedia that we’re waiting for a reply before releasing the patch. Moving the deadline for disclosure, given the patch was actively worked on, was not explicitly negotiated.

Without the uncoordinated disclosure our mitigation and release process would have worked as intended.

Mitigations

As measures to prevent such incidents in the future, Nix maintainers will:

  • Update the security protocol and exercise the updated version

    While we had an informal workflow for releasing security fixes, this was only written down recently. We did not have a checklist for responding to reports though, and little shared experience to build upon.

    Nix vulnerabilities must result in a single line of communication between the reporter and the entire Nix maintainer team. Then communication such as questions on the validity of the patch, and notices of public disclosure, will not be missed.

  • Fully automate the release process

    This was not done previously because the setup worked well enough.
    Automating this will cut down the time to delivery, and allows any team member to complete the release when needed.

29 Likes

Great write-up, thanks.

For the detailed timeline: I already started putting together a public statement at 09:00 CEST (collaborating in the NixOS Marketing and NixOS Security Discussion rooms) and sent it out on social channels at 09:30. Unfortunately I didn’t immediately think to also post it to discourse, and when I did it was stuck in the moderation queue there for a while.

I think when a vulnerability is already fairly widely public and there are workarounds available, publishing the basics (how do you know you’re affected, and what to do if you are) early would be good to add to the playbook. Of course ideally this can be prevented in the future, but if it does happen again, this is valuable information.

4 Likes

Absent additional context, this part of the timeline is difficult for me to understand as consistent with white-hat practices, and suggests a problem that is not addressed by either of the two proposed mitigations.

(If, as an innocuous example, it is explained by a failure of message delivery on Matrix, perhaps we should not be doing time-critical communication of this nature on Matrix.)

8 Likes

Some context here

I don’t see a clear answer in there, or am I missing something?

1 Like

I have no more information than you, but from those threads I got the impression that the reporter was already frustrated by earlier experiences reporting security issues. They announced they intended to disclose after 7 days and disclosed after 9 days. While obviously not great I can somewhat see where they’re coming from.

Somewhat, though: the way I read things part of the root cause seems to be the earlier frustration with lack of response when reporting security issues. Improving the security protocol and the communication with reporters would hopefully help in reducing the likelihood of such frustration (though ofc it’s not sufficient on its own).

3 Likes

It is notable that they said they were frustrated from a lack of response by the Nix team, but there was no problem of a lack of response this time — too many fragmented communication channels is a different problem.

2 Likes

The Nix team seemingly gives low priority to security issues, which is a bit concerning. (according to that lobste.rs post)


Not good!

2 Likes

If the OP is to be believed, it seems to me that in this specific instance the issue was being reasonably prioritised but there was a breakdown of communication with the reporter.

With the mitigations proposed, hopefully this can be avoided in the future.

2 Likes

Again:

If this is factually accurate—if it is the case that the team attempted to contact the reporter with a fix in hand, and seven hours later the reporter publishes without responding directly to the team—a breakdown in communication is the most charitable description possible. Updating the security protocol doesn’t address this unless the breakdown in communication is technical in nature and the new security protocol establishes a more technically reliable channel.

9 Likes

I think spending time overanalyzing what was fundamentally a social/communication issue doesn’t get us anywhere. Rather than pointing fingers, the goal should be to create systems to prevent these kinds of issues from slipping through the cracks in the future.

6 Likes

So long as we only have one side of the story, I think it’s better to assume positive intent and not malice (not that I am accusing you of this, I am merely stating my position).

1 Like

And that’s the contents of our assessment.

4 Likes

Having seen both sides of this problem professionally
 it’s tremendously frustrating on all sides if this happens. Years ago, I had a vuln report languish for months because a company (since it was years ago, I can probably say it was Linksys now
) had to patch a RCE in 20 different devices. However, they still did it, even if their QA cycle took forever.

From the other side, I sympathize with someone finding a vulnerability, and wanting something done about it yesterday so needing to push out a fast update. One thing I’ve learned is that these are just things that happen in software. You can adopt fairly secure coding practices and threat model everything, and something still happens. And, yes, it’s usually from areas where some legacy nonsense you warned about could have been improved, but wasn’t.

It would be very frustrating to be in contact with someone the same morning and have a zero day dropped as you’re waiting on feedback. I still don’t think the timeline answers the question on that. Were the messages missed?

On a related note, Matrix seems to eat messages sometimes (I’ve seen issues with end to end encryption and message sync on occasion
) and given the general discoverability issues of messages on there there’s no way I’d use it for a new project now. The fact that we had to create an entirely new NixOS matrix space should be an indicator to us of some reliability issues at least.

3 Likes

I also don’t think the mitigations are adequate. There is a clear violation of the principle safe defaults here. It should have been confirmed that it wouldn’t be released, not getting that confirmation is a communication failure on the part of the NixCpp team. This would only be the fault of the reporter if they had wrongly told the team they wouldn’t publish.

This can be addressed in the future by mapping load bearing assumptions for safe disclosure, and confirming them, and if not, assuming the worst outcome.

(Also, even if the reporter was at fault, the team should still operate under no unconfirmed assumptions that can lead to untimely disclosure, and if the reported had wrongly told the team there wouldn’t be a disclosure, then they should never be relied on again. So regardless, there is room for improvement, but it does seem to me that this was in no way a failure on the reporting party from the material presented here.)

4 Likes

I’m not looking at this from a ‘who is at fault’ lens. I don’t think it can be the case that the reporter is ‘at fault’. Imagining an entirely malicious person finding and publicizing a vuln without even attempting to disclose responsibly (which clearly this case is not), I would not even call that person ‘at fault’, because they haven’t been entrusted with any responsibility by anyone. We have to expect that hostile entities can exist and build procedures accordingly.

That said, I don’t necessarily agree that confirming assumptions like ‘you won’t publish if we remain in contact with you and demonstrate actual progress toward releasing a fix’ is the missing piece here. I think the first step is understanding whether the failure was technical or social, which will determine whether it needs a technical or a social solution. Having read a lot of threads, I still don’t know which this is—I have my suspicions about Matrix, and I have my suspicions about the attitudes of certain people with respect to making Nix look good or bad, but I am withholding judgment until some facts about this part of the timeline can be made evident.

10 Likes

There’s the original disclosure, the fedi thread, the lobsters post about the disclosure referencing the fedi thread, and then this thread referencing the lobsters thread. I’m not sure that this isn’t encouraging a game of telephone–let’s stick as far to the root as we can.

I wonder what were these past experiences that motivate such an unreasonably short deadline. As an outsider, It does not look responsible.

I’ll try to summarise relevant details in the Lobsters thread regarding the “past experiences”:

  • The same reporter reported a vulnerability in Nix on February 9th
  • The last communication they had from the Nix team was March 8th
  • They poked the Nix maintainers May 21st

As the cited vulnerability (GHSA-wf4c-57rh-9pjg) doesn’t appear to be public (searching for it only ever brings up it being mentioned in the context of the current vulnerability), I would assume it’s either unpatched or simply not public yet.

It’s probably worth saying that the gap (according to the OP) between last contact and uncoordinated disclosure is two days (excluding the attempted contact on the day of disclosure), rather than the months that the other vulnerability has been on hold.

1 Like