How to write bootstraps

The actual problem: CPython 2.7 is EOL and now I have to list it as an insecure package. I’d like to instead use PyPy for Python 2.7, but PyPy depends on Python 2.7 to build.

Approach 1: GCC depends on GCC. The stdenv bootstrap is prepared by building GCC and glibc; the result is tossed into a tarball. The tarball has to be patchelf’d before it can be used again, and stdenv is built in 5-ish different stages instead of the optimal 3 stages due to GCC + glibc interactions. See trofi’s writeups (1, 2) for details.

Approach 2: In the Haskell subsystem, GHC depends on GHC. To get binaries, some versions of GHC are imported from builds for binary distros and patchelf’d into compliance. This only works because other distros provide a niche for somewhat-portable binary GHC builds.

Approach 3: In the Perl 5 subsystem, Perl depends on Perl. During the build step, Perl (used to?) first prepares a small binary with restricted dependencies called miniperl, and then pretends that miniperl is a viable Perl for a first build stage. In the second stage, Perl rebuilds itself according to the standard 3-stage bootstrap.

One complication is that RPython can’t cross-compile. The resulting binaries are always architecture-matched to the build hosts. This means that we may need to indefinitely use CPython 2.7 or some other Python implementation which can be cross-compiled; adding e.g. riscv support to CPython would be easier than adding cross-compilation support to RPython.

Which would work for PyPy? Approach 1 looks messy and can’t cross-compile. Approach 2 only works for x86-64 and aarch64, Linux and Darwin. Approach 3 requires keeping a copy of CPython 2.7 around forever and hoping that it does not cause problems.

What do folks think? I’ll be implementing this in my rpypkgs flake first, but I’m thinking about what would work for nixpkgs in the future.

1 Like

Whae about Tauthon?

Sure, any fork of CPython 2.7 should work, although Tauthon’s documentation warns that some of the internals have changed. If Tauthon had a better security posture, then it would be more attractive; right now, I think that the only Python 2.7 implementation receiving security fixes is PyPy.

I don’t think it nullifies the question about how to bootstrap PyPy, but FWIW in late 2022 nixpkgs switched python2 to ActiveState’s cpython fork (to which they are releasing security fixes).

1 Like

Is it really a problem to use EOL package for bootstrapping only? You can add attribute(s) to ensure that it won’t remain in closures.

3 Likes

According to the PyPy-docs, PyPy can self-bootstrap from PyPy 2.7. Why not just pin an older commit of nixpkgs and use PyPy from this commit to Bootstrap current builds?

Yeah! That would be Approach 1. The main downside right now is that it takes over an hour to iterate; building PyPy takes a lot of RAM and also a lot of non-paralleizable CPU time. The downside later on will be that we can’t add new architectures this way; we need a CPython or something else which builds from stdenv.

I see. In that case we can use an otherwise insecure Python for bootstrapping, as the vulnerable Python will not be in the runtime closure. Assuming none of the vulnerabilities do affect the bootstrapping, ofc.
In the end this bootstrapping issue in an upstream issue…

I think so, yes. It’s also partially our issue because we disagree with upstream on how to add new architectures. We typically want to cross-compile, since we want our beachhead compiler to be something like GCC. That’s not the same pattern as this bootstrap.

PyPy requires companies, GSoC, grants, etc. to pay for each new architecture. The bringup is done with CPython. For example, s390x support in RPython was paid for by IBM and still requires CPython 2.7 today. Once added, this support is permanent and self-sustaining due to the bootstrappability of PyPy for Python 2.7; quoting from here (2016):

PyPy’s own position is that PyPy will support Python 2.7 forever—the RPython language in which PyPy is written is a subset of 2.7, and we have no plan to upgrade that.

So, for RPython builders (PyPy being the most obvious case) we should use PyPy for Python 2.7. And we should also bootstrap PyPy for Python 2.7 so that we can build PyPy with PyPy; if nothing else, it’s faster and uses less memory than CPython, lowering requirements for a famously expensive build.

Upstream makes the assumption that we will impurely install PyPy into our environment, similar to GHC’s assumption that we will always have a reasonably recent GHC installed. The big difference is that this bootstrap doesn’t have to be a moving target, because Python 2.7 is a dead language and so the bootstrap shouldn’t have to be rebuilt often.

2 Likes