The Design of a Self-Compiling C Transpiler Targeting POSIX Shell

6 Likes

This paper is really cool, and I’ve spent at least one night staring at the ceiling after reading it trying to figure out how we’d use it to bootstrap nixpkgs. The result has been some mix of “I have no idea,” “well, can it run on gash,” “but you still need a C compiler to bootstrap it,” “but we can just distribute a version that compiles itself” and so on.

One somewhat concrete idea is that we could use the GNU Mes version of POSIX shell written in Guile, gash, but that still needs a C compiler to bootstrap the interpreter, so I don’t know where this leaves us. If we somehow got gash bootstrapped using bin0 building up a metacircular Lisp evaluator that can run gash, maybe we could build from bin0 up to gash somehow, then get pnut building itself and then building tcc which can compile gcc. The result would be a full source bootstrap except for bin0, but I also don’t know if this is realistic or if there are better options.

6 Likes

Well their idea seems to be that your bootstrap seed would be a POSIX shell (as a binary blob that needs to be trusted) and pnut.sh, i.e. the compiler in transpiled form. They argue that you’d then only need to trust the POSIX shell, really, since pnut.sh is reasonably human readable and thus can be audited just like the pnut source code itself.

The bootstrap chain would then look something like this:

  1. We have some binary POSIX shell and pnut.sh as well as pnut-exe.c (which is part of the pnut source code).
  2. We use the shell and pnut.sh to transpile pnut-exe.c to pnut-exe.sh. This is a variant of pnut that produces ELF executables targeting x86 Linux only (!). The resulting executables are much faster than shell, obviously.
  3. We use pnut-exe.sh to build an executable variant of pnut-exe.c.
  4. We use the ELF pnut-exe to build TinyCC. Edit: Note that “[w]ork to build TCC with pnut-exe is ongoing.” (p. 80).
  5. We go from there like we usually would (probably build GCC 4.8 or whatever)…

I assume it is not really practical to transpile TinyCC (would it even theoretically work?) because it would be very slow, so alternative backends to x86 Linux would still be necessary.

Edit: pnut targeting shell doesn’t implement switch fallthrough and other features of C that can’t be transpiled in a readable fashion, but pnut targeting x86 Linux ELF does. For this reason, it is not possible to build TinyCC with pnut targeting shell. As a consequence, using pnut for bootstrapping is only possible on i686-linux and x86_64-linux until other backends are implemented.

I think they make a good or rather pragmatic argument for POSIX shell as the bootstrap seed:

  • It is easy to get a POSIX shell, it has been ported to basically all platforms.
  • We’d need a POSIX shell anyways to do plumbing surrounding the compilation process.
  • There are many conforming implementations available, so it is possible to gain more confidence in the bootstrap using diverse double compilation (traditional reproducible builds efforts rely on this a lot, e.g. ArchLinux. I don’t think anyone has done this for nixpkgs’ bootstrap (on a larger scale) so far, but I might be wrong).

It seems to me that using Scheme and GNU mes is just as good, maybe even better because you “just” have to review the actual source code of GNU mes and not some transpilation result (be it readable or not) additionally.

In theory Scheme has similar benefits to POSIX shell, i.e. it is standardized, easy to port to new platforms, many implementations are available. It is also similar to POSIX shell in that most implementations implement extra nonstandard behavior.

The slight problem with GNU mes is of course that it ships its own scheme interpreter which implements a (weird?) subset of different scheme standards. So I wonder if it’s actually possible in practice to use anything except the bundled interpreter for bootstrap. If not, pnut would be a genuine improvement as they seem to have tested multiple implementations.

3 Likes

One avenue I briefly explored outside of pnut was if other approaches than GNU Mes could produce an integrated set of C compiler/binutils which then could later be used to bootstrap GCC and the GNU Binutils via some sort of interpreter. The reason to do this over a full compiler is that keeping the amount of target specific object code down at this bootstrap layer is a good idea in general, and lets you keep more of the bootstrap chain in plaintext.

Even so, having to compile binutils (M2, blood-elf, etc) adds complexity to the bootstrap. So, I looked into “fully integrated” binutils-less options. The ones I saw were:

Zig. Seems like it’s already on the table to write a Zig interpreter though it hasn’t been done; check the linked Discourse thread:

This interpreter could be written in Lua or Lisp for example, and only needs to be able to interpret exactly the latest tagged release of Zig.

If we had a Zig interpreter, we could presumably bootstrap GCC through Zig compiling itself and then compiling GCC since Zig can process C, though I don’t have any idea if Zig can actually compile GCC.

On the other hand, there’s TCC which could compile itself and then compile GCC. It’s written in C, so would need a C interpreter (more like, an interpreter that only needs to interpret TCC).

The question remains how to get a Zig or C interpreter as soon as possible after bin0 (or with bin0), and that’s where I was going with the “suitably small target specific Lisp interpreter with a metacircular evaluator” idea. (surprise, everything is a Lisp machine!)

Just an idea at this stage, but one with the potential to reduce the number of packages being bootstrapped in exchange for picking something with a fully integrated binutils like Zig or TCC.

1 Like

Zig doesn’t help you at all for bootstrapping since its C compilation ability is just a wrapper around LLVM/clang. If you have LLVM already, you can just use clang and don’t need to bother with zig.

1 Like

OK, yeah, that’s not exactly surprising that you can’t exactly get the C support “for free” with Zig and need to link against LLVM. Thanks for the clarification.