Python builds keep *.pyc in store - dockerTools

While building python docker images with mach-nix and dockerTools, I observed while analyzing the final image with dive, that all __pycache__ folders are present in the nix-store together with there *.pyc files.

This essentially ± doubles the image size. Is this something to be tackled by how python packages are build in nix or on the dockerTools level?

Or anywhere else altogether? (Where?)

it’s a side effect of testing, the installed package is used, so python’s normal behavior of generating the py[co]'s is done

Thanks, I was almost suspecting this.

In your personal opinion, where should I seek to fix (better: improve) this?

I don’t think there is anything being done to address cleaning of binary files on wheel installation. If you’re using the nix dockerTools, you should get roughly a layer per package so the actual cost should be pretty low if you’re generating them all through nix’s dockerTools.

To anwser your question about 'improving" this. There’s some trade offs, pyc’s are faster to load, but obviously increase image size. I don’t think there’s been anything for removing them in their entirity. There has been some work in re-compiling them for reproducibility though

https://github.com/pypa/pip/issues/7808

Ok, thanks! That is some interesting input!

Since when doing local debugging, I usually have bind mounts for interpreted languages, therefore py[oc] shall prevail over py before my eyes.

I now tend to think this is something that dockerTools might want to contemplate (remove non compiled files from interpreted languages)?

I now tend to think this is something that dockerTools might want to contemplate (remove non compiled files from interpreted languages)?

An I wonder how to proceed - given my current skillset and should I want to push things?

if you wanted to remove the generated files for all packages, you would probably need to create a hook, and override buildPythonPackage to use the “removeBinCodeAfterInstallHook” (or whatever you want to call it) by default.

The main downside of this, is that you would essentially be diverging from master, and will have to build all python packages locally.

Here’s a PR for adding a python hook buildPythonPackage: recompile bytecode for reproducibility by FRidh · Pull Request #90208 · NixOS/nixpkgs · GitHub , you could probably upstream a “removeBinaryOutputHook” (or something better) which by default doesn’t get included, and then in an overlay override buildPythonPackage to then switch the hook to being used by default.

Is there any progress on this front? The following docker image takes 1.5 Gb:

  pythonRunPackages = with python.pkgs;[
    python
    attrs
    jupyter
    matplotlib
    nbconvert
    nbformat
    numpy
    pandas
    pytest
    scipy
    seaborn
    structlog
  ];

  docker-image = pkgs.dockerTools.buildImage {
    name = "myimage";
    tag = "latest";
    copyToRoot = [
      pkgs.dockerTools.usrBinEnv
      pkgs.dockerTools.binSh
      pkgs.coreutils
      pkgs.gnugrep
      (python.withPackages (ps: pythonRunPackages))
    ];
  };

An image from a Dockerfile with the same packages needs only 650Mb.

Building minimal docker images requires some work, and potentially recompiling stuff from source.

It is nothing that comes for free.

Usually nix build model gives us some good start, but not always.

Getting an Erlang application even close to something comparable to a naïve and idiomatic docker approach took me days. Getting it to something that was even smaller took me another year of (occasional) investigation and PRs to different places.

Unless someone else did investigate this issue until then, I will take a closer look in about 4 to 6 hours.

Absolutely! Not only for python.

Strategically speaking (in light of increasing its appeal outside of the community), I find this is a still yet not widely recognized field of improvement for Nixpkgs: optimize the distribution model for minimal bandwidth use.

There are attempts to solve this problem through smart deduplication and compression strategies (look for casync, “rolling hash”, chunking, zstd:chunked & friends).

But I feel the classical tactics, such as static builds, dependency conciseness (maybe? / somtimes?) and as we see precompiled leftovers of interpreted languages: these classical tactics are underexploited.

They are definitivly possible. But it doesn’t look like there’s already a “beaten happy path” in the community.

What can we do about it?

  • Apply for and join the Nixpkgs Architecture Team
  • Bear these topics to its agenda
  • Create awareness (how?)
  • Take action (PRs)

But maybe the discussion isn’t yet ripe for this?

In general each maintainer should strive to keep runtime closure sizes small, and reuse as much as possible from nixpkgs to be able to use shared dependencies when “combining” closures in a single environment.

Sadly this is not always easy, and we have to weigh using default stuff against using minified stuff.

We have to face the facts, nixpkgs is mostly built with nixos in mind and closures have been optimized to properly integrate there.

The docker builders are nice and work well enough for some ad-hoc stuff, but I really wouldn’t trust them beyond that, “ad-hoc defined isolated container”, as soon I need a container that I actually want to run permanently or distribute even, I would make sure that it’s size is small and the nixos integration stuff got removed, leaving it with what it really needs.

I have seen systemd slipping into many containers, though it’s usually not needed there, while it doesn’t hurt in most other cases.

There are of course other things that might be the issue here in this described image.

I slightly adjusted the expression to make it build for me… And I get a compressed size of ~490 MiB.

Uncompressed I get 1.5 GiB indeed.

But what I consider a bit weird, is, that I see a runtime closure size of 1.1GiB for the stuff that gets copied to root, so there has to be another 400 MiB added…

On the other hand side, the jupyter closure alone is 666.5 MiB, So I can’t really believe that plain docker would be below that… Perhaps it would be easier if you would also show the Dockerfile that you used to do the build in comparison.

For reference, I used the flake as in the spoiler to do my inspection:

flake.nix
{
  inputs.nixpkgs.url = "github:nixos/nixpkgs?ref=nixos-unstable";

  outputs = {
    self,
    nixpkgs,
  }: {
    packages.x86_64-linux.default = self.packages.x86_64-linux.docker;

    packages.x86_64-linux.docker = let
      pkgs = nixpkgs.legacyPackages.x86_64-linux;
    in
      pkgs.dockerTools.buildImage {
        name = "myimage";
        tag = "latest";
        copyToRoot = [self.packages.x86_64-linux.copyToRoot];
      };

    packages.x86_64-linux.copyToRoot = let
      pkgs = nixpkgs.legacyPackages.x86_64-linux;
      pythonRunPackages = ps:
        with ps; [
          python
          attrs
          jupyter
          matplotlib
          nbconvert
          nbformat
          numpy
          pandas
          pytest
          scipy
          seaborn
          structlog
        ];
    in
      pkgs.symlinkJoin {
        name = "myCopyToRoot";
        paths = [
          pkgs.dockerTools.usrBinEnv
          pkgs.dockerTools.binSh
          pkgs.coreutils
          pkgs.gnugrep
          (pkgs.python3.withPackages pythonRunPackages)
        ];
      };
  };
}

Have you tried the effect of pkgs.staticPkgs.python3 (i.e. “the alpine mantra”)? Assuming that this would work at all and faithfully take static versions of binaries in the dependency closure.

Also python has, iirc, a minimal version without some of the optional dependencies.

And as mentioned above, I doubt .pyc files are stripped in the meantime in recent nixpkgs. So there we have some roughly speaking factor of 2 on some subset of the closure.

A pragmatic approach would be to remove all unnecessary files from the resulting container (which is just a tar).

I don’t think that’s a convincing strategy, because we may asume the runtime closure is already well-specced in nixpkgs. There may be occasionally a maintainer mistake, but I wouldn’t assume it to be a systemic issue, unless you’ve seen it otherwise in nixpkgs.

But if it turns out that there is a systemic issue, we have a really good Problem Statement that we should track to its root and eventually take to the Nixpkgs Architecture Team.

One systemic issue could be that packages are typicall very eager and maximalist in declaring runtime dependencies… But on the other hand that question may be very runtime specific.

One solution to such a problem could be a post-processing tool that observes the application at runtime keeping record.of the open calls to files in the nix store and then reverse enginneer dependencies that may be excluded from the closure.