Force only one python instance (in a nix-built container image or devShell)

I’m experimenting with a container-image-builder flake, to do away with LAN-local VM based gitlab-runners completely, while at the same time being able to run native nixos devShells and get the same results in gitlab-runners with docker-executor and an image built with the same inputs as the devShell.

Now I was (again) bitten by a longstanding issue: I always seem to get multiple python3 realisations in my env/container. If the wrong one is found (e.g. by ansible), it obviously doesn’t contain the python modules I need and throws errors.

Now in my container the one under /bin/python3 is the correct one, and executing:

podman run --rm -it -v $(pwd):$(pwd) -w $(pwd) -e ANSIBLE_HOME=".submodules/ansible-defs" -e ANSIBLE_CONFIG=".submodules/ansible-defs/ansible.cfg" -e PRIVTOKEN docker://1nnoserv:15000/devops-imgs/ansible-omni-deploy:ansible2.15.0 ./deliver.sh

works fine.

However, if I execute the same script from a .gitlab-ci.yaml using the same image in a docker-executor, it somehow finds the wrong python3 interpreter (one that I actually don’t even want in my container!) and throws an error:

Failed to import the required Python library (pexpect) on runner-tb1epk3--project-39-concurrent-0's Python /nix/store/lwzzgbnj41d657lpxczk6l5f7d5zcnj1-python3-3.10.11/bin/python3.10

Sure enough, it’s the wrong one.

It can be seen that the container has a python3 instance I don’t want or need (the second one, which is the one chosen by ansible in my CI):

bash-5.2# readlink $(which python3)
/nix/store/9ab2488gcawn919c5rpqd7hgp0dy153r-python3-3.10.11-env/bin/python3
bash-5.2# find /nix/store -wholename '*/bin/python3'
/nix/store/9ab2488gcawn919c5rpqd7hgp0dy153r-python3-3.10.11-env/bin/python3
/nix/store/lwzzgbnj41d657lpxczk6l5f7d5zcnj1-python3-3.10.11/bin/python3

So what I actually want is to make absolutely sure that only one python3 is instantiated in my devShell/container, thereby saving (image) space and solving this issue.

The flake is here for reference.

Should I be overriding the python3 attribute on the pkgs level? If so, which one is the “top-level” to be overriden?

1 Like

This isn’t a full answer since it isn’t immediately clear why the “wrong” python is getting used–just trying to nudge your mental model in the right direction.

There aren’t really two pythons here. python3-3.10.11-env/bin/python3 is just a wrapper for python3-3.10.11/bin/python3.

$ nix why-depends .#devShell.x86_64-linux /nix/store/lwzzgbnj41d657lpxczk6l5f7d5zcnj1-python3-3.10.11
/nix/store/cqh7dqgdgshg5kni9rrqgwr1k22hcgn1-nix-shell
└───/nix/store/wz7ap6iaiy52d7llscq6q58q718hv51d-mypy
    └───/nix/store/9ab2488gcawn919c5rpqd7hgp0dy153r-python3-3.10.11-env
        └───/nix/store/lwzzgbnj41d657lpxczk6l5f7d5zcnj1-python3-3.10.11

$ nix-shell -p binutils --command 'strings /nix/store/9ab2488gcawn919c5rpqd7hgp0dy153r-python3-3.10.11-env/bin/python'

/nix/store/dg8mpqqykmw9c7l0bgzzb5znkymlbfjw-glibc-2.37-8/lib/ld-linux-x86-64.so.2
__libc_start_main
execv
putenv
libc.so.6
GLIBC_2.2.5
GLIBC_2.34
/nix/store/dg8mpqqykmw9c7l0bgzzb5znkymlbfjw-glibc-2.37-8/lib:/nix/store/sm14bmd3l61p5m0q7wa5g7rz2bl6azqf-gcc-12.2.0-lib/lib
__gmon_start__
PTE1
H=(@@
NIX_PYTHONPREFIX=/nix/store/9ab2488gcawn919c5rpqd7hgp0dy153r-python3-3.10.11-env
NIX_PYTHONEXECUTABLE=/nix/store/9ab2488gcawn919c5rpqd7hgp0dy153r-python3-3.10.11-env/bin/python3.10
NIX_PYTHONPATH=/nix/store/9ab2488gcawn919c5rpqd7hgp0dy153r-python3-3.10.11-env/lib/python3.10/site-packages
PYTHONNOUSERSITE=true
/nix/store/lwzzgbnj41d657lpxczk6l5f7d5zcnj1-python3-3.10.11/bin/python
# ------------------------------------------------------------------------------------
# The C-code for this binary wrapper has been generated using the following command:
makeCWrapper '/nix/store/lwzzgbnj41d657lpxczk6l5f7d5zcnj1-python3-3.10.11/bin/python' \
    --set 'NIX_PYTHONPREFIX' '/nix/store/9ab2488gcawn919c5rpqd7hgp0dy153r-python3-3.10.11-env' \
    --set 'NIX_PYTHONEXECUTABLE' '/nix/store/9ab2488gcawn919c5rpqd7hgp0dy153r-python3-3.10.11-env/bin/python3.10' \
    --set 'NIX_PYTHONPATH' '/nix/store/9ab2488gcawn919c5rpqd7hgp0dy153r-python3-3.10.11-env/lib/python3.10/site-packages' \
    --set 'PYTHONNOUSERSITE' 'true'
# (Use `nix-shell -p makeBinaryWrapper` to get access to makeCWrapper in your shell)
# ------------------------------------------------------------------------------------
;*3$"
GCC: (GNU) 12.2.0
...

You won’t be able to remove this other Python. The fix may still be overriding a reference to python3 with the python-env with your packages, but I’m not sure. (There are other possibilities…)

I’m a little suspicious of (ansible-lint.override { python3=py; }). When I run nix-shell -p ansible-lint, I do indeed see pexpect in the list of packages that Nix tries to download. Did you add the override trying to fix this same issue?

I looked around (with this GH code search) and found at least two examples where it seems like people are trying to work around some kind of trouble with ansible-lint:

I’m curious what happens if you try the workaround that the 2nd link is using.

3 Likes

Indeed it’s related, but the other way around: I figured it’s the only package in my buildInputs that is (directly) dependent on python3, so I hoped that overriding with this explicit reference it would make the other reference (to the “wrong” python) “go away” (even if it’s still in the nix store as shown by your analysis and referenced by a wrapper, it would suffice if it’s not directly accessible/linked into the current env…). But alas…

I’ll be checking out the pointers you gave and report back…

1 Like

It might be a coincidence / red herring.

I don’t know if any other python packages in the environment are also depending on pexpect. (I assume this same issue is why you tried adding pexpect to the python environment?)

It might be worth verifying this by either adding some debug around or commenting out whatever runs ansible-lint?

Indeed ansible-lint might be a red herring: in fact I’m using (p)expect myself in ansible scripts, and in fact do not (yet) rely on ansible-lint in the CI, only in the devShell. If I comment out ansible-lint entirely, however, I still get the unwanted reference in my container, and ansible chooses the wrong interpreter.

Interestingly, after removing bash from the buildInputs and updating my scripts (the ones that execute ansible-playbook) to use #!/usr/bin/env sh I now get the same error also in the simulated container run (directly with podman). At least it’s reproducible now :stuck_out_tongue_winking_eye: .

EDIT: it gets weirder. With the same image that failed executing the playbook requiring pexpect, if I start sh in it and do:

sh-5.2# file /bin/{python3,ansible*}
/bin/python3:            symbolic link to /nix/store/9ab2488gcawn919c5rpqd7hgp0dy153r-python3-3.10.11-env/bin/python3
/bin/ansible:            symbolic link to /nix/store/9ab2488gcawn919c5rpqd7hgp0dy153r-python3-3.10.11-env/bin/ansible
/bin/ansible-community:  symbolic link to /nix/store/9ab2488gcawn919c5rpqd7hgp0dy153r-python3-3.10.11-env/bin/ansible-community
/bin/ansible-config:     symbolic link to /nix/store/9ab2488gcawn919c5rpqd7hgp0dy153r-python3-3.10.11-env/bin/ansible-config
/bin/ansible-connection: symbolic link to /nix/store/9ab2488gcawn919c5rpqd7hgp0dy153r-python3-3.10.11-env/bin/ansible-connection
/bin/ansible-console:    symbolic link to /nix/store/9ab2488gcawn919c5rpqd7hgp0dy153r-python3-3.10.11-env/bin/ansible-console
/bin/ansible-doc:        symbolic link to /nix/store/9ab2488gcawn919c5rpqd7hgp0dy153r-python3-3.10.11-env/bin/ansible-doc
/bin/ansible-galaxy:     symbolic link to /nix/store/9ab2488gcawn919c5rpqd7hgp0dy153r-python3-3.10.11-env/bin/ansible-galaxy
/bin/ansible-inventory:  symbolic link to /nix/store/9ab2488gcawn919c5rpqd7hgp0dy153r-python3-3.10.11-env/bin/ansible-inventory
/bin/ansible-playbook:   symbolic link to /nix/store/9ab2488gcawn919c5rpqd7hgp0dy153r-python3-3.10.11-env/bin/ansible-playbook
/bin/ansible-pull:       symbolic link to /nix/store/9ab2488gcawn919c5rpqd7hgp0dy153r-python3-3.10.11-env/bin/ansible-pull
/bin/ansible-test:       symbolic link to /nix/store/9ab2488gcawn919c5rpqd7hgp0dy153r-python3-3.10.11-env/bin/ansible-test
/bin/ansible-vault:      symbolic link to /nix/store/9ab2488gcawn919c5rpqd7hgp0dy153r-python3-3.10.11-env/bin/ansible-vault
sh-5.2# python
Python 3.10.11 (main, Apr  4 2023, 22:10:32) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pexpect
>>>

So everything looks good and as intended, but ansible somehow still manages to get the wrong interpreter :exploding_head:

1 Like

Hehe. One of those other possibilities is that something is being under-specified and left up to the environment. nix-shell/devShell try to provide an environment humans can readily hack around in, so by default they let much of your system/user environment leak in (which from time to time will obscure the fact that something isn’t well specified).

You might be able to validate that hypothesis by switching back to bash and trying the dev shell with --ignore-environment and see if it reproduces the failure. (Though even this won’t be a perfect proxy I don’t think…)

Can you elaborate a bit on how you know this and maybe show more of the error context? (It’s unclear to me if you’re still seeing the same error message, how you know the interpreter is wrong, and how you know ansible is where it’s going wrong. The details may help clarify where the loose threads are…)

To be clear, in most of the above I was mainly referring to container instances executing ansible playbooks (manual via podman vs. docker-executor by a gitlab-runner (where I have limited control on the container execution); these now behave the same for this context. But indeed I can reproduce it in the devShell.

The ansible error shows the full store path of the “wrong” interpreter, but it leaves out crucial information about how it actually found it. The peculiar thing is, it finds the correct py-env to get the pexpect module, but then tries to execute it with the wrong interpreter (even though it suggests it’s just trying /bin/python3.10 etc., which is there in the container and the “correct” one).

TASK [Testing remote connection using expect] ****************************************************************************************************
task path: /home/jeroen/devel/g1tlab/r/deploy/rctestenv.rcregistration/.submodules/rcregistration/_test-interpreter.ansible.yml:20
<localhost> ESTABLISH LOCAL CONNECTION FOR USER: root
<localhost> EXEC /bin/sh -c '( umask 77 && mkdir -p "` echo /var/tmp `"&& mkdir "` echo /var/tmp/ansible-tmp-1687935594.4501653-394-12230152464923
5 `" && echo ansible-tmp-1687935594.4501653-394-122301524649235="` echo /var/tmp/ansible-tmp-1687935594.4501653-394-122301524649235 `" ) && sleep
0'
<pisatellite> Attempting python interpreter discovery
<localhost> EXEC /bin/sh -c 'echo PLATFORM; uname; echo FOUND; command -v '"'"'python3.11'"'"'; command -v '"'"'python3.10'"'"'; command -v '"'"'p
ython3.9'"'"'; command -v '"'"'python3.8'"'"'; command -v '"'"'python3.7'"'"'; command -v '"'"'python3.6'"'"'; command -v '"'"'python3.5'"'"'; com
mand -v '"'"'/usr/bin/python3'"'"'; command -v '"'"'/usr/libexec/platform-python'"'"'; command -v '"'"'python2.7'"'"'; command -v '"'"'/usr/bin/py
thon'"'"'; command -v '"'"'python'"'"'; echo ENDFOUND && sleep 0'
<localhost> EXEC /bin/sh -c '/nix/store/lwzzgbnj41d657lpxczk6l5f7d5zcnj1-python3-3.10.11/bin/python3.10 && sleep 0'
<pisatellite> Python interpreter discovery fallback (unable to get Linux distribution/version info)
Using module file /nix/store/9ab2488gcawn919c5rpqd7hgp0dy153r-python3-3.10.11-env/lib/python3.10/site-packages/ansible/modules/expect.py
<localhost> PUT /home/jeroen/devel/g1tlab/r/deploy/rctestenv.rcregistration/.submodules/ansible-defs/tmp/ansible-local-371oarlchpw/tmpjh1urw0j TO
/nix/store/12gimb0hnqgisvzwk1nf30rlgz49l2yd-fake-nss/var/tmp/ansible-tmp-1687935594.4501653-394-122301524649235/AnsiballZ_expect.py
<localhost> EXEC /bin/sh -c 'chmod u+x /var/tmp/ansible-tmp-1687935594.4501653-394-122301524649235/ /var/tmp/ansible-tmp-1687935594.4501653-394-12
2301524649235/AnsiballZ_expect.py && sleep 0'
<localhost> EXEC /bin/sh -c '/nix/store/lwzzgbnj41d657lpxczk6l5f7d5zcnj1-python3-3.10.11/bin/python3.10 /var/tmp/ansible-tmp-1687935594.4501653-39
4-122301524649235/AnsiballZ_expect.py && sleep 0'
<localhost> EXEC /bin/sh -c 'rm -f -r /var/tmp/ansible-tmp-1687935594.4501653-394-122301524649235/ > /dev/null 2>&1 && sleep 0'
The full traceback is:
Traceback (most recent call last):
  File "/var/tmp/ansible_ansible.builtin.expect_payload_ka5qy6cq/ansible_ansible.builtin.expect_payload.zip/ansible/modules/expect.py", line 117,
in <module>
ModuleNotFoundError: No module named 'pexpect'
[WARNING]: Platform linux on host pisatellite is using the discovered Python interpreter at
/nix/store/lwzzgbnj41d657lpxczk6l5f7d5zcnj1-python3-3.10.11/bin/python3.10, but future installation of another Python interpreter could change
the meaning of that path. See https://docs.ansible.com/ansible-core/2.15/reference_appendices/interpreter_discovery.html for more information.

results from doing (in the container):

ansible-playbook -vvv _test-interpreter.ansible.yml

(the script executes a simple ansible.builtin.expect task)

BUT: if I do in the container:

sh-5.2# readlink $(command -v python3.10)
/nix/store/9ab2488gcawn919c5rpqd7hgp0dy153r-python3-3.10.11-env/bin/python3.10

which is the correct python environment.

One thing I failed to mention before: the test script is like this, notably it forces most commands being executed on the ansible controller while executing a barebones ssh session with connection data from hosts.yaml (which is what I need for my use case since some targets are semi-embedded and some don’t have python or pub-key auth.)

- hosts: pisatellite
  name: Test whether the correct python interpreter is used by using expect

  tasks:

    - name: Testing remote connection using expect
      ansible.builtin.expect:
        timeout: 60
        command: |
          sh -c '\
            echo "We ($(hostname -s)) tell you to report your hostname..." | \
              ssh {{ ssh_custom_args }} -p{{ ansible_ssh_port }} {{ ansible_user }}@{{ ansible_host }} "\
                cat -;
                echo Hello from $(hostname -s);
          "'
        responses:
          (.*)password: "{{ ansible_ssh_pass }}"
      delegate_to: localhost
      connection: local

BTW, circling back to the binary wrapper topic: as I now understand (and checked) all binaries in my /nix/store/9ab2488gcawn919c5rpqd7hgp0dy153r-python3-3.10.11-env/bin/ path are binary wrappers, each weighing 15K, which seems awfully big considering they just need to set a few env vars?

1 Like

I just found this article and sure enough my inventory also had a localhost entry…

This appears to be the root cause of the botched ansible interpreter discovery issue (i.e removing it fixed things)!

Lessons learnt:

  • a nix buildEnv environment is implemented (by makeCWrapper under the hood) using binary wrappers that execute the “normal” binaries (i.e. package realisations from nixpkgs in /nix/store/) with added context (mostly env vars) that make up the properties of the env.
  • subtle things in “global” configs can break things badly possibly without producing meaningful errors.
  • since binary wrappers are relatively large (~15kB) and go “all-in” (i.e. python, python3, python3.10 are all wrappers, no symlinks used there), one might consider using buildEnv sparingly if possible.

Thanks @abathur for bearing with me and showing some nix introspection tricks along the way, increasing my understanding of buildEnv.

2 Likes