Very strange runtime error when running a built script in a shell

I have a problem defeating all my intuitions.
A script from the Apache Spark package throws ps: command not found at runtime.
But ps is available in the Nix shell.

I’m very interested in understanding the cause of it. The way I set up Spark is a bit hacky, but I don’t think it can cause issues.

This is a repo where people can reproduce the issue. GitHub - Zhen-hao/spark3-nix-shell

A bit of a blind shot, but possibly spark resets the PATH at some point before trying to run ps (wouldn’t be that surprising, I’ve seen other tools do that for whatever reason)

good point. I can’t see any path resetting in the bash scripts unless some binaries get called which change the path.

What script? How are you running it? What does it invoke?

I downloaded the whole honkin’ archive and my best guess from grep -rnE "\bps\b" ~/Downloads/spark-3.1.1-bin-without-hadoop/**/*.sh is that this is one of:

spark-3.1.1-bin-without-hadoop/sbin/spark-daemon.sh
spark-3.1.1-bin-without-hadoop/sbin/stop-all.sh

I don’t see anything obvious in spark-daemon, but the only match in stop-all could certainly cause trouble:

stop-all.sh:39:    running=`${SPARK_HOME}/sbin/workers.sh ps -ef | grep -v grep | grep deploy.worker.Worker`

The usage description for workers.sh says:

Run a shell command on all worker hosts.

If it is indeed this script, the next step is probably involves investigating the environment of the spark workers.

Thanks for checking.
the command I used is $SPARK_HOME/sbin/start-master.sh, which shouldn’t start any workers.

Sorry, I guess I jumped to the .nix file a bit too quickly.

I don’t see anything terribly obvious in the shell scripts that I downloaded from 3.1.1, but I did notice a few more weird things that might be worth following up on:

  1. While I was reading bin/load-spark-env.sh from the archive locally, I did notice a reference to spark-env.sh which I didn’t see in the archive. I didn’t know what to make of it at the time, but while looking up the nixpkgs expression to look into #2 below, I did see where the nixpkgs expression creates this: nixpkgs/default.nix at 8284fc30c84ea47e63209d1a892aca1dfcd6bdf3 · NixOS/nixpkgs · GitHub

    That may well be clobbering the PATH you are setting. It looks like you could probably test this theory by also setting export SPARK_ENV_LOADED=1 (probably not a real fix unless you also take over setting some of what is in bin/load-spark-env.sh.)

  2. there are two references to 2.4.3 at:

    I guess this is just the version string baked in to the installPhase from the original derivation–so that part might be fine–but it is possible that there have been changes between 2.4.3 and 3.1.1, especially given the major version bump, that the existing package may not account for.

1 Like

You are right. My hack to get Spark 3 may not work.
I don’t want anyone to spend too much digging this because of my dirty hack.

I was hoping to know what can possibly cause the error. I had the intuition that the path in the shell is visible to everything run in it.

Even without my Spark 3 (spark-nt) hack, the version from the archive already has the problem.

I can’t test the latest version on nixpkgs because it fails to build.

Did you try setting the env I described in #1?

yes, but then it would report that the master has already started immediately after running the command.

That’s still progress of some sort! :slight_smile:

There may ultimately be some sort of incompatibility between 2.x and 3.x that will cause trouble, but I think there are still a few simple-ish things to try before coming to that conclusion.

Can you try:

  • removing export SPARK_ENV_LOADED=1 if you still have it set
  • removing PATH from your shellHook
  • in your override of spark, append procps to its buildInputs with something like: buildInputs = old.buildInputs ++ [ pkgs.procps ];

For a little context:

I think you’re just running into some wonkiness around a ~gap in the Nix ecosystem roughly caused by the fact that shell scripts aren’t compiled:

  1. It’s easy for packagers to miss dependencies in shell scripts because Nix doesn’t have a process that’ll break/fail due to missing commands at build time.
  2. When a package’s scripts contain bare command invocations, we either have to:
    • add (~leak) all of the script’s runtime dependencies to the user or system PATH
    • find some way to patch in a fixed PATH at build time

I suspect #2 explains why the nixpkgs derivation for spark builds a fixed PATH into conf/spark-env.sh, and #1 explains why ps is missing (there may also be more).

(I have been building resholve to address this ecosystem gap [i.e., resholve breaks a build when a dependency is missing, and rewrites bare invocations to absolute store paths], but the initial Nix API is focused on straightforward packages of shell scripts. The Nix integration needs more work before it can ~easily tackle cases like this.)

2 Likes

I added this and it works!!

Thank you for the solution and for the explanation. Now I understand much better about the issue :slight_smile:

1 Like

also, resholve is great. thank you for making it!

1 Like