Intermittent crash in sshd-session, with malloc hardening

I’m seeing an intermittent crash in sshd-session, occurring once or twice a day on average (as a result of the constant probing that any machine listening on port 22 must endure):

Sep 29 15:09:33 tinka sshd[567]: error: session process 8913 for connection from [addr] to [me] killed by signal 11
Sep 30 07:03:43 tinka sshd[567]: error: session process 3501 for connection from [addr] to [me] killed by signal 11
Sep 30 07:36:47 tinka sshd[567]: error: session process 3580 for connection from [addr] to [me] killed by signal 11
Oct 01 22:58:40 tinka sshd[580]: error: session process 14767 for connection from [addr] to [me] killed by signal 11
Oct 03 02:08:06 tinka sshd[580]: error: session process 24671 for connection from [addr] to [me] killed by signal 11
Oct 04 22:38:01 tinka sshd[580]: error: session process 36547 for connection from [addr] to [me] killed by signal 11
Oct 05 00:15:39 tinka sshd[580]: error: session process 37110 for connection from [addr] to [me] killed by signal 11
Oct 05 22:06:52 tinka sshd[580]: error: session process 42567 for connection from [addr] to [me] killed by signal 11
Oct 06 09:44:42 tinka sshd[580]: error: session process 48825 for connection from [addr] to [me] killed by signal 11
Oct 06 11:42:06 tinka sshd[580]: error: session process 50376 for connection from [addr] to [me] killed by signal 11
Oct 08 00:19:40 tinka sshd[583]: error: session process 10045 for connection from [addr] to [me] killed by signal 11
Oct 08 07:35:24 tinka sshd[583]: error: session process 13044 for connection from [addr] to [me] killed by signal 11
Oct 08 12:29:39 tinka sshd[583]: error: session process 14224 for connection from [addr] to [me] killed by signal 11

I turned on core dumps and managed to capture a stack trace or two … and they point at the guts of the dynamic linker.

(gdb) backtrace
#0  0x00007bb751c1919d in _dl_find_object_update_1 (loaded=0x7bb751425000, count=<optimized out>) at dl-find_object.c:754
#1  _dl_find_object_update (new_map=new_map@entry=0x7957505ecc10) at dl-find_object.c:814
#2  0x00007bb751c013e8 in dl_open_worker_begin (a=a@entry=0x7ffd04f99830) at dl-open.c:731
#3  0x00007bb751bf64f1 in __GI__dl_catch_exception (exception=exception@entry=0x7ffd04f99690, operate=operate@entry=0x7bb751c00f30 <dl_open_worker_begin>,
    args=args@entry=0x7ffd04f99830) at dl-catch.c:241
#4  0x00007bb751c00767 in dl_open_worker (a=a@entry=0x7ffd04f99830) at dl-open.c:778
#5  0x00007bb751bf64f1 in __GI__dl_catch_exception (exception=exception@entry=0x7ffd04f99810, operate=operate@entry=0x7bb751c00730 <dl_open_worker>,
    args=args@entry=0x7ffd04f99830) at dl-catch.c:241
#6  0x00007bb751c00b5c in _dl_open (file=0x78b7505eea21 "/nix/store/j27vfym88320q12vnz1qcipdmgjmb3w9-linux-pam-1.6.1/lib/security/pam_env.so",
    mode=<optimized out>, caller_dlopen=0x7bb751be1159 <_pam_load_module+201>, nsid=<optimized out>, argc=5, argv=0x7ffd04f9a988, env=0x7837505e1110)
    at dl-open.c:880
#7  0x00007bb750a9391c in dlopen_doit (a=a@entry=0x7ffd04f99aa0) at dlopen.c:56
#8  0x00007bb751bf64f1 in __GI__dl_catch_exception (exception=exception@entry=0x7ffd04f99a00, operate=0x7bb750a938c0 <dlopen_doit>, args=0x7ffd04f99aa0)
    at dl-catch.c:241
#9  0x00007bb751bf6623 in _dl_catch_error (objname=0x7ffd04f99a58, errstring=0x7ffd04f99a60, mallocedp=0x7ffd04f99a57, operate=<optimized out>,
    args=<optimized out>) at dl-catch.c:260
#10 0x00007bb750a93397 in _dlerror_run (operate=operate@entry=0x7bb750a938c0 <dlopen_doit>, args=args@entry=0x7ffd04f99aa0) at dlerror.c:138
#11 0x00007bb750a939f1 in dlopen_implementation (file=<optimized out>, mode=<optimized out>, dl_caller=<optimized out>) at dlopen.c:71
#12 ___dlopen (file=<optimized out>, mode=<optimized out>) at dlopen.c:81
#13 0x00007bb751be1159 in _pam_load_module () from /nix/store/j27vfym88320q12vnz1qcipdmgjmb3w9-linux-pam-1.6.1/lib/libpam.so.0
#14 0x00007bb751be191a in _pam_add_handler () from /nix/store/j27vfym88320q12vnz1qcipdmgjmb3w9-linux-pam-1.6.1/lib/libpam.so.0
#15 0x00007bb751be200d in _pam_parse_conf_file () from /nix/store/j27vfym88320q12vnz1qcipdmgjmb3w9-linux-pam-1.6.1/lib/libpam.so.0
#16 0x00007bb751be2603 in _pam_init_handlers () from /nix/store/j27vfym88320q12vnz1qcipdmgjmb3w9-linux-pam-1.6.1/lib/libpam.so.0
#17 0x00007bb751be3dcc in _pam_start_internal () from /nix/store/j27vfym88320q12vnz1qcipdmgjmb3w9-linux-pam-1.6.1/lib/libpam.so.0
#18 0x00006329b645a5fc in sshpam_init ()
#19 0x00006329b645b8fa in start_pam ()
#20 0x00006329b6451947 in mm_answer_pam_start ()
#21 0x00006329b645100e in monitor_read ()
#22 0x00006329b6453986 in monitor_child_preauth ()
#23 0x00006329b642d3ac in main ()

(This is with the separated debug symbols for g8zyryr9cr6540xsyg4avqkwgxpnwj2a-glibc-2.40-66 installed. As far as I can tell, neither j27vfym88320q12vnz1qcipdmgjmb3w9-linux-pam-1.6.1 nor h44876m5fzv3zwh7kpa96xi03f6pfpkf-openssh-10.0p2 offers debug symbols.)

Poking around at the assembly level, the arguments to dlopen seem sane…

(gdb) frame 13
#13 0x00007bb751be1159 in _pam_load_module ()
   from /nix/store/j27vfym88320q12vnz1qcipdmgjmb3w9-linux-pam-1.6.1/lib/libpam.so.0
(gdb) disassemble
Dump of assembler code for function _pam_load_module:
   0x00007bb751be1090 <+0>:	push   %r15
...
   0x00007bb751be114d <+189>:	mov    %rbp,%rdi
   0x00007bb751be1150 <+192>:	lea    (%r14,%rax,8),%rbx
   0x00007bb751be1154 <+196>:	call   0x7bb751be49e0 <_pam_dlopen>
=> 0x00007bb751be1159 <+201>:	mov    %rax,0x10(%rbx)
...

(gdb) disassemble _pam_dlopen
Dump of assembler code for function _pam_dlopen:
   0x00007bb751be49e0 <+0>:	mov    $0x2,%esi
   0x00007bb751be49e5 <+5>:	jmp    0x7bb751bde270 <dlopen@plt>

## so the first argument to dlopen should still be in %rbp in the _pam_load_module frame,
## and the second argument to dlopen is 0x2 == RTLD_NOW
(gdb) x/s $rbp
0x78b7505eea21:	"/nix/store/j27vfym88320q12vnz1qcipdmgjmb3w9-linux-pam-1.6.1/lib/security/pam_env.so"

… and the actual crash is because of an unexpected null pointer:

(gdb) frame 0
#0  0x00007bb751c1919d in _dl_find_object_update ()
   from /nix/store/g8zyryr9cr6540xsyg4avqkwgxpnwj2a-glibc-2.40-66/lib/ld-linux-x86-64.so.2
(gdb) disassemble
Dump of assembler code for function _dl_find_object_update:
   0x00007bb751c18e50 <+0>:	endbr64
...
   0x00007bb751c1918e <+830>:	mov    %rcx,%rax
   0x00007bb751c19191 <+833>:	mov    (%r10),%rdi
   0x00007bb751c19194 <+836>:	shl    $0x5,%rax
   0x00007bb751c19198 <+840>:	mov    0x20(%rax,%rbp,1),%rax
=> 0x00007bb751c1919d <+845>:	cmp    %rax,0x390(%rdi)
   0x00007bb751c191a4 <+852>:	jb     0x7bb751c19266 <_dl_find_object_update+1046>
(gdb) p/x $rdi
$1 = 0x0
(gdb) p/x $r10
$2 = 0x7bb751425000
(gdb) x/16xg ($r10 - 64)
0x7bb751424fc0:	0x0000000000000000	0x0000000000000000
0x7bb751424fd0:	0x0000000000000000	0x0000000000000000
0x7bb751424fe0:	0x0000000000000000	0x0000000000000000
0x7bb751424ff0:	0x0000000000000000	0x0000000000000000
0x7bb751425000:	0x0000000000000000	0x0000000000000000
0x7bb751425010:	0x0000000000000000	0x0000000000000000
0x7bb751425020:	0x0000000000000000	0x0000000000000000
0x7bb751425030:	0x0000000000000000	0x0000000000000000

I am using a hardened malloc:

  environment.memoryAllocator.provider = "scudo";
  environment.variables.SCUDO_OPTIONS = "pattern_fill_contents=1";

so I am suspicious that this might be a bad interaction between scudo and the dynamic linker; however, it’s still a bug in something and I would appreciate suggestions for how to troubleshoot further.

Certain software have very sensitive behaviors to the actual malloc implementation (sometimes due to bugs in the software itself).

Is scudo known to work with sshd-session? I’d look at who is using scudo with sshd and see if they have some patches.

(I recommend GitHub - Inria-Prosecco/StarMalloc: A verified, modern, hardened memory allocator. [maintainer=@cmovcc] which has been tested quite.)

Right now my working hypothesis is that this isn’t anything to do with scudo specifically, but rather the C library has a bug related to calling dlopen in a process that has had any malloc replacement injected via ld(-nix).so.preload. If I’m right, then setting LD_BIND_NOW=t globally should be a workaround, and I’m testing that now; but it’ll be a week or so before I can be sure if it worked, because the crashes only happen about once a day.

StarMalloc doesn’t seem to be an option for environment.memoryAllocator.provider, at least not in 25.05.

Indeed, it requires packaging.

Right, this seems to be a rabbit hole in glibc fun I’d assume.

Yep, and PAM is involved as well, as are, potentially, some of the odd things sshd does in the name of hardening itself. I’m hoping I can come up with a reliable reproducer and then dump the problem on the glibc people.

1 Like

update: global LD_BIND_NOW=t did not suppress the crashes. I’m now testing whether an alternative malloc is involved at all.