I’m seeing an intermittent crash in sshd-session, occurring once or twice a day on average (as a result of the constant probing that any machine listening on port 22 must endure):
Sep 29 15:09:33 tinka sshd[567]: error: session process 8913 for connection from [addr] to [me] killed by signal 11
Sep 30 07:03:43 tinka sshd[567]: error: session process 3501 for connection from [addr] to [me] killed by signal 11
Sep 30 07:36:47 tinka sshd[567]: error: session process 3580 for connection from [addr] to [me] killed by signal 11
Oct 01 22:58:40 tinka sshd[580]: error: session process 14767 for connection from [addr] to [me] killed by signal 11
Oct 03 02:08:06 tinka sshd[580]: error: session process 24671 for connection from [addr] to [me] killed by signal 11
Oct 04 22:38:01 tinka sshd[580]: error: session process 36547 for connection from [addr] to [me] killed by signal 11
Oct 05 00:15:39 tinka sshd[580]: error: session process 37110 for connection from [addr] to [me] killed by signal 11
Oct 05 22:06:52 tinka sshd[580]: error: session process 42567 for connection from [addr] to [me] killed by signal 11
Oct 06 09:44:42 tinka sshd[580]: error: session process 48825 for connection from [addr] to [me] killed by signal 11
Oct 06 11:42:06 tinka sshd[580]: error: session process 50376 for connection from [addr] to [me] killed by signal 11
Oct 08 00:19:40 tinka sshd[583]: error: session process 10045 for connection from [addr] to [me] killed by signal 11
Oct 08 07:35:24 tinka sshd[583]: error: session process 13044 for connection from [addr] to [me] killed by signal 11
Oct 08 12:29:39 tinka sshd[583]: error: session process 14224 for connection from [addr] to [me] killed by signal 11
I turned on core dumps and managed to capture a stack trace or two … and they point at the guts of the dynamic linker.
(gdb) backtrace
#0 0x00007bb751c1919d in _dl_find_object_update_1 (loaded=0x7bb751425000, count=<optimized out>) at dl-find_object.c:754
#1 _dl_find_object_update (new_map=new_map@entry=0x7957505ecc10) at dl-find_object.c:814
#2 0x00007bb751c013e8 in dl_open_worker_begin (a=a@entry=0x7ffd04f99830) at dl-open.c:731
#3 0x00007bb751bf64f1 in __GI__dl_catch_exception (exception=exception@entry=0x7ffd04f99690, operate=operate@entry=0x7bb751c00f30 <dl_open_worker_begin>,
args=args@entry=0x7ffd04f99830) at dl-catch.c:241
#4 0x00007bb751c00767 in dl_open_worker (a=a@entry=0x7ffd04f99830) at dl-open.c:778
#5 0x00007bb751bf64f1 in __GI__dl_catch_exception (exception=exception@entry=0x7ffd04f99810, operate=operate@entry=0x7bb751c00730 <dl_open_worker>,
args=args@entry=0x7ffd04f99830) at dl-catch.c:241
#6 0x00007bb751c00b5c in _dl_open (file=0x78b7505eea21 "/nix/store/j27vfym88320q12vnz1qcipdmgjmb3w9-linux-pam-1.6.1/lib/security/pam_env.so",
mode=<optimized out>, caller_dlopen=0x7bb751be1159 <_pam_load_module+201>, nsid=<optimized out>, argc=5, argv=0x7ffd04f9a988, env=0x7837505e1110)
at dl-open.c:880
#7 0x00007bb750a9391c in dlopen_doit (a=a@entry=0x7ffd04f99aa0) at dlopen.c:56
#8 0x00007bb751bf64f1 in __GI__dl_catch_exception (exception=exception@entry=0x7ffd04f99a00, operate=0x7bb750a938c0 <dlopen_doit>, args=0x7ffd04f99aa0)
at dl-catch.c:241
#9 0x00007bb751bf6623 in _dl_catch_error (objname=0x7ffd04f99a58, errstring=0x7ffd04f99a60, mallocedp=0x7ffd04f99a57, operate=<optimized out>,
args=<optimized out>) at dl-catch.c:260
#10 0x00007bb750a93397 in _dlerror_run (operate=operate@entry=0x7bb750a938c0 <dlopen_doit>, args=args@entry=0x7ffd04f99aa0) at dlerror.c:138
#11 0x00007bb750a939f1 in dlopen_implementation (file=<optimized out>, mode=<optimized out>, dl_caller=<optimized out>) at dlopen.c:71
#12 ___dlopen (file=<optimized out>, mode=<optimized out>) at dlopen.c:81
#13 0x00007bb751be1159 in _pam_load_module () from /nix/store/j27vfym88320q12vnz1qcipdmgjmb3w9-linux-pam-1.6.1/lib/libpam.so.0
#14 0x00007bb751be191a in _pam_add_handler () from /nix/store/j27vfym88320q12vnz1qcipdmgjmb3w9-linux-pam-1.6.1/lib/libpam.so.0
#15 0x00007bb751be200d in _pam_parse_conf_file () from /nix/store/j27vfym88320q12vnz1qcipdmgjmb3w9-linux-pam-1.6.1/lib/libpam.so.0
#16 0x00007bb751be2603 in _pam_init_handlers () from /nix/store/j27vfym88320q12vnz1qcipdmgjmb3w9-linux-pam-1.6.1/lib/libpam.so.0
#17 0x00007bb751be3dcc in _pam_start_internal () from /nix/store/j27vfym88320q12vnz1qcipdmgjmb3w9-linux-pam-1.6.1/lib/libpam.so.0
#18 0x00006329b645a5fc in sshpam_init ()
#19 0x00006329b645b8fa in start_pam ()
#20 0x00006329b6451947 in mm_answer_pam_start ()
#21 0x00006329b645100e in monitor_read ()
#22 0x00006329b6453986 in monitor_child_preauth ()
#23 0x00006329b642d3ac in main ()
(This is with the separated debug symbols for g8zyryr9cr6540xsyg4avqkwgxpnwj2a-glibc-2.40-66 installed. As far as I can tell, neither j27vfym88320q12vnz1qcipdmgjmb3w9-linux-pam-1.6.1 nor h44876m5fzv3zwh7kpa96xi03f6pfpkf-openssh-10.0p2 offers debug symbols.)
Poking around at the assembly level, the arguments to dlopen seem sane…
(gdb) frame 13
#13 0x00007bb751be1159 in _pam_load_module ()
from /nix/store/j27vfym88320q12vnz1qcipdmgjmb3w9-linux-pam-1.6.1/lib/libpam.so.0
(gdb) disassemble
Dump of assembler code for function _pam_load_module:
0x00007bb751be1090 <+0>: push %r15
...
0x00007bb751be114d <+189>: mov %rbp,%rdi
0x00007bb751be1150 <+192>: lea (%r14,%rax,8),%rbx
0x00007bb751be1154 <+196>: call 0x7bb751be49e0 <_pam_dlopen>
=> 0x00007bb751be1159 <+201>: mov %rax,0x10(%rbx)
...
(gdb) disassemble _pam_dlopen
Dump of assembler code for function _pam_dlopen:
0x00007bb751be49e0 <+0>: mov $0x2,%esi
0x00007bb751be49e5 <+5>: jmp 0x7bb751bde270 <dlopen@plt>
## so the first argument to dlopen should still be in %rbp in the _pam_load_module frame,
## and the second argument to dlopen is 0x2 == RTLD_NOW
(gdb) x/s $rbp
0x78b7505eea21: "/nix/store/j27vfym88320q12vnz1qcipdmgjmb3w9-linux-pam-1.6.1/lib/security/pam_env.so"
… and the actual crash is because of an unexpected null pointer:
(gdb) frame 0
#0 0x00007bb751c1919d in _dl_find_object_update ()
from /nix/store/g8zyryr9cr6540xsyg4avqkwgxpnwj2a-glibc-2.40-66/lib/ld-linux-x86-64.so.2
(gdb) disassemble
Dump of assembler code for function _dl_find_object_update:
0x00007bb751c18e50 <+0>: endbr64
...
0x00007bb751c1918e <+830>: mov %rcx,%rax
0x00007bb751c19191 <+833>: mov (%r10),%rdi
0x00007bb751c19194 <+836>: shl $0x5,%rax
0x00007bb751c19198 <+840>: mov 0x20(%rax,%rbp,1),%rax
=> 0x00007bb751c1919d <+845>: cmp %rax,0x390(%rdi)
0x00007bb751c191a4 <+852>: jb 0x7bb751c19266 <_dl_find_object_update+1046>
(gdb) p/x $rdi
$1 = 0x0
(gdb) p/x $r10
$2 = 0x7bb751425000
(gdb) x/16xg ($r10 - 64)
0x7bb751424fc0: 0x0000000000000000 0x0000000000000000
0x7bb751424fd0: 0x0000000000000000 0x0000000000000000
0x7bb751424fe0: 0x0000000000000000 0x0000000000000000
0x7bb751424ff0: 0x0000000000000000 0x0000000000000000
0x7bb751425000: 0x0000000000000000 0x0000000000000000
0x7bb751425010: 0x0000000000000000 0x0000000000000000
0x7bb751425020: 0x0000000000000000 0x0000000000000000
0x7bb751425030: 0x0000000000000000 0x0000000000000000
I am using a hardened malloc:
environment.memoryAllocator.provider = "scudo";
environment.variables.SCUDO_OPTIONS = "pattern_fill_contents=1";
so I am suspicious that this might be a bad interaction between scudo and the dynamic linker; however, it’s still a bug in something and I would appreciate suggestions for how to troubleshoot further.