• Peter

    Peter

    3 months ago
    Hey again 👋 Just a note, with a CPU soft lockup issue I mentioned above. It appears that in Ubunut's 5.4 LTS Kernel two "new" linux-audit patches were bundled: [RFC PATCH] audit: improve audit queue handling when "audit=1" on cmdline [PATCH] audit: ensure userspace is penalized the same as the kernel when under pressure These first made it into the AWS, Azure, GKE, and GCP Ubuntu Kernels with versions greater than 5.4.0-1068. Unfortunately, this appears to have a detrimental impact on systems running osquery - or with audit rules installed which generate quite a lot of events. Under load we see kernels above this version reliably killing systems entirely due to CPU soft lockups.
    kernel: watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [kauditd:22] Downgrading to Kernel
    5.4.0.1068.50
    and rebooting the system with no other changes resolves the issue. Subsequent load spikes to not experience this issue. However, if we roll the Kernel forward to an "affected" version, we once again see the soft lockups.
  • The osquery managed audit rules are as follows:
    -a always,exit -S connect,accept,bind,execve,accept4,execveat
    New printk'd messages prior to the lockup:
    Apr 07 04:39:35 ip-10-0-74-131 kernel: audit_log_start: 1773 callbacks suppressed
    Apr 07 04:39:35 ip-10-0-74-131 kernel: audit: audit_backlog=4098 > audit_backlog_limit=4096
    Apr 07 04:39:35 ip-10-0-74-131 kernel: audit: audit_lost=27 audit_rate_limit=0 audit_backlog_limit=4096
    Apr 07 04:39:51 ip-10-0-74-131 kernel: audit: audit_backlog=4098 > audit_backlog_limit=4096
    Apr 07 04:39:51 ip-10-0-74-131 kernel: audit: audit_lost=28 audit_rate_limit=0 audit_backlog_limit=4096
  • Immediately followed by a trace:
    Apr 07 04:38:57 ip-10-0-74-131 kernel: watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [kauditd:22]
    Apr 07 04:38:57 ip-10-0-74-131 kernel: Modules linked in: ppdev serio_raw parport_pc parport sch_fq_codel ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper ena
    Apr 07 04:38:57 ip-10-0-74-131 kernel: CPU: 1 PID: 22 Comm: kauditd Not tainted 5.4.0-1071-aws #76~18.04.1-Ubuntu
    Apr 07 04:38:57 ip-10-0-74-131 kernel: Hardware name: Amazon EC2 t3.small/, BIOS 1.0 10/16/2017
    Apr 07 04:38:57 ip-10-0-74-131 kernel: RIP: 0010:_raw_spin_unlock_irqrestore+0x15/0x20
    Apr 07 04:38:57 ip-10-0-74-131 kernel: Code: e8 b0 a3 65 ff 4c 29 e0 4c 39 f0 76 cf 80 0b 08 eb 8a 90 90 90 0f 1f 44 00 00 55 48 89 e5 e8 66 0c 68 ff 66 90 48 89 f7 57 9d <0f> 1f 44 00 00 5d c3 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 c6 07
    Apr 07 04:38:57 ip-10-0-74-131 kernel: RSP: 0018:ffffa385c00cbe08 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
    Apr 07 04:38:57 ip-10-0-74-131 kernel: RAX: 0000000000000001 RBX: ffffffff96fd3b70 RCX: 0000000000000769
    Apr 07 04:38:57 ip-10-0-74-131 kernel: RDX: 000000000000076a RSI: 0000000000000246 RDI: 0000000000000246
    Apr 07 04:38:57 ip-10-0-74-131 kernel: RBP: ffffa385c00cbe08 R08: 0000000000000000 R09: ffffffff958b4801
    Apr 07 04:38:57 ip-10-0-74-131 kernel: R10: ffffa385c00cbd98 R11: 0000000000000000 R12: ffff907db699ff00
    Apr 07 04:38:57 ip-10-0-74-131 kernel: R13: ffffffff96fd3b84 R14: ffff907db699ffd4 R15: ffff907db699ff00
    Apr 07 04:38:57 ip-10-0-74-131 kernel: FS:  0000000000000000(0000) GS:ffff907dfb500000(0000) knlGS:0000000000000000
    Apr 07 04:38:57 ip-10-0-74-131 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Apr 07 04:38:57 ip-10-0-74-131 kernel: CR2: 000055c63d5d9ee0 CR3: 0000000008a0a006 CR4: 00000000007606e0
    Apr 07 04:38:57 ip-10-0-74-131 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    Apr 07 04:38:57 ip-10-0-74-131 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Apr 07 04:38:57 ip-10-0-74-131 kernel: PKRU: 55555554
    Apr 07 04:38:57 ip-10-0-74-131 kernel: Call Trace:
    Apr 07 04:38:57 ip-10-0-74-131 kernel:  skb_queue_head+0x47/0x50
    Apr 07 04:38:57 ip-10-0-74-131 kernel:  kauditd_rehold_skb+0x18/0x20
    Apr 07 04:38:57 ip-10-0-74-131 kernel:  kauditd_send_queue+0xb6/0xf0
    Apr 07 04:38:57 ip-10-0-74-131 kernel:  ? kauditd_retry_skb+0x20/0x20
    Apr 07 04:38:57 ip-10-0-74-131 kernel:  kauditd_thread+0xa0/0x250
    Apr 07 04:38:57 ip-10-0-74-131 kernel:  ? __wake_up_pollfree+0x40/0x40
    Apr 07 04:38:57 ip-10-0-74-131 kernel:  kthread+0x121/0x140
    Apr 07 04:38:57 ip-10-0-74-131 kernel:  ? auditd_reset+0xa0/0xa0
  • I can provide an easy script to generate a repro for anyone who would be interested in helping me validate. I've seen postings on the Ubuntu Launchpad tracker and Stack Overflow which seem very similar, and indicate that this issue occurs in Ubuntu 20.04, as well as the associated AWS, Azure, and "generic" Kernels. It feels to me to be one of two things: 1. One of the patches has an unexpected condition which is leading to this behaviour on loaded systems. 2. The new "penalty" code makes it impossible for osquery to consume audit data on a loaded system in a timely manner. Though this doesn't seem to explain the lockups directly.
  • Stefano Bonicatti

    Stefano Bonicatti

    3 months ago
    @Peter could you try passing to osquery
    --audit_backlog_wait_time=60000
    and see if this improves anything?
  • Peter

    Peter

    3 months ago
    Sure can, just a mo.
  • Unfortunately not. Same issue, CPU soft lockup under load.
  • Stefano Bonicatti

    Stefano Bonicatti

    3 months ago
    I see; I only gave this a quick look but surely is confusing. The first patch you linked seems to have been done to actually fix an issue about soft lockups: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1965723 So the combination of the two somehow made it worse? Also I’m curious, what’s the osquery CPU usage in all this?
  • Also, putting the wait time to 60k makes the lockup happen sooner, later or at the same time?
  • Peter

    Peter

    3 months ago
    Yeah, that's what has us a bit perplexed. The Kernel we're using definitely has the "fix" patch, but we can reliably reproduce the issue under load. It definitely seems to take longer before the soft lockups are seen with this wait time set. Just trying to get the load information for you now, which is a bit challenging as the system stops responding to terminal input entirely so we may only get information from just before the system locks up.
  • Stefano Bonicatti

    Stefano Bonicatti

    3 months ago
    actually, now that I remember, have you verified that osquery is not getting killed by the watchdog?
  • because as far as I recall when there’s high activity like that, the default limits of 200MB of RAM won’t cut it
  • Peter

    Peter

    3 months ago
    Let me check around the watchdog. Thanks for the assistance, by the way! Our full flags file, as we increase the memory limit slightly:
    --audit_allow_config=true
    --audit_allow_sockets=true
    --audit_persist=true
    --audit_allow_accept_socket_events=true
    --audit_allow_null_accept_socket_events=true
    --disable_audit=false
    --disable_events=false
    --enable_file_events=true
    --events_expiry=1
    --events_max=50000
    --audit_backlog_wait_time=60000
    --events_optimize=true
    --host_identifier=uuid
    --pack_delimiter=_
    --verbose=false
    --schedule_splay_percent=10
    --logger_snapshot_event_type=true
    --watchdog_memory_limit=350
    --watchdog_utilization_limit=30
  • Stefano Bonicatti

    Stefano Bonicatti

    3 months ago
    So, I would check if osquery is getting killed. A way to understand that more quickly is to launch osquery in the shell so that the watchdog process is able to write the message about cpu or memory limits being hit (otherwise the message gets lost because the watchdog doesn’t implement the logger plugins that permits it to write them on file or via TLS)
  • I understand that the behavior of osquery and the watchdog would be the same pre and post patch, but to rule out issues, this might amplify the issue you’re seeing post patch
  • Peter

    Peter

    3 months ago
    Gotcha, let me give that a try. In the mean time, here's the last tmux frame before the system became entirely unresponsive. It's interesting that pid 930 of osqueryd is in an uninterruptible sleep. That said, the CPU and memory seems okay?
  • My knowledge of the kauditd internals isn't great, but I wonder whether osqueryd is disconnecting from the audit socket perhaps due to watchdog termination, which is triggering unexpected behaviour downstream from the call to
    auditd_reset
    . Due to the the new "penalty" code, I wonder if that's causing the call to
    auditd_reset
    after this disconnection to be blocked in the kthread unexpectedly - as if it was the client that had been penalised?
  • That's strange. If I run osqueryd in the foreground, using the same flags - with the addition of
    --verbose
    the issue doesn't present itself 🤔
    osqueryd --verbose --flagfile /etc/osquery/osquery.flags --config_path /etc/osquery/osquery.conf
  • I just ^C the
    osqueryd
    running in the foreground which didn't have any issues, and started the service - via systemd - and ran the reproduction script, and it started soft locking up again.
  • Stefano Bonicatti

    Stefano Bonicatti

    3 months ago
    I was going to say that there’s a limit in systemd
  • Peter

    Peter

    3 months ago
    I just disabled this CPU Quota line and running in systemd it no longer locks.
  • I have a strong feeling this is something wonky with the penalty code when the client disconnects from the audit socket.
  • Stefano Bonicatti

    Stefano Bonicatti

    3 months ago
    though I don’t understand why osquery should disconnect. Now the problem is understanding if osquery was getting killed due to too much memory usage. I have a PR open since a while which was addressing the issue of osquery not having enough CPU and especially having the audit socket reading thread going faster than what the other processing threads could go
  • and an internal buffer would grow “indefinitely”, causing the memory limit being hit
  • Peter

    Peter

    3 months ago
    Interesting, yeah, that's a good question. Without that CPUQuota flag the reproduction script completes and osquery keeps running. Let me run it under sar to see what happens with memory utilisation
  • Stefano Bonicatti

    Stefano Bonicatti

    3 months ago
    I’m probably not seeing some interaction in the penalty code, but all it seemed to me was a sleep in audit_receive, which btw osquery by default sets it to 0, so I would’ve expected that change to do nothing.
  • Peter

    Peter

    3 months ago
    The semantics behind why osquery might be disconnected from the socket is still a mystery - if that is occuring. That said, in that case what happens if
    auditd_reset
    is called while the wait queue is "active"? https://elixir.bootlin.com/linux/v5.4.189/source/kernel/audit.c#L1559
  • Thanks again for the assistance here. I'm going to go do some more reading on the topic, and have a think about where I can get a debugger attached to get better visibility into what's occurring. Be back in a few.
  • Ah I see what you mean around
    audit_backlog_wait_time
    . Well now I'm just even more confused 😂
  • Well now I'm just even more confused 😂
  • Looping back on this one, it appears that the AWS 1072.77 Kernel resolves this issue. The issue is no-longer triggerable in this Kernel. According to the following diff, the 1072.77 Kernel includes some changes to
    audit.c
    . What's weird is that the diff for linux-aws 5.4.0-1068.72 to 5.4.0-1069.73 includes these same
    audit.c
    patches, but I can't reproduce the issue on the new Kernel, whereas I can reliably on the older ones. 🤔