https://github.com/osquery/osquery logo
#fleet
Title
# fleet
a

Alon Starikov

11/10/2020, 12:33 PM
Hey guys, my fleet process started acting weird lately. It starts then takes over all of the CPU and RAM and stops after a few minutes. I have about 15000 hosts and this has never happened before, any ideas?
z

zwass

11/10/2020, 2:42 PM
What version of Fleet are you running?
a

Alon Starikov

11/10/2020, 2:55 PM
Currently 3.0.0, planning on upgrading to 3.3.0 in the near future
z

zwass

11/10/2020, 3:21 PM
Do your logs by chance include many requests to the EnrollAgent endpoint?
a

Alon Starikov

11/10/2020, 3:27 PM
Yes
z

zwass

11/10/2020, 3:30 PM
Is that expected for you? Do you have quite a few new agents enrolling?
If not, is it possible you've deployed a number of hosts with the same hardware UUID? Perhaps by copying a VM?
a

Alon Starikov

11/10/2020, 3:38 PM
That might be the case, is that the cause?
z

zwass

11/10/2020, 3:49 PM
We saw similar with another user. The problem is that enrollment is a bit of an expensive operation and if there are multiple hosts that appear to be the same host to Fleet they will continually overwrite the enrollment.
Here are some notes from that conversation: Status Quo (host_identifier=uuid) - Works until hosts have the same UUID. Seems to be an issue  in your (current) environment. - Not viable in your (current) environment due to hosts overwriting enrollment. host_identifier=instance - A new, osquery-specific UUID will be generated and stored in the osquery DB for each host - Works until a VM image is copied with the osquery DB already initialized (though host_identifier=uuid will fail in the same way) - Changing this now will cause Fleet to see every host as a fresh enrollment, leading to a single duplicate for each host in Fleet. The duplicates will have to be cleaned up later (though this can be automated with the host_expiry setting in Fleet). Redeploy offending hosts with properly reset UUIDs - No idea if this is viable for your situation, but if the duplicate issue described above seems worse than doing this, it is worth considering
🍻 1
a

Alon Starikov

11/10/2020, 3:54 PM
Right, I’ll look into it. Thanks!
z

zwass

11/10/2020, 4:54 PM
Please let me know how that goes. Of course we also need to fix Fleet to alert the user and not fall over in this situation.
@Alon Starikov are you still encountering this issue? Would it be possible for you to generate a debug archive so that I can try to understand what is going on (https://github.com/fleetdm/fleet/blob/master/docs/infrastructure/performance.md#generate-debug-archive-fleet-340)? I am going to implement a fix that will rate limit enrollment but I'd also really like to debug the issue that is being triggered before that is fixed.
@Alon Starikov we've pushed a cooldown period for host enrollment in Fleet 3.5.0 that is likely to resolve the issue for you. If you have a chance before upgrading we would really appreciate a debug archive. It's easy to do and may help us prevent similar problems in the future.
a

Alon Starikov

12/12/2020, 10:33 AM
Apologies, I won’t be able to get around to it this week unfortunately... I will try to get it done as soon as I can. host_identifier=instance actually seems to do the trick for me though, I haven’t encountered any problems since changing this setting
🍻 1
3 Views