• f

    Francisco Huerta

    1 year ago
    Hi, everyone. Hope anyone can provide some hints on a Fleet sizing problem we're seeing in our labs: We're running some stress tests with a single Fleet node and at a certain point we start seeing "enrolling too often" errors that lead to Fleet getting unstable. Assuming there has to be a certain break point, are there any techniques to prevent this problem? e.g., enabling multiple network interfaces (currently we only have one) for osquery <> Fleet traffic? Any config parameters to tweak?
  • As said, any guidance, similar experiences, best practices ... would be very useful at this stage. Thanks much!
  • zwass

    zwass

    1 year ago
    Are you running multiple instances of osquery on the same host to do this load testing?
  • f

    Francisco Huerta

    1 year ago
    hey, @zwass, our setup is as follows: 10x hosts running approx. 500 dockers each, for a total of 5,000 osquery instances. Those 5,000 endpoints are hitting a single Fleet DM server (eight-core machine)
  • we got two type of errors: "enrolling too often" and also "TLS handshake error: EOF".
  • mySQL is running on a separate VM configured with a max of 400 simultaneous connections
  • zwass

    zwass

    1 year ago
    EOF could be due to running out of open sockets/file descriptors and might require adjusting
    ulimit
    on the server and/or docker hosts.
  • f

    Francisco Huerta

    1 year ago
    that's something we suspected too, but we increased the
    ulimit
    parameter and at peak moments we are not close to that limit
  • zwass

    zwass

    1 year ago
    What are you setting for
    --host_identifier
    in the Docker hosts?
  • Depending on the deployment scenario, that is often a cause of "enrolling too often"
  • Setting it to
    instance
    tends to help
  • f

    Francisco Huerta

    1 year ago
    we are not setting it, so I guess it must be getting the default value.
  • I cannot see the
    instance
    in the documentation, are you meaning setting it as
    --host_identifier =  instance
    ?
  • what would this be helpful for?
  • (appreciate all prompt replies, by the way, thanks!) 👍
  • zwass

    zwass

    1 year ago
    In case the containers are sharing hardware UUIDs this helps Fleet see each container as a separate instance of osquery.
  • f

    Francisco Huerta

    1 year ago
    got you. we will give it a try. thanks so much!
  • sorry @zwass, do yoy mean
    --host_identifier = uuid
    , or is it
    --host_identifier = instance
    ? just to confirm I'm doing it right
  • zwass

    zwass

    1 year ago
    Try using instance.
  • f

    Francisco Huerta

    1 year ago
    👍
  • Hey, @zwass. As an update, we've been testing the performance with --host_identifier = instance and we don't see much of difference. After a certain threshold, we see again EOF messages popping up.
  • When this happens, we see an increase in the number of database connections (from a flat average of 50 when everything works fine to peaks of 400, our limit)
  • CPU consumption also gets increased to 100
  • we've tried creating a second network interface to balance incoming connections from the agents to the Fleet manager but we don't see a significant improvement here either
  • zwass

    zwass

    1 year ago
    This is CPU consumption on the Fleet server or the MySQL server?
  • FWIW I've load tested Fleet to 150,000+ simulated devices and folks are using Fleet in production on close to 100,000 devices.
  • At around 5,000 devices you might want to think about adding a load balancer routing traffic to multiple Fleet servers. But I know I can get more than that running on just my mac laptop.
  • The TLS EOF errors are in the osquery logs or the Fleet server logs?
  • As for the "enrolling too often" error, I found a bug in osquery that is probably causing this (https://github.com/osquery/osquery/issues/6993). Because of that we are disabling the enrollment cooldown by default in the next release, coming out today. Pulling that down once we release it could help address that issue.
  • f

    Francisco Huerta

    1 year ago
    Thanks! trying to answer your questions / comments in order:
  • CPU consumption refers to the Fleet server
  • Yes, we've got a load balancer in front of the server
  • TLS EOF are reported by the Fleet server
  • thanks for the indication on the bug, we will look into it 👍
  • An extra insight: we see improvements when setting tls_session_reuse = false
  • it seems that in our case (and maybe due to the way we create simulated endpoints) a number of connections are kept open over time, causing Fleet eventually to get unstable