• Jocelyn Bothe

    Jocelyn Bothe

    1 year ago
    we tried upgrading to v4 today and it did not go well after upgrade, our mysql db keeps crashing on this query
    Query (2b66089b0a90): INSERT IGNORE INTO scheduled_query_stats (     scheduled_query_id,     host_id,     average_memory,     denylisted,     executions,     schedule_interval,     last_executed,     output_size,     system_time,     user_time,     wall_time    )    VALUES ((SELECT sq.id FROM scheduled_queries sq JOIN packs p ON (sq.pack_id = p.id) WHERE p.name = ? AND sq.name = ?),?,?,?,?,?,?,?,?,?,?),((SELECT sq.id FROM scheduled_queries sq JOIN packs p ON (sq.pack_id = p.id) WHERE p.name = ? AND sq.name = ?),?,?,?,?,?,?,?,?,?,?),((SELECT sq.id FROM scheduled_queries sq JOIN packs p ON (sq.pack_id = p.id) WHERE p.name = ? AND sq.name = ?),?,?,?,?,?,?,?,?,?,?),((SELECT sq.id FROM scheduled_queries sq JOIN packs p ON (sq.pack_id = p.id) WHERE p.name = ? AND sq.name = ?),?,?,?,?,?,?,?,?,?,?),((SELECT sq.id FROM scheduled_queries sq JOIN packs p ON (sq.pack_id = p.id) WHERE p.name = ? AND sq.name = ?),?,?,?,?,?,?,?,?,?,?),((SELECT sq.id FROM scheduled_queries sq JOIN packs p ON (sq.pack_id = p.id) WHERE p.name = ? AND sq.name = ?),?
  • Tomas Touceda

    Tomas Touceda

    1 year ago
    hi Jocelyn, do you have logs you can share to see what the error is?
  • Jocelyn Bothe

    Jocelyn Bothe

    1 year ago
    that is from the mysql error log
  • 2021-07-21T20:00:48.271896Z 0 [Note] Server socket created on IP: '::'.
    2021-07-21T20:00:48.298284Z 0 [Warning] 'user' entry 'mysql.sys@localhost' ignored in --skip-name-resolve mode.
    2021-07-21T20:00:48.298300Z 0 [Warning] 'user' entry 'rdsadmin@localhost' ignored in --skip-name-resolve mode.
    2021-07-21T20:00:48.360779Z 0 [Note] Event Scheduler: Loaded 1 event
    2021-07-21T20:00:48.360853Z 1 [Note] Event Scheduler: scheduler thread started with id 1
    2021-07-21T20:00:48.467787Z 0 [Note] /rdsdbbin/oscar/bin/mysqld: ready for connections.
    Version: '5.7.12'  socket: '/tmp/mysql.sock'  port: 3306  MySQL Community Server (GPL)
    20:00:51 UTC - mysqld got signal 11 ;
    This could be because you hit a bug. It is also possible that this binary
    or one of the libraries it was linked against is corrupt, improperly built,
    or misconfigured. This error can also be caused by malfunctioning hardware.
    Attempting to collect some information that could help diagnose the problem.
    As this is a crash and something is definitely wrong, the information
    collection process might fail.
    
    key_buffer_size=16777216
    read_buffer_size=262144
    max_used_connections=84
    max_threads=3000
    thread_count=39
    connection_count=38
    connection_count=36
    It is possible that mysqld could use up to 
    key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1601274 K  bytes of memory
    Hope that's ok; if not, decrease some variables in the equation.
    
    
    Trying to get some variables.
    Some pointers may be invalid and cause the dump to abort.
    Query (2b85dd1b1a90): INSERT IGNORE INTO scheduled_query_stats (     scheduled_query_id,     host_id,     average_memory,     denylisted,     executions,     schedule_interval,     last_executed,     output_size,     system_time,     user_time,     wall_time    )    VALUES ((SELECT sq.id FROM scheduled_queries sq JOIN packs p ON (sq.pack_id = p.id) WHERE p.name = ? AND sq.name = ?),?,?,?,?,?,?,?,?,?,?),((SELECT sq.id FROM scheduled_queries sq JOIN packs p ON (sq.pack_id = p.id) WHERE p.name = ? AND sq.name = ?),?,?,?,?,?,?,?,?,?,?),((SELECT sq.id FROM scheduled_queries sq JOIN packs p ON (sq.pack_id = p.id) WHERE p.name = ? AND sq.name = ?),?,?,?,?,?,?,?,?,?,?),((SELECT sq.id FROM scheduled_queries sq JOIN packs p ON (sq.pack_id = p.id) WHERE p.name = ? AND sq.name = ?),?,?,?,?,?,?,?,?,?,?),((SELECT sq.id FROM scheduled_queries sq JOIN packs p ON (sq.pack_id = p.id) WHERE p.name = ? AND sq.name = ?),?,?,?,?,?,?,?,?,?,?),((SELECT sq.id FROM scheduled_queries sq JOIN packs p ON (sq.pack_id = p.id) WHERE p.name = ? AND sq.name = ?),?
    Connection ID (thread ID): 124
    Status: NOT_KILLED
    
    The manual page at <http://dev.mysql.com/doc/mysql/en/crashing.html> contains
    information that should help you find out what is causing the crash.
    Writing a core file
  • it would crash on that as soon as it started
  • we have over 20k hosts, so I suspect when it's trying to do the initial data population for scheduled_query_stats it's too much for the db to handle
  • Tomas Touceda

    Tomas Touceda

    1 year ago
    would you be able to try with mysql 5.7.35? I don't recall any crashes like this one before. I would be surprised if a bug in fleet could cause that
  • do you have CPU/memory metrics for the database?
  • Jocelyn Bothe

    Jocelyn Bothe

    1 year ago
    we're using an AWS Aurora mysql global db
  • cpu, mem, and connections were all fine
  • it was literally just dying on that query over and over
  • the scheduled_query_stats table doesn't exist in 3.10, it looks like when 4.0 starts it tries to populate data into that new table?
  • so what is scheduled_query_stats for, and can we disable it somehow, for testing?
  • Tomas Touceda

    Tomas Touceda

    1 year ago
    it's updated at different points as the hosts do requests against fleet for different things
  • Jocelyn Bothe

    Jocelyn Bothe

    1 year ago
    why start storing stats in the db?
  • Tomas Touceda

    Tomas Touceda

    1 year ago
    I would have to get back to you on that, as I wasn't involved in that
  • Jocelyn Bothe

    Jocelyn Bothe

    1 year ago
    fair enough
  • I'd just love to get v4 working
  • Tomas Touceda

    Tomas Touceda

    1 year ago
    It looks like the LTS version of aurora is 2.07, and based on the logs you are in at most 2.03.1
  • would you be able to test upgrading to 2.07?
  • Jocelyn Bothe

    Jocelyn Bothe

    1 year ago
    we're on
    Engine version
    5.7.mysql_aurora.2.07.2
  • Instance class
    db.r5.2xlarge
    vCPU
    8
    RAM
    64 GB
  • Tomas Touceda

    Tomas Touceda

    1 year ago
    huh, weird, the log shows
    Version: '5.7.12'  socket: '/tmp/mysql.sock'  port: 3306
    which according to the logs
    For Aurora MySQL 2.x, all versions 2.03.1 and lower are represented by the engine version 5.7.12.
  • let me do some digging and will get back to you
  • so it seems others have encountered similar issues, and they solved it by upgrading to 2.10 https://osquery.slack.com/archives/C01DXJL16D8/p1623355865176100?thread_ts=1623178672.165300&amp;cid=C01DXJL16D8
  • Jocelyn Bothe

    Jocelyn Bothe

    1 year ago
    mysql> select aurora_version(), @@aurora_version;
    +------------------+------------------+
    | aurora_version() | @@aurora_version |
    +------------------+------------------+
    | 2.07.2           | 2.07.2           |
    +------------------+------------------+
    1 row in set (0.00 sec)
  • sweet, we'll give that a try tomorrow
  • Noah Talerman

    Noah Talerman

    1 year ago
    @Jocelyn Bothe thank you for your patience and help troubleshooting the upgrade. Adding context to your earlier question:
    why start storing stats in the db?
    Fleet 3.12.0 introduced scheduled query information (“Name”, “Description”, “Frequency”, and “Last run”) to the Host details page. This way you can verify if, and when, a scheduled query ran successfully against a specific host without having to check osquery status logs or a configured log destination. In the coming months, the Fleet team plans to surface more scheduled query performance info in the Fleet UI. This UI will likely use
    scheduled_query_stats
    .
  • Jocelyn Bothe

    Jocelyn Bothe

    1 year ago
    thanks for the info!
  • During our investigation, we identified the source of these failovers due to following sql statement:
    INSERT IGNORE INTO scheduled_query_stats (...) VALUES (SELECT ... JOIN ...)(...)
    This is a known Mysql Community edition bug and is already fixed in Aurora with the 2.09.1 release.  We recommend you to upgrade your cluster to 2.09.1 at your earliest convenience.
    In the meantime, if possible could you avoid running this query to prevent further crashes.  You may instead modify your sql statement to insert single row at a time to avoid this issue.
  • might be worth adding a note to your upgrade guide to mention the new requirement for mysql version