Operations grimoire/Incidents/2018-12-08-Mastodon

From Nasqueron Agora

📕📁📜 Old technical information :: content warning

⌛ This Nasqueron Operations Grimoire page hasn't been updated for a long time.

☣ As our infrastructure evolves quickly, there is a good chance this information is outdated or now inaccurate. Be careful and consider update it.

➡️ To assert the information is still up-to-date or not, you can check the history of the relevant role in our Operations repository.

Tracked at https://devcentral.nasqueron.org/T1492.

Incident timeline

  • ffmpeg process are stuck in the queue and don't allow other jobs to run
  • Several people report the issue by DM, but of course, timeline isn't updated, including the notifications one
  • 2018-12-08 11:35 Investigation to see what happens on the queue
  • 2018-12-08 PM The stuck ffmpeg processes was killed
  • 2018-12-08 PM The queue started to resorbed
  • 2018-12-08 PM Workers capacity has been increased
  • 2018-12-08 17:17 Prepare a script to kill stuck ffmpeg:
    rOPS: roles/paas-docker/containers/files/mastodon/clear-video-queue.py
  • 2018-12-08 PM The queue resorbed more quickly
  • 2018-12-08 19:46 The situation was fixed

Analysis

There is a clear monitoring issue here: we need to get an alert if the queue contains more than some hundred jobs.

Actionables

  • Monitor Sidekiq queue
  • Report issue upstream (Mastodon and ffmpeg?)
  • Provide a watchdog to kill ŝtuck ffmpeg process (script is done, but a cron to run it is still needed)