Operations grimoire/Incidents/2018-12-08-Mastodon: Difference between revisions

From Nasqueron Agora
(Created page with "Tracked at https://devcentral.nasqueron.org/T1492. == Incident timeline == * ffmpeg process are stuck in the queue and don't allow other jobs to run * Several people report t...")
 
(→‎Actionables: No note here)
 
Line 19: Line 19:
* Report issue upstream (Mastodon and ffmpeg?)
* Report issue upstream (Mastodon and ffmpeg?)
* Provide a watchdog to kill ŝtuck ffmpeg process (script is done, but a cron to run it is still needed)
* Provide a watchdog to kill ŝtuck ffmpeg process (script is done, but a cron to run it is still needed)
;Notes
<references group="Note" />

Latest revision as of 02:34, 9 December 2018

Tracked at https://devcentral.nasqueron.org/T1492.

Incident timeline

  • ffmpeg process are stuck in the queue and don't allow other jobs to run
  • Several people report the issue by DM, but of course, timeline isn't updated, including the notifications one
  • 2018-12-08 11:35 Investigation to see what happens on the queue
  • 2018-12-08 PM The stuck ffmpeg processes was killed
  • 2018-12-08 PM The queue started to resorbed
  • 2018-12-08 PM Workers capacity has been increased
  • 2018-12-08 17:17 Prepare a script to kill stuck ffmpeg:
    rOPS: roles/paas-docker/containers/files/mastodon/clear-video-queue.py
  • 2018-12-08 PM The queue resorbed more quickly
  • 2018-12-08 19:46 The situation was fixed

Analysis

There is a clear monitoring issue here: we need to get an alert if the queue contains more than some hundred jobs.

Actionables

  • Monitor Sidekiq queue
  • Report issue upstream (Mastodon and ffmpeg?)
  • Provide a watchdog to kill ŝtuck ffmpeg process (script is done, but a cron to run it is still needed)