Operations grimoire/Incidents/2018-12-08-Mastodon
From Nasqueron Agora
Tracked at https://devcentral.nasqueron.org/T1492.
Incident timeline
- ffmpeg process are stuck in the queue and don't allow other jobs to run
- Several people report the issue by DM, but of course, timeline isn't updated, including the notifications one
- 2018-12-08 11:35 Investigation to see what happens on the queue
- 2018-12-08 PM The stuck ffmpeg processes was killed
- 2018-12-08 PM The queue started to resorbed
- 2018-12-08 PM Workers capacity has been increased
- 2018-12-08 17:17 Prepare a script to kill stuck ffmpeg:
rOPS: roles/paas-docker/containers/files/mastodon/clear-video-queue.py - 2018-12-08 PM The queue resorbed more quickly
- 2018-12-08 19:46 The situation was fixed
Analysis
There is a clear monitoring issue here: we need to get an alert if the queue contains more than some hundred jobs.
Actionables
- Monitor Sidekiq queue
- Report issue upstream (Mastodon and ffmpeg?)
- Provide a watchdog to kill ŝtuck ffmpeg process (script is done, but a cron to run it is still needed)