Operations grimoire/Incidents/2018-12-08-Mastodon: Difference between revisions
From Nasqueron Agora
(Created page with "Tracked at https://devcentral.nasqueron.org/T1492. == Incident timeline == * ffmpeg process are stuck in the queue and don't allow other jobs to run * Several people report t...") |
(→Actionables: No note here) |
||
Line 19: | Line 19: | ||
* Report issue upstream (Mastodon and ffmpeg?) | * Report issue upstream (Mastodon and ffmpeg?) | ||
* Provide a watchdog to kill ŝtuck ffmpeg process (script is done, but a cron to run it is still needed) | * Provide a watchdog to kill ŝtuck ffmpeg process (script is done, but a cron to run it is still needed) | ||
Latest revision as of 02:34, 9 December 2018
Tracked at https://devcentral.nasqueron.org/T1492.
Incident timeline
- ffmpeg process are stuck in the queue and don't allow other jobs to run
- Several people report the issue by DM, but of course, timeline isn't updated, including the notifications one
- 2018-12-08 11:35 Investigation to see what happens on the queue
- 2018-12-08 PM The stuck ffmpeg processes was killed
- 2018-12-08 PM The queue started to resorbed
- 2018-12-08 PM Workers capacity has been increased
- 2018-12-08 17:17 Prepare a script to kill stuck ffmpeg:
rOPS: roles/paas-docker/containers/files/mastodon/clear-video-queue.py - 2018-12-08 PM The queue resorbed more quickly
- 2018-12-08 19:46 The situation was fixed
Analysis
There is a clear monitoring issue here: we need to get an alert if the queue contains more than some hundred jobs.
Actionables
- Monitor Sidekiq queue
- Report issue upstream (Mastodon and ffmpeg?)
- Provide a watchdog to kill ŝtuck ffmpeg process (script is done, but a cron to run it is still needed)