Operations grimoire/Incidents/2017-03-01-Eglide: Difference between revisions

From Nasqueron Agora
(Created page with "Tracked at https://devcentral.nasqueron.org/T1162. == Incident timeline == * 03:18:56 amj weechat client timeouts on Freenode * 03:19:05 Odderon timeouts too * 04:22:47 tomje...")
 
Line 16: Line 16:


== Actionables ==
== Actionables ==
* Get Sandlayth contact information on file
* Get operations squad contact information on file ([https://devcentral.nasqueron.org/T1164 T1164]) *
* Ensure Scaleway account is accessible
* Ensure Scaleway account is accessible ([https://devcentral.nasqueron.org/T1165 T1165]) *
* [DONE] Enable Odderon service ([https://devcentral.nasqueron.org/D934 D934])
* [DONE] Enable Odderon service ([https://devcentral.nasqueron.org/T1163 T1163])
 
* ''tasks restricted to ops''

Revision as of 13:05, 6 March 2017

Tracked at https://devcentral.nasqueron.org/T1162.

Incident timeline

  • 03:18:56 amj weechat client timeouts on Freenode
  • 03:19:05 Odderon timeouts too
  • 04:22:47 tomjerr asks if it's down
  • 19:52:41 Sandlayth rebooted the server

After the incident, it was noticed Odderon didn't automatically connect:

  • 21:08:05 Odderon joins channel

Analysis

Outage root cause isn't known, logs doesn't contain any relevant information.

A simple reboot was enough to resume service, but 16 hours was needed to reach Sandlayth, alone with credentials to do it.

Actionables

  • Get operations squad contact information on file (T1164) *
  • Ensure Scaleway account is accessible (T1165) *
  • [DONE] Enable Odderon service (T1163)
  • tasks restricted to ops