<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://agora.nasqueron.org/index.php?action=history&amp;feed=atom&amp;title=Operations_grimoire%2FIncidents%2F2025-12-14-Hypervisor</id>
	<title>Operations grimoire/Incidents/2025-12-14-Hypervisor - Revision history</title>
	<link rel="self" type="application/atom+xml" href="https://agora.nasqueron.org/index.php?action=history&amp;feed=atom&amp;title=Operations_grimoire%2FIncidents%2F2025-12-14-Hypervisor"/>
	<link rel="alternate" type="text/html" href="https://agora.nasqueron.org/index.php?title=Operations_grimoire/Incidents/2025-12-14-Hypervisor&amp;action=history"/>
	<updated>2026-04-14T06:37:53Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.46.0-alpha</generator>
	<entry>
		<id>https://agora.nasqueron.org/index.php?title=Operations_grimoire/Incidents/2025-12-14-Hypervisor&amp;diff=2182&amp;oldid=prev</id>
		<title>Dereckson: Created page with &quot;Our main hypervisor for production VMs didn&#039;t answer at all. A reboot was needed at datacenter level.  Impact was wide: web sites (including wiki), databases, web services, Docker containers (including DevCentral), Drake router, primary DNS  == Incident timeline == &#039;&#039;All timestamps are UTC.&#039;&#039;  ; 2025-12-14 * 14:11 - Agora (web-001) and DevCentral (docker-002) up.  * 14:02 - Routing restored between WindRiver and router-001 to help troubleshoot other GRE connections and r...&quot;</title>
		<link rel="alternate" type="text/html" href="https://agora.nasqueron.org/index.php?title=Operations_grimoire/Incidents/2025-12-14-Hypervisor&amp;diff=2182&amp;oldid=prev"/>
		<updated>2025-12-14T14:21:51Z</updated>

		<summary type="html">&lt;p&gt;Created page with &amp;quot;Our main hypervisor for production VMs didn&amp;#039;t answer at all. A reboot was needed at datacenter level.  Impact was wide: web sites (including wiki), databases, web services, Docker containers (including DevCentral), Drake router, primary DNS  == Incident timeline == &amp;#039;&amp;#039;All timestamps are UTC.&amp;#039;&amp;#039;  ; 2025-12-14 * 14:11 - Agora (web-001) and DevCentral (docker-002) up.  * 14:02 - Routing restored between WindRiver and router-001 to help troubleshoot other GRE connections and r...&amp;quot;&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;Our main hypervisor for production VMs didn&amp;#039;t answer at all. A reboot was needed at datacenter level.&lt;br /&gt;
&lt;br /&gt;
Impact was wide: web sites (including wiki), databases, web services, Docker containers (including DevCentral), Drake router, primary DNS&lt;br /&gt;
&lt;br /&gt;
== Incident timeline ==&lt;br /&gt;
&amp;#039;&amp;#039;All timestamps are UTC.&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
; 2025-12-14&lt;br /&gt;
* 14:11 - Agora (web-001) and DevCentral (docker-002) up.&lt;br /&gt;
&lt;br /&gt;
* 14:02 - Routing restored between WindRiver and router-001 to help troubleshoot other GRE connections and routing.&lt;br /&gt;
&lt;br /&gt;
* 13:42 - On docker-002, nginx answers and serve 502.&lt;br /&gt;
&lt;br /&gt;
* 13:40 - Dorian started all machines.&lt;br /&gt;
&lt;br /&gt;
* 13:38 - Dorian confirms autostart is disabled on each machine.&lt;br /&gt;
&lt;br /&gt;
* 13:23 - Server rebooted by OVH, VMWare console answers.&lt;br /&gt;
&lt;br /&gt;
* 13:17 - Dereckson confirmed IPMI wasn&amp;#039;t reacheable neither for console or reboot.&lt;br /&gt;
&lt;br /&gt;
* 12:54 - OVH monitoring warns us hyper-001 server is down.&lt;br /&gt;
&lt;br /&gt;
Timestamps are UTC. Timestamps are an estimation, but 10:22 for MariaDB restart is accurate (from Salt).&lt;br /&gt;
&lt;br /&gt;
== Analysis ==&lt;br /&gt;
Server didn&amp;#039;t respond to ping. Root cause is still unknown.&lt;br /&gt;
&lt;br /&gt;
== Fix ==&lt;br /&gt;
At OVH level, soft reboot of the server.&lt;br /&gt;
&lt;br /&gt;
We also enabled autostart of the VMs, currently it was disabled and Dorian needed to manually start them&lt;br /&gt;
&lt;br /&gt;
== Actionables ==&lt;br /&gt;
* {{T|2197}} Fix GRE tunnel between WindRiver and router-001&lt;br /&gt;
* Enable autostart for VM (done)&lt;br /&gt;
* Enable autostart for Docker containers (to do)&lt;br /&gt;
** Create the network for containers needing acquisitariat, will be more stable than links for autostart&lt;/div&gt;</summary>
		<author><name>Dereckson</name></author>
	</entry>
</feed>