Operations grimoire/Router: Difference between revisions

From Nasqueron Agora
Duranzed (talk | contribs)
Created page with "= Router = == Troubleshoot == === Switch primary router === This procedure can be used when the current primary router doesn't allow access and network access through the GRE/CARP setup is degraded. Typical symptoms observed: * ICMP may still work, but TCP sessions cannot actually be used * `nc` can connect to TCP ports, but SSH or Vault requests hang * Vault queries time out during TLS handshake * access through Windriver becomes unstable * router-003 does not appea..."
 
Duranzed (talk | contribs)
 
(9 intermediate revisions by the same user not shown)
Line 1: Line 1:
= Router =
= Routers =


== Troubleshoot ==
== Troubleshoot ==
Line 13: Line 13:
* access through Windriver becomes unstable
* access through Windriver becomes unstable
* router-003 does not appear to restore SSH access properly when it becomes primary
* router-003 does not appear to restore SSH access properly when it becomes primary
* the issue may disappear temporarily after switching the primary router (mostly router-002 remains usuable), but can come back later
* the issue may disappear temporarily after switching the primary router but can come back later


Observed troubleshooting notes:
Observed troubleshooting notes:
Line 25: Line 25:


<syntaxhighlight lang="bash">
<syntaxhighlight lang="bash">
ifconfig vmx1 down
ifconfig vmx1 down on router-002 / ifconfig vmx1 up on router-003
</syntaxhighlight>
</syntaxhighlight>


To re-enable the interface later:
the unused router needs to be in INIT status and not BAKCUP, as BACKUP status blocks access for an unknown reason.


<syntaxhighlight lang="bash">
Check CARP logs after finishing the procedure.
ifconfig vmx1 up
</syntaxhighlight>
 
If the switch does not restore connectivity, it may be necessary to revert the operation by disabling `vmx1` again on the new primary and re-enabling it on the previous router.
 
==== Checks ====
 
Check current routing table:
 
<syntaxhighlight lang="bash">
netstat -rn
</syntaxhighlight>
 
Check TCP connectivity:


<syntaxhighlight lang="bash">
<syntaxhighlight lang="bash">
nc -zv 172.27.27.7 22
tail -f /var/log/carp.log
nc -zv 172.27.27.7 8201
</syntaxhighlight>
</syntaxhighlight>
Check Vault status:
<syntaxhighlight lang="bash">
vault status
</syntaxhighlight>
==== Notes ====


This is currently a workaround, not a permanent fix.
This is currently a workaround, not a permanent fix.


The network issue is still unstable and requires further investigation.
The network issue is still unstable and requires further investigation.

Latest revision as of 15:36, 20 May 2026

Routers

Troubleshoot

Switch primary router

This procedure can be used when the current primary router doesn't allow access and network access through the GRE/CARP setup is degraded.

Typical symptoms observed:

  • ICMP may still work, but TCP sessions cannot actually be used
  • `nc` can connect to TCP ports, but SSH or Vault requests hang
  • Vault queries time out during TLS handshake
  • access through Windriver becomes unstable
  • router-003 does not appear to restore SSH access properly when it becomes primary
  • the issue may disappear temporarily after switching the primary router but can come back later

Observed troubleshooting notes:

  • the issue can reappear after about 1 hour, but sometimes only after up to 48 hours
  • GRE tunnels may remain pingable while application access times out
  • when the issue occurs, routing and primary router state should be checked

Procedure

If `router-002` is currently primary and needs to be switched out temporarily, disable `vmx1` on `router-002` so that `router-003` can take over:

ifconfig vmx1 down on router-002 / ifconfig vmx1 up on router-003

the unused router needs to be in INIT status and not BAKCUP, as BACKUP status blocks access for an unknown reason.

Check CARP logs after finishing the procedure.

tail -f /var/log/carp.log

This is currently a workaround, not a permanent fix.

The network issue is still unstable and requires further investigation.