Operations grimoire/Router: Difference between revisions

From Nasqueron Agora
Duranzed (talk | contribs)
Duranzed (talk | contribs)
 
(One intermediate revision by the same user not shown)
Line 13: Line 13:
* access through Windriver becomes unstable
* access through Windriver becomes unstable
* router-003 does not appear to restore SSH access properly when it becomes primary
* router-003 does not appear to restore SSH access properly when it becomes primary
* the issue may disappear temporarily after switching the primary router (mostly router-002 remains usuable), but can come back later
* the issue may disappear temporarily after switching the primary router but can come back later


Observed troubleshooting notes:
Observed troubleshooting notes:
Line 29: Line 29:


the unused router needs to be in INIT status and not BAKCUP, as BACKUP status blocks access for an unknown reason.
the unused router needs to be in INIT status and not BAKCUP, as BACKUP status blocks access for an unknown reason.
To re-enable the interface later:
<syntaxhighlight lang="bash">
ifconfig vmx1 up
</syntaxhighlight>
If the switch does not restore connectivity, it may be necessary to revert the operation by disabling `vmx1` again on the new primary and re-enabling it on the previous router.


Check CARP logs after finishing the procedure.
Check CARP logs after finishing the procedure.

Latest revision as of 15:36, 20 May 2026

Routers

Troubleshoot

Switch primary router

This procedure can be used when the current primary router doesn't allow access and network access through the GRE/CARP setup is degraded.

Typical symptoms observed:

  • ICMP may still work, but TCP sessions cannot actually be used
  • `nc` can connect to TCP ports, but SSH or Vault requests hang
  • Vault queries time out during TLS handshake
  • access through Windriver becomes unstable
  • router-003 does not appear to restore SSH access properly when it becomes primary
  • the issue may disappear temporarily after switching the primary router but can come back later

Observed troubleshooting notes:

  • the issue can reappear after about 1 hour, but sometimes only after up to 48 hours
  • GRE tunnels may remain pingable while application access times out
  • when the issue occurs, routing and primary router state should be checked

Procedure

If `router-002` is currently primary and needs to be switched out temporarily, disable `vmx1` on `router-002` so that `router-003` can take over:

ifconfig vmx1 down on router-002 / ifconfig vmx1 up on router-003

the unused router needs to be in INIT status and not BAKCUP, as BACKUP status blocks access for an unknown reason.

Check CARP logs after finishing the procedure.

tail -f /var/log/carp.log

This is currently a workaround, not a permanent fix.

The network issue is still unstable and requires further investigation.