Operations grimoire/Router

From Nasqueron Agora

Routers

Troubleshoot

Switch primary router

This procedure can be used when the current primary router doesn't allow access and network access through the GRE/CARP setup is degraded.

Typical symptoms observed:

  • ICMP may still work, but TCP sessions cannot actually be used
  • `nc` can connect to TCP ports, but SSH or Vault requests hang
  • Vault queries time out during TLS handshake
  • access through Windriver becomes unstable
  • router-003 does not appear to restore SSH access properly when it becomes primary
  • the issue may disappear temporarily after switching the primary router (mostly router-002 remains usuable), but can come back later

Observed troubleshooting notes:

  • the issue can reappear after about 1 hour, but sometimes only after up to 48 hours
  • GRE tunnels may remain pingable while application access times out
  • when the issue occurs, routing and primary router state should be checked

Procedure

If `router-002` is currently primary and needs to be switched out temporarily, disable `vmx1` on `router-002` so that `router-003` can take over:

ifconfig vmx1 down on router-002 / ifconfig vmx1 up on router-003

the unused router needs to be in INIT status and not BAKCUP, as BACKUP status blocks access for an unknown reason.

To re-enable the interface later:

ifconfig vmx1 up

If the switch does not restore connectivity, it may be necessary to revert the operation by disabling `vmx1` again on the new primary and re-enabling it on the previous router.

Checks

Check CARP logs

tail -f /var/log/carp.log

Check current routing table:

netstat -rn

Check TCP connectivity:

nc -zv 172.27.27.7 22
nc -zv 172.27.27.7 8201

Check Vault status:

vault status


This is currently a workaround, not a permanent fix.

The network issue is still unstable and requires further investigation.