Operations grimoire/Router
Routers
Troubleshoot
Switch primary router
This procedure can be used when the current primary router doesn't allow access and network access through the GRE/CARP setup is degraded.
Typical symptoms observed:
- ICMP may still work, but TCP sessions cannot actually be used
- `nc` can connect to TCP ports, but SSH or Vault requests hang
- Vault queries time out during TLS handshake
- access through Windriver becomes unstable
- router-003 does not appear to restore SSH access properly when it becomes primary
- the issue may disappear temporarily after switching the primary router but can come back later
Observed troubleshooting notes:
- the issue can reappear after about 1 hour, but sometimes only after up to 48 hours
- GRE tunnels may remain pingable while application access times out
- when the issue occurs, routing and primary router state should be checked
Procedure
If `router-002` is currently primary and needs to be switched out temporarily, disable `vmx1` on `router-002` so that `router-003` can take over:
ifconfig vmx1 down on router-002 / ifconfig vmx1 up on router-003
the unused router needs to be in INIT status and not BAKCUP, as BACKUP status blocks access for an unknown reason.
To re-enable the interface later:
ifconfig vmx1 up
If the switch does not restore connectivity, it may be necessary to revert the operation by disabling `vmx1` again on the new primary and re-enabling it on the previous router.
Check CARP logs after finishing the procedure.
tail -f /var/log/carp.log
This is currently a workaround, not a permanent fix.
The network issue is still unstable and requires further investigation.
