Operations grimoire/Router: Difference between revisions
From Nasqueron Agora
| (5 intermediate revisions by the same user not shown) | |||
| Line 1: | Line 1: | ||
= | = Routers = | ||
== Troubleshoot == | == Troubleshoot == | ||
| Line 13: | Line 13: | ||
* access through Windriver becomes unstable | * access through Windriver becomes unstable | ||
* router-003 does not appear to restore SSH access properly when it becomes primary | * router-003 does not appear to restore SSH access properly when it becomes primary | ||
* the issue may disappear temporarily after switching the primary router | * the issue may disappear temporarily after switching the primary router but can come back later | ||
Observed troubleshooting notes: | Observed troubleshooting notes: | ||
| Line 30: | Line 30: | ||
the unused router needs to be in INIT status and not BAKCUP, as BACKUP status blocks access for an unknown reason. | the unused router needs to be in INIT status and not BAKCUP, as BACKUP status blocks access for an unknown reason. | ||
Check CARP logs after finishing the procedure. | |||
<syntaxhighlight lang="bash"> | <syntaxhighlight lang="bash"> | ||
tail -f /var/log/carp.log | |||
</syntaxhighlight> | </syntaxhighlight> | ||
This is currently a workaround, not a permanent fix. | This is currently a workaround, not a permanent fix. | ||
The network issue is still unstable and requires further investigation. | The network issue is still unstable and requires further investigation. | ||
Latest revision as of 15:36, 20 May 2026
Routers
Troubleshoot
Switch primary router
This procedure can be used when the current primary router doesn't allow access and network access through the GRE/CARP setup is degraded.
Typical symptoms observed:
- ICMP may still work, but TCP sessions cannot actually be used
- `nc` can connect to TCP ports, but SSH or Vault requests hang
- Vault queries time out during TLS handshake
- access through Windriver becomes unstable
- router-003 does not appear to restore SSH access properly when it becomes primary
- the issue may disappear temporarily after switching the primary router but can come back later
Observed troubleshooting notes:
- the issue can reappear after about 1 hour, but sometimes only after up to 48 hours
- GRE tunnels may remain pingable while application access times out
- when the issue occurs, routing and primary router state should be checked
Procedure
If `router-002` is currently primary and needs to be switched out temporarily, disable `vmx1` on `router-002` so that `router-003` can take over:
ifconfig vmx1 down on router-002 / ifconfig vmx1 up on router-003
the unused router needs to be in INIT status and not BAKCUP, as BACKUP status blocks access for an unknown reason.
Check CARP logs after finishing the procedure.
tail -f /var/log/carp.log
This is currently a workaround, not a permanent fix.
The network issue is still unstable and requires further investigation.
