Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BGP Sessions Remain Idle After Restarting systemd-networkd or Rebooting on VM in FRR 10.2.1 (Physical Machine & KVM) #17860

Open
2 tasks done
skyblueted opened this issue Jan 15, 2025 · 6 comments
Assignees
Labels
bgp triage Needs further investigation

Comments

@skyblueted
Copy link

skyblueted commented Jan 15, 2025

Description

Recently, after upgrading the FRR version on both Baremetal and VMs to 10.2.1, we encountered an abnormal issue.
When the VM is rebooted or systemd-network is restarted, the BGP session remains stuck in Idle until we either bring down and up the bridge interface or restart FRR on the physical machine (bare-metal).

image

However, when using FRR versions 9.0.2/9.0.5 on the physical machine, the BGP peer automatically re-establishes correctly without issues

FRR log (VM):

Jan 15 19:03:01 vm-01 bgpd[3231]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
Jan 15 19:03:01 vm-01 bgpd[3231]: [H4B4J-DCW2R][EC 33554455] ens4 [Error] bgp_read_packet error: Connection reset by peer
Jan 15 19:03:11 vm-01 bgpd[3231]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
Jan 15 19:03:11 vm-01 bgpd[3231]: [H4B4J-DCW2R][EC 33554455] ens4 [Error] bgp_read_packet error: Connection reset by peer
Jan 15 19:03:21 vm-01 bgpd[3231]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
Jan 15 19:03:31 vm-01 bgpd[3231]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
Jan 15 19:03:31 vm-01 bgpd[3231]: [H4B4J-DCW2R][EC 33554455] ens4 [Error] bgp_read_packet error: Connection reset by peer
Jan 15 19:03:41 vm-01 bgpd[3231]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
Jan 15 19:03:41 vm-01 bgpd[3231]: [H4B4J-DCW2R][EC 33554455] ens4 [Error] bgp_read_packet error: Connection reset by peer
Jan 15 19:03:51 vm-01 bgpd[3231]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
Jan 15 19:03:51 vm-01 bgpd[3231]: [H4B4J-DCW2R][EC 33554455] ens4 [Error] bgp_read_packet error: Connection reset by peer
Jan 15 19:04:01 vm-01 bgpd[3231]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected

Packet sniffer (VM):

root@vm-01:~# tcpdump -i ens4 'tcp port 179' -n
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on ens4, link-type EN10MB (Ethernet), snapshot length 262144 bytes
19:04:40.697365 IP6 fe80::666f:5bff:fef0:82d7.44616 > fe80::182d:37ff:fe5d:3c0e.179: Flags [S], seq 3840576242, win 62580, options [mss 8940,sackOK,TS val 2973803148 ecr 0,nop,wscale 14], length 0
19:04:40.697490 IP6 fe80::182d:37ff:fe5d:3c0e.179 > fe80::666f:5bff:fef0:82d7.44616: Flags [S.], seq 1164270323, ack 3840576243, win 62496, options [mss 8940,sackOK,TS val 1447149278 ecr 2973803148,nop,wscale 14], length 0
19:04:40.697526 IP6 fe80::666f:5bff:fef0:82d7.44616 > fe80::182d:37ff:fe5d:3c0e.179: Flags [.], ack 1, win 4, options [nop,nop,TS val 2973803148 ecr 1447149278], length 0
19:04:40.697603 IP6 fe80::666f:5bff:fef0:82d7.44616 > fe80::182d:37ff:fe5d:3c0e.179: Flags [P.], seq 1:184, ack 1, win 4, options [nop,nop,TS val 2973803148 ecr 1447149278], length 183: BGP
19:04:40.697684 IP6 fe80::182d:37ff:fe5d:3c0e.179 > fe80::666f:5bff:fef0:82d7.44616: Flags [.], ack 184, win 4, options [nop,nop,TS val 1447149278 ecr 2973803148], length 0
19:04:40.697687 IP6 fe80::182d:37ff:fe5d:3c0e.179 > fe80::666f:5bff:fef0:82d7.44616: Flags [R.], seq 1, ack 184, win 4, options [nop,nop,TS val 1447149278 ecr 2973803148], length 0
19:04:40.697720 IP6 fe80::666f:5bff:fef0:82d7.44616 > fe80::182d:37ff:fe5d:3c0e.179: Flags [R], seq 3840576426, win 0, length 0

BGP session (abnormal):

vm-01# show ip bgp summary

IPv4 Unicast Summary:
BGP router identifier 100.111.91.240, local AS number 65500 VRF default vrf-id 0
BGP table version 3
RIB entries 4, using 512 bytes of memory
Peers 1, using 24 KiB of memory
Peer groups 1, using 64 bytes of memory

Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd PfxSnt Desc
ens4 4 65500 0 59 0 0 0 never Active 0 N/A

Bring down and up the bridge interface or restart FRR on the physical machine

root@vm-01:~# tcpdump -i ens4 'tcp port 179' -n
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on ens4, link-type EN10MB (Ethernet), snapshot length 262144 bytes
19:05:42.900870 IP6 fe80::666f:5bff:fef0:82d7.60170 > fe80::182d:37ff:fe5d:3c0e.179: Flags [S], seq 1472385439, win 62580, options [mss 8940,sackOK,TS val 2973865351 ecr 0,nop,wscale 14], length 0
19:05:42.901019 IP6 fe80::182d:37ff:fe5d:3c0e.179 > fe80::666f:5bff:fef0:82d7.60170: Flags [S.], seq 3060964090, ack 1472385440, win 62496, options [mss 8940,sackOK,TS val 1447211481 ecr 2973865351,nop,wscale 14], length 0
19:05:42.901081 IP6 fe80::666f:5bff:fef0:82d7.60170 > fe80::182d:37ff:fe5d:3c0e.179: Flags [.], ack 1, win 4, options [nop,nop,TS val 2973865352 ecr 1447211481], length 0
19:05:42.901237 IP6 fe80::666f:5bff:fef0:82d7.60170 > fe80::182d:37ff:fe5d:3c0e.179: Flags [P.], seq 1:184, ack 1, win 4, options [nop,nop,TS val 2973865352 ecr 1447211481], length 183: BGP
19:05:42.901287 IP6 fe80::182d:37ff:fe5d:3c0e.179 > fe80::666f:5bff:fef0:82d7.60170: Flags [F.], seq 1, ack 1, win 4, options [nop,nop,TS val 1447211481 ecr 2973865352], length 0
19:05:42.901316 IP6 fe80::182d:37ff:fe5d:3c0e.179 > fe80::666f:5bff:fef0:82d7.60170: Flags [R], seq 3060964091, win 0, length 0
19:05:52.901889 IP6 fe80::666f:5bff:fef0:82d7.44206 > fe80::182d:37ff:fe5d:3c0e.179: Flags [S], seq 4072060639, win 62580, options [mss 8940,sackOK,TS val 2973875352 ecr 0,nop,wscale 14], length 0
19:05:53.928066 IP6 fe80::666f:5bff:fef0:82d7.44206 > fe80::182d:37ff:fe5d:3c0e.179: Flags [S], seq 4072060639, win 62580, options [mss 8940,sackOK,TS val 2973876379 ecr 0,nop,wscale 14], length 0
19:05:54.227238 IP6 fe80::182d:37ff:fe5d:3c0e.179 > fe80::666f:5bff:fef0:82d7.44206: Flags [S.], seq 3255217763, ack 4072060640, win 62496, options [mss 8940,sackOK,TS val 1447222508 ecr 2973876379,nop,wscale 14], length 0
19:05:54.227268 IP6 fe80::666f:5bff:fef0:82d7.44206 > fe80::182d:37ff:fe5d:3c0e.179: Flags [.], ack 1, win 4, options [nop,nop,TS val 2973876678 ecr 1447222508], length 0
19:05:54.227432 IP6 fe80::182d:37ff:fe5d:3c0e.179 > fe80::666f:5bff:fef0:82d7.44206: Flags [F.], seq 1, ack 1, win 4, options [nop,nop,TS val 1447222807 ecr 2973876678], length 0
19:05:54.227460 IP6 fe80::666f:5bff:fef0:82d7.44206 > fe80::182d:37ff:fe5d:3c0e.179: Flags [P.], seq 1:184, ack 2, win 4, options [nop,nop,TS val 2973876678 ecr 1447222807], length 183: BGP
19:05:54.227553 IP6 fe80::182d:37ff:fe5d:3c0e.179 > fe80::666f:5bff:fef0:82d7.44206: Flags [R], seq 3255217765, win 0, length 0
19:05:54.576228 IP6 fe80::182d:37ff:fe5d:3c0e.56812 > fe80::666f:5bff:fef0:82d7.179: Flags [S], seq 1796987062, win 62580, options [mss 8940,sackOK,TS val 1447223156 ecr 0,nop,wscale 14], length 0
19:05:54.576269 IP6 fe80::666f:5bff:fef0:82d7.179 > fe80::182d:37ff:fe5d:3c0e.56812: Flags [S.], seq 3311269946, ack 1796987063, win 62496, options [mss 8940,sackOK,TS val 2973877027 ecr 1447223156,nop,wscale 14], length 0
19:05:54.576328 IP6 fe80::182d:37ff:fe5d:3c0e.56812 > fe80::666f:5bff:fef0:82d7.179: Flags [.], ack 1, win 4, options [nop,nop,TS val 1447223156 ecr 2973877027], length 0
19:05:54.576400 IP6 fe80::182d:37ff:fe5d:3c0e.56812 > fe80::666f:5bff:fef0:82d7.179: Flags [P.], seq 1:170, ack 1, win 4, options [nop,nop,TS val 1447223156 ecr 2973877027], length 169: BGP
19:05:54.576429 IP6 fe80::666f:5bff:fef0:82d7.179 > fe80::182d:37ff:fe5d:3c0e.56812: Flags [.], ack 170, win 4, options [nop,nop,TS val 2973877027 ecr 1447223156], length 0
19:05:54.576699 IP6 fe80::666f:5bff:fef0:82d7.179 > fe80::182d:37ff:fe5d:3c0e.56812: Flags [P.], seq 1:184, ack 170, win 4, options [nop,nop,TS val 2973877027 ecr 1447223156], length 183: BGP
19:05:54.576744 IP6 fe80::182d:37ff:fe5d:3c0e.56812 > fe80::666f:5bff:fef0:82d7.179: Flags [.], ack 184, win 4, options [nop,nop,TS val 1447223156 ecr 2973877027], length 0
19:05:54.576798 IP6 fe80::182d:37ff:fe5d:3c0e.56812 > fe80::666f:5bff:fef0:82d7.179: Flags [P.], seq 170:189, ack 184, win 4, options [nop,nop,TS val 1447223156 ecr 2973877027], length 19: BGP
19:05:54.576829 IP6 fe80::666f:5bff:fef0:82d7.179 > fe80::182d:37ff:fe5d:3c0e.56812: Flags [P.], seq 184:203, ack 189, win 4, options [nop,nop,TS val 2973877027 ecr 1447223156], length 19: BGP
19:05:54.619076 IP6 fe80::182d:37ff:fe5d:3c0e.56812 > fe80::666f:5bff:fef0:82d7.179: Flags [.], ack 203, win 4, options [nop,nop,TS val 1447223199 ecr 2973877027], length 0
19:05:55.677175 IP6 fe80::666f:5bff:fef0:82d7.179 > fe80::182d:37ff:fe5d:3c0e.56812: Flags [P.], seq 203:725, ack 189, win 4, options [nop,nop,TS val 2973878128 ecr 1447223199], length 522: BGP
19:05:55.677311 IP6 fe80::182d:37ff:fe5d:3c0e.56812 > fe80::666f:5bff:fef0:82d7.179: Flags [.], ack 725, win 4, options [nop,nop,TS val 1447224257 ecr 2973878128], length 0
19:05:55.877183 IP6 fe80::182d:37ff:fe5d:3c0e.56812 > fe80::666f:5bff:fef0:82d7.179: Flags [P.], seq 189:506, ack 725, win 4, options [nop,nop,TS val 1447224457 ecr 2973878128], length 317: BGP
19:05:55.920065 IP6 fe80::666f:5bff:fef0:82d7.179 > fe80::182d:37ff:fe5d:3c0e.56812: Flags [.], ack 506, win 4, options [nop,nop,TS val 2973878371 ecr 1447224457], length 0
19:06:04.021284 IP6 fe80::666f:5bff:fef0:82d7.179 > fe80::182d:37ff:fe5d:3c0e.56812: Flags [P.], seq 725:875, ack 506, win 4, options [nop,nop,TS val 2973886472 ecr 1447224457], length 150: BGP
19:06:04.021397 IP6 fe80::182d:37ff:fe5d:3c0e.56812 > fe80::666f:5bff:fef0:82d7.179: Flags [.], ack 875, win 4, options [nop,nop,TS val 1447232601 ecr 2973886472], length 0
^C
29 packets captured
29 packets received by filter
0 packets dropped by kernel
root@vm-01:~#
root@vm-01:~# vtysh

Hello, this is FRRouting (version 10.2.1).
Copyright 1996-2005 Kunihiro Ishiguro, et al.

vm-01# show ip bgp summary

IPv4 Unicast Summary:
BGP router identifier 100.111.91.240, local AS number 65500 VRF default vrf-id 0
BGP table version 5
RIB entries 6, using 768 bytes of memory
Peers 1, using 24 KiB of memory
Peer groups 1, using 64 bytes of memory

Neighbor        V         AS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd   PfxSnt Desc
ens4            4      65500         7        74        5    0    0 00:00:23            2        3 FRRouting/10.2.1

Version

OS/kernel: Ubuntu 22.04/6.2.0-39-generic
FRR version: 10.2.1

How to reproduce

Both Bare-metal and VM are using FRR version 10.2.1.

BM's FRR config:

hostname bm-01
log file /var/log/frr/bgpd.log informational
log syslog informational
no zebra nexthop kernel enable
service intergrated-vtysh-config
!
interface eno5
 ipv6 nd ra-interval 4
 ipv6 nd ra-lifetime 10
 no ipv6 nd suppress-ra
!
interface ens3f0
 ipv6 nd ra-interval 4
 ipv6 nd ra-lifetime 10
 no ipv6 nd suppress-ra
!
interface br1
 ipv6 nd ra-interval 4
 ipv6 nd ra-lifetime 10
 no ipv6 nd suppress-ra
exit
!
router bgp 65500 vrf vm
 bgp router-id 100.111.11.228
 neighbor vm_fabric peer-group
 neighbor vm_fabric remote-as 65500
 neighbor vm_fabric description Internal VM Network
 neighbor vm_fabric bfd
 neighbor vm_fabric bfd profile bfd_template
 neighbor vm_fabric timers connect 10
 neighbor vm_fabric capacity extended-nexthop
 neighbor br1 interface peer-group vm_fabric
!
address-family ipv4 unicast
  redistribute kernel route-map route_filter
  redistribute connected route-map route_filter
  neighbor vm_fabric default-originate
  neighbor vm_fabric soft-reconfiguration inbound
  maximum-paths 64
  maximum-paths ibgp 64
 exit-address-family
 !
 address-family ipv6 unicast
  redistribute kernel route-map v6_route_filter
  redistribute connected route-map v6_route_filter
  neighbor vm_fabric activate
  neighbor vm_fabric default-originate
  neighbor vm_fabric soft-reconfiguration inbound
  maximum-paths 64
  maximum-paths ibgp 64
 exit-address-family
exit
!
router bgp 64705
 bgp router-id 100.111.11.228
 neighbor fabric peer-group
 neighbor fabric remote-as 64705
 neighbor fabric description Interal Fabric Network
 neighbor fabric bfd
 neighbor fabric bfd profile bfd_template
 neighbor fabric timers connect 10
 neighbor fabric capability extended-nexthop
 neighbor eno5 interface peer-group fabric
 neighbor ens3f0 interface peer-group fabric
 !
 address-family ipv4 unicast
  redistribute kernel route-map route_filter
  redistribute connected route-map route_filter
  neighbor fabric soft-reconfiguration inbound
  maximum-paths 64
  maximum-paths ibgp 64
  import vrf vm
 exit-address-family
 !
 address-family ipv6 unicast
  redistribute kernel route-map v6_route_filter
  redistribute connected route-map v6_route_filter
  neighbor fabric activate
  neighbor fabric soft-reconfiguration inbound
  maximum-paths 64
  maximum-paths ibgp 64
  import vrf vm
 exit-address-family
exit
!
access-list block_default seq 5 permit 0.0.0.0/0 exact-match
!
ipv6 access-list v6_block_default seq 5 permit ::/0 exact-match
!
route-map route_filter deny 10
 match ip address block_default
exit
!
route-map route_filter permit 20
exit
!
route-map v6_route_filter deny 10
 match ipv6 address v6_block_default
exit
!
route-map v6_route_filter permit 20
exit
!
bfd
 profile bfd_template
  detect-multiplier 6
  transmit-interval 500
  receive-interval 500
 exit
 !
exit
!

VM's FRR config:

log file /var/log/frr/bgpd.log informational
log syslog informational
service integrated-vtysh-config
no zebra nexthop kernel enable
!
interface ens4
 ipv6 nd ra-interval 4
 ipv6 nd ra-lifetime 10
 no ipv6 nd suppress-ra
!
router bgp 65500
 bgp router-id 100.111.91.240
 neighbor fabric peer-group
 neighbor ens4 peer-group fabric
 neighbor fabric remote-as 65500
 neighbor fabric description Interal Fabric Network
 neighbor fabric bfd
 neighbor fabric bfd profile bfd_template
 neighbor fabric timers connect 10
 neighbor fabric capability extended-nexthop
 neighbor ens4 interface peer-group fabric
 !
 address-family ipv4 unicast
  redistribute kernel route-map route_filter
  redistribute connected route-map route_filter
  neighbor fabric soft-reconfiguration inbound
  maximum-paths 64
  maximum-paths ibgp 64
 exit-address-family
!
 address-family ipv6 unicast
  redistribute kernel route-map v6_route_filter
  redistribute connected route-map v6_route_filter
  neighbor fabric activate
  neighbor fabric soft-reconfiguration inbound
  maximum-paths 64
  maximum-paths ibgp 64
 exit-address-family
!
access-list block_default seq 5 permit 0.0.0.0/0 exact-match
!
ipv6 access-list v6_block_default seq 5 permit ::/0 exact-match
!
route-map route_filter deny 10
 match ip address block_default
!
route-map route_filter permit 20
!
route-map v6_route_filter deny 10
 match ipv6 address v6_block_default
exit
!
route-map v6_route_filter permit 20
exit
!
bfd
 profile bfd_template
  detect-multiplier 6
  transmit-interval 500
  receive-interval 500
 exit
 !
exit
!
line vty
!

Expected behavior

After rebooting the VM or restarting systemd-networkd, the BGP session automatically re-establishes

Actual behavior

BGP session remains in an Idle state

Additional context

No response

Checklist

  • I have searched the open issues for this bug.
  • I have not included sensitive information in this report.
@skyblueted skyblueted added the triage Needs further investigation label Jan 15, 2025
@skyblueted skyblueted changed the title BGP Sessions Remain Idle After Restarting systemd-networkd on VM in FRR 10.2.1 (Physical Machine & KVM) BGP Sessions Remain Idle After Restarting systemd-networkd or Rebooting on VM in FRR 10.2.1 (Physical Machine & KVM) Jan 15, 2025
@ton31337
Copy link
Member

Could you show the output of debug bgp bfd, debug bfd peer, debug bfd zebra?

@skyblueted
Copy link
Author

Hi @ton31337 ,

After executing the commands "debug bgp bfd", "debug bfd peer", and "debug bfd zebra", I observed the following results in the FRR log.

VM:

Jan 17 20:08:44 vm-01 bgpd[55995]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
Jan 17 20:08:54 vm-01 bgpd[55995]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
Jan 17 20:09:04 vm-01 bgpd[55995]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
Jan 17 20:09:14 vm-01 bgpd[55995]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
Jan 17 20:09:24 vm-01 bgpd[55995]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
Jan 17 20:09:24 vm-01 bgpd[55995]: [QYZDQ-4PHG5][EC 100663316] Attempting to process an I/O event but for fd: 27(8) no thread to handle this!
Jan 17 20:09:34 vm-01 bgpd[55995]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
Jan 17 20:09:34 vm-01 bgpd[55995]: [H4B4J-DCW2R][EC 33554455] ens4 [Error] bgp_read_packet error: Connection reset by peer
Jan 17 20:09:44 vm-01 bgpd[55995]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
Jan 17 20:09:44 vm-01 bgpd[55995]: [H4B4J-DCW2R][EC 33554455] ens4 [Error] bgp_read_packet error: Connection reset by peer
Jan 17 20:09:54 vm-01 bgpd[55995]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
Jan 17 20:10:05 vm-01 bgpd[55995]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
Jan 17 20:10:15 vm-01 bgpd[55995]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
Jan 17 20:10:15 vm-01 bgpd[55995]: [H4B4J-DCW2R][EC 33554455] ens4 [Error] bgp_read_packet error: Connection reset by peer
Jan 17 20:10:25 vm-01 bgpd[55995]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
Jan 17 20:10:25 vm-01 bgpd[55995]: [H4B4J-DCW2R][EC 33554455] ens4 [Error] bgp_read_packet error: Connection reset by peer
Jan 17 20:10:35 vm-01 bgpd[55995]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
Jan 17 20:10:45 vm-01 bgpd[55995]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
Jan 17 20:10:55 vm-01 bgpd[55995]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
Jan 17 20:10:55 vm-01 bgpd[55995]: [H4B4J-DCW2R][EC 33554455] ens4 [Error] bgp_read_packet error: Connection reset by peer
Jan 17 20:11:05 vm-01 bgpd[55995]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
Jan 17 20:11:15 vm-01 bgpd[55995]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
Jan 17 20:11:25 vm-01 bgpd[55995]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
Jan 17 20:11:25 vm-01 bgpd[55995]: [H4B4J-DCW2R][EC 33554455] ens4 [Error] bgp_read_packet error: Connection reset by peer
Jan 17 20:11:35 vm-01 bgpd[55995]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
Jan 17 20:11:45 vm-01 bgpd[55995]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
Jan 17 20:11:45 vm-01 bgpd[55995]: [H4B4J-DCW2R][EC 33554455] ens4 [Error] bgp_read_packet error: Connection reset by peer
Jan 17 20:11:55 vm-01 bgpd[55995]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
Jan 17 20:12:05 vm-01 bgpd[55995]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected

BM (physical machine):

Jan 17 20:08:12 bm-01 bgpd[9599]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
Jan 17 20:08:22 bm-01 bgpd[9599]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
Jan 17 20:08:22 bm-01 bgpd[9599]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
Jan 17 20:08:32 bm-01 bgpd[9599]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
Jan 17 20:08:32 bm-01 bgpd[9599]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
Jan 17 20:08:42 bm-01 bgpd[9599]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
Jan 17 20:08:42 bm-01 bgpd[9599]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
Jan 17 20:08:52 bm-01 bgpd[9599]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
Jan 17 20:08:52 bm-01 bgpd[9599]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
Jan 17 20:09:02 bm-01 bgpd[9599]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
Jan 17 20:09:02 bm-01 bgpd[9599]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
Jan 17 20:09:12 bm-01 bgpd[9599]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
Jan 17 20:09:12 bm-01 bgpd[9599]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
Jan 17 20:09:12 bm-01 bgpd[9599]: [QYZDQ-4PHG5][EC 100663316] Attempting to process an I/O event but for fd: 31(8) no thread to handle this!
Jan 17 20:09:22 bm-01 bgpd[9599]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
Jan 17 20:09:22 bm-01 bgpd[9599]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
Jan 17 20:09:32 bm-01 bgpd[9599]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
Jan 17 20:09:32 bm-01 bgpd[9599]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
Jan 17 20:09:42 bm-01 bgpd[9599]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
Jan 17 20:09:42 bm-01 bgpd[9599]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
Jan 17 20:09:52 bm-01 bgpd[9599]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
Jan 17 20:09:52 bm-01 bgpd[9599]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
Jan 17 20:10:02 bm-01 bgpd[9599]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected

@ton31337
Copy link
Member

Ok, seems nothing what I expected. Could you also enable debug bgp neighbor?

@ton31337 ton31337 added the bgp label Jan 18, 2025
@ton31337 ton31337 self-assigned this Jan 18, 2025
@skyblueted
Copy link
Author

skyblueted commented Jan 20, 2025

Hi @ton31337 ,
Update the following results, after executing the commands "debug bgp bfd", "debug bfd peer", "debug bfd zebra", and "debug bgp neighbor"

Both VM and BM use version 10.2.1:

VM:

root@vm-01:~# date; systemctl restart systemd-networkd
Mon Jan 20 07:05:06 PM CST 2025

Note: In our environment, after restarting systemd-networkd, the frr service will be automatically restarted.

Jan 20 19:05:07 vm-01 bgpd[1575698]: [J9K4Q-T8STY][EC 33554466] ens4 [FSM] Failure handling event BGP_Start in state Idle, prior events BGP_S>
Jan 20 19:05:08 vm-01 zebra[1575691]: [V98V0-MTWPF] client 51 says hello and bids fair to announce only bgp routes vrf=0
Jan 20 19:05:11 vm-01 bgpd[1575698]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
Jan 20 19:05:12 vm-01 watchfrr[1575678]: [QDG3Y-BY5TN] mgmtd state -> up : connect succeeded
Jan 20 19:05:12 vm-01 watchfrr[1575678]: [QDG3Y-BY5TN] bgpd state -> up : connect succeeded
Jan 20 19:05:12 vm-01 watchfrr[1575678]: [QDG3Y-BY5TN] zebra state -> up : connect succeeded
Jan 20 19:05:12 vm-01 watchfrr[1575678]: [QDG3Y-BY5TN] bfdd state -> up : connect succeeded
Jan 20 19:05:12 vm-01 watchfrr[1575678]: [QDG3Y-BY5TN] staticd state -> up : connect succeeded
Jan 20 19:05:12 vm-01 watchfrr[1575678]: [KWE5Q-QNGFC] all daemons up, doing startup-complete notify
Jan 20 19:05:12 vm-01 frrinit.sh[1575668]:  * Started watchfrr
Jan 20 19:05:12 vm-01 systemd[1]: Started FRRouting.
Jan 20 19:05:21 vm-01 bgpd[1575698]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
Jan 20 19:05:31 vm-01 bgpd[1575698]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connecte
Jan 20 19:05:31 vm-01 bgpd[1575698]: [H4B4J-DCW2R][EC 33554455] ens4 [Error] bgp_read_packet error: Connection reset by peer
Jan 20 19:05:41 vm-01 bgpd[1575698]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
Jan 20 19:05:51 vm-01 bgpd[1575698]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
Jan 20 19:05:51 vm-01 bgpd[1575698]: [H4B4J-DCW2R][EC 33554455] ens4 [Error] bgp_read_packet error: Connection reset by peer
Jan 20 19:06:01 vm-01 bgpd[1575698]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
Jan 20 19:06:05 vm-01 bgpd[1575698]: [M59KS-A3ZXZ] bgp_update_receive: rcvd End-of-RIB for IPv6 Unicast from ens4 in vrf default
Jan 20 19:06:05 vm-01 bgpd[1575698]: [M59KS-A3ZXZ] bgp_update_receive: rcvd End-of-RIB for IPv4 Unicast from ens4 in vrf default

BM:

root@bm-01:~# ip l set br1 down; ip l set br1 up;
root@bm-01:~# date
Mon Jan 20 07:06:04 PM CST 2025
bm-01:~# journalctl -u frr -S 19:05
Jan 20 19:05:07 bm-01 bgpd[46043]: [HZN6M-XRM1G] %NOTIFICATION(Hard Reset): sent to neighbor br1 6/10 (Cease/BFD Down) 0 bytes
Jan 20 19:06:03 bm-01 bgpd[46043]: [TXY0T-CYY6F][EC 100663299] Can't get remote address and port: Transport endpoint is not connected
Jan 20 19:06:04 bm-01 bgpd[46043]: [M59KS-A3ZXZ] bgp_update_receive: rcvd End-of-RIB for IPv4 Unicast from br1 in vrf vm
Jan 20 19:06:04 bm-01 bgpd[46043]: [M59KS-A3ZXZ] bgp_update_receive: rcvd End-of-RIB for IPv6 Unicast from br1 in vrf vm

By the way, we tested downgrading the BM version to 10.1.2, but similar issues still occurred.
When we restart the BM’s systemd-networkd or reboot, the BGP peer goes into an idle state until we manually bring down and bring up the VM’s network interface to restore it

However, when both BM and VM versions are downgraded to 10.1.2, restarting systemd-networkd or rebooting on either side automatically rebuilds the BGP peer

@ton31337
Copy link
Member

Could you remove BFD from the configuration and test it out? I want to narrow-down the scope a bit.

@skyblueted
Copy link
Author

skyblueted commented Jan 21, 2025

Sure.

After removing the BFD-related configuration, the BGP behavior meets expectations.

Testing on both the VM and BM environments shows that restarting systemd-networkd or rebooting ultimately still automatically re-establishes the BGP session

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bgp triage Needs further investigation
Projects
None yet
Development

No branches or pull requests

2 participants