The WolfspyreLabs Blog / 2022 / July / Digging into Network Latency / Digging into Network Latency What the hell?? # You know how it goes…. You’re sitting there…. Playing around on the internet… And then wonder… Hey… Why is this SO SLOW?!?! Well.. I started looking into possible causes… And that’s how I got myself nerd sniped Fortunately, the wonderful humans over at [calomel.org] have put together a whole lot of really helpful information. Thanks gang!! In specific, I found their [Freebsd network tuning guide][calomel-fbsdnetwork] Especially helpful! The Problem I’m noticing #I noticed that from my OSX desktop, and most of the hosts here, I could download assets significantly faster from El Internetto™ than I could on from my TrueNAS Scale box. ss to the rescue: root@pine# ss -i4tdM -f inet -E A line from SS output ↕ ```Shell tcp UNCONN 0 0 198.18.198.23:sunrpc 198.18.198.44:658 wscale:7,7 rto:204 rtt:0.124/0.057 ato:40 mss:8948 pmtu:9216 rcvmss:536 advmss:9164 cwnd:10 bytes_sent:56 bytes_acked:57 bytes_received:97 segs_out:3 segs_in:6 data_segs_out:1 data_segs_in:1 send 5.77Gbps lastrcv:4 pacing_rate 11.5Gbps delivery_rate 918Mbps delivered:3 rcv_space:56360 rcv_ssthresh:56360 minrtt:0.078 ``` as well as ```Shell tcp ESTAB 0 0 198.18.198.23:nfs 198.18.198.43:832 cubic wscale:7,7 rto:204 rtt:0.191/0.092 ato:40 mss:8948 pmtu:9216 rcvmss:1128 advmss:9164 cwnd:10 bytes_sent:19641644 bytes_acked:19641644 bytes_received:16077088 segs_out:94854 segs_in:186296 data_segs_out:94767 data_segs_in:94766 send 3.75Gbps lastsnd:5324 lastrcv:5324 lastack:5324 pacing_rate 7.49Gbps delivery_rate 1.59Gbps delivered:94768 app_limited busy:51972ms rcv_rtt:1 rcv_space:56360 rcv_ssthresh:56360 minrtt:0.046 tcp ESTAB 0 0 198.18.40.23:nfs 198.18.40.45:895 cubic wscale:7,7 rto:204 rtt:0.25/0.053 ato:40 mss:1448 pmtu:9216 rcvmss:536 advmss:9164 cwnd:10 bytes_sent:1588948 bytes_acked:1588948 bytes_received:1893348 segs_out:23839 segs_in:36113 data_segs_out:12295 data_segs_in:12299 send 463Mbps lastsnd:34624 lastrcv:34624 lastack:34624 pacing_rate 924Mbps delivery_rate 176Mbps delivered:12296 app_limited busy:1700ms rcv_rtt:281950 rcv_space:64316 rcv_ssthresh:56360 minrtt:0.066 ``` ``` root@pine:/mnt/Itchy# wget https://update.freenas.org/scale/TrueNAS-SCALE-Bluefin-Nightlies/TrueNAS-SCALE-22.12-MASTER-20220629-092907.update --2022-07-04 14:11:12-- https://update.freenas.org/scale/TrueNAS-SCALE-Bluefin-Nightlies/TrueNAS-SCALE-22.12-MASTER-20220629-092907.update Resolving update.freenas.org (update.freenas.org)... 68.70.205.2, 68.70.205.4, 68.70.205.3, ... Connecting to update.freenas.org (update.freenas.org)|68.70.205.2|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 1033240576 (985M) [application/octet-stream] Saving to: ‘TrueNAS-SCALE-22.12-MASTER-20220629-092907.update’ TrueNAS-SCALE-22.12-MASTER-20220629-0929 100%[================================================================================>] 985.38M 5.10MB/s ``` ``` root@pine:/mnt/Itchy# wget https://update.freenas.org/scale/TrueNAS-SCALE-Bluefin-Nightlies/TrueNAS-SCALE-22.12-MASTER-20220629-092907.update --2022-07-04 21:48:29-- https://update.freenas.org/scale/TrueNAS-SCALE-Bluefin-Nightlies/TrueNAS-SCALE-22.12-MASTER-20220629-092907.update Resolving update.freenas.org (update.freenas.org)... 68.70.205.1, 68.70.205.2, 68.70.205.4, ... Connecting to update.freenas.org (update.freenas.org)|68.70.205.1|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 1033240576 (985M) [application/octet-stream] Saving to: ‘TrueNAS-SCALE-22.12-MASTER-20220629-092907.update’ TrueNAS-SCALE-22.12-MASTER-20220629-0929 100%[================================================================================>] 985.38M 61.6MB/s 2022-07-04 21:48:43 (69.0 MB/s) - ‘TrueNAS-SCALE-22.12-MASTER-20220629-092907.update’ saved [1033240576/1033240576] root@pine:/mnt/Itchy# ``` WEIRD Stuff to look at. #OPNSense tunings #netstat -idb netstat -m loader.conf settings to tweak #Stuff found on calomel.org #Disabling hyperthreading machdep.hyperthreading_allowed="0" use the improved tcp stream algorythm net.inet.tcp.soreceive_stream="1" add the H-TCP congestion control algorithm to loader.conf cc_htcp_load="YES" cc_cdg_load="YES" cc_cubic_load="YES keyrate="250.34" # keyboard delay to 250 ms and repeat to 34 cps these settings combined improve the pipeline of packet handling. creates a stream processing pipeline per cpu core and pins them. net.isr.maxthreads=-1 net.isr.bindthreads="1" hw.igb.num_queues="0" hw.igb.enable_msix=1 # qlimit for igmp, arp, ether and ip6 queues only (netstat -Q) (default 256) #net.isr.defaultqlimit="2048" # (default 256) # Size of the syncache hash table, must be a power of 2 (default 512) #net.inet.tcp.syncache.hashsize="1024" # Limit the number of entries permitted in each bucket of the hash table. (default 30) #net.inet.tcp.syncache.bucketlimit="100" #autoboot_delay="-1" # (default 10) seconds Info IPC Socket Buffer: the maximum combined socket buffer size, in bytes, defined by SO_SNDBUF and SO_RCVBUF. kern.ipc.maxsockbuf is also used to define the window scaling factor (wscale in tcpdump) our server will advertise. The window scaling factor is defined as the maximum volume of data allowed in transit before the recieving server is required to send an ACK packet (acknowledgment) to the sending server. FreeBSD’s default maxsockbuf value is two(2) megabytes which corresponds to a window scaling factor (wscale) of six(6) allowing the remote sender to transmit up to 2^6 x 65,535 bytes = 4,194,240 bytes (4MB) in flight, on the network before requiring an ACK packet from our server. In order to support the throughput of modern, long fat networks (LFN) with variable latency we suggest increasing the maximum socket buffer to at least 16MB if the system has enough RAM. “netstat -m” displays the amount of network buffers used. Increase kern.ipc.maxsockbuf if the counters for “mbufs denied” or “mbufs delayed” are greater than zero(0). https://en.wikipedia.org/wiki/TCP_window_scale_option https://en.wikipedia.org/wiki/Bandwidth-delay_product kern.ipc.maxsockbuf=157286400 #enable scaling net.inet.tcp.rfc1323=1 # (default 1) net.inet.tcp.rfc3042=1 # (default 1) net.inet.tcp.rfc3390=1 # (default 1) #net.inet.tcp.recvbuf_inc=65536 # (default 16384) net.inet.tcp.recvbuf_max=4194304 # (default 2097152) net.inet.tcp.recvspace=65536 # (default 65536) net.inet.tcp.sendbuf_inc=65536 # (default 8192) net.inet.tcp.sendbuf_max=4194304 # (default 2097152) net.inet.tcp.sendspace=65536 # (default 32768) tuning of the jumbo memory buffers is necessary as well: kern.ipc.nmbjumbop ```sysctl net.inet.tcp.delayed_ack=1 # (default 1) net.inet.tcp.delacktime=10 # (default 100) net.inet.tcp.mssdflt=1460 #net.inet.tcp.mssdflt=1240 # goog's http/3 QUIC spec #net.inet.tcp.mssdflt=8000 net.inet.tcp.minmss=536 # (default 216) net.inet.tcp.abc_l_var=44 # (default 2) net.inet.tcp.initcwnd_segments=44 #selective Ancknowledgement net.inet.tcp.sack.enable=1 net.inet.tcp.rfc6675_pipe=1 # (default 0) # use RFC8511 TCP Alternative Backoff with ECN net.inet.tcp.cc.abe=1 # (default 0, disabled) net.inet.tcp.syncache.rexmtlimit=0 # (default 3) #disable syncookies net.inet.tcp.syncookies=0 # Disable TCP Segmentation Offload net.inet.tcp.tso=0 increasing the number of packets able to be processed in an interrupt is advisable. The default 0 indicates 16 frames (less than 24kB) man iflib for more info dev.ixl.0.iflib.rx_budget=65535 dev.ixl.1.iflib.rx_budget=65535 dev.ixl.2.iflib.rx_budget=65535 dev.ixl.3.iflib.rx_budget=65535 `` kern.random.fortuna.minpoolsize=128 ```sysctl kern.random.harvest.mask=33119 Hardening kern.ipc.shm_use_phys=1 kern.msgbuf_show_timestamp=1 net.inet.ip.portrange.randomtime=5 net.inet.tcp.blackhole=2 net.inet.tcp.fast_finwait2_recycle=1 # recycle FIN/WAIT states quickly, helps against DoS, but may cause false RST (default 0) net.inet.tcp.fastopen.client_enable=0 # disable TCP Fast Open client side, enforce three way TCP handshake (default 1, enabled) net.inet.tcp.fastopen.server_enable=0 # disable TCP Fast Open server side, enforce three way TCP handshake (default 0) net.inet.tcp.finwait2_timeout=1000 # TCP FIN_WAIT_2 timeout waiting for client FIN packet before state close (default 60000, 60 sec) net.inet.tcp.icmp_may_rst=0 # icmp may not send RST to avoid spoofed icmp/udp floods (default 1) net.inet.tcp.keepcnt=2 # amount of tcp keep alive probe failures before socket is forced closed (default 8) net.inet.tcp.keepidle=62000 # time before starting tcp keep alive probes on an idle, TCP connection (default 7200000, 7200 secs) net.inet.tcp.keepinit=5000 # tcp keep alive client reply timeout (default 75000, 75 secs) net.inet.tcp.msl=2500 # Maximum Segment Lifetime, time the connection spends in TIME_WAIT state (default 30000, 2*MSL = 60 sec) net.inet.tcp.path_mtu_discovery=1 # disable for mtu=1500 as most paths drop ICMP type 3 packets, but keep enabled for mtu=9000 (default 1) net.inet.udp.blackhole=1 # drop udp packets destined for closed sockets (default 0) net.inet.udp.recvspace=1048576 # UDP receive space, HTTP/3 webserver, "netstat -sn -p udp" and increase if full socket buffers (default 42080) #security.bsd.hardlink_check_gid=1 # unprivileged processes may not create hard links to files owned by other groups, DISABLE for mailman (default 0) #security.bsd.hardlink_check_uid=1 # unprivileged processes may not create hard links to files owned by other users, DISABLE for mailman (default 0) security.bsd.see_other_gids=0 # groups only see their own processes. root can see all (default 1) security.bsd.see_other_uids=0 # users only see their own processes. root can see all (default 1) security.bsd.stack_guard_page=1 # insert a stack guard page ahead of growable segments, stack smashing protection (SSP) (default 0) security.bsd.unprivileged_proc_debug=0 # unprivileged processes may not use process debugging (default 1) security.bsd.unprivileged_read_msgbuf=0 # unprivileged processes may not read the kernel message buffer (default 1) net.inet.raw.maxdgram: 128000 net.inet.raw.recvspace: 128000 net.local.stream.sendspace 128000 net.local.stream.recvspace 128000 kern.ipc.soacceptqueue=2048 increase max threads per process kern.threads.max_threads_per_proc=9000 tune tcp keepalives net.inet.tcp.keepidle=10000 # (default 7200000 ) net.inet.tcp.keepintvl=5000 # (default 75000 ) net.inet.tcp.always_keepalive=1 # (default 1) vfs.read_max=128 Stuff found elsewhere #Unrelated-ish. found some github issues [4141][powerdns-issue-4141], [5745][powerdns-issue-5745] on powerdns overloading the local net buffers: 131072 net.local.stream.recvspace: 8192 net.local.stream.sendspace: 8192 net.local.dgram.recvspace: 65536 net.local.dgram.maxdgram: 65535 net.local.seqpacket.recvspace: 8192 net.local.seqpacket.maxseqpacket: 8192 hw.hn.vf_transparent: 1 hw.hn.use_if_start: 0 net.link.ifqmaxlen #50 -> 2048 per https://redmine.pfsense.org/issues/10311 Things I tried #pkg install devcpu-data-intel-20220510 iovctl ixl driver hw.ixl.rx_itr :The RX interrupt rate value, set to 62 (124 usec) by default. hw.ixl.tx_itr :The TX interrupt rate value, set to 122 (244 usec) by default. hw.ixl.i2c_access_method : Access method that driver will use for I2C read and writes via sysctl(8) or verbose ifconfig(8) information display: 0 - best available method 1 - bit bang via I2CPARAMS register 2 - register read/write via I2CCMD register 3 - Use Admin Queue command (default best) Using the Admin Queue is only supported on 710 devices with FW version 1.7 or newer. Set to 0 by default. hw.ixl.enable_tx_fc_filter : Filter out packets with Ethertype 0x8808 from being sent out by non-adapter sources. This prevents (potentially untrusted) software or iavf(4) devices from sending out flow control packets and creating a DoS (Denial of Service) event. Enabled by default. hw.ixl.enable_head_writeback When the driver is finding the last TX descriptor processed by the hardware, use a value written to memory by the hardware instead of scanning the descriptor ring for completed descriptors. Enabled by default; disable to mimic the TX behavior found in ixgbe(4). SYSCTL PROCEDURES dev.ixl.#.fc Sets the 802.3x flow control mode that the adapter will advertise on the link. The negotiated flow control setting can be viewed in the interface’s media field if ifconfig(8) - 0 Disables flow control - 1 is RX - 2 is TX pause - 3 enables full dev.ixl.#.advertise_speed Set the speeds that the interface will advertise on the link. dev.ixl.#.supported_speeds contains the speeds that are allowed to be set. dev.ixl.#.current_speed Displays the current speed. dev.ixl.#.fw_version Displays the current firmware and NVM versions of the adapter. dev.ixl.#.debug.switch_vlans Set the Ethertype used by the hardware itself to handle internal services. Frames with this Ethertype will be dropped without notice. Defaults to 0x88a8, which is a well known number for IEEE 802.1ad VLAN stacking. If you need 802.1ad support, set this number to any another Ethertype i.e. 0xffff. INTERRUPT STORMS It is important to note that 40G operation can generate high numbers of interrupts, often incorrectly being interpreted as a storm condition in the kernel. It is suggested that this be resolved by setting: hw.intr_storm_threshold: 0 IOVCTL OPTIONS The driver supports additional optional parameters for created VFs (Virtual Functions) when using iovctl(8): mac-addr (unicast-mac) Set the Ethernet MAC address that the VF will use. If unspecified, the VF will use a randomly generated MAC address. mac-anti-spoof (bool) Prevent the VF from sending Ethernet frames with a source address that does not match its own. allow-set-mac (bool) Allow the VF to set its own Ethernet MAC address allow-promisc (bool) Allow the VF to inspect all of the traffic sent to the port. num-queues (uint16_t) Specify the number of queues the VF will have. By default, this is set to the number of MSI-X vectors supported by the VF minus one. An up to date list of parameters and their defaults can be found by using iovctl(8) with the -S option. #Tests``` Result 1 Interface lagg0_vlan2 Start Time 2022-07-04 20:12:10 -0500 Port 27449 General Time Tue, 05 Jul 2022 01:14:37 UTC Duration 30 Block Size 131072 Connection Local Host 192.0.2.1 Local Port 27449 Remote Host 192.0.2.23 Remote Port 42218 CPU Usage Host Total 78.54 Host User 14.81 Host System 63.75 Remote Total 0.00 Remote User 0.00 Remote System 0.00 Performance Data Start 0 0 End 30.000273 30.000273 Seconds 30.000273 30.000273 Bytes 0 36067520404 Bits Per Second 0 9617917918.01361 PINETEST: iperf 3.9 PINETEST: PINETEST: Linux pine 5.15.45+truenas #1 SMP Fri Jun 17 19:32:18 UTC 2022 x86_64 Control connection MSS 9044 PINETEST: Time: Tue, 05 Jul 2022 01:14:37 GMT PINETEST: Connecting to host 192.0.2.1, port 27449 PINETEST: Cookie: 6yl23mxmlzitkz3wwertio5zl2nqcw6g553a PINETEST: TCP MSS: 9044 (default) PINETEST: [ 5] local 192.0.2.23 port 42218 connected to 192.0.2.1 port 27449 PINETEST: Starting Test: protocol: TCP, 1 streams, 131072 byte blocks, omitting 0 seconds, 30 second test, tos 0 PINETEST: [ ID] Interval Transfer Bitrate Retr Cwnd PINETEST: [ 5] 0.00-1.00 sec 1.15 GBytes 9.88 Gbits/sec 0 1.99 MBytes PINETEST: [ 5] 1.00-2.00 sec 1.15 GBytes 9.83 Gbits/sec 0 2.09 MBytes PINETEST: [ 5] 2.00-3.00 sec 1.13 GBytes 9.72 Gbits/sec 0 2.09 MBytes PINETEST: [ 5] 3.00-4.00 sec 1.14 GBytes 9.75 Gbits/sec 0 2.09 MBytes PINETEST: [ 5] 4.00-5.00 sec 1.10 GBytes 9.42 Gbits/sec 0 2.09 MBytes PINETEST: [ 5] 5.00-6.00 sec 1.02 GBytes 8.76 Gbits/sec 0 2.09 MBytes PINETEST: [ 5] 6.00-7.00 sec 1.08 GBytes 9.31 Gbits/sec 0 2.09 MBytes PINETEST: [ 5] 7.00-8.00 sec 1.09 GBytes 9.32 Gbits/sec 0 2.09 MBytes PINETEST: [ 5] 8.00-9.00 sec 1.12 GBytes 9.60 Gbits/sec 0 2.09 MBytes PINETEST: [ 5] 9.00-10.00 sec 1.15 GBytes 9.88 Gbits/sec 0 2.09 MBytes PINETEST: [ 5] 10.00-11.00 sec 1.15 GBytes 9.89 Gbits/sec 0 2.09 MBytes PINETEST: [ 5] 11.00-12.00 sec 1.15 GBytes 9.86 Gbits/sec 0 2.09 MBytes PINETEST: [ 5] 12.00-13.00 sec 1.15 GBytes 9.83 Gbits/sec 0 2.09 MBytes PINETEST: [ 5] 13.00-14.00 sec 1.15 GBytes 9.86 Gbits/sec 0 2.09 MBytes PINETEST: [ 5] 14.00-15.00 sec 1.11 GBytes 9.50 Gbits/sec 0 2.09 MBytes PINETEST: [ 5] 15.00-16.00 sec 1.10 GBytes 9.43 Gbits/sec 0 2.09 MBytes PINETEST: [ 5] 16.00-17.00 sec 1.13 GBytes 9.70 Gbits/sec 0 2.09 MBytes PINETEST: [ 5] 17.00-18.00 sec 1.14 GBytes 9.83 Gbits/sec 0 2.09 MBytes PINETEST: [ 5] 18.00-19.00 sec 1.15 GBytes 9.87 Gbits/sec 0 2.09 MBytes PINETEST: [ 5] 19.00-20.00 sec 1.15 GBytes 9.86 Gbits/sec 0 2.09 MBytes PINETEST: [ 5] 20.00-21.00 sec 1.08 GBytes 9.26 Gbits/sec 0 2.09 MBytes PINETEST: [ 5] 21.00-22.00 sec 1.13 GBytes 9.70 Gbits/sec 0 2.09 MBytes PINETEST: [ 5] 22.00-23.00 sec 1.13 GBytes 9.69 Gbits/sec 0 2.09 MBytes PINETEST: [ 5] 23.00-24.00 sec 1.05 GBytes 9.03 Gbits/sec 0 2.09 MBytes PINETEST: [ 5] 24.00-25.00 sec 1.06 GBytes 9.10 Gbits/sec 0 2.09 MBytes PINETEST: [ 5] 25.00-26.00 sec 1.15 GBytes 9.86 Gbits/sec 0 2.09 MBytes PINETEST: [ 5] 26.00-27.00 sec 1.15 GBytes 9.86 Gbits/sec 0 2.09 MBytes PINETEST: [ 5] 27.00-28.00 sec 1.14 GBytes 9.80 Gbits/sec 0 2.09 MBytes PINETEST: [ 5] 28.00-29.00 sec 1.09 GBytes 9.39 Gbits/sec 0 2.09 MBytes PINETEST: [ 5] 29.00-30.00 sec 1.14 GBytes 9.78 Gbits/sec 0 2.09 MBytes PINETEST: - - - - - - - - - - - - - - - - - - - - - - - - - PINETEST: Test Complete. Summary Results: PINETEST: [ ID] Interval Transfer Bitrate Retr PINETEST: [ 5] 0.00-30.00 sec 33.6 GBytes 9.62 Gbits/sec 0 sender PINETEST: [ 5] 0.00-30.00 sec 33.6 GBytes 9.62 Gbits/sec receiver PINETEST: CPU Utilization: local/sender 52.7% (1.3%u/51.4%s), remote/receiver 78.5% (14.8%u/63.7%s) PINETEST: snd_tcp_congestion cubic PINETEST: rcv_tcp_congestion newreno PINETEST: PINETEST: iperf Done. [xkcd-nerdsnipe-img]: <https://imgs.xkcd.com/comics/nerd_sniping.png> [xkcd-nerdsnipe-comic]: <https://xkcd.com/356> [calomel.org]: <https://calomel.org> [cloudflare-fragpost]: <https://blog.cloudflare.com/ip-fragmentation-is-broken/> [icmptest]: <http://icmpcheck.popcount.org> [powerdns-issue-5745]: <https://github.com/PowerDNS/pdns/issues/5745> [powerdns-issue-4141]: <https://github.com/PowerDNS/pdns/issues/4141> [pfsense-redmine-vf-settings]: <https://redmine.pfsense.org/issues/9647> [pfsense-forums-vf-settings]: <https://forum.netgate.com/topic/169884/after-upgrade-inter-v-lan-communication-is-very-slow-on-hyper-v/50?lang=en-US>