本篇主要介紹了 Intel HDSLB 的基本運行原理和部署配置的方式,希望能夠幫助讀者們順利的把 HDSLB-DPVS 項目 “玩” 起來。
在上一篇《 Intel HDSLB 高性能四層負載均衡器 — 快速入門和應用場景 》中,我們著重介紹了 HDSLB(High Density Scalable Load Balancer,高密度可擴展的負載均衡器)作為新一代高性能四層負載均衡器的需求定位、分析了 HDSLB 在云計算和邊緣計算應用場景中的特性優(yōu)勢,以及解讀了 HDSLB 的性能測試數據。
再進一步的,在本篇中我們主要關注 HDSLB 的基本運行原理和部署配置方式,更側重于實際的操作。為了讓更廣泛的開發(fā)者們都能夠快捷方便的對 HDSLB 展開研究,所以在本篇中會采用 HDSLB-DPVS 開源版本來進行介紹。
顧名思義,HDSLB-DPVS 是基于 DPVS 進行二次開發(fā)的項目。而 DPVS,又稱為 DPDK-LVS,是一個參考了 LVS 內核態(tài)四層負載均衡器設計原理并基于 DPDK 用戶態(tài)數據面加速框架進行開發(fā)的四層負載均衡器?梢,HDSLB-DPVS 的技術堆棧主要由以下 4 個部分組成:
要清晰的理解 HDSLB-DPVS 的基本實現(xiàn)原理,我們需要從頭開始講起。
LVS(Linux Virtual Server,Linux 虛擬服務器)是一個誕生于 1998 年的四層負載均衡器開源項目,其目標是使用 Local Balancer 技術和 Server Cluster 技術來實現(xiàn)一個具有良好可伸縮性(Scalability)、可靠性(Reliability)和可管理性(Manageability)的 Virtual Server。
現(xiàn)在來看,雖然 LVS 基于 Kernel 實現(xiàn)的數據面性能已經不合時宜,但在邏輯架構的設計層面,LVS 的核心術語依舊沿用至今,包括:
關于 LVS 更詳細的內容,推薦閱讀:《 LVS & Keepalived 實現(xiàn) L4 高可用負載均衡器 》
隨著 2010 年,IEEE 802.3 標準委員會發(fā)布了 40GbE 和 100GbE 802.3ba 以太網標準后,數據中心正式進入了 100G 時代。從那時起,Linux 內核協(xié)議棧的網絡處理性能就一直備受挑戰(zhàn)。先看幾個數據:
但實際上,100G 網卡線速為 2 億 PPS,即每個包處理的時間不能超過 50 納秒。
可見,基于 Kernel 的數據面已經走到了拐點,DPDK 為此而生,并通過下列加速技術實現(xiàn)了 100G 線性轉發(fā)。
關于 DPDK 更詳細的內容,推薦閱讀:《 DPDK — 數據加速方案的核心思想 》
綜上,由于 LVS 的數據面是一個 Linux Kernel Module(ipvs),其性能無法滿足現(xiàn)代化需求,所以國內公司 iqiyi 基于 DPDK 開發(fā)了 DPVS。值得一提的是,由于 DPVS 項目由國內公司開源和維護,所以其開源社區(qū)對中文開發(fā)者也會更加友好。
除了性能方面的優(yōu)化之外,在功能層面,DPVS 也提供了更豐富的能力,包括:
在軟件架構方面,DPVS 沿用了數據分離架構和基于 Keepalived 的 Master-Backup 高可用架構。
HDSLB-DPVS 和 DPVS 本身都作為高性能負載均衡器,那么兩者的本質區(qū)別是什么呢?答案就是更強大的性能!
通常的,我們可以使用 RBP(Ratio of Bandwidth and Performance growth rate,帶寬性能增速比)來定義網絡帶寬的增速比上 CPU 性能的增速,即:RBP=BW GR/Perf. GR。
如下圖所示。2010 年前,網絡的帶寬年化增長大約是 30%,到 2015 年增長到 35%,然后在近年達到 45%。相對應的,CPU 的性能增長從 10 年前的 23%,下降到 12%,并在近年直接降低到 3.5%。在這 3 個時間段內,RBP 指標從 RBP~1 附近(I/O 壓力尚未顯現(xiàn)出來),上升到 RBP~3,并在近年超過了 RBP~10。
可見,CPU 幾乎已經無法直接應對網絡帶寬的增速。而圍繞 CPU 進行純軟件加速的 DPDK 框架正在面臨挑戰(zhàn)。
回到 DPVS 和 HDSLB-DPVS 的本質區(qū)別這個問題。在理論設計層面,DPVS 的目標是基于 DPDK 框架實現(xiàn)了軟件層面的加速,而 HDSLB-DPVS 則更進一步的將這種加速融入到 CPU 和 NIC 互相結合的硬件平臺之上,并實現(xiàn)了 “高密度” 和 “可擴展” 這 2 大目標:
實踐方面,在最新型的 Intel Xeon CPU(e.g. 3rd & 4th generation)和 E810 100G NIC 硬件平臺上,實現(xiàn)了:
對此,我們在《 Intel HDSLB 高性能四層負載均衡器 — 快速入門和應用場景 》文中已經對 HDSLB-DPVS 超越前代的性能數據進行了分析,這里不在贅述。
下面進入到實踐環(huán)節(jié),主要關注 HDSLB-DPVS 的編譯、部署和配置。為了降低開發(fā)者門檻,所以本文主要使用了開發(fā)機低門檻配置來進行部署和調試。
物理測試機性能推薦 | 虛擬開發(fā)機低門檻推薦 | |
---|---|---|
CPU 架構 | Intel Xeon CPU 四代 | 支持 AVX512 系列指令集的 Intel CPU 型號,例如:Skylake 等 |
CPU 資源 | 2NUMA,關閉超線程 | 1NUMA,4C |
Memory 資源 | 128G | 16G |
NIC 型號 | Intel E810 100G | VirtI/O 驅動,支持多隊列 |
本文 CPU 信息:
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 46 bits physical, 57 bits virtual
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2root@l4lb:~# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 46 bits physical, 57 bits virtual
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 106
Model name: Intel(R) Xeon(R) Platinum 8350C CPU @ 2.60GHz
Stepping: 6
CPU MHz: 2599.994
BogoMIPS: 5199.98
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 96 KiB
L1i cache: 64 KiB
L2 cache: 2.5 MiB
L3 cache: 48 MiB
NUMA node0 CPU(s): 0-3
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Retbleed: Not affected
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Vulnerable, IBPB: disabled, STIBP: disabled, PBRSB-eIBRS: Vulnerable
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_ts
c arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_tim
er aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb ibrs_enhanced fsgsbase tsc_adjust bm
i1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xge
tbv1 xsaves arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid arch_capabilities
值的注意的是,Ubuntu /boot 分區(qū)要大于 2G,避免出現(xiàn)內核升級故障問題。參考引用: https://askubuntu.com/questions/1207958/error-24-write-error-cannot-write-compressed-block
本文 OS 信息:
# 更新系統(tǒng)
$ sudo apt-get update -y && sudo apt-get upgrade -y
# Dev
$ sudo apt-get install git vim wget patch unzip -y
# popt
$ sudo apt-get install libpopt-dev -y
# NUMA
$ sudo apt-get install libnuma-dev -y
$ sudo apt-get install numactl -y
# Pcap
$ sudo apt-get install libpcap-dev -y
# SSL
$ sudo apt-get install libssl-dev -y
# Kernel 5.4.0-136
$ uname -r
5.4.0-136-generic
$ ll /boot/vmlinuz*
lrwxrwxrwx 1 root root 25 Dec 27 2022 /boot/vmlinuz -> vmlinuz-5.4.0-136-generic
-rw------- 1 root root 13660416 Aug 10 2022 /boot/vmlinuz-5.4.0-125-generic
-rw------- 1 root root 13668608 Nov 24 2022 /boot/vmlinuz-5.4.0-136-generic
-rw------- 1 root root 11657976 Apr 21 2020 /boot/vmlinuz-5.4.0-26-generic
lrwxrwxrwx 1 root root 25 Dec 27 2022 /boot/vmlinuz.old -> vmlinuz-5.4.0-125-generic
$ dpkg -l | egrep "linux-(signed|modules|image|headers)" | grep $(uname -r)
ii linux-headers-5.4.0-136-generic 5.4.0-136.153 amd64 Linux kernel headers for version 5.4.0 on 64 bit x86 SMP
ii linux-image-5.4.0-136-generic 5.4.0-136.153 amd64 Signed kernel image generic
ii linux-modules-5.4.0-136-generic 5.4.0-136.153 amd64 Linux kernel extra modules for version 5.4.0 on 64 bit x86 SMP
ii linux-modules-extra-5.4.0-136-generic 5.4.0-136.153 amd64 Linux kernel extra modules for version 5.4.0 on 64 bit x86 SMP
# GCC 9.4.0
$ gcc --version
gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
# 文件描述符
$ ulimit -n 655350
$ echo "ulimit -n 655350" >> /etc/rc.local
$ chmod a+x /etc/rc.local
DPDK 安裝部署的詳細內容,推薦閱讀:《 DPDK — 安裝部署 》。
$ cd /root/
$ git clone https://github.com/intel/high-density-scalable-load-balancer hdslb
$ wget http://fast.dpdk.org/rel/dpdk-20.08.tar.xz
$ tar vxf dpdk-20.08.tar.xz
# 打補丁
$ cp hdslb/patch/dpdk-20.08/*.patch dpdk-20.08/
$ cd dpdk-20.08/
$ patch -p 1 < 0002-support-large_memory.patch
$ patch -p 1 < 0003-net-i40e-ice-support-rx-markid-ofb.patch
# 編譯
$ make config T=x86_64-native-linuxapp-gcc MAKE_PAUSE=n
$ make MAKE_PAUSE=n -j 4
HDSLB-DPVS 的編譯安裝過程中需要依賴許多 CPU 硬件加速指令,例如:AVX2、AVX512 等等。要編譯成功,有 2 方面的要求:
$ cd dpdk-20.08/
$ export RTE_SDK=$PWD
$ cd hdslb/
$ chmod +x tools/keepalived/configure
# 編譯安裝
$ make -j 4
$ make install
在物理機測試環(huán)境中,大頁內存應該盡可能的給,HDSLB 的 LB connect pool 需要分配大量的內存,這與實際的性能規(guī)格有直接關系。
$ mkdir /mnt/huge_1GB
$ mount -t hugetlbfs nodev /mnt/huge_1GB
$ vim /etc/fstab
nodev /mnt/huge_1GB hugetlbfs pagesize=1GB 0 0
$ # for NUMA machine
$ echo 15 > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
$ vim /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="${GRUB_CMDLINE_LINUX_DEFAULT} default_hugepagesz=1G hugepagesz=1G hugepages=15"
$ sudo update-grub
$ init 6
$ cat /proc/meminfo | grep Huge
AnonHugePages: 0 kB
ShmemHugePages: 0 kB
FileHugePages: 0 kB
HugePages_Total: 13
HugePages_Free: 13
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 1048576 kB
Hugetlb: 13631488 kB
$ free -h
total used free shared buff/cache available
Mem: 15Gi 13Gi 2.0Gi 2.0Mi 408Mi 2.2Gi
Swap: 0B 0B 0B
$ modprobe vfio-pci
$ modprobe vfio enable_unsafe_noiommu_mode=1 # https://stackoverflow.com/questions/75840973/dpdk20-11-3-cannot-bind-device-to-vfio-pci
$ echo 1 > /sys/module/vfio/parameters/enable_unsafe_noiommu_mode
$ cd dpdk-20.08/
$ export RTE_SDK=$PWD
$ insmod ${RTE_SDK}/build/kmod/rte_kni.ko
$ ${RTE_SDK}/usertools/dpdk-devbind.py --status-dev net
Network devices using kernel driver
===================================
0000:01:00.0 'Virtio network device 1000' if=eth0 drv=virtio-pci unused=vfio-pci *Active*
0000:03:00.0 'Virtio network device 1000' if=eth1 drv=virtio-pci unused=vfio-pci
0000:04:00.0 'Virtio network device 1000' if=eth2 drv=virtio-pci unused=vfio-pci
$ ifconfig eth1 down # 0000:03:00.0
$ ifconfig eth2 down # 0000:04:00.0
$ ${RTE_SDK}/usertools/dpdk-devbind.py -b vfio-pci 0000:03:00.0 0000:04:00.0
$ ${RTE_SDK}/usertools/dpdk-devbind.py --status-dev net
Network devices using DPDK-compatible driver
============================================
0000:03:00.0 'Virtio network device 1000' drv=vfio-pci unused=
0000:04:00.0 'Virtio network device 1000' drv=vfio-pci unused=
Network devices using kernel driver
===================================
0000:01:00.0 'Virtio network device 1000' if=eth0 drv=virtio-pci unused=vfio-pci *Active*
$ cp conf/hdslb.conf.sample /etc/hdslb.conf
# 配置解析
$ cat /etc/hdslb.conf
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! This is hdslb default configuration file.
!
! The attribute "" denotes the configuration item at initialization stage. Item of
! this type is configured oneshoot and not reloadable. If invalid value configured in the
! file, hdslb would use its default value.
!
! Note that hdslb configuration file supports the following comment type:
! * line comment: using '#" or '!'
! * inline range comment: using '<' and '>', put comment in between
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! global config
global_defs {
log_level DEBUG # 方便調試
! log_file /var/log/hdslb.log
! log_async_mode on
}
! netif config
netif_defs {
pktpool_size 1048575
pktpool_cache 256
# LAN Interface 配置
device dpdk0 {
rx {
queue_number 3
descriptor_number 1024
! rss all
}
tx {
queue_number 3
descriptor_number 1024
}
fdir {
mode perfect
pballoc 64k
status matched
}
! promisc_mode
kni_name dpdk0.kni
}
# WAN Interface 配置
device dpdk1 {
rx {
queue_number 3
descriptor_number 1024
! rss all
}
tx {
queue_number 3
descriptor_number 1024
}
fdir {
mode perfect
pballoc 64k
status matched
}
! promisc_mode
kni_name dpdk1.kni
}
! bonding bond0 {
! mode 0
! slave dpdk0
! slave dpdk1
! primary dpdk0
! kni_name bond0.kni
!}
}
! worker config (lcores)
worker_defs {
# control plane CPU
worker cpu0 {
type master
cpu_id 0
}
# data plane CPU
# dpdk0、1 這 2 個 Port 的同一個收發(fā)隊列共用同一個 CPU
worker cpu1 {
type slave
cpu_id 1
port dpdk0 {
rx_queue_ids 0
tx_queue_ids 0
! isol_rx_cpu_ids 9
! isol_rxq_ring_sz 1048576
}
port dpdk1 {
rx_queue_ids 0
tx_queue_ids 0
! isol_rx_cpu_ids 9
! isol_rxq_ring_sz 1048576
}
}
worker cpu2 {
type slave
cpu_id 2
port dpdk0 {
rx_queue_ids 1
tx_queue_ids 1
! isol_rx_cpu_ids 10
! isol_rxq_ring_sz 1048576
}
port dpdk1 {
rx_queue_ids 1
tx_queue_ids 1
! isol_rx_cpu_ids 10
! isol_rxq_ring_sz 1048576
}
}
worker cpu3 {
type slave
cpu_id 3
port dpdk0 {
rx_queue_ids 2
tx_queue_ids 2
! isol_rx_cpu_ids 11
! isol_rxq_ring_sz 1048576
}
port dpdk1 {
rx_queue_ids 2
tx_queue_ids 2
! isol_rx_cpu_ids 11
! isol_rxq_ring_sz 1048576
}
}
}
! timer config
timer_defs {
# cpu job loops to schedule dpdk timer management
schedule_interval 500
}
! hdslb neighbor config
neigh_defs {
unres_queue_length 128
timeout 60
}
! hdslb ipv4 config
ipv4_defs {
forwarding off
default_ttl 64
fragment {
bucket_number 4096
bucket_entries 16
max_entries 4096
ttl 1
}
}
! hdslb ipv6 config
ipv6_defs {
disable off
forwarding off
route6 {
method hlist
recycle_time 10
}
}
! control plane config
ctrl_defs {
lcore_msg {
ring_size 4096
sync_msg_timeout_us 30000000
priority_level low
}
ipc_msg {
unix_domain /var/run/hdslb_ctrl
}
}
! ipvs config
ipvs_defs {
conn {
conn_pool_size 2097152
conn_pool_cache 256
conn_init_timeout 30
! expire_quiescent_template
! fast_xmit_close
! redirect off
}
udp {
! defence_udp_drop
uoa_mode opp
uoa_max_trail 3
timeout {
normal 300
last 3
}
}
tcp {
! defence_tcp_drop
timeout {
none 2
established 90
syn_sent 3
syn_recv 30
fin_wait 7
time_wait 7
close 3
close_wait 7
last_ack 7
listen 120
synack 30
last 2
}
synproxy {
synack_options {
mss 1452
ttl 63
sack
! wscale
! timestamp
}
! defer_rs_syn
rs_syn_max_retry 3
ack_storm_thresh 10
max_ack_saved 3
conn_reuse_state {
close
time_wait
! fin_wait
! close_wait
! last_ack
}
}
}
}
! sa_pool config
sa_pool {
pool_hash_size 16
}
$ cd hdslb/
$ ./bin/hdslb
current thread affinity is set to F
EAL: Detected 4 lcore(s)
EAL: Detected 1 NUMA nodes
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: Selected IOVA mode 'PA'
EAL: Probing VFIO support...
EAL: VFIO support initialized
EAL: Invalid NUMA socket, default to 0
EAL: Probe PCI driver: net_virtio (1af4:1000) device: 0000:01:00.0 (socket 0)
EAL: Invalid NUMA socket, default to 0
EAL: Probe PCI driver: net_virtio (1af4:1000) device: 0000:03:00.0 (socket 0)
EAL: using IOMMU type 8 (No-IOMMU)
EAL: Ignore mapping IO port bar(0)
EAL: Invalid NUMA socket, default to 0
EAL: Probe PCI driver: net_virtio (1af4:1000) device: 0000:04:00.0 (socket 0)
EAL: Ignore mapping IO port bar(0)
EAL: No legacy callbacks, legacy socket not created
DPVS: HDSLB version: , build on 2024.05.24.14:37:02
CFG_FILE: Opening configuration file '/etc/hdslb.conf'.
CFG_FILE: log_level = WARNING
NETIF: dpdk0:rx_queue_number = 3
NETIF: dpdk1:rx_queue_number = 3
NETIF: worker cpu1:dpdk0 rx_queue_id += 0
NETIF: worker cpu1:dpdk0 tx_queue_id += 0
NETIF: worker cpu1:dpdk1 rx_queue_id += 0
NETIF: worker cpu1:dpdk1 tx_queue_id += 0
NETIF: worker cpu2:dpdk0 rx_queue_id += 1
NETIF: worker cpu2:dpdk0 tx_queue_id += 1
NETIF: worker cpu2:dpdk1 rx_queue_id += 1
NETIF: worker cpu2:dpdk1 tx_queue_id += 1
NETIF: worker cpu3:dpdk0 rx_queue_id += 2
NETIF: worker cpu3:dpdk0 tx_queue_id += 2
NETIF: worker cpu3:dpdk1 rx_queue_id += 2
NETIF: worker cpu3:dpdk1 tx_queue_id += 2
Kni: kni_add_dev: fail to set mac FA:27:00:00:0A:02 for dpdk0.kni: Timer expired
Kni: kni_add_dev: fail to set mac FA:27:00:00:0B:F6 for dpdk1.kni: Timer expired
HDSLB-DPVS 進程起來后,可以看見 2 個 DPDK Port 和對應的 2 個 KNI Interface。其中 DPDK Port 用于 LB 數據面轉發(fā),而 KNI 則用于 Keepalived HA 部署模式。
$ cd hdslb/bin/
$ ./dpip link show
1: dpdk0: socket 0 mtu 1500 rx-queue 3 tx-queue 3
UP 10000 Mbps half-duplex auto-nego
addr FA:27:00:00:0A:02 OF_TX_IP_CSUM
2: dpdk1: socket 0 mtu 1500 rx-queue 3 tx-queue 3
UP 10000 Mbps half-duplex auto-nego
addr FA:27:00:00:0B:F6 OF_TX_IP_CSUM
$ ip a
1: lo: mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: mtu 1500 qdisc mq state UP group default qlen 1000
link/ether fa:27:00:00:00:c0 brd ff:ff:ff:ff:ff:ff
inet 192.168.0.4/25 brd 192.168.0.127 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::f827:ff:fe00:c0/64 scope link
valid_lft forever preferred_lft forever
71: dpdk0.kni: mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether fa:27:00:00:0a:02 brd ff:ff:ff:ff:ff:ff
72: dpdk1.kni: mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether fa:27:00:00:0b:f6 brd ff:ff:ff:ff:ff:ff
$ cd hdslb/bin
# add VIP to WAN interface
$ ./dpip addr add 10.0.0.100/32 dev dpdk1
# route for WAN/LAN access
$ ./dpip route add 10.0.0.0/16 dev dpdk1
$ ./dpip route add 192.168.100.0/24 dev dpdk0
# add routes for other network or default route if needed.
$ ./dpip route show
inet 10.0.0.100/32 via 0.0.0.0 src 0.0.0.0 dev dpdk1 mtu 1500 tos 0 scope host metric 0 proto auto
inet 192.168.100.0/24 via 0.0.0.0 src 0.0.0.0 dev dpdk0 mtu 1500 tos 0 scope link metric 0 proto auto
inet 10.0.0.0/16 via 0.0.0.0 src 0.0.0.0 dev dpdk1 mtu 1500 tos 0 scope link metric 0 proto auto
# add service to forwarding, scheduling mode is RR.
$ ./ipvsadm -A -t 10.0.0.100:80 -s rr
# add two RS for service, forwarding mode is FNAT (-b)
$ ./ipvsadm -a -t 10.0.0.100:80 -r 192.168.100.2 -b
$ ./ipvsadm -a -t 10.0.0.100:80 -r 192.168.100.3 -b
# add at least one Local-IP (LIP) for FNAT on LAN interface
$ ./ipvsadm --add-laddr -z 192.168.100.200 -t 10.0.0.100:80 -F dpdk0
# Check
$ ./ipvsadm -Ln
IP Virtual Server version 0.0.0 (size=0)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP 10.0.0.100:80 rr
-> 192.168.100.2:80 FullNat 1 0 0
-> 192.168.100.3:80 FullNat 1 0 0
$ python -m SimpleHTTPServer 80
$ curl 10.0.0.100
問題 1 :hdslb/tools/keepalived/configure 沒有執(zhí)行權限。
make[1]: Leaving directory '/root/hdslb/src'
make[1]: Entering directory '/root/hdslb/tools'
if [ ! -f keepalived/Makefile ]; then \
cd keepalived && \
./configure && \
cd -; \
fi
/bin/sh: 3: ./configure: Permission denied
make[1]: *** [Makefile:29: keepalived_conf] Error 126
make[1]: Leaving directory '/root/hdslb/tools'
make: *** [Makefile:33: all] Error 1
# 解決
$ chmod +x /root/hdslb/tools/keepalived/configure
問題 2 :缺少配置文件
Cause: ports in DPDK RTE (2) != ports in dpvs.conf(0)
# 解決
$ cp conf/hdslb.conf.sample /etc/hdslb.conf
問題 3 :開發(fā)機 2MB hugepage size 太小
Cause: Cannot init mbuf pool on socket 0
# 解決:把 hugepagesize 配置為 1G
# ref:https://stackoverflow.com/questions/51630926/cannot-create-mbuf-pool-with-dpdk
問題 4 :缺少 rte_kni 模塊
Cause: add KNI port fail, exiting...
# 解決
$ insmod ${RTE_SDK}/build/kmod/rte_kni.ko
問題 5 :開發(fā)機大頁內存不夠
Kni: kni_add_dev: fail to set mac FA:27:00:00:07:AA for dpdk0.kni: Timer expired
Kni: kni_add_dev: fail to set mac FA:27:00:00:00:E1 for dpdk1.kni: Timer expired
IPSET: ipset_init: lcore 3: nothing to do.
IPVS: dp_vs_conn_init: lcore 3: nothing to do.
IPVS: fail to init synproxy: no memory
Segmentation fault (core dumped)
# 解決:擴容到 15G。
問題 6 :開發(fā)機網卡不支持 HDSLB-DPVS 需要的 hardware offloads 功能。
Kni: kni_add_dev: fail to set mac FA:27:00:00:0A:02 for dpdk0.kni: Timer expired
Kni: kni_add_dev: fail to set mac FA:27:00:00:0B:F6 for dpdk1.kni: Timer expired
Ethdev port_id=0 requested Rx offloads 0x62f doesn't match Rx offloads capabilities 0xa1d in rte_eth_dev_configure()
NETIF: netif_port_start: fail to config dpdk0
EAL: Error - exiting with code: 1
Cause: Start dpdk0 failed, skipping ...
# 解決:修改 netif 模塊,不啟動不支持的 offloads 功能。
static struct rte_eth_conf default_port_conf = {
.rxmode = {
......
.offloads = 0,
//.offloads = DEV_RX_OFFLOAD_CHECKSUM | DEV_RX_OFFLOAD_VLAN,
},
......
.txmode = {
......
.offloads = 0,
//.offloads = DEV_TX_OFFLOAD_IPV4_CKSUM | DEV_TX_OFFLOAD_UDP_CKSUM | DEV_TX_OFFLOAD_TCP_CKSUM | DEV_TX_OFFLOAD_MBUF_FAST_FREE,
},
NOTE:根據 DPDK 的文檔,offloads mask 的每個 bit 都代表了特定的卸載功能。以下 0-15bit 對應的 Features:
問題 7 :開發(fā)機網絡不支持 RSS 多隊列。valid value: 0x0 表示當前網卡不支持任何 RSS 哈希函數。
Kni: kni_add_dev: fail to set mac FA:27:00:00:0A:02 for dpdk0.kni: Timer expired
Kni: kni_add_dev: fail to set mac FA:27:00:00:0B:F6 for dpdk1.kni: Timer expired
Ethdev port_id=0 invalid rss_hf: 0x3afbc, valid value: 0x0
NETIF: netif_port_start: fail to config dpdk0
EAL: Error - exiting with code: 1
Cause: Start dpdk0 failed, skipping ...
# 解決方式 1:使用支持 multi-queues 和 RSS hash 的網卡。
# 解決方式 2:修改 netif 模塊,不啟動 multi-queues 和 RSS hash 功能。
static struct rte_eth_conf default_port_conf = {
.rxmode = {
//.mq_mode = ETH_MQ_RX_RSS,
.mq_mode = ETH_MQ_RX_NONE,
......
},
.rx_adv_conf = {
.rss_conf = {
.rss_key = NULL,
//.rss_hf = /*ETH_RSS_IP*/ ETH_RSS_TCP,
.rss_hf = 0
},
},
......
port->dev_conf.rx_adv_conf.rss_conf.rss_hf = 0;
問題 8 :不支持多播地址配置
Kni: kni_add_dev: fail to set mac FA:27:00:00:0A:02 for dpdk0.kni: Timer expired
Kni: kni_add_dev: fail to set mac FA:27:00:00:0B:F6 for dpdk1.kni: Timer expired
NETIF: multicast address add failed for device dpdk0
EAL: Error - exiting with code: 1
Cause: Start dpdk0 failed, skipping ...
# 解決:關閉多播功能
//ret = idev_add_mcast_init(port);
//if (ret != EDPVS_OK) {
// RTE_LOG(WARNING, NETIF, "multicast address add failed for device %s\n", port->name);
// return ret;
//}
問題 9 :LB connect pool 內存太小,程序崩潰退出。
$ ./ipvsadm -A -t 10.0.0.100:80 -s rr
[sockopt_msg_recv] socket msg header recv error -- 0/88 recieved
IPVS: lb_conn_hash_table_init: lcore 0: create conn_hash_tbl failed. Requested size: 1073741824 bytes. LB_CONN_CACHE_LINES_DEF: 1, LB_CONN_TBL_SIZE: 16777216
# 解決方式 1:繼續(xù)加大頁內存到實際需要的大小。
# 解決方式 2:
# 1):釋放一個 lcore 的大頁內存
# 2):調小 DPVS_CONN_POOL_SIZE_DEF 從 2097152 減少到 1048576
//#define DPVS_CONN_POOL_SIZE_DEF 2097152
#define DPVS_CONN_POOL_SIZE_DEF 1048576
問題 10 :編譯器版本低缺少編譯指令。
error: inlining failed in call to always_inline "'_mm256_cmpeq_epi64_mask':" : target specific option mismatch
# 解決:
# 1)升級 GCC 版本到 9.4.0
# 2)確定 CPU 支持指令集。ref:https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#expand=3828,301,2553&text=_mm256_cmpeq_epi64_mask&ig_expand=872
值得注意的是上述問題記錄是筆者在低配開發(fā)機中調試程序時所遇見的問題,實際上在一個資源充足的物理測試機上通常不會出現(xiàn)由于資源不足導致的大部分問題。
最后,本篇主要介紹了 Intel HDSLB 的基本運行原理和部署配置的方式,希望能夠幫助讀者們順利的把 HDSLB-DPVS 項目 “玩” 起來。后面,我們將再次開發(fā)機環(huán)境的基礎之上,通過《Intel HDSLB 高性能四層負載均衡器 — 高級特性和代碼剖析》,繼續(xù)深入挖掘 HDSLB-DPVS 的高級特性、軟件架構分析和代碼解讀。敬請繼續(xù)期待。:)