Root/
1 | IP OVER INFINIBAND |
2 | |
3 | The ib_ipoib driver is an implementation of the IP over InfiniBand |
4 | protocol as specified by RFC 4391 and 4392, issued by the IETF ipoib |
5 | working group. It is a "native" implementation in the sense of |
6 | setting the interface type to ARPHRD_INFINIBAND and the hardware |
7 | address length to 20 (earlier proprietary implementations |
8 | masqueraded to the kernel as ethernet interfaces). |
9 | |
10 | Partitions and P_Keys |
11 | |
12 | When the IPoIB driver is loaded, it creates one interface for each |
13 | port using the P_Key at index 0. To create an interface with a |
14 | different P_Key, write the desired P_Key into the main interface's |
15 | /sys/class/net/<intf name>/create_child file. For example: |
16 | |
17 | echo 0x8001 > /sys/class/net/ib0/create_child |
18 | |
19 | This will create an interface named ib0.8001 with P_Key 0x8001. To |
20 | remove a subinterface, use the "delete_child" file: |
21 | |
22 | echo 0x8001 > /sys/class/net/ib0/delete_child |
23 | |
24 | The P_Key for any interface is given by the "pkey" file, and the |
25 | main interface for a subinterface is in "parent." |
26 | |
27 | Datagram vs Connected modes |
28 | |
29 | The IPoIB driver supports two modes of operation: datagram and |
30 | connected. The mode is set and read through an interface's |
31 | /sys/class/net/<intf name>/mode file. |
32 | |
33 | In datagram mode, the IB UD (Unreliable Datagram) transport is used |
34 | and so the interface MTU has is equal to the IB L2 MTU minus the |
35 | IPoIB encapsulation header (4 bytes). For example, in a typical IB |
36 | fabric with a 2K MTU, the IPoIB MTU will be 2048 - 4 = 2044 bytes. |
37 | |
38 | In connected mode, the IB RC (Reliable Connected) transport is used. |
39 | Connected mode takes advantage of the connected nature of the IB |
40 | transport and allows an MTU up to the maximal IP packet size of 64K, |
41 | which reduces the number of IP packets needed for handling large UDP |
42 | datagrams, TCP segments, etc and increases the performance for large |
43 | messages. |
44 | |
45 | In connected mode, the interface's UD QP is still used for multicast |
46 | and communication with peers that don't support connected mode. In |
47 | this case, RX emulation of ICMP PMTU packets is used to cause the |
48 | networking stack to use the smaller UD MTU for these neighbours. |
49 | |
50 | Stateless offloads |
51 | |
52 | If the IB HW supports IPoIB stateless offloads, IPoIB advertises |
53 | TCP/IP checksum and/or Large Send (LSO) offloading capability to the |
54 | network stack. |
55 | |
56 | Large Receive (LRO) offloading is also implemented and may be turned |
57 | on/off using ethtool calls. Currently LRO is supported only for |
58 | checksum offload capable devices. |
59 | |
60 | Stateless offloads are supported only in datagram mode. |
61 | |
62 | Interrupt moderation |
63 | |
64 | If the underlying IB device supports CQ event moderation, one can |
65 | use ethtool to set interrupt mitigation parameters and thus reduce |
66 | the overhead incurred by handling interrupts. The main code path of |
67 | IPoIB doesn't use events for TX completion signaling so only RX |
68 | moderation is supported. |
69 | |
70 | Debugging Information |
71 | |
72 | By compiling the IPoIB driver with CONFIG_INFINIBAND_IPOIB_DEBUG set |
73 | to 'y', tracing messages are compiled into the driver. They are |
74 | turned on by setting the module parameters debug_level and |
75 | mcast_debug_level to 1. These parameters can be controlled at |
76 | runtime through files in /sys/module/ib_ipoib/. |
77 | |
78 | CONFIG_INFINIBAND_IPOIB_DEBUG also enables files in the debugfs |
79 | virtual filesystem. By mounting this filesystem, for example with |
80 | |
81 | mount -t debugfs none /sys/kernel/debug |
82 | |
83 | it is possible to get statistics about multicast groups from the |
84 | files /sys/kernel/debug/ipoib/ib0_mcg and so on. |
85 | |
86 | The performance impact of this option is negligible, so it |
87 | is safe to enable this option with debug_level set to 0 for normal |
88 | operation. |
89 | |
90 | CONFIG_INFINIBAND_IPOIB_DEBUG_DATA enables even more debug output in |
91 | the data path when data_debug_level is set to 1. However, even with |
92 | the output disabled, enabling this configuration option will affect |
93 | performance, because it adds tests to the fast path. |
94 | |
95 | References |
96 | |
97 | Transmission of IP over InfiniBand (IPoIB) (RFC 4391) |
98 | http://ietf.org/rfc/rfc4391.txt |
99 | IP over InfiniBand (IPoIB) Architecture (RFC 4392) |
100 | http://ietf.org/rfc/rfc4392.txt |
101 | IP over InfiniBand: Connected Mode (RFC 4755) |
102 | http://ietf.org/rfc/rfc4755.txt |
103 |
Branches:
ben-wpan
ben-wpan-stefan
javiroman/ks7010
jz-2.6.34
jz-2.6.34-rc5
jz-2.6.34-rc6
jz-2.6.34-rc7
jz-2.6.35
jz-2.6.36
jz-2.6.37
jz-2.6.38
jz-2.6.39
jz-3.0
jz-3.1
jz-3.11
jz-3.12
jz-3.13
jz-3.15
jz-3.16
jz-3.18-dt
jz-3.2
jz-3.3
jz-3.4
jz-3.5
jz-3.6
jz-3.6-rc2-pwm
jz-3.9
jz-3.9-clk
jz-3.9-rc8
jz47xx
jz47xx-2.6.38
master
Tags:
od-2011-09-04
od-2011-09-18
v2.6.34-rc5
v2.6.34-rc6
v2.6.34-rc7
v3.9