Root/
1 | this_cpu operations |
2 | ------------------- |
3 | |
4 | this_cpu operations are a way of optimizing access to per cpu |
5 | variables associated with the *currently* executing processor through |
6 | the use of segment registers (or a dedicated register where the cpu |
7 | permanently stored the beginning of the per cpu area for a specific |
8 | processor). |
9 | |
10 | The this_cpu operations add a per cpu variable offset to the processor |
11 | specific percpu base and encode that operation in the instruction |
12 | operating on the per cpu variable. |
13 | |
14 | This means there are no atomicity issues between the calculation of |
15 | the offset and the operation on the data. Therefore it is not |
16 | necessary to disable preempt or interrupts to ensure that the |
17 | processor is not changed between the calculation of the address and |
18 | the operation on the data. |
19 | |
20 | Read-modify-write operations are of particular interest. Frequently |
21 | processors have special lower latency instructions that can operate |
22 | without the typical synchronization overhead but still provide some |
23 | sort of relaxed atomicity guarantee. The x86 for example can execute |
24 | RMV (Read Modify Write) instructions like inc/dec/cmpxchg without the |
25 | lock prefix and the associated latency penalty. |
26 | |
27 | Access to the variable without the lock prefix is not synchronized but |
28 | synchronization is not necessary since we are dealing with per cpu |
29 | data specific to the currently executing processor. Only the current |
30 | processor should be accessing that variable and therefore there are no |
31 | concurrency issues with other processors in the system. |
32 | |
33 | On x86 the fs: or the gs: segment registers contain the base of the |
34 | per cpu area. It is then possible to simply use the segment override |
35 | to relocate a per cpu relative address to the proper per cpu area for |
36 | the processor. So the relocation to the per cpu base is encoded in the |
37 | instruction via a segment register prefix. |
38 | |
39 | For example: |
40 | |
41 | DEFINE_PER_CPU(int, x); |
42 | int z; |
43 | |
44 | z = this_cpu_read(x); |
45 | |
46 | results in a single instruction |
47 | |
48 | mov ax, gs:[x] |
49 | |
50 | instead of a sequence of calculation of the address and then a fetch |
51 | from that address which occurs with the percpu operations. Before |
52 | this_cpu_ops such sequence also required preempt disable/enable to |
53 | prevent the kernel from moving the thread to a different processor |
54 | while the calculation is performed. |
55 | |
56 | The main use of the this_cpu operations has been to optimize counter |
57 | operations. |
58 | |
59 | this_cpu_inc(x) |
60 | |
61 | results in the following single instruction (no lock prefix!) |
62 | |
63 | inc gs:[x] |
64 | |
65 | instead of the following operations required if there is no segment |
66 | register. |
67 | |
68 | int *y; |
69 | int cpu; |
70 | |
71 | cpu = get_cpu(); |
72 | y = per_cpu_ptr(&x, cpu); |
73 | (*y)++; |
74 | put_cpu(); |
75 | |
76 | Note that these operations can only be used on percpu data that is |
77 | reserved for a specific processor. Without disabling preemption in the |
78 | surrounding code this_cpu_inc() will only guarantee that one of the |
79 | percpu counters is correctly incremented. However, there is no |
80 | guarantee that the OS will not move the process directly before or |
81 | after the this_cpu instruction is executed. In general this means that |
82 | the value of the individual counters for each processor are |
83 | meaningless. The sum of all the per cpu counters is the only value |
84 | that is of interest. |
85 | |
86 | Per cpu variables are used for performance reasons. Bouncing cache |
87 | lines can be avoided if multiple processors concurrently go through |
88 | the same code paths. Since each processor has its own per cpu |
89 | variables no concurrent cacheline updates take place. The price that |
90 | has to be paid for this optimization is the need to add up the per cpu |
91 | counters when the value of the counter is needed. |
92 | |
93 | |
94 | Special operations: |
95 | ------------------- |
96 | |
97 | y = this_cpu_ptr(&x) |
98 | |
99 | Takes the offset of a per cpu variable (&x !) and returns the address |
100 | of the per cpu variable that belongs to the currently executing |
101 | processor. this_cpu_ptr avoids multiple steps that the common |
102 | get_cpu/put_cpu sequence requires. No processor number is |
103 | available. Instead the offset of the local per cpu area is simply |
104 | added to the percpu offset. |
105 | |
106 | |
107 | |
108 | Per cpu variables and offsets |
109 | ----------------------------- |
110 | |
111 | Per cpu variables have *offsets* to the beginning of the percpu |
112 | area. They do not have addresses although they look like that in the |
113 | code. Offsets cannot be directly dereferenced. The offset must be |
114 | added to a base pointer of a percpu area of a processor in order to |
115 | form a valid address. |
116 | |
117 | Therefore the use of x or &x outside of the context of per cpu |
118 | operations is invalid and will generally be treated like a NULL |
119 | pointer dereference. |
120 | |
121 | In the context of per cpu operations |
122 | |
123 | x is a per cpu variable. Most this_cpu operations take a cpu |
124 | variable. |
125 | |
126 | &x is the *offset* a per cpu variable. this_cpu_ptr() takes |
127 | the offset of a per cpu variable which makes this look a bit |
128 | strange. |
129 | |
130 | |
131 | |
132 | Operations on a field of a per cpu structure |
133 | -------------------------------------------- |
134 | |
135 | Let's say we have a percpu structure |
136 | |
137 | struct s { |
138 | int n,m; |
139 | }; |
140 | |
141 | DEFINE_PER_CPU(struct s, p); |
142 | |
143 | |
144 | Operations on these fields are straightforward |
145 | |
146 | this_cpu_inc(p.m) |
147 | |
148 | z = this_cpu_cmpxchg(p.m, 0, 1); |
149 | |
150 | |
151 | If we have an offset to struct s: |
152 | |
153 | struct s __percpu *ps = &p; |
154 | |
155 | z = this_cpu_dec(ps->m); |
156 | |
157 | z = this_cpu_inc_return(ps->n); |
158 | |
159 | |
160 | The calculation of the pointer may require the use of this_cpu_ptr() |
161 | if we do not make use of this_cpu ops later to manipulate fields: |
162 | |
163 | struct s *pp; |
164 | |
165 | pp = this_cpu_ptr(&p); |
166 | |
167 | pp->m--; |
168 | |
169 | z = pp->n++; |
170 | |
171 | |
172 | Variants of this_cpu ops |
173 | ------------------------- |
174 | |
175 | this_cpu ops are interrupt safe. Some architecture do not support |
176 | these per cpu local operations. In that case the operation must be |
177 | replaced by code that disables interrupts, then does the operations |
178 | that are guaranteed to be atomic and then reenable interrupts. Doing |
179 | so is expensive. If there are other reasons why the scheduler cannot |
180 | change the processor we are executing on then there is no reason to |
181 | disable interrupts. For that purpose the __this_cpu operations are |
182 | provided. For example. |
183 | |
184 | __this_cpu_inc(x); |
185 | |
186 | Will increment x and will not fallback to code that disables |
187 | interrupts on platforms that cannot accomplish atomicity through |
188 | address relocation and a Read-Modify-Write operation in the same |
189 | instruction. |
190 | |
191 | |
192 | |
193 | &this_cpu_ptr(pp)->n vs this_cpu_ptr(&pp->n) |
194 | -------------------------------------------- |
195 | |
196 | The first operation takes the offset and forms an address and then |
197 | adds the offset of the n field. |
198 | |
199 | The second one first adds the two offsets and then does the |
200 | relocation. IMHO the second form looks cleaner and has an easier time |
201 | with (). The second form also is consistent with the way |
202 | this_cpu_read() and friends are used. |
203 | |
204 | |
205 | Christoph Lameter, April 3rd, 2013 |
206 |
Branches:
ben-wpan
ben-wpan-stefan
javiroman/ks7010
jz-2.6.34
jz-2.6.34-rc5
jz-2.6.34-rc6
jz-2.6.34-rc7
jz-2.6.35
jz-2.6.36
jz-2.6.37
jz-2.6.38
jz-2.6.39
jz-3.0
jz-3.1
jz-3.11
jz-3.12
jz-3.13
jz-3.15
jz-3.16
jz-3.18-dt
jz-3.2
jz-3.3
jz-3.4
jz-3.5
jz-3.6
jz-3.6-rc2-pwm
jz-3.9
jz-3.9-clk
jz-3.9-rc8
jz47xx
jz47xx-2.6.38
master
Tags:
od-2011-09-04
od-2011-09-18
v2.6.34-rc5
v2.6.34-rc6
v2.6.34-rc7
v3.9