Root/
1 | Wait/Wound Deadlock-Proof Mutex Design |
2 | ====================================== |
3 | |
4 | Please read mutex-design.txt first, as it applies to wait/wound mutexes too. |
5 | |
6 | Motivation for WW-Mutexes |
7 | ------------------------- |
8 | |
9 | GPU's do operations that commonly involve many buffers. Those buffers |
10 | can be shared across contexts/processes, exist in different memory |
11 | domains (for example VRAM vs system memory), and so on. And with |
12 | PRIME / dmabuf, they can even be shared across devices. So there are |
13 | a handful of situations where the driver needs to wait for buffers to |
14 | become ready. If you think about this in terms of waiting on a buffer |
15 | mutex for it to become available, this presents a problem because |
16 | there is no way to guarantee that buffers appear in a execbuf/batch in |
17 | the same order in all contexts. That is directly under control of |
18 | userspace, and a result of the sequence of GL calls that an application |
19 | makes. Which results in the potential for deadlock. The problem gets |
20 | more complex when you consider that the kernel may need to migrate the |
21 | buffer(s) into VRAM before the GPU operates on the buffer(s), which |
22 | may in turn require evicting some other buffers (and you don't want to |
23 | evict other buffers which are already queued up to the GPU), but for a |
24 | simplified understanding of the problem you can ignore this. |
25 | |
26 | The algorithm that the TTM graphics subsystem came up with for dealing with |
27 | this problem is quite simple. For each group of buffers (execbuf) that need |
28 | to be locked, the caller would be assigned a unique reservation id/ticket, |
29 | from a global counter. In case of deadlock while locking all the buffers |
30 | associated with a execbuf, the one with the lowest reservation ticket (i.e. |
31 | the oldest task) wins, and the one with the higher reservation id (i.e. the |
32 | younger task) unlocks all of the buffers that it has already locked, and then |
33 | tries again. |
34 | |
35 | In the RDBMS literature this deadlock handling approach is called wait/wound: |
36 | The older tasks waits until it can acquire the contended lock. The younger tasks |
37 | needs to back off and drop all the locks it is currently holding, i.e. the |
38 | younger task is wounded. |
39 | |
40 | Concepts |
41 | -------- |
42 | |
43 | Compared to normal mutexes two additional concepts/objects show up in the lock |
44 | interface for w/w mutexes: |
45 | |
46 | Acquire context: To ensure eventual forward progress it is important the a task |
47 | trying to acquire locks doesn't grab a new reservation id, but keeps the one it |
48 | acquired when starting the lock acquisition. This ticket is stored in the |
49 | acquire context. Furthermore the acquire context keeps track of debugging state |
50 | to catch w/w mutex interface abuse. |
51 | |
52 | W/w class: In contrast to normal mutexes the lock class needs to be explicit for |
53 | w/w mutexes, since it is required to initialize the acquire context. |
54 | |
55 | Furthermore there are three different class of w/w lock acquire functions: |
56 | |
57 | * Normal lock acquisition with a context, using ww_mutex_lock. |
58 | |
59 | * Slowpath lock acquisition on the contending lock, used by the wounded task |
60 | after having dropped all already acquired locks. These functions have the |
61 | _slow postfix. |
62 | |
63 | From a simple semantics point-of-view the _slow functions are not strictly |
64 | required, since simply calling the normal ww_mutex_lock functions on the |
65 | contending lock (after having dropped all other already acquired locks) will |
66 | work correctly. After all if no other ww mutex has been acquired yet there's |
67 | no deadlock potential and hence the ww_mutex_lock call will block and not |
68 | prematurely return -EDEADLK. The advantage of the _slow functions is in |
69 | interface safety: |
70 | - ww_mutex_lock has a __must_check int return type, whereas ww_mutex_lock_slow |
71 | has a void return type. Note that since ww mutex code needs loops/retries |
72 | anyway the __must_check doesn't result in spurious warnings, even though the |
73 | very first lock operation can never fail. |
74 | - When full debugging is enabled ww_mutex_lock_slow checks that all acquired |
75 | ww mutex have been released (preventing deadlocks) and makes sure that we |
76 | block on the contending lock (preventing spinning through the -EDEADLK |
77 | slowpath until the contended lock can be acquired). |
78 | |
79 | * Functions to only acquire a single w/w mutex, which results in the exact same |
80 | semantics as a normal mutex. This is done by calling ww_mutex_lock with a NULL |
81 | context. |
82 | |
83 | Again this is not strictly required. But often you only want to acquire a |
84 | single lock in which case it's pointless to set up an acquire context (and so |
85 | better to avoid grabbing a deadlock avoidance ticket). |
86 | |
87 | Of course, all the usual variants for handling wake-ups due to signals are also |
88 | provided. |
89 | |
90 | Usage |
91 | ----- |
92 | |
93 | Three different ways to acquire locks within the same w/w class. Common |
94 | definitions for methods #1 and #2: |
95 | |
96 | static DEFINE_WW_CLASS(ww_class); |
97 | |
98 | struct obj { |
99 | struct ww_mutex lock; |
100 | /* obj data */ |
101 | }; |
102 | |
103 | struct obj_entry { |
104 | struct list_head head; |
105 | struct obj *obj; |
106 | }; |
107 | |
108 | Method 1, using a list in execbuf->buffers that's not allowed to be reordered. |
109 | This is useful if a list of required objects is already tracked somewhere. |
110 | Furthermore the lock helper can use propagate the -EALREADY return code back to |
111 | the caller as a signal that an object is twice on the list. This is useful if |
112 | the list is constructed from userspace input and the ABI requires userspace to |
113 | not have duplicate entries (e.g. for a gpu commandbuffer submission ioctl). |
114 | |
115 | int lock_objs(struct list_head *list, struct ww_acquire_ctx *ctx) |
116 | { |
117 | struct obj *res_obj = NULL; |
118 | struct obj_entry *contended_entry = NULL; |
119 | struct obj_entry *entry; |
120 | |
121 | ww_acquire_init(ctx, &ww_class); |
122 | |
123 | retry: |
124 | list_for_each_entry (entry, list, head) { |
125 | if (entry->obj == res_obj) { |
126 | res_obj = NULL; |
127 | continue; |
128 | } |
129 | ret = ww_mutex_lock(&entry->obj->lock, ctx); |
130 | if (ret < 0) { |
131 | contended_entry = entry; |
132 | goto err; |
133 | } |
134 | } |
135 | |
136 | ww_acquire_done(ctx); |
137 | return 0; |
138 | |
139 | err: |
140 | list_for_each_entry_continue_reverse (entry, list, head) |
141 | ww_mutex_unlock(&entry->obj->lock); |
142 | |
143 | if (res_obj) |
144 | ww_mutex_unlock(&res_obj->lock); |
145 | |
146 | if (ret == -EDEADLK) { |
147 | /* we lost out in a seqno race, lock and retry.. */ |
148 | ww_mutex_lock_slow(&contended_entry->obj->lock, ctx); |
149 | res_obj = contended_entry->obj; |
150 | goto retry; |
151 | } |
152 | ww_acquire_fini(ctx); |
153 | |
154 | return ret; |
155 | } |
156 | |
157 | Method 2, using a list in execbuf->buffers that can be reordered. Same semantics |
158 | of duplicate entry detection using -EALREADY as method 1 above. But the |
159 | list-reordering allows for a bit more idiomatic code. |
160 | |
161 | int lock_objs(struct list_head *list, struct ww_acquire_ctx *ctx) |
162 | { |
163 | struct obj_entry *entry, *entry2; |
164 | |
165 | ww_acquire_init(ctx, &ww_class); |
166 | |
167 | list_for_each_entry (entry, list, head) { |
168 | ret = ww_mutex_lock(&entry->obj->lock, ctx); |
169 | if (ret < 0) { |
170 | entry2 = entry; |
171 | |
172 | list_for_each_entry_continue_reverse (entry2, list, head) |
173 | ww_mutex_unlock(&entry2->obj->lock); |
174 | |
175 | if (ret != -EDEADLK) { |
176 | ww_acquire_fini(ctx); |
177 | return ret; |
178 | } |
179 | |
180 | /* we lost out in a seqno race, lock and retry.. */ |
181 | ww_mutex_lock_slow(&entry->obj->lock, ctx); |
182 | |
183 | /* |
184 | * Move buf to head of the list, this will point |
185 | * buf->next to the first unlocked entry, |
186 | * restarting the for loop. |
187 | */ |
188 | list_del(&entry->head); |
189 | list_add(&entry->head, list); |
190 | } |
191 | } |
192 | |
193 | ww_acquire_done(ctx); |
194 | return 0; |
195 | } |
196 | |
197 | Unlocking works the same way for both methods #1 and #2: |
198 | |
199 | void unlock_objs(struct list_head *list, struct ww_acquire_ctx *ctx) |
200 | { |
201 | struct obj_entry *entry; |
202 | |
203 | list_for_each_entry (entry, list, head) |
204 | ww_mutex_unlock(&entry->obj->lock); |
205 | |
206 | ww_acquire_fini(ctx); |
207 | } |
208 | |
209 | Method 3 is useful if the list of objects is constructed ad-hoc and not upfront, |
210 | e.g. when adjusting edges in a graph where each node has its own ww_mutex lock, |
211 | and edges can only be changed when holding the locks of all involved nodes. w/w |
212 | mutexes are a natural fit for such a case for two reasons: |
213 | - They can handle lock-acquisition in any order which allows us to start walking |
214 | a graph from a starting point and then iteratively discovering new edges and |
215 | locking down the nodes those edges connect to. |
216 | - Due to the -EALREADY return code signalling that a given objects is already |
217 | held there's no need for additional book-keeping to break cycles in the graph |
218 | or keep track off which looks are already held (when using more than one node |
219 | as a starting point). |
220 | |
221 | Note that this approach differs in two important ways from the above methods: |
222 | - Since the list of objects is dynamically constructed (and might very well be |
223 | different when retrying due to hitting the -EDEADLK wound condition) there's |
224 | no need to keep any object on a persistent list when it's not locked. We can |
225 | therefore move the list_head into the object itself. |
226 | - On the other hand the dynamic object list construction also means that the -EALREADY return |
227 | code can't be propagated. |
228 | |
229 | Note also that methods #1 and #2 and method #3 can be combined, e.g. to first lock a |
230 | list of starting nodes (passed in from userspace) using one of the above |
231 | methods. And then lock any additional objects affected by the operations using |
232 | method #3 below. The backoff/retry procedure will be a bit more involved, since |
233 | when the dynamic locking step hits -EDEADLK we also need to unlock all the |
234 | objects acquired with the fixed list. But the w/w mutex debug checks will catch |
235 | any interface misuse for these cases. |
236 | |
237 | Also, method 3 can't fail the lock acquisition step since it doesn't return |
238 | -EALREADY. Of course this would be different when using the _interruptible |
239 | variants, but that's outside of the scope of these examples here. |
240 | |
241 | struct obj { |
242 | struct ww_mutex ww_mutex; |
243 | struct list_head locked_list; |
244 | }; |
245 | |
246 | static DEFINE_WW_CLASS(ww_class); |
247 | |
248 | void __unlock_objs(struct list_head *list) |
249 | { |
250 | struct obj *entry, *temp; |
251 | |
252 | list_for_each_entry_safe (entry, temp, list, locked_list) { |
253 | /* need to do that before unlocking, since only the current lock holder is |
254 | allowed to use object */ |
255 | list_del(&entry->locked_list); |
256 | ww_mutex_unlock(entry->ww_mutex) |
257 | } |
258 | } |
259 | |
260 | void lock_objs(struct list_head *list, struct ww_acquire_ctx *ctx) |
261 | { |
262 | struct obj *obj; |
263 | |
264 | ww_acquire_init(ctx, &ww_class); |
265 | |
266 | retry: |
267 | /* re-init loop start state */ |
268 | loop { |
269 | /* magic code which walks over a graph and decides which objects |
270 | * to lock */ |
271 | |
272 | ret = ww_mutex_lock(obj->ww_mutex, ctx); |
273 | if (ret == -EALREADY) { |
274 | /* we have that one already, get to the next object */ |
275 | continue; |
276 | } |
277 | if (ret == -EDEADLK) { |
278 | __unlock_objs(list); |
279 | |
280 | ww_mutex_lock_slow(obj, ctx); |
281 | list_add(&entry->locked_list, list); |
282 | goto retry; |
283 | } |
284 | |
285 | /* locked a new object, add it to the list */ |
286 | list_add_tail(&entry->locked_list, list); |
287 | } |
288 | |
289 | ww_acquire_done(ctx); |
290 | return 0; |
291 | } |
292 | |
293 | void unlock_objs(struct list_head *list, struct ww_acquire_ctx *ctx) |
294 | { |
295 | __unlock_objs(list); |
296 | ww_acquire_fini(ctx); |
297 | } |
298 | |
299 | Method 4: Only lock one single objects. In that case deadlock detection and |
300 | prevention is obviously overkill, since with grabbing just one lock you can't |
301 | produce a deadlock within just one class. To simplify this case the w/w mutex |
302 | api can be used with a NULL context. |
303 | |
304 | Implementation Details |
305 | ---------------------- |
306 | |
307 | Design: |
308 | ww_mutex currently encapsulates a struct mutex, this means no extra overhead for |
309 | normal mutex locks, which are far more common. As such there is only a small |
310 | increase in code size if wait/wound mutexes are not used. |
311 | |
312 | In general, not much contention is expected. The locks are typically used to |
313 | serialize access to resources for devices. The only way to make wakeups |
314 | smarter would be at the cost of adding a field to struct mutex_waiter. This |
315 | would add overhead to all cases where normal mutexes are used, and |
316 | ww_mutexes are generally less performance sensitive. |
317 | |
318 | Lockdep: |
319 | Special care has been taken to warn for as many cases of api abuse |
320 | as possible. Some common api abuses will be caught with |
321 | CONFIG_DEBUG_MUTEXES, but CONFIG_PROVE_LOCKING is recommended. |
322 | |
323 | Some of the errors which will be warned about: |
324 | - Forgetting to call ww_acquire_fini or ww_acquire_init. |
325 | - Attempting to lock more mutexes after ww_acquire_done. |
326 | - Attempting to lock the wrong mutex after -EDEADLK and |
327 | unlocking all mutexes. |
328 | - Attempting to lock the right mutex after -EDEADLK, |
329 | before unlocking all mutexes. |
330 | |
331 | - Calling ww_mutex_lock_slow before -EDEADLK was returned. |
332 | |
333 | - Unlocking mutexes with the wrong unlock function. |
334 | - Calling one of the ww_acquire_* twice on the same context. |
335 | - Using a different ww_class for the mutex than for the ww_acquire_ctx. |
336 | - Normal lockdep errors that can result in deadlocks. |
337 | |
338 | Some of the lockdep errors that can result in deadlocks: |
339 | - Calling ww_acquire_init to initialize a second ww_acquire_ctx before |
340 | having called ww_acquire_fini on the first. |
341 | - 'normal' deadlocks that can occur. |
342 | |
343 | FIXME: Update this section once we have the TASK_DEADLOCK task state flag magic |
344 | implemented. |
345 |
Branches:
ben-wpan
ben-wpan-stefan
javiroman/ks7010
jz-2.6.34
jz-2.6.34-rc5
jz-2.6.34-rc6
jz-2.6.34-rc7
jz-2.6.35
jz-2.6.36
jz-2.6.37
jz-2.6.38
jz-2.6.39
jz-3.0
jz-3.1
jz-3.11
jz-3.12
jz-3.13
jz-3.15
jz-3.16
jz-3.18-dt
jz-3.2
jz-3.3
jz-3.4
jz-3.5
jz-3.6
jz-3.6-rc2-pwm
jz-3.9
jz-3.9-clk
jz-3.9-rc8
jz47xx
jz47xx-2.6.38
master
Tags:
od-2011-09-04
od-2011-09-18
v2.6.34-rc5
v2.6.34-rc6
v2.6.34-rc7
v3.9