Root/
Source at commit fbf123cd4cc0c097fe9a99c90109ebb2a5e94a50 created 10 years 3 months ago. By Lars-Peter Clausen, dma: jz4740: Dequeue descriptor from active list before completing it | |
---|---|
1 | Futex Requeue PI |
2 | ---------------- |
3 | |
4 | Requeueing of tasks from a non-PI futex to a PI futex requires |
5 | special handling in order to ensure the underlying rt_mutex is never |
6 | left without an owner if it has waiters; doing so would break the PI |
7 | boosting logic [see rt-mutex-desgin.txt] For the purposes of |
8 | brevity, this action will be referred to as "requeue_pi" throughout |
9 | this document. Priority inheritance is abbreviated throughout as |
10 | "PI". |
11 | |
12 | Motivation |
13 | ---------- |
14 | |
15 | Without requeue_pi, the glibc implementation of |
16 | pthread_cond_broadcast() must resort to waking all the tasks waiting |
17 | on a pthread_condvar and letting them try to sort out which task |
18 | gets to run first in classic thundering-herd formation. An ideal |
19 | implementation would wake the highest-priority waiter, and leave the |
20 | rest to the natural wakeup inherent in unlocking the mutex |
21 | associated with the condvar. |
22 | |
23 | Consider the simplified glibc calls: |
24 | |
25 | /* caller must lock mutex */ |
26 | pthread_cond_wait(cond, mutex) |
27 | { |
28 | lock(cond->__data.__lock); |
29 | unlock(mutex); |
30 | do { |
31 | unlock(cond->__data.__lock); |
32 | futex_wait(cond->__data.__futex); |
33 | lock(cond->__data.__lock); |
34 | } while(...) |
35 | unlock(cond->__data.__lock); |
36 | lock(mutex); |
37 | } |
38 | |
39 | pthread_cond_broadcast(cond) |
40 | { |
41 | lock(cond->__data.__lock); |
42 | unlock(cond->__data.__lock); |
43 | futex_requeue(cond->data.__futex, cond->mutex); |
44 | } |
45 | |
46 | Once pthread_cond_broadcast() requeues the tasks, the cond->mutex |
47 | has waiters. Note that pthread_cond_wait() attempts to lock the |
48 | mutex only after it has returned to user space. This will leave the |
49 | underlying rt_mutex with waiters, and no owner, breaking the |
50 | previously mentioned PI-boosting algorithms. |
51 | |
52 | In order to support PI-aware pthread_condvar's, the kernel needs to |
53 | be able to requeue tasks to PI futexes. This support implies that |
54 | upon a successful futex_wait system call, the caller would return to |
55 | user space already holding the PI futex. The glibc implementation |
56 | would be modified as follows: |
57 | |
58 | |
59 | /* caller must lock mutex */ |
60 | pthread_cond_wait_pi(cond, mutex) |
61 | { |
62 | lock(cond->__data.__lock); |
63 | unlock(mutex); |
64 | do { |
65 | unlock(cond->__data.__lock); |
66 | futex_wait_requeue_pi(cond->__data.__futex); |
67 | lock(cond->__data.__lock); |
68 | } while(...) |
69 | unlock(cond->__data.__lock); |
70 | /* the kernel acquired the mutex for us */ |
71 | } |
72 | |
73 | pthread_cond_broadcast_pi(cond) |
74 | { |
75 | lock(cond->__data.__lock); |
76 | unlock(cond->__data.__lock); |
77 | futex_requeue_pi(cond->data.__futex, cond->mutex); |
78 | } |
79 | |
80 | The actual glibc implementation will likely test for PI and make the |
81 | necessary changes inside the existing calls rather than creating new |
82 | calls for the PI cases. Similar changes are needed for |
83 | pthread_cond_timedwait() and pthread_cond_signal(). |
84 | |
85 | Implementation |
86 | -------------- |
87 | |
88 | In order to ensure the rt_mutex has an owner if it has waiters, it |
89 | is necessary for both the requeue code, as well as the waiting code, |
90 | to be able to acquire the rt_mutex before returning to user space. |
91 | The requeue code cannot simply wake the waiter and leave it to |
92 | acquire the rt_mutex as it would open a race window between the |
93 | requeue call returning to user space and the waiter waking and |
94 | starting to run. This is especially true in the uncontended case. |
95 | |
96 | The solution involves two new rt_mutex helper routines, |
97 | rt_mutex_start_proxy_lock() and rt_mutex_finish_proxy_lock(), which |
98 | allow the requeue code to acquire an uncontended rt_mutex on behalf |
99 | of the waiter and to enqueue the waiter on a contended rt_mutex. |
100 | Two new system calls provide the kernel<->user interface to |
101 | requeue_pi: FUTEX_WAIT_REQUEUE_PI and FUTEX_REQUEUE_CMP_PI. |
102 | |
103 | FUTEX_WAIT_REQUEUE_PI is called by the waiter (pthread_cond_wait() |
104 | and pthread_cond_timedwait()) to block on the initial futex and wait |
105 | to be requeued to a PI-aware futex. The implementation is the |
106 | result of a high-speed collision between futex_wait() and |
107 | futex_lock_pi(), with some extra logic to check for the additional |
108 | wake-up scenarios. |
109 | |
110 | FUTEX_REQUEUE_CMP_PI is called by the waker |
111 | (pthread_cond_broadcast() and pthread_cond_signal()) to requeue and |
112 | possibly wake the waiting tasks. Internally, this system call is |
113 | still handled by futex_requeue (by passing requeue_pi=1). Before |
114 | requeueing, futex_requeue() attempts to acquire the requeue target |
115 | PI futex on behalf of the top waiter. If it can, this waiter is |
116 | woken. futex_requeue() then proceeds to requeue the remaining |
117 | nr_wake+nr_requeue tasks to the PI futex, calling |
118 | rt_mutex_start_proxy_lock() prior to each requeue to prepare the |
119 | task as a waiter on the underlying rt_mutex. It is possible that |
120 | the lock can be acquired at this stage as well, if so, the next |
121 | waiter is woken to finish the acquisition of the lock. |
122 | |
123 | FUTEX_REQUEUE_PI accepts nr_wake and nr_requeue as arguments, but |
124 | their sum is all that really matters. futex_requeue() will wake or |
125 | requeue up to nr_wake + nr_requeue tasks. It will wake only as many |
126 | tasks as it can acquire the lock for, which in the majority of cases |
127 | should be 0 as good programming practice dictates that the caller of |
128 | either pthread_cond_broadcast() or pthread_cond_signal() acquire the |
129 | mutex prior to making the call. FUTEX_REQUEUE_PI requires that |
130 | nr_wake=1. nr_requeue should be INT_MAX for broadcast and 0 for |
131 | signal. |
132 |
Branches:
ben-wpan
ben-wpan-stefan
javiroman/ks7010
jz-2.6.34
jz-2.6.34-rc5
jz-2.6.34-rc6
jz-2.6.34-rc7
jz-2.6.35
jz-2.6.36
jz-2.6.37
jz-2.6.38
jz-2.6.39
jz-3.0
jz-3.1
jz-3.11
jz-3.12
jz-3.13
jz-3.15
jz-3.16
jz-3.18-dt
jz-3.2
jz-3.3
jz-3.4
jz-3.5
jz-3.6
jz-3.6-rc2-pwm
jz-3.9
jz-3.9-clk
jz-3.9-rc8
jz47xx
jz47xx-2.6.38
master
Tags:
od-2011-09-04
od-2011-09-18
v2.6.34-rc5
v2.6.34-rc6
v2.6.34-rc7
v3.9