Root/
1 | Real-Time group scheduling |
2 | -------------------------- |
3 | |
4 | CONTENTS |
5 | ======== |
6 | |
7 | 0. WARNING |
8 | 1. Overview |
9 | 1.1 The problem |
10 | 1.2 The solution |
11 | 2. The interface |
12 | 2.1 System-wide settings |
13 | 2.2 Default behaviour |
14 | 2.3 Basis for grouping tasks |
15 | 3. Future plans |
16 | |
17 | |
18 | 0. WARNING |
19 | ========== |
20 | |
21 | Fiddling with these settings can result in an unstable system, the knobs are |
22 | root only and assumes root knows what he is doing. |
23 | |
24 | Most notable: |
25 | |
26 | * very small values in sched_rt_period_us can result in an unstable |
27 | system when the period is smaller than either the available hrtimer |
28 | resolution, or the time it takes to handle the budget refresh itself. |
29 | |
30 | * very small values in sched_rt_runtime_us can result in an unstable |
31 | system when the runtime is so small the system has difficulty making |
32 | forward progress (NOTE: the migration thread and kstopmachine both |
33 | are real-time processes). |
34 | |
35 | 1. Overview |
36 | =========== |
37 | |
38 | |
39 | 1.1 The problem |
40 | --------------- |
41 | |
42 | Realtime scheduling is all about determinism, a group has to be able to rely on |
43 | the amount of bandwidth (eg. CPU time) being constant. In order to schedule |
44 | multiple groups of realtime tasks, each group must be assigned a fixed portion |
45 | of the CPU time available. Without a minimum guarantee a realtime group can |
46 | obviously fall short. A fuzzy upper limit is of no use since it cannot be |
47 | relied upon. Which leaves us with just the single fixed portion. |
48 | |
49 | 1.2 The solution |
50 | ---------------- |
51 | |
52 | CPU time is divided by means of specifying how much time can be spent running |
53 | in a given period. We allocate this "run time" for each realtime group which |
54 | the other realtime groups will not be permitted to use. |
55 | |
56 | Any time not allocated to a realtime group will be used to run normal priority |
57 | tasks (SCHED_OTHER). Any allocated run time not used will also be picked up by |
58 | SCHED_OTHER. |
59 | |
60 | Let's consider an example: a frame fixed realtime renderer must deliver 25 |
61 | frames a second, which yields a period of 0.04s per frame. Now say it will also |
62 | have to play some music and respond to input, leaving it with around 80% CPU |
63 | time dedicated for the graphics. We can then give this group a run time of 0.8 |
64 | * 0.04s = 0.032s. |
65 | |
66 | This way the graphics group will have a 0.04s period with a 0.032s run time |
67 | limit. Now if the audio thread needs to refill the DMA buffer every 0.005s, but |
68 | needs only about 3% CPU time to do so, it can do with a 0.03 * 0.005s = |
69 | 0.00015s. So this group can be scheduled with a period of 0.005s and a run time |
70 | of 0.00015s. |
71 | |
72 | The remaining CPU time will be used for user input and other tasks. Because |
73 | realtime tasks have explicitly allocated the CPU time they need to perform |
74 | their tasks, buffer underruns in the graphics or audio can be eliminated. |
75 | |
76 | NOTE: the above example is not fully implemented yet. We still |
77 | lack an EDF scheduler to make non-uniform periods usable. |
78 | |
79 | |
80 | 2. The Interface |
81 | ================ |
82 | |
83 | |
84 | 2.1 System wide settings |
85 | ------------------------ |
86 | |
87 | The system wide settings are configured under the /proc virtual file system: |
88 | |
89 | /proc/sys/kernel/sched_rt_period_us: |
90 | The scheduling period that is equivalent to 100% CPU bandwidth |
91 | |
92 | /proc/sys/kernel/sched_rt_runtime_us: |
93 | A global limit on how much time realtime scheduling may use. Even without |
94 | CONFIG_RT_GROUP_SCHED enabled, this will limit time reserved to realtime |
95 | processes. With CONFIG_RT_GROUP_SCHED it signifies the total bandwidth |
96 | available to all realtime groups. |
97 | |
98 | * Time is specified in us because the interface is s32. This gives an |
99 | operating range from 1us to about 35 minutes. |
100 | * sched_rt_period_us takes values from 1 to INT_MAX. |
101 | * sched_rt_runtime_us takes values from -1 to (INT_MAX - 1). |
102 | * A run time of -1 specifies runtime == period, ie. no limit. |
103 | |
104 | |
105 | 2.2 Default behaviour |
106 | --------------------- |
107 | |
108 | The default values for sched_rt_period_us (1000000 or 1s) and |
109 | sched_rt_runtime_us (950000 or 0.95s). This gives 0.05s to be used by |
110 | SCHED_OTHER (non-RT tasks). These defaults were chosen so that a run-away |
111 | realtime tasks will not lock up the machine but leave a little time to recover |
112 | it. By setting runtime to -1 you'd get the old behaviour back. |
113 | |
114 | By default all bandwidth is assigned to the root group and new groups get the |
115 | period from /proc/sys/kernel/sched_rt_period_us and a run time of 0. If you |
116 | want to assign bandwidth to another group, reduce the root group's bandwidth |
117 | and assign some or all of the difference to another group. |
118 | |
119 | Realtime group scheduling means you have to assign a portion of total CPU |
120 | bandwidth to the group before it will accept realtime tasks. Therefore you will |
121 | not be able to run realtime tasks as any user other than root until you have |
122 | done that, even if the user has the rights to run processes with realtime |
123 | priority! |
124 | |
125 | |
126 | 2.3 Basis for grouping tasks |
127 | ---------------------------- |
128 | |
129 | There are two compile-time settings for allocating CPU bandwidth. These are |
130 | configured using the "Basis for grouping tasks" multiple choice menu under |
131 | General setup > Group CPU Scheduler: |
132 | |
133 | a. CONFIG_USER_SCHED (aka "Basis for grouping tasks" = "user id") |
134 | |
135 | This lets you use the virtual files under |
136 | "/sys/kernel/uids/<uid>/cpu_rt_runtime_us" to control he CPU time reserved for |
137 | each user . |
138 | |
139 | The other option is: |
140 | |
141 | .o CONFIG_CGROUP_SCHED (aka "Basis for grouping tasks" = "Control groups") |
142 | |
143 | This uses the /cgroup virtual file system and |
144 | "/cgroup/<cgroup>/cpu.rt_runtime_us" to control the CPU time reserved for each |
145 | control group instead. |
146 | |
147 | For more information on working with control groups, you should read |
148 | Documentation/cgroups/cgroups.txt as well. |
149 | |
150 | Group settings are checked against the following limits in order to keep the |
151 | configuration schedulable: |
152 | |
153 | \Sum_{i} runtime_{i} / global_period <= global_runtime / global_period |
154 | |
155 | For now, this can be simplified to just the following (but see Future plans): |
156 | |
157 | \Sum_{i} runtime_{i} <= global_runtime |
158 | |
159 | |
160 | 3. Future plans |
161 | =============== |
162 | |
163 | There is work in progress to make the scheduling period for each group |
164 | ("/sys/kernel/uids/<uid>/cpu_rt_period_us" or |
165 | "/cgroup/<cgroup>/cpu.rt_period_us" respectively) configurable as well. |
166 | |
167 | The constraint on the period is that a subgroup must have a smaller or |
168 | equal period to its parent. But realistically its not very useful _yet_ |
169 | as its prone to starvation without deadline scheduling. |
170 | |
171 | Consider two sibling groups A and B; both have 50% bandwidth, but A's |
172 | period is twice the length of B's. |
173 | |
174 | * group A: period=100000us, runtime=10000us |
175 | - this runs for 0.01s once every 0.1s |
176 | |
177 | * group B: period= 50000us, runtime=10000us |
178 | - this runs for 0.01s twice every 0.1s (or once every 0.05 sec). |
179 | |
180 | This means that currently a while (1) loop in A will run for the full period of |
181 | B and can starve B's tasks (assuming they are of lower priority) for a whole |
182 | period. |
183 | |
184 | The next project will be SCHED_EDF (Earliest Deadline First scheduling) to bring |
185 | full deadline scheduling to the linux kernel. Deadline scheduling the above |
186 | groups and treating end of the period as a deadline will ensure that they both |
187 | get their allocated time. |
188 | |
189 | Implementing SCHED_EDF might take a while to complete. Priority Inheritance is |
190 | the biggest challenge as the current linux PI infrastructure is geared towards |
191 | the limited static priority levels 0-99. With deadline scheduling you need to |
192 | do deadline inheritance (since priority is inversely proportional to the |
193 | deadline delta (deadline - now)). |
194 | |
195 | This means the whole PI machinery will have to be reworked - and that is one of |
196 | the most complex pieces of code we have. |
197 |
Branches:
ben-wpan
ben-wpan-stefan
javiroman/ks7010
jz-2.6.34
jz-2.6.34-rc5
jz-2.6.34-rc6
jz-2.6.34-rc7
jz-2.6.35
jz-2.6.36
jz-2.6.37
jz-2.6.38
jz-2.6.39
jz-3.0
jz-3.1
jz-3.11
jz-3.12
jz-3.13
jz-3.15
jz-3.16
jz-3.18-dt
jz-3.2
jz-3.3
jz-3.4
jz-3.5
jz-3.6
jz-3.6-rc2-pwm
jz-3.9
jz-3.9-clk
jz-3.9-rc8
jz47xx
jz47xx-2.6.38
master
Tags:
od-2011-09-04
od-2011-09-18
v2.6.34-rc5
v2.6.34-rc6
v2.6.34-rc7
v3.9