Root/
1 | What is hwpoison? |
2 | |
3 | Upcoming Intel CPUs have support for recovering from some memory errors |
4 | (``MCA recovery''). This requires the OS to declare a page "poisoned", |
5 | kill the processes associated with it and avoid using it in the future. |
6 | |
7 | This patchkit implements the necessary infrastructure in the VM. |
8 | |
9 | To quote the overview comment: |
10 | |
11 | * High level machine check handler. Handles pages reported by the |
12 | * hardware as being corrupted usually due to a 2bit ECC memory or cache |
13 | * failure. |
14 | * |
15 | * This focusses on pages detected as corrupted in the background. |
16 | * When the current CPU tries to consume corruption the currently |
17 | * running process can just be killed directly instead. This implies |
18 | * that if the error cannot be handled for some reason it's safe to |
19 | * just ignore it because no corruption has been consumed yet. Instead |
20 | * when that happens another machine check will happen. |
21 | * |
22 | * Handles page cache pages in various states. The tricky part |
23 | * here is that we can access any page asynchronous to other VM |
24 | * users, because memory failures could happen anytime and anywhere, |
25 | * possibly violating some of their assumptions. This is why this code |
26 | * has to be extremely careful. Generally it tries to use normal locking |
27 | * rules, as in get the standard locks, even if that means the |
28 | * error handling takes potentially a long time. |
29 | * |
30 | * Some of the operations here are somewhat inefficient and have non |
31 | * linear algorithmic complexity, because the data structures have not |
32 | * been optimized for this case. This is in particular the case |
33 | * for the mapping from a vma to a process. Since this case is expected |
34 | * to be rare we hope we can get away with this. |
35 | |
36 | The code consists of a the high level handler in mm/memory-failure.c, |
37 | a new page poison bit and various checks in the VM to handle poisoned |
38 | pages. |
39 | |
40 | The main target right now is KVM guests, but it works for all kinds |
41 | of applications. KVM support requires a recent qemu-kvm release. |
42 | |
43 | For the KVM use there was need for a new signal type so that |
44 | KVM can inject the machine check into the guest with the proper |
45 | address. This in theory allows other applications to handle |
46 | memory failures too. The expection is that near all applications |
47 | won't do that, but some very specialized ones might. |
48 | |
49 | --- |
50 | |
51 | There are two (actually three) modi memory failure recovery can be in: |
52 | |
53 | vm.memory_failure_recovery sysctl set to zero: |
54 | All memory failures cause a panic. Do not attempt recovery. |
55 | (on x86 this can be also affected by the tolerant level of the |
56 | MCE subsystem) |
57 | |
58 | early kill |
59 | (can be controlled globally and per process) |
60 | Send SIGBUS to the application as soon as the error is detected |
61 | This allows applications who can process memory errors in a gentle |
62 | way (e.g. drop affected object) |
63 | This is the mode used by KVM qemu. |
64 | |
65 | late kill |
66 | Send SIGBUS when the application runs into the corrupted page. |
67 | This is best for memory error unaware applications and default |
68 | Note some pages are always handled as late kill. |
69 | |
70 | --- |
71 | |
72 | User control: |
73 | |
74 | vm.memory_failure_recovery |
75 | See sysctl.txt |
76 | |
77 | vm.memory_failure_early_kill |
78 | Enable early kill mode globally |
79 | |
80 | PR_MCE_KILL |
81 | Set early/late kill mode/revert to system default |
82 | arg1: PR_MCE_KILL_CLEAR: Revert to system default |
83 | arg1: PR_MCE_KILL_SET: arg2 defines thread specific mode |
84 | PR_MCE_KILL_EARLY: Early kill |
85 | PR_MCE_KILL_LATE: Late kill |
86 | PR_MCE_KILL_DEFAULT: Use system global default |
87 | PR_MCE_KILL_GET |
88 | return current mode |
89 | |
90 | |
91 | --- |
92 | |
93 | Testing: |
94 | |
95 | madvise(MADV_HWPOISON, ....) |
96 | (as root) |
97 | Poison a page in the process for testing |
98 | |
99 | |
100 | hwpoison-inject module through debugfs |
101 | |
102 | /sys/debug/hwpoison/ |
103 | |
104 | corrupt-pfn |
105 | |
106 | Inject hwpoison fault at PFN echoed into this file. This does |
107 | some early filtering to avoid corrupted unintended pages in test suites. |
108 | |
109 | unpoison-pfn |
110 | |
111 | Software-unpoison page at PFN echoed into this file. This |
112 | way a page can be reused again. |
113 | This only works for Linux injected failures, not for real |
114 | memory failures. |
115 | |
116 | Note these injection interfaces are not stable and might change between |
117 | kernel versions |
118 | |
119 | corrupt-filter-dev-major |
120 | corrupt-filter-dev-minor |
121 | |
122 | Only handle memory failures to pages associated with the file system defined |
123 | by block device major/minor. -1U is the wildcard value. |
124 | This should be only used for testing with artificial injection. |
125 | |
126 | corrupt-filter-memcg |
127 | |
128 | Limit injection to pages owned by memgroup. Specified by inode number |
129 | of the memcg. |
130 | |
131 | Example: |
132 | mkdir /sys/fs/cgroup/mem/hwpoison |
133 | |
134 | usemem -m 100 -s 1000 & |
135 | echo `jobs -p` > /sys/fs/cgroup/mem/hwpoison/tasks |
136 | |
137 | memcg_ino=$(ls -id /sys/fs/cgroup/mem/hwpoison | cut -f1 -d' ') |
138 | echo $memcg_ino > /debug/hwpoison/corrupt-filter-memcg |
139 | |
140 | page-types -p `pidof init` --hwpoison # shall do nothing |
141 | page-types -p `pidof usemem` --hwpoison # poison its pages |
142 | |
143 | corrupt-filter-flags-mask |
144 | corrupt-filter-flags-value |
145 | |
146 | When specified, only poison pages if ((page_flags & mask) == value). |
147 | This allows stress testing of many kinds of pages. The page_flags |
148 | are the same as in /proc/kpageflags. The flag bits are defined in |
149 | include/linux/kernel-page-flags.h and documented in |
150 | Documentation/vm/pagemap.txt |
151 | |
152 | Architecture specific MCE injector |
153 | |
154 | x86 has mce-inject, mce-test |
155 | |
156 | Some portable hwpoison test programs in mce-test, see blow. |
157 | |
158 | --- |
159 | |
160 | References: |
161 | |
162 | http://halobates.de/mce-lc09-2.pdf |
163 | Overview presentation from LinuxCon 09 |
164 | |
165 | git://git.kernel.org/pub/scm/utils/cpu/mce/mce-test.git |
166 | Test suite (hwpoison specific portable tests in tsrc) |
167 | |
168 | git://git.kernel.org/pub/scm/utils/cpu/mce/mce-inject.git |
169 | x86 specific injector |
170 | |
171 | |
172 | --- |
173 | |
174 | Limitations: |
175 | |
176 | - Not all page types are supported and never will. Most kernel internal |
177 | objects cannot be recovered, only LRU pages for now. |
178 | - Right now hugepage support is missing. |
179 | |
180 | --- |
181 | Andi Kleen, Oct 2009 |
182 | |
183 |
Branches:
ben-wpan
ben-wpan-stefan
javiroman/ks7010
jz-2.6.34
jz-2.6.34-rc5
jz-2.6.34-rc6
jz-2.6.34-rc7
jz-2.6.35
jz-2.6.36
jz-2.6.37
jz-2.6.38
jz-2.6.39
jz-3.0
jz-3.1
jz-3.11
jz-3.12
jz-3.13
jz-3.15
jz-3.16
jz-3.18-dt
jz-3.2
jz-3.3
jz-3.4
jz-3.5
jz-3.6
jz-3.6-rc2-pwm
jz-3.9
jz-3.9-clk
jz-3.9-rc8
jz47xx
jz47xx-2.6.38
master
Tags:
od-2011-09-04
od-2011-09-18
v2.6.34-rc5
v2.6.34-rc6
v2.6.34-rc7
v3.9