Root/
1 | ---------------------------------------------------------------------- |
2 | 1. INTRODUCTION |
3 | |
4 | Modern filesystems feature checksumming of data and metadata to |
5 | protect against data corruption. However, the detection of the |
6 | corruption is done at read time which could potentially be months |
7 | after the data was written. At that point the original data that the |
8 | application tried to write is most likely lost. |
9 | |
10 | The solution is to ensure that the disk is actually storing what the |
11 | application meant it to. Recent additions to both the SCSI family |
12 | protocols (SBC Data Integrity Field, SCC protection proposal) as well |
13 | as SATA/T13 (External Path Protection) try to remedy this by adding |
14 | support for appending integrity metadata to an I/O. The integrity |
15 | metadata (or protection information in SCSI terminology) includes a |
16 | checksum for each sector as well as an incrementing counter that |
17 | ensures the individual sectors are written in the right order. And |
18 | for some protection schemes also that the I/O is written to the right |
19 | place on disk. |
20 | |
21 | Current storage controllers and devices implement various protective |
22 | measures, for instance checksumming and scrubbing. But these |
23 | technologies are working in their own isolated domains or at best |
24 | between adjacent nodes in the I/O path. The interesting thing about |
25 | DIF and the other integrity extensions is that the protection format |
26 | is well defined and every node in the I/O path can verify the |
27 | integrity of the I/O and reject it if corruption is detected. This |
28 | allows not only corruption prevention but also isolation of the point |
29 | of failure. |
30 | |
31 | ---------------------------------------------------------------------- |
32 | 2. THE DATA INTEGRITY EXTENSIONS |
33 | |
34 | As written, the protocol extensions only protect the path between |
35 | controller and storage device. However, many controllers actually |
36 | allow the operating system to interact with the integrity metadata |
37 | (IMD). We have been working with several FC/SAS HBA vendors to enable |
38 | the protection information to be transferred to and from their |
39 | controllers. |
40 | |
41 | The SCSI Data Integrity Field works by appending 8 bytes of protection |
42 | information to each sector. The data + integrity metadata is stored |
43 | in 520 byte sectors on disk. Data + IMD are interleaved when |
44 | transferred between the controller and target. The T13 proposal is |
45 | similar. |
46 | |
47 | Because it is highly inconvenient for operating systems to deal with |
48 | 520 (and 4104) byte sectors, we approached several HBA vendors and |
49 | encouraged them to allow separation of the data and integrity metadata |
50 | scatter-gather lists. |
51 | |
52 | The controller will interleave the buffers on write and split them on |
53 | read. This means that Linux can DMA the data buffers to and from |
54 | host memory without changes to the page cache. |
55 | |
56 | Also, the 16-bit CRC checksum mandated by both the SCSI and SATA specs |
57 | is somewhat heavy to compute in software. Benchmarks found that |
58 | calculating this checksum had a significant impact on system |
59 | performance for a number of workloads. Some controllers allow a |
60 | lighter-weight checksum to be used when interfacing with the operating |
61 | system. Emulex, for instance, supports the TCP/IP checksum instead. |
62 | The IP checksum received from the OS is converted to the 16-bit CRC |
63 | when writing and vice versa. This allows the integrity metadata to be |
64 | generated by Linux or the application at very low cost (comparable to |
65 | software RAID5). |
66 | |
67 | The IP checksum is weaker than the CRC in terms of detecting bit |
68 | errors. However, the strength is really in the separation of the data |
69 | buffers and the integrity metadata. These two distinct buffers must |
70 | match up for an I/O to complete. |
71 | |
72 | The separation of the data and integrity metadata buffers as well as |
73 | the choice in checksums is referred to as the Data Integrity |
74 | Extensions. As these extensions are outside the scope of the protocol |
75 | bodies (T10, T13), Oracle and its partners are trying to standardize |
76 | them within the Storage Networking Industry Association. |
77 | |
78 | ---------------------------------------------------------------------- |
79 | 3. KERNEL CHANGES |
80 | |
81 | The data integrity framework in Linux enables protection information |
82 | to be pinned to I/Os and sent to/received from controllers that |
83 | support it. |
84 | |
85 | The advantage to the integrity extensions in SCSI and SATA is that |
86 | they enable us to protect the entire path from application to storage |
87 | device. However, at the same time this is also the biggest |
88 | disadvantage. It means that the protection information must be in a |
89 | format that can be understood by the disk. |
90 | |
91 | Generally Linux/POSIX applications are agnostic to the intricacies of |
92 | the storage devices they are accessing. The virtual filesystem switch |
93 | and the block layer make things like hardware sector size and |
94 | transport protocols completely transparent to the application. |
95 | |
96 | However, this level of detail is required when preparing the |
97 | protection information to send to a disk. Consequently, the very |
98 | concept of an end-to-end protection scheme is a layering violation. |
99 | It is completely unreasonable for an application to be aware whether |
100 | it is accessing a SCSI or SATA disk. |
101 | |
102 | The data integrity support implemented in Linux attempts to hide this |
103 | from the application. As far as the application (and to some extent |
104 | the kernel) is concerned, the integrity metadata is opaque information |
105 | that's attached to the I/O. |
106 | |
107 | The current implementation allows the block layer to automatically |
108 | generate the protection information for any I/O. Eventually the |
109 | intent is to move the integrity metadata calculation to userspace for |
110 | user data. Metadata and other I/O that originates within the kernel |
111 | will still use the automatic generation interface. |
112 | |
113 | Some storage devices allow each hardware sector to be tagged with a |
114 | 16-bit value. The owner of this tag space is the owner of the block |
115 | device. I.e. the filesystem in most cases. The filesystem can use |
116 | this extra space to tag sectors as they see fit. Because the tag |
117 | space is limited, the block interface allows tagging bigger chunks by |
118 | way of interleaving. This way, 8*16 bits of information can be |
119 | attached to a typical 4KB filesystem block. |
120 | |
121 | This also means that applications such as fsck and mkfs will need |
122 | access to manipulate the tags from user space. A passthrough |
123 | interface for this is being worked on. |
124 | |
125 | |
126 | ---------------------------------------------------------------------- |
127 | 4. BLOCK LAYER IMPLEMENTATION DETAILS |
128 | |
129 | 4.1 BIO |
130 | |
131 | The data integrity patches add a new field to struct bio when |
132 | CONFIG_BLK_DEV_INTEGRITY is enabled. bio->bi_integrity is a pointer |
133 | to a struct bip which contains the bio integrity payload. Essentially |
134 | a bip is a trimmed down struct bio which holds a bio_vec containing |
135 | the integrity metadata and the required housekeeping information (bvec |
136 | pool, vector count, etc.) |
137 | |
138 | A kernel subsystem can enable data integrity protection on a bio by |
139 | calling bio_integrity_alloc(bio). This will allocate and attach the |
140 | bip to the bio. |
141 | |
142 | Individual pages containing integrity metadata can subsequently be |
143 | attached using bio_integrity_add_page(). |
144 | |
145 | bio_free() will automatically free the bip. |
146 | |
147 | |
148 | 4.2 BLOCK DEVICE |
149 | |
150 | Because the format of the protection data is tied to the physical |
151 | disk, each block device has been extended with a block integrity |
152 | profile (struct blk_integrity). This optional profile is registered |
153 | with the block layer using blk_integrity_register(). |
154 | |
155 | The profile contains callback functions for generating and verifying |
156 | the protection data, as well as getting and setting application tags. |
157 | The profile also contains a few constants to aid in completing, |
158 | merging and splitting the integrity metadata. |
159 | |
160 | Layered block devices will need to pick a profile that's appropriate |
161 | for all subdevices. blk_integrity_compare() can help with that. DM |
162 | and MD linear, RAID0 and RAID1 are currently supported. RAID4/5/6 |
163 | will require extra work due to the application tag. |
164 | |
165 | |
166 | ---------------------------------------------------------------------- |
167 | 5.0 BLOCK LAYER INTEGRITY API |
168 | |
169 | 5.1 NORMAL FILESYSTEM |
170 | |
171 | The normal filesystem is unaware that the underlying block device |
172 | is capable of sending/receiving integrity metadata. The IMD will |
173 | be automatically generated by the block layer at submit_bio() time |
174 | in case of a WRITE. A READ request will cause the I/O integrity |
175 | to be verified upon completion. |
176 | |
177 | IMD generation and verification can be toggled using the |
178 | |
179 | /sys/block/<bdev>/integrity/write_generate |
180 | |
181 | and |
182 | |
183 | /sys/block/<bdev>/integrity/read_verify |
184 | |
185 | flags. |
186 | |
187 | |
188 | 5.2 INTEGRITY-AWARE FILESYSTEM |
189 | |
190 | A filesystem that is integrity-aware can prepare I/Os with IMD |
191 | attached. It can also use the application tag space if this is |
192 | supported by the block device. |
193 | |
194 | |
195 | int bdev_integrity_enabled(block_device, int rw); |
196 | |
197 | bdev_integrity_enabled() will return 1 if the block device |
198 | supports integrity metadata transfer for the data direction |
199 | specified in 'rw'. |
200 | |
201 | bdev_integrity_enabled() honors the write_generate and |
202 | read_verify flags in sysfs and will respond accordingly. |
203 | |
204 | |
205 | int bio_integrity_prep(bio); |
206 | |
207 | To generate IMD for WRITE and to set up buffers for READ, the |
208 | filesystem must call bio_integrity_prep(bio). |
209 | |
210 | Prior to calling this function, the bio data direction and start |
211 | sector must be set, and the bio should have all data pages |
212 | added. It is up to the caller to ensure that the bio does not |
213 | change while I/O is in progress. |
214 | |
215 | bio_integrity_prep() should only be called if |
216 | bio_integrity_enabled() returned 1. |
217 | |
218 | |
219 | int bio_integrity_tag_size(bio); |
220 | |
221 | If the filesystem wants to use the application tag space it will |
222 | first have to find out how much storage space is available. |
223 | Because tag space is generally limited (usually 2 bytes per |
224 | sector regardless of sector size), the integrity framework |
225 | supports interleaving the information between the sectors in an |
226 | I/O. |
227 | |
228 | Filesystems can call bio_integrity_tag_size(bio) to find out how |
229 | many bytes of storage are available for that particular bio. |
230 | |
231 | Another option is bdev_get_tag_size(block_device) which will |
232 | return the number of available bytes per hardware sector. |
233 | |
234 | |
235 | int bio_integrity_set_tag(bio, void *tag_buf, len); |
236 | |
237 | After a successful return from bio_integrity_prep(), |
238 | bio_integrity_set_tag() can be used to attach an opaque tag |
239 | buffer to a bio. Obviously this only makes sense if the I/O is |
240 | a WRITE. |
241 | |
242 | |
243 | int bio_integrity_get_tag(bio, void *tag_buf, len); |
244 | |
245 | Similarly, at READ I/O completion time the filesystem can |
246 | retrieve the tag buffer using bio_integrity_get_tag(). |
247 | |
248 | |
249 | 5.3 PASSING EXISTING INTEGRITY METADATA |
250 | |
251 | Filesystems that either generate their own integrity metadata or |
252 | are capable of transferring IMD from user space can use the |
253 | following calls: |
254 | |
255 | |
256 | struct bip * bio_integrity_alloc(bio, gfp_mask, nr_pages); |
257 | |
258 | Allocates the bio integrity payload and hangs it off of the bio. |
259 | nr_pages indicate how many pages of protection data need to be |
260 | stored in the integrity bio_vec list (similar to bio_alloc()). |
261 | |
262 | The integrity payload will be freed at bio_free() time. |
263 | |
264 | |
265 | int bio_integrity_add_page(bio, page, len, offset); |
266 | |
267 | Attaches a page containing integrity metadata to an existing |
268 | bio. The bio must have an existing bip, |
269 | i.e. bio_integrity_alloc() must have been called. For a WRITE, |
270 | the integrity metadata in the pages must be in a format |
271 | understood by the target device with the notable exception that |
272 | the sector numbers will be remapped as the request traverses the |
273 | I/O stack. This implies that the pages added using this call |
274 | will be modified during I/O! The first reference tag in the |
275 | integrity metadata must have a value of bip->bip_sector. |
276 | |
277 | Pages can be added using bio_integrity_add_page() as long as |
278 | there is room in the bip bio_vec array (nr_pages). |
279 | |
280 | Upon completion of a READ operation, the attached pages will |
281 | contain the integrity metadata received from the storage device. |
282 | It is up to the receiver to process them and verify data |
283 | integrity upon completion. |
284 | |
285 | |
286 | 5.4 REGISTERING A BLOCK DEVICE AS CAPABLE OF EXCHANGING INTEGRITY |
287 | METADATA |
288 | |
289 | To enable integrity exchange on a block device the gendisk must be |
290 | registered as capable: |
291 | |
292 | int blk_integrity_register(gendisk, blk_integrity); |
293 | |
294 | The blk_integrity struct is a template and should contain the |
295 | following: |
296 | |
297 | static struct blk_integrity my_profile = { |
298 | .name = "STANDARDSBODY-TYPE-VARIANT-CSUM", |
299 | .generate_fn = my_generate_fn, |
300 | .verify_fn = my_verify_fn, |
301 | .get_tag_fn = my_get_tag_fn, |
302 | .set_tag_fn = my_set_tag_fn, |
303 | .tuple_size = sizeof(struct my_tuple_size), |
304 | .tag_size = <tag bytes per hw sector>, |
305 | }; |
306 | |
307 | 'name' is a text string which will be visible in sysfs. This is |
308 | part of the userland API so chose it carefully and never change |
309 | it. The format is standards body-type-variant. |
310 | E.g. T10-DIF-TYPE1-IP or T13-EPP-0-CRC. |
311 | |
312 | 'generate_fn' generates appropriate integrity metadata (for WRITE). |
313 | |
314 | 'verify_fn' verifies that the data buffer matches the integrity |
315 | metadata. |
316 | |
317 | 'tuple_size' must be set to match the size of the integrity |
318 | metadata per sector. I.e. 8 for DIF and EPP. |
319 | |
320 | 'tag_size' must be set to identify how many bytes of tag space |
321 | are available per hardware sector. For DIF this is either 2 or |
322 | 0 depending on the value of the Control Mode Page ATO bit. |
323 | |
324 | See 6.2 for a description of get_tag_fn and set_tag_fn. |
325 | |
326 | ---------------------------------------------------------------------- |
327 | 2007-12-24 Martin K. Petersen <martin.petersen@oracle.com> |
328 |
Branches:
ben-wpan
ben-wpan-stefan
javiroman/ks7010
jz-2.6.34
jz-2.6.34-rc5
jz-2.6.34-rc6
jz-2.6.34-rc7
jz-2.6.35
jz-2.6.36
jz-2.6.37
jz-2.6.38
jz-2.6.39
jz-3.0
jz-3.1
jz-3.11
jz-3.12
jz-3.13
jz-3.15
jz-3.16
jz-3.18-dt
jz-3.2
jz-3.3
jz-3.4
jz-3.5
jz-3.6
jz-3.6-rc2-pwm
jz-3.9
jz-3.9-clk
jz-3.9-rc8
jz47xx
jz47xx-2.6.38
master
Tags:
od-2011-09-04
od-2011-09-18
v2.6.34-rc5
v2.6.34-rc6
v2.6.34-rc7
v3.9