Root/
1 | I/O statistics fields |
2 | --------------- |
3 | |
4 | Since 2.4.20 (and some versions before, with patches), and 2.5.45, |
5 | more extensive disk statistics have been introduced to help measure disk |
6 | activity. Tools such as sar and iostat typically interpret these and do |
7 | the work for you, but in case you are interested in creating your own |
8 | tools, the fields are explained here. |
9 | |
10 | In 2.4 now, the information is found as additional fields in |
11 | /proc/partitions. In 2.6, the same information is found in two |
12 | places: one is in the file /proc/diskstats, and the other is within |
13 | the sysfs file system, which must be mounted in order to obtain |
14 | the information. Throughout this document we'll assume that sysfs |
15 | is mounted on /sys, although of course it may be mounted anywhere. |
16 | Both /proc/diskstats and sysfs use the same source for the information |
17 | and so should not differ. |
18 | |
19 | Here are examples of these different formats: |
20 | |
21 | 2.4: |
22 | 3 0 39082680 hda 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160 |
23 | 3 1 9221278 hda1 35486 0 35496 38030 0 0 0 0 0 38030 38030 |
24 | |
25 | |
26 | 2.6 sysfs: |
27 | 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160 |
28 | 35486 38030 38030 38030 |
29 | |
30 | 2.6 diskstats: |
31 | 3 0 hda 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160 |
32 | 3 1 hda1 35486 38030 38030 38030 |
33 | |
34 | On 2.4 you might execute "grep 'hda ' /proc/partitions". On 2.6, you have |
35 | a choice of "cat /sys/block/hda/stat" or "grep 'hda ' /proc/diskstats". |
36 | The advantage of one over the other is that the sysfs choice works well |
37 | if you are watching a known, small set of disks. /proc/diskstats may |
38 | be a better choice if you are watching a large number of disks because |
39 | you'll avoid the overhead of 50, 100, or 500 or more opens/closes with |
40 | each snapshot of your disk statistics. |
41 | |
42 | In 2.4, the statistics fields are those after the device name. In |
43 | the above example, the first field of statistics would be 446216. |
44 | By contrast, in 2.6 if you look at /sys/block/hda/stat, you'll |
45 | find just the eleven fields, beginning with 446216. If you look at |
46 | /proc/diskstats, the eleven fields will be preceded by the major and |
47 | minor device numbers, and device name. Each of these formats provides |
48 | eleven fields of statistics, each meaning exactly the same things. |
49 | All fields except field 9 are cumulative since boot. Field 9 should |
50 | go to zero as I/Os complete; all others only increase (unless they |
51 | overflow and wrap). Yes, these are (32-bit or 64-bit) unsigned long |
52 | (native word size) numbers, and on a very busy or long-lived system they |
53 | may wrap. Applications should be prepared to deal with that; unless |
54 | your observations are measured in large numbers of minutes or hours, |
55 | they should not wrap twice before you notice them. |
56 | |
57 | Each set of stats only applies to the indicated device; if you want |
58 | system-wide stats you'll have to find all the devices and sum them all up. |
59 | |
60 | Field 1 -- # of reads completed |
61 | This is the total number of reads completed successfully. |
62 | Field 2 -- # of reads merged, field 6 -- # of writes merged |
63 | Reads and writes which are adjacent to each other may be merged for |
64 | efficiency. Thus two 4K reads may become one 8K read before it is |
65 | ultimately handed to the disk, and so it will be counted (and queued) |
66 | as only one I/O. This field lets you know how often this was done. |
67 | Field 3 -- # of sectors read |
68 | This is the total number of sectors read successfully. |
69 | Field 4 -- # of milliseconds spent reading |
70 | This is the total number of milliseconds spent by all reads (as |
71 | measured from __make_request() to end_that_request_last()). |
72 | Field 5 -- # of writes completed |
73 | This is the total number of writes completed successfully. |
74 | Field 7 -- # of sectors written |
75 | This is the total number of sectors written successfully. |
76 | Field 8 -- # of milliseconds spent writing |
77 | This is the total number of milliseconds spent by all writes (as |
78 | measured from __make_request() to end_that_request_last()). |
79 | Field 9 -- # of I/Os currently in progress |
80 | The only field that should go to zero. Incremented as requests are |
81 | given to appropriate struct request_queue and decremented as they finish. |
82 | Field 10 -- # of milliseconds spent doing I/Os |
83 | This field increases so long as field 9 is nonzero. |
84 | Field 11 -- weighted # of milliseconds spent doing I/Os |
85 | This field is incremented at each I/O start, I/O completion, I/O |
86 | merge, or read of these stats by the number of I/Os in progress |
87 | (field 9) times the number of milliseconds spent doing I/O since the |
88 | last update of this field. This can provide an easy measure of both |
89 | I/O completion time and the backlog that may be accumulating. |
90 | |
91 | |
92 | To avoid introducing performance bottlenecks, no locks are held while |
93 | modifying these counters. This implies that minor inaccuracies may be |
94 | introduced when changes collide, so (for instance) adding up all the |
95 | read I/Os issued per partition should equal those made to the disks ... |
96 | but due to the lack of locking it may only be very close. |
97 | |
98 | In 2.6, there are counters for each CPU, which make the lack of locking |
99 | almost a non-issue. When the statistics are read, the per-CPU counters |
100 | are summed (possibly overflowing the unsigned long variable they are |
101 | summed to) and the result given to the user. There is no convenient |
102 | user interface for accessing the per-CPU counters themselves. |
103 | |
104 | Disks vs Partitions |
105 | ------------------- |
106 | |
107 | There were significant changes between 2.4 and 2.6 in the I/O subsystem. |
108 | As a result, some statistic information disappeared. The translation from |
109 | a disk address relative to a partition to the disk address relative to |
110 | the host disk happens much earlier. All merges and timings now happen |
111 | at the disk level rather than at both the disk and partition level as |
112 | in 2.4. Consequently, you'll see a different statistics output on 2.6 for |
113 | partitions from that for disks. There are only *four* fields available |
114 | for partitions on 2.6 machines. This is reflected in the examples above. |
115 | |
116 | Field 1 -- # of reads issued |
117 | This is the total number of reads issued to this partition. |
118 | Field 2 -- # of sectors read |
119 | This is the total number of sectors requested to be read from this |
120 | partition. |
121 | Field 3 -- # of writes issued |
122 | This is the total number of writes issued to this partition. |
123 | Field 4 -- # of sectors written |
124 | This is the total number of sectors requested to be written to |
125 | this partition. |
126 | |
127 | Note that since the address is translated to a disk-relative one, and no |
128 | record of the partition-relative address is kept, the subsequent success |
129 | or failure of the read cannot be attributed to the partition. In other |
130 | words, the number of reads for partitions is counted slightly before time |
131 | of queuing for partitions, and at completion for whole disks. This is |
132 | a subtle distinction that is probably uninteresting for most cases. |
133 | |
134 | More significant is the error induced by counting the numbers of |
135 | reads/writes before merges for partitions and after for disks. Since a |
136 | typical workload usually contains a lot of successive and adjacent requests, |
137 | the number of reads/writes issued can be several times higher than the |
138 | number of reads/writes completed. |
139 | |
140 | In 2.6.25, the full statistic set is again available for partitions and |
141 | disk and partition statistics are consistent again. Since we still don't |
142 | keep record of the partition-relative address, an operation is attributed to |
143 | the partition which contains the first sector of the request after the |
144 | eventual merges. As requests can be merged across partition, this could lead |
145 | to some (probably insignificant) inaccuracy. |
146 | |
147 | Additional notes |
148 | ---------------- |
149 | |
150 | In 2.6, sysfs is not mounted by default. If your distribution of |
151 | Linux hasn't added it already, here's the line you'll want to add to |
152 | your /etc/fstab: |
153 | |
154 | none /sys sysfs defaults 0 0 |
155 | |
156 | |
157 | In 2.6, all disk statistics were removed from /proc/stat. In 2.4, they |
158 | appear in both /proc/partitions and /proc/stat, although the ones in |
159 | /proc/stat take a very different format from those in /proc/partitions |
160 | (see proc(5), if your system has it.) |
161 | |
162 | -- ricklind@us.ibm.com |
163 |
Branches:
ben-wpan
ben-wpan-stefan
javiroman/ks7010
jz-2.6.34
jz-2.6.34-rc5
jz-2.6.34-rc6
jz-2.6.34-rc7
jz-2.6.35
jz-2.6.36
jz-2.6.37
jz-2.6.38
jz-2.6.39
jz-3.0
jz-3.1
jz-3.11
jz-3.12
jz-3.13
jz-3.15
jz-3.16
jz-3.18-dt
jz-3.2
jz-3.3
jz-3.4
jz-3.5
jz-3.6
jz-3.6-rc2-pwm
jz-3.9
jz-3.9-clk
jz-3.9-rc8
jz47xx
jz47xx-2.6.38
master
Tags:
od-2011-09-04
od-2011-09-18
v2.6.34-rc5
v2.6.34-rc6
v2.6.34-rc7
v3.9