Root/m1rc3/norruption/LOG

Source at commit 88fa76bc88246f1188daaaf2d77045e895311205 created 8 years 1 month ago.
By Werner Almesberger, m1rc3/norruption/LOG: finally got one more
1--- Tue 2011-09-06 ------------------------------------------------------------
2
3Running "loop": power-cycle, sleep 2 s, jtag-boot, sleep 70 seconds,
4which is enough to boot into FN and render "The Tunnel" for a moment,
5then power-cycle again (off-time is 5 s).
6
7Note that the test loop is "open-loop" and will cycle also past any
8problems. The first time a corrupt standby (or any other issue) is
9observed may therefore be well after the actual event.
10
111: started around 11:53 (M1 configuration is original, without locking)
12(around 500) visually checked boot process; standby was reached normally
13
14--- Wed 2011-09-07 ------------------------------------------------------------
15
16645: neocon stopped working (around 01:58)
17666: detected neocon failure at run 666: restarted neocon; urjtag failed
18     this cycle; back to normal at 667
19684: checked LEDs again (first time since ~500) and found that standby
20     may be failing. stopping test at 685 (around 02:50) for
21     investigation.
22
23Downloaded the standby bitstream:
24
25  wget https://raw.github.com/milkymist/scripts/master/scripts/reflash_m1.sh
26  chmod 755 reflash_m1.sh
27
28  ./reflash_m1.sh --read-flash
29
30Found two corruptions in the standby bitstream:
31
32  diff -u <(hexdump -C standby.fpg) <(hexdump -C /home/root/.qi/milkymist/read-flash/2011...)
33
34-00000080 00 00 4c 83 00 00 4c 87 00 00 cc 85 d8 47 cc 43 |..L...L......G.C|
35+00000080 00 00 4c 83 00 00 4c 87 00 00 c4 80 d8 47 cc 43 |..L...L......G.C|
36
37-00002840 00 08 cc 26 00 00 00 00 00 00 00 00 0c 44 00 98 |...&.........D..|
38+00002840 00 00 cc 26 00 00 00 00 00 00 00 00 0c 44 00 98 |...&.........D..|
39
40CRC-checked the partitions:
41
42  git clone git://github.com/milkymist/milkymist
43  cd milkymist/tools/
44  gcc -Wall -I. -o flterm flterm.c
45  wget http://milkymist.org/updates/current/for-rc3/boot.4e53273.bin
46  ./flterm --port /dev/ttyUSB0 --kernel boot.4e53273.bin
47
48  serialboot
49  a
50
51  only standby.fpg failed the CRC check
52
53Reflashed the standby bitstream:
54
55  wget http://milkymist.org/updates/2011-07-13/for-rc3/fjmem.bit
56  (or http://milkymist.org/updates/fjmem.bit.bz2)
57  wget http://milkymist.org/updates/current/standby.fpg
58
59  jtag
60
61  cable milkymist
62  detect
63  instruction CFG_OUT 000100 BYPASS
64  instruction CFG_IN 000101 BYPASS
65  pld load fjmem.bit
66  initbus fjmem opcode=000010
67  frequency 6000000
68  detectflash 0
69  endian big
70  flashmem 0 standby.fpg noverify
71
72M1 enters standby normally again.
73
74Running "loop2": power-cycle, sleep 2 s, jtag-boot, sleep 10 seconds,
75which is enough to begin (but not finish) booting RTEMS, then
76power-cycle again (off-time is 5 s).
77
781: started around 05:01. Observed until about 200-300 (06:00-06:30)
79that standby was okay.
80~730 (08:48): observed that standby didn't load anymore (note: due to
81a bug in labsw, power is not turned on in about 5-10% of the cycles,
82so the real cycle count should be around 650-700.)
83
84Standby bitstream difference:
85
86-00000080 00 00 4c 83 00 00 4c 87 00 00 cc 85 d8 47 cc 43 |..L...L......G.C|
87+00000080 00 00 00 00 00 00 4c 87 00 00 cc 85 d8 47 cc 43 |......L......G.C|
88
89Reflashed standby and locked the NOR. Testing with loop2 again.
90
911 (09:18): started
92... continuing through the night ...
93
94--- Thu 2011-09-08 ------------------------------------------------------------
95
963483 (03:18): standby is good so far
974325 (07:40): manually ended test. Standby is still good, but starting
98    with cycle 3704, booting RTEMS failed with
99
100    I: Booting from flash...
101    I: Loading 1889692 bytes from flash...
102    E: CRC failed (expected aa12a56a, got 68ec25e6)
103
104A CRC check yielded:
105
106Images CRC:
107  Checking : standby.fpg CRC passed (got c58e8905)
108  Checking : soc-rescue.fpg CRC passed (got 30dcc535)
109  Checking : bios-rescue.bin(CRC) CRC passed (got c78353fa)
110  Checking : splash-rescue.raw CRC passed (got e8ff824f)
111  Checking : flickernoise.fbi(rescue)(CRC) CRC passed (got aa12a56a)
112  Checking : soc.fpg CRC passed (got 3a31e737)
113  Checking : bios.bin(CRC) CRC passed (got 86e23684)
114  Checking : splash.raw CRC passed (got 978f860c)
115  Checking : flickernoise.fbi(CRC) CRC failed (expected aa12a56a, got 68ec25e6)
116
117Read back the FlickerNoise partition with
118
119  readmem 0x920000 0x0400000 fn.bin
120
121Compare with the original:
122
123  wget http://www.milkymist.org/updates/2011-07-13/flickernoise.fbi
124  md5sum flickernoise.fbi
125  5b7367e71bda306b080bde124615859b flickernoise.fbi
126
127  diff -u <(hexdump -C flickernoise.fbi) <(hexdump -C fn.bin)
128
129...
130-0008a380 28 43 00 00 34 64 00 01 58 44 00 00 5c 60 00 1e |(C..4d..XD..\`..|
131+0008a380 28 43 00 00 00 00 00 01 58 44 00 00 5c 60 00 1e |(C......XD..\`..|
132...
133
134Recovered the FN partition and unlocked the NOR:
135
136  flashmem 0x920000 flickernoise.fbi noverify
137  unlockflash 0 55
138
139New test series with script loop4. This differs from loop2 in that
140it uses "pld reconfigure" to return to standby, instead of
141power-cycling. If we still observe corruption with this test, then
142a software problem would be to blame.
143
1441 (09:11): started
1452509 (19:33): standby looks good
146
147All CRC checks pass. Verified that NOR was unlocked:
148
149  (load fjmem, etc.)
150  peek 0 # show old value
151  poke 0 0x40 0 0x0000 # Word Program
152  peek 0 # read back status (0x80 if okay, 0x92 if locked)
153  poke 0 0xff # Read Array (switch back to normal operation)
154
155Took labsw offline to analyze occasional failure to switch. Failure
156was difficult to reproduce. Also opened labsw to tighten a loose nut.
157Afterwards (Friday run), labsw showed much fewer switch failures.
158
159--- Fri 2011-09-09 ------------------------------------------------------------
160
161New test with script "loop5". This time, we only power cycle but don't
162try to boot out of standby. The purpose of this test is to confirm that
163NOR corruption does not occur when powering down while in standby.
164
1651 (11:04): started
166200 (11:28:): stopped to issue "unlockflash 0 105" to make sure all of
167  the NOR is unlocked, just in case
168
169Also checked CRCs. All is well.
170
1711 (11:31): started
1722637 (16:53): stopped. standby looks good.
173
174All partitions pass the CRC check.
175
176Repeating loop2 to make sure the NOR corruption hasn't disappeared for
177an unrelated reason. System is connected to oscilloscope monitoring the
178M1 DC in voltage. This connection provides grounding of DC in.
179
1801 (16:56): started
181
182--- Sat 2011-09-10 ------------------------------------------------------------
183
1842428 (04:57): standby still okay
1852440 (05::01): disconnected oscilloscope
1862463 (05:08): stopped test
187
188All partitions pass the CRC check. Read back the standby partition and
189also found no corruption in bitwise comparison. Furthermore, the unused
190area showed the expected 0xffff pattern.
191
1921 (05:14): restarted test, without oscilloscope.
1932213 (16:11): standby still okay
194
195All partitions pass the CRC check. Unused area of standby shows 0xffff.
196
197Prepared new test (loop7): like loop2, but make a "false start" of
198turning on both channels and immediately turn them off again, wait 16
199seconds, and only then power up properly. This would roughly correspond
200to labsw failing to turn on, as observed in the test runs in which NOR
201corruption occurred.
202
2031 (16:27): started loop7 test
204... continuing through the night ...
205
206--- Sun 2011-09-11 ------------------------------------------------------------
207
2082001 (11:58): standby okay
209
210All partitions pass the CRC check. Unused area of standby shows 0xffff.
211
212Confirmed writability of NOR at address 0x80000 and at address 0.
213Instructions used at address 0x80000:
214
215  jtag> peek 0x80000
216  URJ_BUS_READ(0x00080000) = 0xFFFF (65535)
217  jtag> poke 0x80000 0x40 0x80000 0xffee
218  jtag> peek 0x80000
219  URJ_BUS_READ(0x00080000) = 0x0080 (128)
220  jtag> poke 0 0xff
221  jtag> peek 0x80000
222  URJ_BUS_READ(0x00080000) = 0xFFEE (65518)
223
224--- Mon 2011-09-12 ------------------------------------------------------------
225
226loop8 is similar to loop7. It increases the "false on" period to 10 ms,
227which is enough to make the M1 power LED flash. It reduces the cool off
228period after the false on.
229
2301 (08:11): started loop8 test
2312120 (19:50): standby okay. All partitions pass CRC check.
232
233Going back to the beginning. Test loop (1) runs all the way to rendering.
234Maybe it is necessary after all ...
235
2361 (19:52): started loop (1) test (serial console logged in file log9)
23770 (21:21): standby okay
238
239--- Tue 2011-09-13 ------------------------------------------------------------
240
241223 (00:39): standby failure
242
243Several corruptions were found:
244
245-00000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................|
246+00000000 00 00 00 00 ff ff ff ff ff ff ff ff ff ff ff ff |................|
247...
248 00005510 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................|
249 *
250+00005550 ff ff ff ff ff ff d6 10 ff ff ff ff ff ff ff ff |................|
251+00005560 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................|
252+*
253
254No CRC errors in other partitions.
255
2561 (00:50): restored standby partition. started loop again (log9.1)
257613 (14:04): standby okay
258639 (14:38): standby failure. Other CRCs okay. Corruption:
259
260...
261-00000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................|
262+00000000 00 00 ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................|
263...
264-000014b0 00 c0 cc 26 00 00 00 00 00 00 00 00 0c 44 00 30 |...&.........D.0|
265+000014b0 00 00 cc 26 00 00 00 00 00 00 00 00 0c 44 00 30 |...&.........D.0|
266...
267
2681 (14:46): restored and verified standby partition. started loop again (log9.2)
26923 (15:15): ok
270117 (17:16): standby failure. Other CRCs okay. Corruption:
271
272...
273 000050e0 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................|
274 *
275+00005140 00 00 ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................|
276...
277
2781 (17:23): restored and verified standby partition. started loop again (log9.3)
2791 (17:30): restarted after X crash
280... continuing through the night ...
281
282--- Wed 2011-09-14 ------------------------------------------------------------
283
284... continuing through the whole day ...
285
286--- Thu 2011-09-15 ------------------------------------------------------------
287
2881493 (01:46): standby okay
2891584 (03:44): standby failure. CRC failure in standby and soc-rescue.
290
291Corruption in standby:
292
293-00000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................|
294+00000000 ff ff ff ff 00 00 ff ff ff ff ff ff ff ff ff ff |................|
295...
296-00000a00 00 98 cc 26 00 00 00 00 00 00 00 00 0c 44 00 a0 |...&.........D..|
297+00000a00 00 00 cc 26 00 00 00 00 00 00 00 00 0c 44 00 a0 |...&.........D..|
298...
299
300Read back soc-rescue:
301
302jtag> readmem 0x0A0000 0x0180000 soc-rescue.bin
303
304Corruption:
305
306-00020460 50 00 50 00 00 0f 55 5f ff ff dd ee 00 00 00 00 |P.P...U_........|
307+00020460 00 00 50 00 00 0f 55 5f ff ff dd ee 00 00 00 00 |..P...U_........|
308
309Restore standby, then soc-rescue:
310
311wget http://milkymist.org/updates/current/soc-rescue.fpg
312jtag> flashmem 0xA0000 soc-rescue.fpg noverify
313
3141 (04:00): verified CRCs. started loop again (log9.4)
315
316Summary of recent experiments (all with "loop"):
317
318Last cycle observed First cycle observed Offsets of corrupted
319without corruption with corruption words in standby
320  70 ( 1h29) 223 ( 4h47) 0, 2, 0x5556
321 613 (13h14) 639 (13h48) 0, 0x14b0
322  23 ( 0h29) 117 ( 2h30) 0x5140
3231493 (32h16) 1584 (34h14) 4, 0xa00; soc-rescue 0x20460
324

Archive Download this file

Branches:
master



interactive