Werner's Miscellanea
Sign in or create your account | Project List | Help
Werner's Miscellanea Git Source Tree
Root/
Source at commit 1a2d591947f997e08abac3db1c0ab854a50cd6eb created 8 years 2 months ago. By Werner Almesberger, m1rc3/norruption/LOG: finally got one more corruption | |
---|---|
1 | --- Tue 2011-09-06 ------------------------------------------------------------ |
2 | |
3 | Running "loop": power-cycle, sleep 2 s, jtag-boot, sleep 70 seconds, |
4 | which is enough to boot into FN and render "The Tunnel" for a moment, |
5 | then power-cycle again (off-time is 5 s). |
6 | |
7 | Note that the test loop is "open-loop" and will cycle also past any |
8 | problems. The first time a corrupt standby (or any other issue) is |
9 | observed may therefore be well after the actual event. |
10 | |
11 | 1: started around 11:53 (M1 configuration is original, without locking) |
12 | (around 500) visually checked boot process; standby was reached normally |
13 | |
14 | --- Wed 2011-09-07 ------------------------------------------------------------ |
15 | |
16 | 645: neocon stopped working (around 01:58) |
17 | 666: detected neocon failure at run 666: restarted neocon; urjtag failed |
18 | this cycle; back to normal at 667 |
19 | 684: checked LEDs again (first time since ~500) and found that standby |
20 | may be failing. stopping test at 685 (around 02:50) for |
21 | investigation. |
22 | |
23 | Downloaded the standby bitstream: |
24 | |
25 | wget https://raw.github.com/milkymist/scripts/master/scripts/reflash_m1.sh |
26 | chmod 755 reflash_m1.sh |
27 | |
28 | ./reflash_m1.sh --read-flash |
29 | |
30 | Found two corruptions in the standby bitstream: |
31 | |
32 | diff -u <(hexdump -C standby.fpg) <(hexdump -C /home/root/.qi/milkymist/read-flash/2011...) |
33 | |
34 | -00000080 00 00 4c 83 00 00 4c 87 00 00 cc 85 d8 47 cc 43 |..L...L......G.C| |
35 | +00000080 00 00 4c 83 00 00 4c 87 00 00 c4 80 d8 47 cc 43 |..L...L......G.C| |
36 | |
37 | -00002840 00 08 cc 26 00 00 00 00 00 00 00 00 0c 44 00 98 |...&.........D..| |
38 | +00002840 00 00 cc 26 00 00 00 00 00 00 00 00 0c 44 00 98 |...&.........D..| |
39 | |
40 | CRC-checked the partitions: |
41 | |
42 | git clone git://github.com/milkymist/milkymist |
43 | cd milkymist/tools/ |
44 | gcc -Wall -I. -o flterm flterm.c |
45 | wget http://milkymist.org/updates/current/for-rc3/boot.4e53273.bin |
46 | ./flterm --port /dev/ttyUSB0 --kernel boot.4e53273.bin |
47 | |
48 | serialboot |
49 | a |
50 | |
51 | only standby.fpg failed the CRC check |
52 | |
53 | Reflashed the standby bitstream: |
54 | |
55 | wget http://milkymist.org/updates/2011-07-13/for-rc3/fjmem.bit |
56 | (or http://milkymist.org/updates/fjmem.bit.bz2) |
57 | wget http://milkymist.org/updates/current/standby.fpg |
58 | |
59 | jtag |
60 | |
61 | cable milkymist |
62 | detect |
63 | instruction CFG_OUT 000100 BYPASS |
64 | instruction CFG_IN 000101 BYPASS |
65 | pld load fjmem.bit |
66 | initbus fjmem opcode=000010 |
67 | frequency 6000000 |
68 | detectflash 0 |
69 | endian big |
70 | flashmem 0 standby.fpg noverify |
71 | |
72 | M1 enters standby normally again. |
73 | |
74 | Running "loop2": power-cycle, sleep 2 s, jtag-boot, sleep 10 seconds, |
75 | which is enough to begin (but not finish) booting RTEMS, then |
76 | power-cycle again (off-time is 5 s). |
77 | |
78 | 1: started around 05:01. Observed until about 200-300 (06:00-06:30) |
79 | that standby was okay. |
80 | ~730 (08:48): observed that standby didn't load anymore (note: due to |
81 | a bug in labsw, power is not turned on in about 5-10% of the cycles, |
82 | so the real cycle count should be around 650-700.) |
83 | |
84 | Standby bitstream difference: |
85 | |
86 | -00000080 00 00 4c 83 00 00 4c 87 00 00 cc 85 d8 47 cc 43 |..L...L......G.C| |
87 | +00000080 00 00 00 00 00 00 4c 87 00 00 cc 85 d8 47 cc 43 |......L......G.C| |
88 | |
89 | Reflashed standby and locked the NOR. Testing with loop2 again. |
90 | |
91 | 1 (09:18): started |
92 | ... continuing through the night ... |
93 | |
94 | --- Thu 2011-09-08 ------------------------------------------------------------ |
95 | |
96 | 3483 (03:18): standby is good so far |
97 | 4325 (07:40): manually ended test. Standby is still good, but starting |
98 | with cycle 3704, booting RTEMS failed with |
99 | |
100 | I: Booting from flash... |
101 | I: Loading 1889692 bytes from flash... |
102 | E: CRC failed (expected aa12a56a, got 68ec25e6) |
103 | |
104 | A CRC check yielded: |
105 | |
106 | Images CRC: |
107 | Checking : standby.fpg CRC passed (got c58e8905) |
108 | Checking : soc-rescue.fpg CRC passed (got 30dcc535) |
109 | Checking : bios-rescue.bin(CRC) CRC passed (got c78353fa) |
110 | Checking : splash-rescue.raw CRC passed (got e8ff824f) |
111 | Checking : flickernoise.fbi(rescue)(CRC) CRC passed (got aa12a56a) |
112 | Checking : soc.fpg CRC passed (got 3a31e737) |
113 | Checking : bios.bin(CRC) CRC passed (got 86e23684) |
114 | Checking : splash.raw CRC passed (got 978f860c) |
115 | Checking : flickernoise.fbi(CRC) CRC failed (expected aa12a56a, got 68ec25e6) |
116 | |
117 | Read back the FlickerNoise partition with |
118 | |
119 | readmem 0x920000 0x0400000 fn.bin |
120 | |
121 | Compare with the original: |
122 | |
123 | wget http://www.milkymist.org/updates/2011-07-13/flickernoise.fbi |
124 | md5sum flickernoise.fbi |
125 | 5b7367e71bda306b080bde124615859b flickernoise.fbi |
126 | |
127 | diff -u <(hexdump -C flickernoise.fbi) <(hexdump -C fn.bin) |
128 | |
129 | ... |
130 | -0008a380 28 43 00 00 34 64 00 01 58 44 00 00 5c 60 00 1e |(C..4d..XD..\`..| |
131 | +0008a380 28 43 00 00 00 00 00 01 58 44 00 00 5c 60 00 1e |(C......XD..\`..| |
132 | ... |
133 | |
134 | Recovered the FN partition and unlocked the NOR: |
135 | |
136 | flashmem 0x920000 flickernoise.fbi noverify |
137 | unlockflash 0 55 |
138 | |
139 | New test series with script loop4. This differs from loop2 in that |
140 | it uses "pld reconfigure" to return to standby, instead of |
141 | power-cycling. If we still observe corruption with this test, then |
142 | a software problem would be to blame. |
143 | |
144 | 1 (09:11): started |
145 | 2509 (19:33): standby looks good |
146 | |
147 | All CRC checks pass. Verified that NOR was unlocked: |
148 | |
149 | (load fjmem, etc.) |
150 | peek 0 # show old value |
151 | poke 0 0x40 0 0x0000 # Word Program |
152 | peek 0 # read back status (0x80 if okay, 0x92 if locked) |
153 | poke 0 0xff # Read Array (switch back to normal operation) |
154 | |
155 | Took labsw offline to analyze occasional failure to switch. Failure |
156 | was difficult to reproduce. Also opened labsw to tighten a loose nut. |
157 | Afterwards (Friday run), labsw showed much fewer switch failures. |
158 | |
159 | --- Fri 2011-09-09 ------------------------------------------------------------ |
160 | |
161 | New test with script "loop5". This time, we only power cycle but don't |
162 | try to boot out of standby. The purpose of this test is to confirm that |
163 | NOR corruption does not occur when powering down while in standby. |
164 | |
165 | 1 (11:04): started |
166 | 200 (11:28:): stopped to issue "unlockflash 0 105" to make sure all of |
167 | the NOR is unlocked, just in case |
168 | |
169 | Also checked CRCs. All is well. |
170 | |
171 | 1 (11:31): started |
172 | 2637 (16:53): stopped. standby looks good. |
173 | |
174 | All partitions pass the CRC check. |
175 | |
176 | Repeating loop2 to make sure the NOR corruption hasn't disappeared for |
177 | an unrelated reason. System is connected to oscilloscope monitoring the |
178 | M1 DC in voltage. This connection provides grounding of DC in. |
179 | |
180 | 1 (16:56): started |
181 | |
182 | --- Sat 2011-09-10 ------------------------------------------------------------ |
183 | |
184 | 2428 (04:57): standby still okay |
185 | 2440 (05::01): disconnected oscilloscope |
186 | 2463 (05:08): stopped test |
187 | |
188 | All partitions pass the CRC check. Read back the standby partition and |
189 | also found no corruption in bitwise comparison. Furthermore, the unused |
190 | area showed the expected 0xffff pattern. |
191 | |
192 | 1 (05:14): restarted test, without oscilloscope. |
193 | 2213 (16:11): standby still okay |
194 | |
195 | All partitions pass the CRC check. Unused area of standby shows 0xffff. |
196 | |
197 | Prepared new test (loop7): like loop2, but make a "false start" of |
198 | turning on both channels and immediately turn them off again, wait 16 |
199 | seconds, and only then power up properly. This would roughly correspond |
200 | to labsw failing to turn on, as observed in the test runs in which NOR |
201 | corruption occurred. |
202 | |
203 | 1 (16:27): started loop7 test |
204 | ... continuing through the night ... |
205 | |
206 | --- Sun 2011-09-11 ------------------------------------------------------------ |
207 | |
208 | 2001 (11:58): standby okay |
209 | |
210 | All partitions pass the CRC check. Unused area of standby shows 0xffff. |
211 | |
212 | Confirmed writability of NOR at address 0x80000 and at address 0. |
213 | Instructions used at address 0x80000: |
214 | |
215 | jtag> peek 0x80000 |
216 | URJ_BUS_READ(0x00080000) = 0xFFFF (65535) |
217 | jtag> poke 0x80000 0x40 0x80000 0xffee |
218 | jtag> peek 0x80000 |
219 | URJ_BUS_READ(0x00080000) = 0x0080 (128) |
220 | jtag> poke 0 0xff |
221 | jtag> peek 0x80000 |
222 | URJ_BUS_READ(0x00080000) = 0xFFEE (65518) |
223 | |
224 | --- Mon 2011-09-12 ------------------------------------------------------------ |
225 | |
226 | loop8 is similar to loop7. It increases the "false on" period to 10 ms, |
227 | which is enough to make the M1 power LED flash. It reduces the cool off |
228 | period after the false on. |
229 | |
230 | 1 (08:11): started loop8 test |
231 | 2120 (19:50): standby okay. All partitions pass CRC check. |
232 | |
233 | Going back to the beginning. Test loop (1) runs all the way to rendering. |
234 | Maybe it is necessary after all ... |
235 | |
236 | 1 (19:52): started loop (1) test (serial console logged in file log9) |
237 | 70 (21:21): standby okay |
238 | |
239 | --- Tue 2011-09-13 ------------------------------------------------------------ |
240 | |
241 | 223 (00:39): standby failure |
242 | |
243 | Several corruptions were found: |
244 | |
245 | -00000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................| |
246 | +00000000 00 00 00 00 ff ff ff ff ff ff ff ff ff ff ff ff |................| |
247 | 00000010 55 99 aa 66 0c 85 00 e0 04 00 8c 85 60 14 8c 82 |U..f........`...| |
248 | 00000020 bc 00 8c 86 90 77 8c 43 20 00 01 c9 0c 87 00 f3 |.....w.C .......| |
249 | 00000030 0c 83 00 81 04 00 04 00 04 00 04 00 04 00 04 00 |................| |
250 | @@ -1153,6 +1153,9 @@ |
251 | 00005500 00 00 00 00 00 00 00 00 ff ff ff ff ff ff ff ff |................| |
252 | 00005510 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................| |
253 | * |
254 | +00005550 ff ff ff ff ff ff d6 10 ff ff ff ff ff ff ff ff |................| |
255 | +00005560 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................| |
256 | +* |
257 | |
258 | No CRC errors in other partitions. |
259 | |
260 | 1 (00:50): restored standby partition. started loop again. |
261 |
Branches:
master