Discussion:
ubifs_decompress: cannot decompress ...
Matthew L. Creech
2011-05-27 21:12:44 UTC
Permalink
Hi,

We encountered a device which has the following error:

UBIFS error (pid 428): ubifs_decompress: cannot decompress 1010 bytes,
compressor lzo, error -22
UBIFS error (pid 428): read_block: bad data node (block 388, inode 556196)
UBIFS error (pid 428): do_readpage: cannot read page 388 of inode
556196, error -22
UBIFS error (pid 428): ubifs_decompress: cannot decompress 1010 bytes,
compressor lzo, error -22
UBIFS error (pid 428): read_block: bad data node (block 388, inode 556196)
UBIFS error (pid 428): do_readpage: cannot read page 388 of inode
556196, error -22

The device is still bootable (though not functional), so I can get
more debug info if needed. Any ideas on what could cause this?

- MPC8313 (using fsl_elbc_nand driver)
- 1GB NAND flash (partitioned into 5MB for U-Boot, remainder for UBI/UBIFS)
- 2.6.39 kernel

Thanks!
--
Matthew L. Creech
Ben Gardiner
2011-05-30 12:29:13 UTC
Permalink
Hi Matthew,
Post by Matthew L. Creech
UBIFS error (pid 428): ubifs_decompress: cannot decompress 1010 bytes,
compressor lzo, error -22
UBIFS error (pid 428): read_block: bad data node (block 388, inode 556196)
UBIFS error (pid 428): do_readpage: cannot read page 388 of inode
556196, error -22
UBIFS error (pid 428): ubifs_decompress: cannot decompress 1010 bytes,
compressor lzo, error -22
UBIFS error (pid 428): read_block: bad data node (block 388, inode 556196)
UBIFS error (pid 428): do_readpage: cannot read page 388 of inode
556196, error -22
It looks like the lzo.c decompress function lzo_decompress() maps all
non-ok return codes from lzo1x_decompress_safe() to -EINVAL (-22). So
any one of the possible failures of the decompressor could cause the
error you are observing.

I don't see any debug statements in lzo1x_decompress_safe() that can
be enabled, so you might want to add some printing to
lzo1x_decompress_safe() or at least print the non-ok return code of
lzo1x_decompress_safe() in lzo_decompress() to get an idea of how the
decompressor is failing.

Best Regards,
Ben Gardiner

---
Nanometrics Inc.
http://www.nanometrics.ca
Ben Gardiner
2011-05-31 16:10:47 UTC
Permalink
Hi Matthew,
On Mon, May 30, 2011 at 8:29 AM, Ben Gardiner
Post by Ben Gardiner
I don't see any debug statements in lzo1x_decompress_safe() that can
be enabled, so you might want to add some printing to
lzo1x_decompress_safe() or at least print the non-ok return code of
lzo1x_decompress_safe() in lzo_decompress() to get an idea of how the
decompressor is failing.
Looks like it's returning LZO_E_LOOKBEHIND_OVERRUN. ?I don't know what
that indicates, but there is trailing 0xff data in the block to be
Interesting -- does the trailing 0xff have ECC set, or is it erased
pages of 0xff?
XXXX: LZO_E_LOOKBEHIND_OVERRUN
Can't say much more about this than lzo-2.05/doc/LZOAPI.TXT [1] does:

LZO_E_LOOKBEHIND_OVERRUN
Your data is corrupted.
UBIFS error (pid 428): ubifs_decompress: cannot decompress 1010 bytes,
compressor lzo, error -22
00000000: 00 0f 69 6e 3a 61 74 74 72 00 c2 38 1c 03 39 03 ?..in:attr..8..9.
00000010: 2f 30 31 2f 6d 57 43 2f 2e 56 61 6c 75 65 4d 61 ?/01/mWC/.ValueMa
00000020: 78 3a 61 91 03 94 31 72 00 69 6e d0 03 00 01 37 ?x:a...1r.in....7
00000030: 1c 03 3d 01 2f 53 65 72 69 61 6c 4e 75 6d 62 65 ?..=./SerialNumbe
00000040: 72 2f 98 07 98 03 00 0a 04 2f 03 63 01 2f 53 6f ?r/......./.c./So
00000050: 66 74 77 61 72 65 2f 43 6f 6d 6d 61 6e 64 73 2f ?ftware/Commands/
00000060: 6e 65 78 74 d2 01 49 44 29 bc 00 03 06 47 04 81 ?next..ID)....G..
00000070: 0d 03 28 c0 00 00 04 44 61 74 61 20 53 65 72 76 ?..(....Data Serv
00000080: 65 72 73 2f 42 41 43 6e 65 74 2d 49 50 c4 01 0b ?ers/BACnet-IP...
00000090: 44 65 76 69 63 65 49 6e 73 74 61 6e 63 65 29 14 ?DeviceInstance).
000000a0: 01 05 00 f7 28 41 04 81 03 02 20 0c 1f 01 4d 41 ?....(A.... ...MA
000000b0: 43 29 08 01 03 02 19 42 04 81 05 20 0d 07 01 4e ?C).....B... ...N
000000c0: 61 6d 2a 14 02 02 02 18 3c 03 7b 20 02 04 01 07 ?am*.....<.{ ....
000000e0: 04 81 01 20 05 f4 00 06 54 69 6d 65 54 6f 4c 69 ?... ....TimeToLi
000000f0: 76 2b f5 01 1e 20 05 f4 01 02 61 73 65 49 64 80 ?v+... ....aseId.
00000100: 36 2a f5 01 20 20 05 f4 03 07 44 65 66 61 75 6c ?6*.. ?....Defaul
00000110: 74 4e 65 74 2f e4 07 03 02 1a 46 04 81 0d 20 01 ?tNet/.....F... .
00000120: 00 03 00 02 45 6e 61 62 6c 65 42 61 73 65 46 6f ?....EnableBaseFo
00000130: 72 47 61 74 65 77 61 79 2a 24 02 02 1f 45 04 81 ?rGateway*$...E..
00000140: 0b 20 0e 18 01 03 52 6f 75 74 65 64 2a 14 01 02 ?. ....Routed*...
00000150: 21 50 04 81 21 20 01 14 01 c2 07 30 31 3e 84 09 ?!P..! .....01>..
00000160: 03 01 f4 4b 04 81 17 20 17 40 01 2d a8 09 02 24 ?...K... . at .-...$
00000170: 4c 04 81 19 20 17 2c 01 2e d0 09 01 23 3a 03 77 ?L... .,.....#:.w
00000180: 20 07 2c 01 2d 1d 02 1c 20 05 d0 0c b0 24 33 cc ? .,.-... ....$3.
00000190: 07 02 1b 00 00 00 11 6c 00 00 3f 6a 68 2e 73 ec ?.......l..?jh.s.
000001a0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ?................
000001b0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ?................
000001c0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ?................
000001d0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ?................
000001e0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ?................
000001f0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ?................
00000200: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ?................
00000210: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ?................
00000220: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ?................
00000230: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ?................
00000240: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ?................
00000250: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ?................
00000260: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ?................
00000270: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ?................
00000280: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ?................
00000290: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ?................
000002a0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ?................
000002b0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ?................
000002c0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ?................
000002d0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ?................
000002e0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ?................
000002f0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ?................
00000300: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ?................
00000310: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ?................
00000320: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ?................
00000330: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ?................
00000340: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ?................
00000350: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ?................
00000360: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ?................
00000370: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ?................
00000380: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ?................
00000390: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ?................
000003a0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ?................
000003b0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ?................
000003c0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ?................
000003d0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ?................
000003e0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ?................
000003f0: ff ff
UBIFS error (pid 428): read_block: bad data node (block 388, inode 556196)
UBIFS error (pid 428): do_readpage: cannot read page 388 of inode
556196, error -22
Could it be possible that writing the page was interrupted? I guess
the CRC checks above decompress would catch that though.

Sorry I can't be of more help here.

Best Regards,
Ben Gardiner

[1] http://www.oberhumer.com/opensource/lzo/download/

---
Nanometrics Inc.
http://www.nanometrics.ca
Matthew L. Creech
2011-05-31 21:47:19 UTC
Permalink
On Tue, May 31, 2011 at 12:10 PM, Ben Gardiner
Post by Ben Gardiner
Interesting -- does the trailing 0xff have ECC set, or is it erased
pages of 0xff?
...
Post by Ben Gardiner
Could it be possible that writing the page was interrupted? I guess
the CRC checks above decompress would catch that though.
I verified by adding a call to ubi->mtd->read_oob(): the 0xff data
starts on a page boundary (a whole multiple of 2k in my case). The
associated OOB area for that page is all 0xff as well.

So I guess this is less about the original LZO error, and more about
how a page in the middle of a UBIFS node got erased out from
underneath it. Which seems suspiciously similar to the
ubifs_read_node() error which I reported last year:

http://lists.infradead.org/pipermail/linux-mtd/2010-July/031069.html

and which is still showing up from time to time on devices in the
field. In that case the erased page contained the node header and so
"type" was interpreted as 255; in this case the erased page is in the
middle of a data node, resulting in decompression failure instead.

Unfortunately it's not repeatable enough for us to capture adequate
debug output - we've had several devices logging extensive debug
output via netconsole for months now, but none have recreated this
problem so far under those conditions.
--
Matthew L. Creech
Artem Bityutskiy
2011-06-01 07:51:26 UTC
Permalink
Post by Matthew L. Creech
Unfortunately it's not repeatable enough for us to capture adequate
debug output - we've had several devices logging extensive debug
output via netconsole for months now, but none have recreated this
problem so far under those conditions.
How this happens? What do you do? Does this happen after mount when you
first read your data? Or this happens at some point while you stress
testing your system? Or this happens after a power cut?
--
Best Regards,
Artem Bityutskiy (????? ????????)
Matthew L. Creech
2011-06-02 04:30:15 UTC
Permalink
Post by Artem Bityutskiy
How this happens? What do you do? Does this happen after mount when you
first read your data? Or this happens at some point while you stress
testing your system? Or this happens after a power cut?
So far there's no discernable pattern. Most of the failed units are
returns from the field, so we don't know what kind of conditions
they've been placed in. Some are from our test department, but we
haven't found anything that might "trigger" the problem in any way.

The device works fine for some period of time (usually weeks /
months), then we get complaints about various problems. The reported
symptoms eventually come down to one of these UBIFS errors. Depending
on the region which happens to go bad, it can result in breakage of a
minor feature (because a file we try to read/write after mount
triggers the error), all the way up to a completely non-functional
device. I'm not sure if we've ever seen it fail to mount altogether
(I'll check into that), but we've had several cases in which U-Boot
couldn't read the kernel image from UBIFS, so the device wouldn't boot
Linux at all.

Power cuts are probably not common, though. We have to expect them in
the product of course, but practically speaking, our service guy
assures me that a couple of the bad units he shipped me had stable
power and were rarely/never rebooted. But I can't rule it out with
certainty.

Aside from that, it's just normal operation. If the usage pattern
matters, the only files ever written to in the persistent (UBIFS)
filesystem are SQLite databases. It's generally light usage, logging
a variety of measurements once every 5 minutes. I've tried
stress-testing by running non-stop SQLite operations, recreating the
normal usage pattern but with a _much_ higher frequency of writes than
normal. It didn't seem to help reproduce the error - we've yet to
succeed in making this problem happen under controlled conditions.

As for this specific error (ubifs_decompress): tomorrow I'll try to
gather & post additional log data for this device. Thanks!
--
Matthew L. Creech
Matthew L. Creech
2011-06-02 18:59:34 UTC
Permalink
Post by Matthew L. Creech
As for this specific error (ubifs_decompress): tomorrow I'll try to
gather & post additional log data for this device. ?Thanks!
Here is a console dump with more information enabled:

http://mcreech.com/work/ubifs-decompress-err.txt

I turned on mount & recovery debug messages, although it seems to
mount & recover correctly, so presumably any info from past recovery
actions is long gone. The error actually occurs later on, when our
main application accesses SQLite databases.

I dumped out the corrupted node data in 3 places:

1. In ubifs_decompress(), I dumped the data buffer, prefixed with
"compressed node" (redundant with #3, really)

2. In read_block(), I page-align the starting offset & size, re-fetch
the pertinent pages from flash, and dump those with corresponding OOB
info, prefixed with "data" and "oob" respectively

3. There's already a dbg_dump_node() call in read_block(), so I
enabled that as well
--
Matthew L. Creech
Artem Bityutskiy
2011-06-06 09:58:36 UTC
Permalink
Post by Matthew L. Creech
Post by Matthew L. Creech
As for this specific error (ubifs_decompress): tomorrow I'll try to
gather & post additional log data for this device. Thanks!
http://mcreech.com/work/ubifs-decompress-err.txt
I turned on mount & recovery debug messages, although it seems to
mount & recover correctly, so presumably any info from past recovery
actions is long gone. The error actually occurs later on, when our
main application accesses SQLite databases.
1. In ubifs_decompress(), I dumped the data buffer, prefixed with
"compressed node" (redundant with #3, really)
2. In read_block(), I page-align the starting offset & size, re-fetch
the pertinent pages from flash, and dump those with corresponding OOB
info, prefixed with "data" and "oob" respectively
3. There's already a dbg_dump_node() call in read_block(), so I
enabled that as well
So the corruption starts exactly at the NAND page boundary. This makes
me believe that the reason is most probably power cut recovery. But you
say your client ensures there were none...

Yes, you are right that info from past is gone... What I'd like to see
is a dump of the whole LEB. Could you please add 'dbg_dump_leb()' -
basically I want to look if this LEB was passed through GC.

Because my theory is:

1. You have LEB A which contains this data node, but it is not corrupted
yet. Let's call this data node X.
2. GC moves valid data from LEB A to LEB B (lnum 3479).
3. We get a power cut while moving the data. We end up with node X
corrupted in B.
4. UBIFS recovery has a bug and it decides that the copy of node X in
LEB B is OK, commits, and LEB A is erased at some point.
5. And we are in your situation...

But this is just a theory.

I actually worked on power cut emulation testing improvements lately,
and the current state is that "integck -p" fails sometimes. I need to
investigate it - might turn out to be a bug which cases the effect you
see.

Basically, I've improved UBIFS power cut testing and corrupt the buffer
with random data, not only with 0xFFs, and now integck -p starts
failing. See this commit:

http://git.infradead.org/ubifs-2.6.git/commit/96c32bb596c5a74362a6a825f66fde68b6c3487c

It contains several unrelated changes. Ignore the simple random part,
only the changes in 'cut_data()' are interesting.

But I'll split that change on several changes.

And I'll try to investigate the issue - it might turn out to be
integck's issue, will see.
--
Best Regards,
Artem Bityutskiy (????? ????????)
Matthew L. Creech
2011-06-06 16:04:42 UTC
Permalink
Post by Artem Bityutskiy
So the corruption starts exactly at the NAND page boundary. This makes
me believe that the reason is most probably power cut recovery. But you
say your client ensures there were none...
Yes, you are right that info from past is gone... What I'd like to see
is a dump of the whole LEB. Could you please add 'dbg_dump_leb()' -
basically I want to look if this LEB was passed through GC.
[Repost, gmail mangled my node dump]

The dbg_dump_leb() call fails due to a bad CRC; I've posted its output
plus the corresponding raw LEB dump (obtained via mtd->read_oob())
here:

http://mcreech.com/work/ubifs-decompress-err-rawleb.txt
--
Matthew L. Creech
Artem Bityutskiy
2011-06-06 16:18:11 UTC
Permalink
Post by Matthew L. Creech
Post by Artem Bityutskiy
So the corruption starts exactly at the NAND page boundary. This makes
me believe that the reason is most probably power cut recovery. But you
say your client ensures there were none...
Yes, you are right that info from past is gone... What I'd like to see
is a dump of the whole LEB. Could you please add 'dbg_dump_leb()' -
basically I want to look if this LEB was passed through GC.
[Repost, gmail mangled my node dump]
The dbg_dump_leb() call fails due to a bad CRC; I've posted its output
plus the corresponding raw LEB dump (obtained via mtd->read_oob())
I have to go home now - could you please improve dbg_dump_leb().
Currently it calls ubifs_scan(), which scans, finds corrupted node,
prints corruption information and returns -EUCLEAN and destroys the
scanned data.

Instead, we could do:

1. Add another parameter to ubifs_scan() which makes it to _not_ free
scanned data on corruption.

2. In dbg_dump_leb() when you get -EUCLEAN - just go ahead and print the
scanned information instead of exiting. Should not be too difficult to
do.
--
Best Regards,
Artem Bityutskiy (????? ????????)
Matthew L. Creech
2011-06-06 19:52:06 UTC
Permalink
Post by Artem Bityutskiy
I have to go home now - could you please improve dbg_dump_leb().
Currently it calls ubifs_scan(), which scans, finds corrupted node,
prints corruption information and returns -EUCLEAN and destroys the
scanned data.
Will something like this be okay? Or do you still want to dump the
partially-parsed data from the corrupt node as well (not just the raw contents
of the LEB)?


Currently an error in ubifs_scan() will cause dbg_dump_leb() to abort without
completing the dump. Instead, we should abandon parsing the data, but dump
the raw (uninterpreted) LEB contents instead.

Signed-off-by: Matthew L. Creech <mlcreech at gmail.com>
---
fs/ubifs/debug.c | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/fs/ubifs/debug.c b/fs/ubifs/debug.c
index 26d4c61..6ab43e4 100644
--- a/fs/ubifs/debug.c
+++ b/fs/ubifs/debug.c
@@ -901,6 +901,9 @@ void dbg_dump_leb(const struct ubifs_info *c, int lnum)
sleb = ubifs_scan(c, lnum, 0, buf, 0);
if (IS_ERR(sleb)) {
ubifs_err("scan error %d", (int)PTR_ERR(sleb));
+ printk(KERN_DEBUG "\tLEB data buffer:\n");
+ print_hex_dump(KERN_DEBUG, "\t", DUMP_PREFIX_OFFSET, 32, 1,
+ buf, c->leb_size, 0);
goto out;
}
--
1.6.3.3
Artem Bityutskiy
2011-06-07 04:34:26 UTC
Permalink
Post by Matthew L. Creech
ubifs_err("scan error %d", (int)PTR_ERR(sleb));
+ printk(KERN_DEBUG "\tLEB data buffer:\n");
+ print_hex_dump(KERN_DEBUG, "\t", DUMP_PREFIX_OFFSET, 32, 1,
+ buf, c->leb_size, 0);
No, I have difficulties reading hexdumps. You have set of good nodes
following by one broken node. I wanted to see a human-readable dump of
the good nodes at the beginning of the LEB.
--
Best Regards,
Artem Bityutskiy (????? ????????)
Matthew L. Creech
2011-06-07 20:41:16 UTC
Permalink
Post by Artem Bityutskiy
No, I have difficulties reading hexdumps. You have set of good nodes
following by one broken node. I wanted to see a human-readable dump of
the good nodes at the beginning of the LEB.
Oh I see - sorry, I thought you wanted to debug the corrupted portion.

Here's the output for my corrupt flash:

http://mcreech.com/work/ubifs-2011-06-07.txt

I'll follow up with a patch.
--
Matthew L. Creech
Artem Bityutskiy
2011-06-08 14:11:05 UTC
Permalink
Post by Matthew L. Creech
Post by Artem Bityutskiy
No, I have difficulties reading hexdumps. You have set of good nodes
following by one broken node. I wanted to see a human-readable dump of
the good nodes at the beginning of the LEB.
Oh I see - sorry, I thought you wanted to debug the corrupted portion.
http://mcreech.com/work/ubifs-2011-06-07.txt
I'll follow up with a patch.
Yes, it does look like this LEB might be garbage-collected. But it does
not have to be.

Anyway, what I can suggest you is to do several things.

1. If you have many occasions of such error, try to gather some
information about how the device was used, and if it was uncleanly
power-cut. Remember, I often saw that embedded devices have incorrect
reboot. Whe users reboot it "normally" - it does not try to unmount
the FS-es cleanly and just jumps to som HW reset function.

You can verify this by rebooting normally and checking if UBIFS says
"recovery needed" or not. If it does - the reboot was not normal.

2. This error may be due to memory corruptions in some driver (e.g.,
wireless or video), due to issues in the mtd driver, etc. Try to
stress your system with slub/slab full checks enabled, and other
debugging features which you can find in the "hacking" section of
make menuconfig.

3. If my theory is true, then what may help is adding a check it
ubifs recovery function. The recovery ends with an ubifs_leb_change()
call. You need to check the last node there - is it full and correct?
If not, you should print a loud warning and information like leb dump
_before_ the change, and dump of the buffer which we are going to
write with ubifs_leb_change().

You'd probably need to deploy this check to the field if this issue
is not easy to reproduce. If you have then this info you may fix the
bug.

4. Set-up power-cut emulation testing in your office.

P.S. I'm curious where you use UBIFS, if this is not a trade secret, of
course.
--
Best Regards,
Artem Bityutskiy (????? ????????)
Matthew L. Creech
2011-06-08 17:50:24 UTC
Permalink
Post by Artem Bityutskiy
Yes, it does look like this LEB might be garbage-collected. But it does
not have to be.
Anyway, what I can suggest you is to do several things.
1. If you have many occasions of such error, try to gather some
? information about how the device was used, and if it was uncleanly
? power-cut. Remember, I often saw that embedded devices have incorrect
? reboot. Whe users reboot it "normally" - it does not try to unmount
? the FS-es cleanly and just jumps to som HW reset function.
? You can verify this by rebooting normally and checking if UBIFS says
? "recovery needed" or not. If it does - the reboot was not normal.
Yes, it currently reboots uncleanly (though it does do a "sync"
first). I noticed this a while back, and the next release firmware
will have it fixed. However, it doesn't make a huge difference to us,
because these devices are probably more likely to experience power
loss than a software reboot, in the field at least.
Post by Artem Bityutskiy
2. This error may be due to memory corruptions in some driver (e.g.,
? wireless or video), due to issues in the mtd driver, etc. Try to
? stress your system with slub/slab full checks enabled, and other
? debugging features which you can find in the "hacking" section of
? make menuconfig.
Will do.
Post by Artem Bityutskiy
3. If my theory is true, then what may help is adding a check it
? ubifs recovery function. The recovery ends with an ubifs_leb_change()
? call. You need to check the last node there - is it full and correct?
? If not, you should print a loud warning and information like leb dump
? _before_ the change, and dump of the buffer which we are going to
? write with ubifs_leb_change().
? You'd probably need to deploy this check to the field if this issue
? is not easy to reproduce. If you have then this info you may fix the
? bug.
Great, I'll add this check and see if we get any hits. Even if it
takes a while to hit it in the field, this would at least give us a
way to make some progress in finding the issue.
Post by Artem Bityutskiy
4. Set-up power-cut emulation testing in your office.
I did this at one point - I have a programmable UPS, so I was able to
automate a test to turn outlet power off & on repeatedly while having
the device do some work. It didn't seem to help reproduce the
problem, but it's worth trying again on a long-term basis (especially
with the change above to try & catch the corruption in the act).

Thanks again Artem.
--
Matthew L. Creech
Artem Bityutskiy
2011-06-09 12:10:34 UTC
Permalink
Post by Matthew L. Creech
Post by Artem Bityutskiy
Yes, it does look like this LEB might be garbage-collected. But it does
not have to be.
Anyway, what I can suggest you is to do several things.
1. If you have many occasions of such error, try to gather some
information about how the device was used, and if it was uncleanly
power-cut. Remember, I often saw that embedded devices have incorrect
reboot. Whe users reboot it "normally" - it does not try to unmount
the FS-es cleanly and just jumps to som HW reset function.
You can verify this by rebooting normally and checking if UBIFS says
"recovery needed" or not. If it does - the reboot was not normal.
Yes, it currently reboots uncleanly (though it does do a "sync"
first). I noticed this a while back, and the next release firmware
will have it fixed. However, it doesn't make a huge difference to us,
because these devices are probably more likely to experience power
loss than a software reboot, in the field at least.
Post by Artem Bityutskiy
2. This error may be due to memory corruptions in some driver (e.g.,
wireless or video), due to issues in the mtd driver, etc. Try to
stress your system with slub/slab full checks enabled, and other
debugging features which you can find in the "hacking" section of
make menuconfig.
Will do.
Post by Artem Bityutskiy
3. If my theory is true, then what may help is adding a check it
ubifs recovery function. The recovery ends with an ubifs_leb_change()
call. You need to check the last node there - is it full and correct?
If not, you should print a loud warning and information like leb dump
_before_ the change, and dump of the buffer which we are going to
write with ubifs_leb_change().
You'd probably need to deploy this check to the field if this issue
is not easy to reproduce. If you have then this info you may fix the
bug.
Great, I'll add this check and see if we get any hits. Even if it
takes a while to hit it in the field, this would at least give us a
way to make some progress in finding the issue.
With my latest code-base, I am able to inject a hack into
ubifs_leb_change() - but this function does not exist in your code-base.
Anyway, I'm currently running power cut emulation testing with the
following hack:
Matthew L. Creech
2011-06-20 15:35:29 UTC
Permalink
Post by Artem Bityutskiy
With my latest code-base, I am able to inject a hack into
ubifs_leb_change() - but this function does not exist in your code-base.
Anyway, I'm currently running power cut emulation testing with the
Hi Artem, just wanted to post an update to this:

I hacked this change into my kernel last week, and enabled the
in-kernel debugging options you suggested, then set up a power-cut
test:

- the device in question is powered by a UPS with controllable outlets
- its firmware boots and immediately starts writing large amounts of
data to flash (so we'll have a dirty FS)
- a separate script controlling the UPS waits for the device to boot,
then sleeps some random time between 0 and 10 minutes
- it then cuts power to the outlet and later restores it

The test has been running for the better part of a week with no issues
- it has now power-cycled the device over 1500 times, but I've gotten
no errors, and the filesystem is functioning fine. FYI this is way
more stress than was placed on the "bad" units we've gotten back from
the field - most of those rarely lose power, and when they do, they're
less likely to be in the middle of a write operation.

I'll continue testing, but based on this it seems like UBIFS handles
power cuts very well!
--
Matthew L. Creech
Artem Bityutskiy
2011-06-07 10:24:14 UTC
Permalink
Post by Matthew L. Creech
Post by Artem Bityutskiy
I have to go home now - could you please improve dbg_dump_leb().
Currently it calls ubifs_scan(), which scans, finds corrupted node,
prints corruption information and returns -EUCLEAN and destroys the
scanned data.
Will something like this be okay? Or do you still want to dump the
partially-parsed data from the corrupt node as well (not just the raw contents
of the LEB)?
Probably a temporary hack like this should work.

diff --git a/fs/ubifs/scan.c b/fs/ubifs/scan.c
index 37383e8..c709045 100644
--- a/fs/ubifs/scan.c
+++ b/fs/ubifs/scan.c
@@ -352,8 +352,9 @@ corrupted:
ubifs_err("LEB %d scanning failed", lnum);
}
err = -EUCLEAN;
- ubifs_scan_destroy(sleb);
- return ERR_PTR(err);
+// ubifs_scan_destroy(sleb);
+// return ERR_PTR(err);
+ return sleb;

error:
ubifs_err("LEB %d scanning failed, error %d", lnum, err);
--
Best Regards,
Artem Bityutskiy (????? ????????)
Artem Bityutskiy
2011-06-03 04:32:20 UTC
Permalink
Post by Matthew L. Creech
Post by Artem Bityutskiy
How this happens? What do you do? Does this happen after mount when you
first read your data? Or this happens at some point while you stress
testing your system? Or this happens after a power cut?
So far there's no discernable pattern. Most of the failed units are
returns from the field, so we don't know what kind of conditions
they've been placed in. Some are from our test department, but we
haven't found anything that might "trigger" the problem in any way.
The device works fine for some period of time (usually weeks /
months), then we get complaints about various problems. The reported
symptoms eventually come down to one of these UBIFS errors. Depending
on the region which happens to go bad, it can result in breakage of a
minor feature (because a file we try to read/write after mount
triggers the error), all the way up to a completely non-functional
device. I'm not sure if we've ever seen it fail to mount altogether
(I'll check into that), but we've had several cases in which U-Boot
couldn't read the kernel image from UBIFS, so the device wouldn't boot
Linux at all.
Power cuts are probably not common, though. We have to expect them in
the product of course, but practically speaking, our service guy
assures me that a couple of the bad units he shipped me had stable
power and were rarely/never rebooted. But I can't rule it out with
certainty.
Aside from that, it's just normal operation. If the usage pattern
matters, the only files ever written to in the persistent (UBIFS)
filesystem are SQLite databases. It's generally light usage, logging
a variety of measurements once every 5 minutes. I've tried
stress-testing by running non-stop SQLite operations, recreating the
normal usage pattern but with a _much_ higher frequency of writes than
normal. It didn't seem to help reproduce the error - we've yet to
succeed in making this problem happen under controlled conditions.
As for this specific error (ubifs_decompress): tomorrow I'll try to
gather & post additional log data for this device. Thanks!
OK, then this is not about power cuts and unstable bits. First thing
coming to my mind is that your kernel may have some non-UBIFS bugs which
end up in memory corruptions, so UBIFS writes corrupted data to the
flash.

But the hexdump you sent shows that you have some non-0xFFs and then
many 0xFFs. Are those trailing 0xFFs part of the node data or not? If
yes, then it does not look like memory corruption, but more like some
driver/flash issues.

BTW, have you run mtd tests? Would you mind to set up torture test on
one of your boards and let it run fore several weeks. I remember we
found a rare DMA bug in our board by running the torture test for long
time. Also, it might be interesting how your HW and SW behave when you
continuously wear out few eraseblocks.
--
Best Regards,
Artem Bityutskiy (????? ????????)
Matthew L. Creech
2011-05-31 15:47:52 UTC
Permalink
On Mon, May 30, 2011 at 8:29 AM, Ben Gardiner
Post by Ben Gardiner
I don't see any debug statements in lzo1x_decompress_safe() that can
be enabled, so you might want to add some printing to
lzo1x_decompress_safe() or at least print the non-ok return code of
lzo1x_decompress_safe() in lzo_decompress() to get an idea of how the
decompressor is failing.
Looks like it's returning LZO_E_LOOKBEHIND_OVERRUN. I don't know what
that indicates, but there is trailing 0xff data in the block to be
decompressed if that matters:

XXXX: LZO_E_LOOKBEHIND_OVERRUN
UBIFS error (pid 428): ubifs_decompress: cannot decompress 1010 bytes,
compressor lzo, error -22
00000000: 00 0f 69 6e 3a 61 74 74 72 00 c2 38 1c 03 39 03 ..in:attr..8..9.
00000010: 2f 30 31 2f 6d 57 43 2f 2e 56 61 6c 75 65 4d 61 /01/mWC/.ValueMa
00000020: 78 3a 61 91 03 94 31 72 00 69 6e d0 03 00 01 37 x:a...1r.in....7
00000030: 1c 03 3d 01 2f 53 65 72 69 61 6c 4e 75 6d 62 65 ..=./SerialNumbe
00000040: 72 2f 98 07 98 03 00 0a 04 2f 03 63 01 2f 53 6f r/......./.c./So
00000050: 66 74 77 61 72 65 2f 43 6f 6d 6d 61 6e 64 73 2f ftware/Commands/
00000060: 6e 65 78 74 d2 01 49 44 29 bc 00 03 06 47 04 81 next..ID)....G..
00000070: 0d 03 28 c0 00 00 04 44 61 74 61 20 53 65 72 76 ..(....Data Serv
00000080: 65 72 73 2f 42 41 43 6e 65 74 2d 49 50 c4 01 0b ers/BACnet-IP...
00000090: 44 65 76 69 63 65 49 6e 73 74 61 6e 63 65 29 14 DeviceInstance).
000000a0: 01 05 00 f7 28 41 04 81 03 02 20 0c 1f 01 4d 41 ....(A.... ...MA
000000b0: 43 29 08 01 03 02 19 42 04 81 05 20 0d 07 01 4e C).....B... ...N
000000c0: 61 6d 2a 14 02 02 02 18 3c 03 7b 20 02 04 01 07 am*.....<.{ ....
000000d0: 42 4d 44 41 64 64 72 65 73 73 2a fc 01 02 1d 40 BMDAddress*....@
000000e0: 04 81 01 20 05 f4 00 06 54 69 6d 65 54 6f 4c 69 ... ....TimeToLi
000000f0: 76 2b f5 01 1e 20 05 f4 01 02 61 73 65 49 64 80 v+... ....aseId.
00000100: 36 2a f5 01 20 20 05 f4 03 07 44 65 66 61 75 6c 6*.. ....Defaul
00000110: 74 4e 65 74 2f e4 07 03 02 1a 46 04 81 0d 20 01 tNet/.....F... .
00000120: 00 03 00 02 45 6e 61 62 6c 65 42 61 73 65 46 6f ....EnableBaseFo
00000130: 72 47 61 74 65 77 61 79 2a 24 02 02 1f 45 04 81 rGateway*$...E..
00000140: 0b 20 0e 18 01 03 52 6f 75 74 65 64 2a 14 01 02 . ....Routed*...
00000150: 21 50 04 81 21 20 01 14 01 c2 07 30 31 3e 84 09 !P..! .....01>..
00000160: 03 01 f4 4b 04 81 17 20 17 40 01 2d a8 09 02 24 ...K... . at .-...$
00000170: 4c 04 81 19 20 17 2c 01 2e d0 09 01 23 3a 03 77 L... .,.....#:.w
00000180: 20 07 2c 01 2d 1d 02 1c 20 05 d0 0c b0 24 33 cc .,.-... ....$3.
00000190: 07 02 1b 00 00 00 11 6c 00 00 3f 6a 68 2e 73 ec .......l..?jh.s.
000001a0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
000001b0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
000001c0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
000001d0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
000001e0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
000001f0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
00000200: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
00000210: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
00000220: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
00000230: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
00000240: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
00000250: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
00000260: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
00000270: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
00000280: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
00000290: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
000002a0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
000002b0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
000002c0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
000002d0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
000002e0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
000002f0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
00000300: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
00000310: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
00000320: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
00000330: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
00000340: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
00000350: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
00000360: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
00000370: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
00000380: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
00000390: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
000003a0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
000003b0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
000003c0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
000003d0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
000003e0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
000003f0: ff ff
UBIFS error (pid 428): read_block: bad data node (block 388, inode 556196)
UBIFS error (pid 428): do_readpage: cannot read page 388 of inode
556196, error -22
--
Matthew L. Creech
Artem Bityutskiy
2011-06-01 08:02:08 UTC
Permalink
On Mon, May 30, 2011 at 8:29 AM, Ben Gardiner
Post by Ben Gardiner
I don't see any debug statements in lzo1x_decompress_safe() that can
be enabled, so you might want to add some printing to
lzo1x_decompress_safe() or at least print the non-ok return code of
lzo1x_decompress_safe() in lzo_decompress() to get an idea of how the
decompressor is failing.
Looks like it's returning LZO_E_LOOKBEHIND_OVERRUN. I don't know what
that indicates, but there is trailing 0xff data in the block to be
XXXX: LZO_E_LOOKBEHIND_OVERRUN
UBIFS error (pid 428): ubifs_decompress: cannot decompress 1010 bytes,
compressor lzo, error -22
00000000: 00 0f 69 6e 3a 61 74 74 72 00 c2 38 1c 03 39 03 ..in:attr..8..9.
00000010: 2f 30 31 2f 6d 57 43 2f 2e 56 61 6c 75 65 4d 61 /01/mWC/.ValueMa
00000020: 78 3a 61 91 03 94 31 72 00 69 6e d0 03 00 01 37 x:a...1r.in....7
00000030: 1c 03 3d 01 2f 53 65 72 69 61 6c 4e 75 6d 62 65 ..=./SerialNumbe
00000040: 72 2f 98 07 98 03 00 0a 04 2f 03 63 01 2f 53 6f r/......./.c./So
00000050: 66 74 77 61 72 65 2f 43 6f 6d 6d 61 6e 64 73 2f ftware/Commands/
00000060: 6e 65 78 74 d2 01 49 44 29 bc 00 03 06 47 04 81 next..ID)....G..
00000070: 0d 03 28 c0 00 00 04 44 61 74 61 20 53 65 72 76 ..(....Data Serv
00000080: 65 72 73 2f 42 41 43 6e 65 74 2d 49 50 c4 01 0b ers/BACnet-IP...
00000090: 44 65 76 69 63 65 49 6e 73 74 61 6e 63 65 29 14 DeviceInstance).
000000a0: 01 05 00 f7 28 41 04 81 03 02 20 0c 1f 01 4d 41 ....(A.... ...MA
000000b0: 43 29 08 01 03 02 19 42 04 81 05 20 0d 07 01 4e C).....B... ...N
000000c0: 61 6d 2a 14 02 02 02 18 3c 03 7b 20 02 04 01 07 am*.....<.{ ....
000000e0: 04 81 01 20 05 f4 00 06 54 69 6d 65 54 6f 4c 69 ... ....TimeToLi
000000f0: 76 2b f5 01 1e 20 05 f4 01 02 61 73 65 49 64 80 v+... ....aseId.
00000100: 36 2a f5 01 20 20 05 f4 03 07 44 65 66 61 75 6c 6*.. ....Defaul
00000110: 74 4e 65 74 2f e4 07 03 02 1a 46 04 81 0d 20 01 tNet/.....F... .
00000120: 00 03 00 02 45 6e 61 62 6c 65 42 61 73 65 46 6f ....EnableBaseFo
00000130: 72 47 61 74 65 77 61 79 2a 24 02 02 1f 45 04 81 rGateway*$...E..
00000140: 0b 20 0e 18 01 03 52 6f 75 74 65 64 2a 14 01 02 . ....Routed*...
00000150: 21 50 04 81 21 20 01 14 01 c2 07 30 31 3e 84 09 !P..! .....01>..
00000160: 03 01 f4 4b 04 81 17 20 17 40 01 2d a8 09 02 24 ...K... . at .-...$
00000170: 4c 04 81 19 20 17 2c 01 2e d0 09 01 23 3a 03 77 L... .,.....#:.w
00000180: 20 07 2c 01 2d 1d 02 1c 20 05 d0 0c b0 24 33 cc .,.-... ....$3.
00000190: 07 02 1b 00 00 00 11 6c 00 00 3f 6a 68 2e 73 ec .......l..?jh.s.
000001a0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
000001b0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
000001c0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
000001d0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
000001e0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
000001f0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
00000200: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
00000210: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
00000220: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
00000230: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
00000240: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
00000250: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
00000260: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
00000270: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
00000280: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
00000290: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
000002a0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
000002b0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
000002c0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
000002d0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
000002e0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
000002f0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
00000300: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
00000310: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
00000320: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
00000330: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
00000340: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
00000350: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
00000360: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
00000370: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
00000380: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
00000390: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
000003a0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
000003b0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
000003c0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
000003d0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
000003e0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
000003f0: ff ff
UBIFS error (pid 428): read_block: bad data node (block 388, inode 556196)
UBIFS error (pid 428): do_readpage: cannot read page 388 of inode
556196, error -22
I see the following possibilities:

1. The data has been written like this - then the bug is at writing
side. Check the data node - what is its length, is CRC correct? It would
be useful to dump the node which cannot be decompressed - I'd accept
such patch with great delight.

2. You had power cuts while this peace of data has been written and
recovery did not work correctly. Enabling mount and recovery messages
would help.

3. I merged several changes to 2.6.39 which could in theory break
recovery. Try to reproduce this with 2.6.38.

4. The fixup feature might have broke this - we might for some reason
read less data than there. Although I see FFs start at offset 416, which
is strange.
--
Best Regards,
Artem Bityutskiy (????? ????????)
Artem Bityutskiy
2011-06-01 08:07:36 UTC
Permalink
Post by Artem Bityutskiy
1. The data has been written like this - then the bug is at writing
side. Check the data node - what is its length, is CRC correct? It would
be useful to dump the node which cannot be decompressed - I'd accept
such patch with great delight.
2. You had power cuts while this peace of data has been written and
recovery did not work correctly. Enabling mount and recovery messages
would help.
3. I merged several changes to 2.6.39 which could in theory break
recovery. Try to reproduce this with 2.6.38.
4. The fixup feature might have broke this - we might for some reason
read less data than there. Although I see FFs start at offset 416, which
is strange.
So Matthew, let's try to exclude some of these possibilities. Please,
send me some description about what you do to reproduce this. Is this
error persistent or it goes away after remount or reboot. And other
information relevant to the above points.

Thanks!
--
Best Regards,
Artem Bityutskiy (????? ????????)
Artem Bityutskiy
2011-06-01 08:39:36 UTC
Permalink
Post by Artem Bityutskiy
3. I merged several changes to 2.6.39 which could in theory break
recovery. Try to reproduce this with 2.6.38.
Err, sorry, those went to 2.6.40, so disregard this point please.
--
Best Regards,
Artem Bityutskiy (????? ????????)
Matthew L. Creech
2011-06-02 04:34:50 UTC
Permalink
Post by Artem Bityutskiy
4. The fixup feature might have broke this - we might for some reason
read less data than there. Although I see FFs start at offset 416, which
is strange.
Also, I can rule this possibility out - although the device is now
running a recent kernel with the fix-up code present, it was upgraded
last week from an older version. So the UBIFS image it was originally
programmed with (and is still using) didn't have the fixup flag set.
--
Matthew L. Creech
Artem Bityutskiy
2011-06-01 07:48:34 UTC
Permalink
Post by Matthew L. Creech
Hi,
UBIFS error (pid 428): ubifs_decompress: cannot decompress 1010 bytes,
compressor lzo, error -22
UBIFS error (pid 428): read_block: bad data node (block 388, inode 556196)
UBIFS error (pid 428): do_readpage: cannot read page 388 of inode
556196, error -22
UBIFS error (pid 428): ubifs_decompress: cannot decompress 1010 bytes,
compressor lzo, error -22
UBIFS error (pid 428): read_block: bad data node (block 388, inode 556196)
UBIFS error (pid 428): do_readpage: cannot read page 388 of inode
556196, error -22
The device is still bootable (though not functional), so I can get
more debug info if needed. Any ideas on what could cause this?
Yes, you can enable mount and recovery debugging messages and send me
the log.
Post by Matthew L. Creech
- MPC8313 (using fsl_elbc_nand driver)
- 1GB NAND flash (partitioned into 5MB for U-Boot, remainder for UBI/UBIFS)
- 2.6.39 kernel
Thanks!
--
Best Regards,
Artem Bityutskiy (????? ????????)
Loading...