UBIFS and hardware ECC of all FF pages of MLC NAND

Discussion:

Darwin Rambo

2009-09-18 21:31:05 UTC

I have a 512 byte-at-a time hardware ECC generator that generates a particular non-FF code for 512 bytes of FF data. For a 4K MLC NAND flash, that means in the OOB, I have 8 ECCs/page.

Ubinize creates large download files which in some cases have 64 byte headers, followed by a page of FFs, and in other cases, the entire page is FFs. (I noticed that mkfs.jffs2 doesn't appear to create large blocks of FFs in the image file. In fact, by 6MB jffs2 file became 14MB when I rebuilt for UBIFS, much of the file being large blocks of FFs).

When I download and program the flash I took the decision to program the ECC even for all FF pages.

I was getting ECC corruption on startup, and eventually traced it down to UBIFS writing new data to these all FF pages, but because the FS noticed that the page was blank, didn't erase anything, and wrote data, even though the ECCs were still programmed and were non-FF. The result is that the new ECC collided with the old ECC that was there, and I got corruption of a nearby page's ECC as well as the ECC in the page that was written.

When I changed the downloader to detect all FF pages and to leave the ECC area of the OOB at FF, then UBIFS works fine.

I had originally wanted the ECC even on all FF pages since this should help stuck bit problems, even for erased all FF pages.

So the questions are
1. should UBIFS use the FF pattern alone as an assumption of a writable page, or should it also check the ECC?
2. for initial downloading, should an ECC be programmed on all FF data pages? Is there any correction advantage?
3. for runtime page writes, should an all FF page leave the ECC at FF as well?

I apologize in advance if this particular issue has already been covered elsewhere. Thanks!

Adrian Hunter

2009-09-24 13:20:36 UTC

Permalink

Post by Darwin Rambo
I have a 512 byte-at-a time hardware ECC generator that generates a particular non-FF code for 512 bytes of FF data. For a 4K MLC NAND flash, that means in the OOB, I have 8 ECCs/page.
Ubinize creates large download files which in some cases have 64 byte headers, followed by a page of FFs, and in other cases, the entire page is FFs. (I noticed that mkfs.jffs2 doesn't appear to create large blocks of FFs in the image file. In fact, by 6MB jffs2 file became 14MB when I rebuilt for UBIFS, much of the file being large blocks of FFs).
When I download and program the flash I took the decision to program the ECC even for all FF pages.
I was getting ECC corruption on startup, and eventually traced it down to UBIFS writing new data to these all FF pages, but because the FS noticed that the page was blank, didn't erase anything, and wrote data, even though the ECCs were still programmed and were non-FF. The result is that the new ECC collided with the old ECC that was there, and I got corruption of a nearby page's ECC as well as the ECC in the page that was written.
When I changed the downloader to detect all FF pages and to leave the ECC area of the OOB at FF, then UBIFS works fine.
I had originally wanted the ECC even on all FF pages since this should help stuck bit problems, even for erased all FF pages.
So the questions are
1. should UBIFS use the FF pattern alone as an assumption of a writable page, or should it also check the ECC?

Sorry for the slow reply.

UBIFS assumes FF pages at the end of eraseblocks are empty. UBI and UBIFS are
designed not to require OOB and will not read or write it.

Post by Darwin Rambo
2. for initial downloading, should an ECC be programmed on all FF data pages? Is there any correction advantage?

In your case, as you have discovered, you must not program ECC for FF pages at
the end of eraseblocks.

Post by Darwin Rambo
3. for runtime page writes, should an all FF page leave the ECC at FF as well?

No. The only time UBI or UBIFS will write an all FF page is if that is the
data to be stored - in which case, it should be given an ECC.

Post by Darwin Rambo
I apologize in advance if this particular issue has already been covered elsewhere. Thanks!

Artem Bityutskiy

2009-09-24 14:51:46 UTC

Permalink

Post by Adrian Hunter
UBIFS assumes FF pages at the end of eraseblocks are empty. UBI and UBIFS are
designed not to require OOB and will not read or write it.

Post by Darwin Rambo
2. for initial downloading, should an ECC be programmed on all FF data pages? Is there any correction advantage?

In your case, as you have discovered, you must not program ECC for FF pages at
the end of eraseblocks.

Post by Darwin Rambo
3. for runtime page writes, should an all FF page leave the ECC at FF as well?

No. The only time UBI or UBIFS will write an all FF page is if that is the
data to be stored - in which case, it should be given an ECC.

I even wrote a doc about how UBI-aware flashing should be done:

http://www.linux-mtd.infradead.org/doc/ubi.html#L_format

--
Best Regards,
Artem Bityutskiy (????? ????????)

Matthieu CASTET

2009-09-24 15:36:59 UTC

Permalink

Post by Adrian Hunter

Post by Darwin Rambo
2. for initial downloading, should an ECC be programmed on all FF data pages? Is there any correction advantage?

In your case, as you have discovered, you must not program ECC for FF pages at
the end of eraseblocks.

The tricky part is when you read FF pages with ecc in mtd. You will get
an ecc error.

If the ecc writing is done on software you can always xor the ecc code
to make it "FF for FF data".
But if everything is done by hardware...

Matthieu

Artem Bityutskiy

2009-09-25 07:05:09 UTC

Permalink

Post by Matthieu CASTET

Post by Adrian Hunter

Post by Darwin Rambo
2. for initial downloading, should an ECC be programmed on all FF data pages? Is there any correction advantage?

In your case, as you have discovered, you must not program ECC for FF pages at
the end of eraseblocks.

The tricky part is when you read FF pages with ecc in mtd. You will get
an ecc error.
If the ecc writing is done on software you can always xor the ecc code
to make it "FF for FF data".
But if everything is done by hardware...

Right, which means the UBI/UBIFS flasher should be smart and skip
0xFF-ed NAND pages at the end of eraseblocks. This adds some complexity
to the flasher, thought. And here:

http://www.linux-mtd.infradead.org/doc/ubi.html#L_format_det

I even described in details the flashing algorithm with my limited
English vocabulary :-)

--
Best Regards,
Artem Bityutskiy (????? ????????)

Darwin Rambo

2009-09-29 13:26:27 UTC

Permalink

Artem,

One thing you might add is a paranoid check for the OOB being set to 0xFF before
programming a page. If someone programs trailing pages in a block of 0xFF by mistake,
and puts a non-0xFF ECC in the OOB, then the UBIFS code would write to an already
written ECC, which I have found to corrupt other blocks ECCs on my part. It also gives
strange error messages and refuses to mount on reboot. The messages do not look like
they are related to the original ECC write problem so it is harder to debug.

With this particular error, you can see messages like below:

UBIFS error (pid 245): ubifs_read_node: bad node type (255 but expected 2)
UBIFS error (pid 245): ubifs_read_node: bad node at LEB 73:456392
UBI error: ubi_io_read: error -74 while reading 64 bytes from PEB 3:0, read 64 bytes
UBI warning: ubi_eba_init_scan: cannot reserve enough PEBs for bad PEB handling,
reserved 17, need 19
UBI warning: ubi_eba_copy_leb: error -74 while reading data from PEB 3
UBI error: wear_leveling_worker: error -74 while moving PEB 3 to PEB 2
UBI warning: ubi_ro_mode: switch to read-only mode
UBI error: do_work: work failed with error code -74
UBI error: ubi_thread: ubi_bgt0d: work failed with error code -74
UBI error: ubi_io_read: error -74 while reading 516096 bytes from PEB 3:8192, re
ad 516096 bytes
UBIFS error (pid 1): ubifs_scan: corrupt empty space at LEB 1:8192
UBIFS error (pid 1): ubifs_scanned_corruption: corrupted data at LEB 1:8192
UBIFS error (pid 1): ubifs_scan: LEB 1 scanning failed
UBI error: ubi_io_read: error -74 while reading 516096 bytes from PEB 3:8192, read 516096 bytes
UBIFS error (pid 1): ubifs_recover_master_node: failed to recover master node
List of all partitions:
1f00 512 mtdblock0 (driver?)
1f01 2048 mtdblock1 (driver?)
1f02 2048 mtdblock2 (driver?)
1f03 8192 mtdblock3 (driver?)
1f04 2048 mtdblock4 (driver?)
1f05 2048 mtdblock5 (driver?)
1f06 1007616 mtdblock6 (driver?)
1f07 1006592 mtdblock7 (driver?)
1f08 32768 mtdblock8 (driver?)
1f09 1024 mtdblock9 (driver?)
1f0a 980280 mtdblock10 (driver?)
No filesystem could mount root, tried: ubifs

A better error message would say something like:
"UBI error: Data page incorrectly programmed to all 0xFFs with non-0xFF ECC."

Another suggestion is rather than creating large files stuffed with 0xFF pads the
end of some of the blocks, to have a ubinize option which creates a download header
in front of each block with block length and valid data length. Then the 0xFF's
wouldn't have to be carried around and the user would be less likely to program
0xFF's by mistake. They would typically only program the useful data that is in
the file instead, and since they erased the block to program, the trailing 0xFFs
would be taken care of automatically. Of course, this would require custom flasher
changes to accommodate. Thanks.

Regards,
Darwin

-----Original Message-----
From: Artem Bityutskiy [mailto:dedekind at infradead.org]
Sent: Friday, September 25, 2009 12:05 AM
To: Matthieu CASTET
Cc: Adrian Hunter; Darwin Rambo; linux-mtd at lists.infradead.org
Subject: Re: UBIFS and hardware ECC of all FF pages of MLC NAND

Post by Matthieu CASTET

Post by Adrian Hunter

Post by Darwin Rambo
2. for initial downloading, should an ECC be programmed

on all FF data pages? Is there any correction advantage?

Post by Matthieu CASTET

Post by Adrian Hunter
In your case, as you have discovered, you must not

program ECC for FF pages at

Post by Matthieu CASTET

Post by Adrian Hunter
the end of eraseblocks.

The tricky part is when you read FF pages with ecc in mtd.

You will get

Post by Matthieu CASTET
an ecc error.
If the ecc writing is done on software you can always xor

the ecc code

Post by Matthieu CASTET
to make it "FF for FF data".
But if everything is done by hardware...

Right, which means the UBI/UBIFS flasher should be smart and skip
0xFF-ed NAND pages at the end of eraseblocks. This adds some
complexity
http://www.linux-mtd.infradead.org/doc/ubi.html#L_format_det
I even described in details the flashing algorithm with my limited
English vocabulary :-)
--
Best Regards,
Artem Bityutskiy (????? ????????)

Artem Bityutskiy

2009-09-29 15:42:16 UTC

Permalink

Post by Darwin Rambo
Artem,
One thing you might add is a paranoid check for the OOB being set to 0xFF before
programming a page. If someone programs trailing pages in a block of 0xFF by mistake,
and puts a non-0xFF ECC in the OOB, then the UBIFS code would write to an already
written ECC, which I have found to corrupt other blocks ECCs on my part. It also gives
strange error messages and refuses to mount on reboot. The messages do not look like
they are related to the original ECC write problem so it is harder to debug.

Do you mean extending the 'ubi_dbg_check_all_ff()' check and make it
also read OOB to make sure there are only 0xFF bytes? Well, it might be
useful, but I would prefer to get a patch from someone, rather than
implementing this myself. :-)

Post by Darwin Rambo
UBIFS error (pid 245): ubifs_read_node: bad node type (255 but expected 2)
UBIFS error (pid 245): ubifs_read_node: bad node at LEB 73:456392
UBI error: ubi_io_read: error -74 while reading 64 bytes from PEB 3:0, read 64 bytes

Well, here I already see that the problem is on driver level because I
cannot read data. Also, if your driver prints an error message in case
of an uncorrectable ECC errors, this could help.

Post by Darwin Rambo
UBI warning: ubi_eba_init_scan: cannot reserve enough PEBs for bad PEB handling,
reserved 17, need 19
UBI warning: ubi_eba_copy_leb: error -74 while reading data from PEB 3
UBI error: wear_leveling_worker: error -74 while moving PEB 3 to PEB 2
UBI warning: ubi_ro_mode: switch to read-only mode
UBI error: do_work: work failed with error code -74
UBI error: ubi_thread: ubi_bgt0d: work failed with error code -74
UBI error: ubi_io_read: error -74 while reading 516096 bytes from PEB 3:8192, re
ad 516096 bytes
UBIFS error (pid 1): ubifs_scan: corrupt empty space at LEB 1:8192
UBIFS error (pid 1): ubifs_scanned_corruption: corrupted data at LEB 1:8192
UBIFS error (pid 1): ubifs_scan: LEB 1 scanning failed
UBI error: ubi_io_read: error -74 while reading 516096 bytes from PEB 3:8192, read 516096 bytes
UBIFS error (pid 1): ubifs_recover_master_node: failed to recover master node

... snip ...

Post by Darwin Rambo
"UBI error: Data page incorrectly programmed to all 0xFFs with non-0xFF ECC."

Probably, but that would happen only if you have debugging checks
enabled, right?

Post by Darwin Rambo
Another suggestion is rather than creating large files stuffed with 0xFF pads the
end of some of the blocks, to have a ubinize option which creates a download header
in front of each block with block length and valid data length. Then the 0xFF's
wouldn't have to be carried around and the user would be less likely to program
0xFF's by mistake. They would typically only program the useful data that is in
the file instead, and since they erased the block to program, the trailing 0xFFs
would be taken care of automatically. Of course, this would require custom flasher
changes to accommodate. Thanks.

It is doable, but I can predict then other people will complain why the
hack they cannot use simple nandwrite when flashing UBI images. And for
many people who have HW which has no problems with writing 0xFFs - plane
nandwrite is usable.

But how much 0xFFs are there are? There should not be that many. We pad
special areas like the UBIFS log, the UBI volume table, the UBIFS lprops
area, the UBIFS master area with 0xFF, but that is it. Your _data_,
i.e., the FS contents is not "stuffed with 0xFFs", it is only those
special UBIFS areas.

So, does it really worth doing what you have suggested? Skipping 0xFFed
works just fine. Will the images be really much smaller?

--
Best Regards,
Artem Bityutskiy (????? ????????)

Darwin Rambo

2009-09-29 16:13:55 UTC

Permalink

-----Original Message-----
From: Artem Bityutskiy [mailto:dedekind at infradead.org]
Sent: Tuesday, September 29, 2009 8:42 AM
To: Darwin Rambo
Cc: Matthieu CASTET; linux-mtd at lists.infradead.org; Adrian Hunter
Subject: RE: UBIFS and hardware ECC of all FF pages of MLC NAND

Post by Darwin Rambo
Artem,
One thing you might add is a paranoid check for the OOB

being set to 0xFF before

Post by Darwin Rambo
programming a page. If someone programs trailing pages in a

block of 0xFF by mistake,

Post by Darwin Rambo
and puts a non-0xFF ECC in the OOB, then the UBIFS code

would write to an already

Post by Darwin Rambo
written ECC, which I have found to corrupt other blocks

ECCs on my part. It also gives

Post by Darwin Rambo
strange error messages and refuses to mount on reboot. The

messages do not look like

Post by Darwin Rambo
they are related to the original ECC write problem so it is

harder to debug.
Do you mean extending the 'ubi_dbg_check_all_ff()' check and make it
also read OOB to make sure there are only 0xFF bytes? Well,
it might be
useful, but I would prefer to get a patch from someone, rather than
implementing this myself. :-)

That's what I meant. I am not very patch-aware but will consider trying.

Post by Darwin Rambo
UBIFS error (pid 245): ubifs_read_node: bad node type (255

but expected 2)

Post by Darwin Rambo
UBIFS error (pid 245): ubifs_read_node: bad node at LEB 73:456392
UBI error: ubi_io_read: error -74 while reading 64 bytes

from PEB 3:0, read 64 bytes
Well, here I already see that the problem is on driver level because I
cannot read data. Also, if your driver prints an error message in case
of an uncorrectable ECC errors, this could help.

That's probably the easiest solution.

Post by Darwin Rambo
UBI warning: ubi_eba_init_scan: cannot reserve enough PEBs

for bad PEB handling,

Post by Darwin Rambo
reserved 17, need 19
UBI warning: ubi_eba_copy_leb: error -74 while reading data

from PEB 3

Post by Darwin Rambo
UBI error: wear_leveling_worker: error -74 while moving PEB

3 to PEB 2

Post by Darwin Rambo
UBI warning: ubi_ro_mode: switch to read-only mode
UBI error: do_work: work failed with error code -74
UBI error: ubi_thread: ubi_bgt0d: work failed with error code -74
UBI error: ubi_io_read: error -74 while reading 516096

bytes from PEB 3:8192, re

Post by Darwin Rambo
ad 516096 bytes
UBIFS error (pid 1): ubifs_scan: corrupt empty space at LEB 1:8192
UBIFS error (pid 1): ubifs_scanned_corruption: corrupted

data at LEB 1:8192

Post by Darwin Rambo
UBIFS error (pid 1): ubifs_scan: LEB 1 scanning failed
UBI error: ubi_io_read: error -74 while reading 516096

bytes from PEB 3:8192, read 516096 bytes

Post by Darwin Rambo
UBIFS error (pid 1): ubifs_recover_master_node: failed to

recover master node
... snip ...

Post by Darwin Rambo
"UBI error: Data page incorrectly programmed to all 0xFFs

with non-0xFF ECC."
Probably, but that would happen only if you have debugging checks
enabled, right?

Right.

Post by Darwin Rambo
Another suggestion is rather than creating large files

stuffed with 0xFF pads the

Post by Darwin Rambo
end of some of the blocks, to have a ubinize option which

creates a download header

Post by Darwin Rambo
in front of each block with block length and valid data

length. Then the 0xFF's

Post by Darwin Rambo
wouldn't have to be carried around and the user would be

less likely to program

Post by Darwin Rambo
0xFF's by mistake. They would typically only program the

useful data that is in

Post by Darwin Rambo
the file instead, and since they erased the block to

program, the trailing 0xFFs

Post by Darwin Rambo
would be taken care of automatically. Of course, this would

require custom flasher

Post by Darwin Rambo
changes to accommodate. Thanks.

This is for an embedded system in which we serial download
initially, and then upgrade block by block over the network via ethernet
or wireless, so we don't use nandwrite at this time. I wasn't suggesting
changing the default behaviour of ubinize, just adding a
switch for embedded types and also to avoid accidental programming
of these regions. However, if it's too confusing, then it may not be worth it.

But how much 0xFFs are there are? There should not be that
many. We pad
special areas like the UBIFS log, the UBI volume table, the
UBIFS lprops
area, the UBIFS master area with 0xFF, but that is it. Your _data_,
i.e., the FS contents is not "stuffed with 0xFFs", it is only those
special UBIFS areas.

Yes it is only special UBIFS areas.

It is a bigger problem with 512K erase blocks. In this case, my
6MB jffs2 image grows to over 14MB ubifs image due to padding. There are about 12
partial blocks with little data in the first few pages, and about 4 partial
blocks at the end. 16 partial blocks is about 8 MB of overhead on 6MB of
real content.

So, does it really worth doing what you have suggested?
Skipping 0xFFed
works just fine. Will the images be really much smaller?

See above. Thanks.

Darwin

--
Best Regards,
Artem Bityutskiy (????? ????????)

Artem Bityutskiy

2009-09-29 16:20:39 UTC

Permalink

Post by Darwin Rambo
It is a bigger problem with 512K erase blocks. In this case, my
6MB jffs2 image grows to over 14MB ubifs image due to padding. There are about 12
partial blocks with little data in the first few pages, and about 4 partial
blocks at the end. 16 partial blocks is about 8 MB of overhead on 6MB of
real content.

Well, of course I do not object if someone implements the optimization
you mentioned, but I do not have time to do this.

--
Best Regards,
Artem Bityutskiy (????? ????????)

Darwin Rambo

2009-09-29 17:03:20 UTC

Permalink

-----Original Message-----
From: Artem Bityutskiy [mailto:dedekind1 at gmail.com]
Sent: Tuesday, September 29, 2009 9:21 AM
To: Darwin Rambo
Cc: dedekind at infradead.org; linux-mtd at lists.infradead.org;
Matthieu CASTET; Adrian Hunter
Subject: Re: UBIFS and hardware ECC of all FF pages of MLC NAND

Post by Darwin Rambo
It is a bigger problem with 512K erase blocks. In this case, my
6MB jffs2 image grows to over 14MB ubifs image due to

padding. There are about 12

Post by Darwin Rambo
partial blocks with little data in the first few pages, and

about 4 partial

Post by Darwin Rambo
blocks at the end. 16 partial blocks is about 8 MB of

overhead on 6MB of

Post by Darwin Rambo
real content.

Well, of course I do not object if someone implements the optimization
you mentioned, but I do not have time to do this.
--
Best Regards,
Artem Bityutskiy (????? ????????)

I'm reluctant to implement something unless several people ask for it,
Otherwise it may be wasted effort and a candidate for future deprecation.
Let's wait and see. Thanks for your help understanding this.

Regards,
Darwin

Artem Bityutskiy

2009-10-11 08:39:17 UTC

Permalink

Post by Darwin Rambo
"UBI error: Data page incorrectly programmed to all 0xFFs with non-0xFF ECC."

Just FYI, I've created this FAQ section:

http://www.linux-mtd.infradead.org/faq/ubifs.html#L_why_ubiformat

Here is the full text in case someone would review:

Why I have to use ubiformat?
The first obvious reason is that ubiformat preserves erase counters, so
you do not lose your wear-leveling information when flashing new images.

The other reason is more subtle, and specific to NAND flashes which have
ECC calculation algorithm which produces ECC code not equivalent to all
0xFF bytes if the NAND page contains only 0xFF bytes. Consider an
example.

* We erase whole flash, so everything is 0xFF'ed now.
* We write an UBI/UBIFS image to flash using nandwrite.
* Some eraseblocks in the UBIFS image may contain several empty
NAND pages at the end, and UBIFS will write to them when it is
run.
* The nandwrite utility writes whole image, and it explicitely
writes 0xFF bytes to those NAND pages.
* The ECC checksums are calculated for these 0xFF'ed NAND pages
and are stored in the OOB area. The ECC codes are not 0xFF'ed.
This is often the case for HW ECC calculation engines, and it is
difficult to fix this. Normally, ECC codes should be 0xFF'ed for
such pages.
* When later UBIFS runs, it writes data to these NAND pages, which
means that a new ECC code is calculated, and written on top of
the existing one (unsuccessfully, of course). This may trigger
an error straight away, but usually at this point no error is
triggered.
* At some point UBIFS is trying to read from these pages, and gets
and an ECC error (-EBADMSG = -74).

In fewer words, ubiformat makes sure that every NAND page is written
once and only once after the erasure. If you use nandwrite, some pages
are written twice - once by nandwrite, and once by UBIFS.

--
Best Regards,
Artem Bityutskiy (????? ????????)

Darwin Rambo

2009-10-11 14:38:00 UTC

Permalink

Hi Artem,

Some feedback inline. Thanks.

Post by Artem Bityutskiy
The other reason is more subtle, and specific to NAND flashes
which have
ECC calculation algorithm which produces ECC code not
equivalent to all
0xFF bytes if the NAND page contains only 0xFF bytes. Consider an
example.
* We erase whole flash, so everything is 0xFF'ed now.
* We write an UBI/UBIFS image to flash using nandwrite.
* Some eraseblocks in the UBIFS image may contain several empty
NAND pages at the end, and UBIFS will write to them when it is
run.

I think this is dangerous for UBIFS to assume that FF data = FF oob, especially
as hardware ECCs appearing more and more. It would be nice if there
was a standard that all FF data must generate all FF ECC but this isn't the case
(though it would solve some corruption issues). Perhaps we should leave a runtime
check in (not paranoid check) for the next year or two that checks the oob also
if the data is all FF just to catch these issues.

Post by Artem Bityutskiy
* When later UBIFS runs, it writes data to these NAND
pages, which
means that a new ECC code is calculated, and written on top of
the existing one (unsuccessfully, of course). This may trigger
an error straight away, but usually at this point no error is
triggered.

When this happens, you often see an XOR operation taking place on the ECC. For example, if the
ECC for a 512 byte sector all FF data is
"10 ae d1 f6 12 6c 65 3d 68 86 1a db 4a"
and the new intended ECC for a new sector of non FF data is
"18 20 f1 91 87 d3 bd 30 a7 4f 3f 23 75"
then I have seen that the resultant ECC (since programming can only change 1's to 0's) is like an AND operation
"10 20 d1 90 02 40 25 30 20 06 1a 03 40"

Now readback validation if it were turned on would catch that the ECC correction could not be
performed and you could see an error right away in this case. Now an interesting thing
is that I have proven with my 4K page MLC flashes that _other_ blocks can have their ECCs
corrupted when this collision occurs - though this might be a local hardware issue. That
took a while to debug in case anyone is having similar problems.

Post by Artem Bityutskiy
* At some point UBIFS is trying to read from these
pages, and gets
and an ECC error (-EBADMSG = -74).
In fewer words, ubiformat makes sure that every NAND page is written
once and only once after the erasure. If you use nandwrite, some pages
are written twice - once by nandwrite, and once by UBIFS.

This may be all the more reason to leave a runtime check in on the oob being all FF
for a while on all FF data. Good defensive programming to not assume anything about
what happened earlier with previous flash operations.

Post by Artem Bityutskiy
--
Best Regards,
Artem Bityutskiy (????? ????????)

Thanks!

Darwin

Artem Bityutskiy

2009-10-11 15:04:22 UTC

Permalink

Hi,

Post by Darwin Rambo

Post by Artem Bityutskiy
The other reason is more subtle, and specific to NAND flashes which have
ECC calculation algorithm which produces ECC code not
equivalent to all
0xFF bytes if the NAND page contains only 0xFF bytes. Consider an
example.
* We erase whole flash, so everything is 0xFF'ed now.
* We write an UBI/UBIFS image to flash using nandwrite.
* Some eraseblocks in the UBIFS image may contain several empty
NAND pages at the end, and UBIFS will write to them when it is
run.

I think this is dangerous for UBIFS to assume that FF data = FF oob, especially
as hardware ECCs appearing more and more.

UBIFS does not assume *anything* about ECC. UBI/UBIFS does not assume
anything about flash type even. E.g., it works on NOR.

All UBIFS assumes is that it may write more data to the end of
eraseblocks, nothing else. IMHO, this is a reasonable assumption.

Post by Darwin Rambo
It would be nice if there
was a standard that all FF data must generate all FF ECC but this isn't the case
(though it would solve some corruption issues). Perhaps we should leave a runtime
check in (not paranoid check) for the next year or two that checks the oob also
if the data is all FF just to catch these issues.

UBI/UBIFS is perfectly fine with any algorithm. All I ask to do is to
use ubiformat tool to flash UBI images, or any other tool which is able
to skip 0xFFed NAND pages.

I've documented how an UBI-aware flasher should work:
http://www.linux-mtd.infradead.org/doc/ubi.html#L_flasher_algo

Post by Darwin Rambo

Right, the result is anyway corrupted ECC.

Post by Darwin Rambo
Now readback validation if it were turned on would catch that the ECC correction could not be
performed and you could see an error right away in this case.

This is out of UBI/UBIFS scope. MTD driver may do this, if it is in
debug mode, of if you are ok with spending time for reading.

Post by Darwin Rambo
Now an interesting thing
is that I have proven with my 4K page MLC flashes that _other_ blocks can have their ECCs
corrupted when this collision occurs - though this might be a local hardware issue. That
took a while to debug in case anyone is having similar problems.

Wow, this is really nasty :-)

Post by Darwin Rambo

This may be all the more reason to leave a runtime check in on the oob being all FF
for a while on all FF data.

Again, this is out of UBI/UBIFS scope. At this level we do not care
about ECC at all. This may be done in MTD level, and event not always. I
believe there are controllers which will not even let you read the ECC.

Post by Darwin Rambo
Good defensive programming to not assume anything about
what happened earlier with previous flash operations.

To be fast we should assume something. We cannot read after each write,
unless we are in debugging mode. Also, MTD already does have the "write
verify" option, so this defensive thing exists, actually.

--
Best Regards,
Artem Bityutskiy (????? ????????)

Darwin Rambo

2009-10-11 17:36:45 UTC

Permalink

Post by Artem Bityutskiy
UBIFS does not assume *anything* about ECC. UBI/UBIFS does not assume
anything about flash type even. E.g., it works on NOR.
All UBIFS assumes is that it may write more data to the end of
eraseblocks, nothing else. IMHO, this is a reasonable assumption.

Okay, I understand it is a lower level driver issue now and will look at
putting the checks there. Thanks.

Darwin