UBIFS question: Power-cuts after ubifs_leb

Post by Richard Weinberger
Artem, Adrian,
While playing with a new UBI/UBIFS test framework I managed to hit this error,
[ 2412.268964] UBIFS error (ubi0:0 pid 708): scan_check_cb: bad accounting of
LEB 11: free 0, dirty 118072 flags 0x1, should be free 126976, dirty 0
LEB 11 is unmapped but LPT still thinks that some data is used,
lp->free + lp >dirty < leb-size.
Even without lprobs self-checks, the same filesystem will later hit this
ubifs_assert(!list_empty(&sleb->nodes));
The assert makes sure that the LEB actually contains nodes.
ubifs_garbage_collect_leb() handles the special case lp->free + lp->dirty ==
c->leb_size.
But not lp->free + lp->dirty < leb-size.
Now I'm not sure where to fix that, maybe you can remember some design
decisions.
1. Shall we massage ubifs_garbage_collect_leb() to deal with this special case
too?
2. Is it already a bug when this case happens?

Post by Richard Weinberger
From reviewing the code, I think the said situation can arise when we face

power-cut
if (snod->type == UBIFS_IDX_NODE) {
...
} else {
...
err = ubifs_change_one_lp(c, lnum, c->leb_size, 0, 0, 0, 0);
...
err = ubifs_leb_unmap(c, lnum);
// POWER CUT
}
We mark the LEB as free and unmap it.
ubifs_change_one_lp() does not immediately write a new LPT, if we lose power
right after ubifs_leb_unmap() it can happen that the LEB already got erased
but the LPT has the old accounting information.

Doesn't GC copy the nodes into the journal, so after the journal is
replayed, the old nodes will be dirtied and lprops will be correct again.

Post by Richard Weinberger
UBIFS seems to expect such situations only for
lp->free + lp->dirty == c->leb_size
What do you think?
Thanks,
//richard

Richard Weinberger

2018-07-09 13:11:39 UTC

Post by Richard Weinberger
From reviewing the code, I think the said situation can arise when we face

Doesn't GC copy the nodes into the journal, so after the journal is
replayed, the old nodes will be dirtied and lprops will be correct again.

Yes, this is the theory. But the assert proves that something is not as we expect it. ;-\

Thanks,
//richard

Adrian Hunter

2018-07-10 06:58:22 UTC

If it is caused by a power-cut then the error would appear during the next
mount, probably due to a failure in the replay to correctly account for the
dirty space.

The downside of papering over the problem in GC is the possibility that that
then hides a real problem.

Post by Richard Weinberger
From reviewing the code, I think the said situation can arise when we face

Doesn't GC copy the nodes into the journal, so after the journal is
replayed, the old nodes will be dirtied and lprops will be correct again.

Yes, this is the theory. But the assert proves that something is not as we expect it. ;-\

Artem Bityutskiy

2018-07-24 08:01:30 UTC

Hi, I looked at this quickly and talked to Adrian. I cannot solve this
for you but here are some thoughts.

So it has 126976 - 118072 = 8904 of used space.

Post by Richard Weinberger
LEB 11 is unmapped but LPT still thinks that some data is used,
lp->free + lp >dirty < leb-size.

And the problem is this used space. We think that most probably the bug
is in replay and these 9804 bytes should have been marked as dirty
during replay, but they were not. If they were marked as dirty, we
probably would not have problem.

It is unlikely this is a bug in GC. The recovery path is the most
tricky thing, and recovery happens in slightly different order, and if
we were expecting a but, we'd suspect the recovery replay.

Post by Richard Weinberger
Even without lprobs self-checks, the same filesystem will later hit this
ubifs_assert(!list_empty(&sleb->nodes));
The assert makes sure that the LEB actually contains nodes.
ubifs_garbage_collect_leb() handles the special case lp->free + lp-

Post by Richard Weinberger
dirty ==

c->leb_size.
But not lp->free + lp->dirty < leb-size.
Now I'm not sure where to fix that, maybe you can remember some design
decisions.
1. Shall we massage ubifs_garbage_collect_leb() to deal with this special case
too?

lp->free + lp->dirty < leb-size means there is used data, and we GC by
moving. So this is not a special case.

Post by Richard Weinberger
2. Is it already a bug when this case happens?

I do not think so.

Post by Richard Weinberger
From reviewing the code, I think the said situation can arise when we face
power-cut
if (snod->type == UBIFS_IDX_NODE) {
...
} else {
...

OK, we are here. This means that lprops in memory belives this LEB has
no used data, only dirt and free space. It can be recycled.

Now, if we have power cut right here, next time we'll go to recovery
replay. If the replay is correct, we'll end up with same lprops in
memory, and we'll recycle the LEB anyway.

Post by Richard Weinberger
err = ubifs_change_one_lp(c, lnum, c->leb_size, 0, 0, 0, 0);

We change lprops in memory.

Post by Richard Weinberger
...
err = ubifs_leb_unmap(c, lnum);
// POWER CUT
}
We mark the LEB as free and unmap it.
ubifs_change_one_lp() does not immediately write a new LPT, if we lose power
right after ubifs_leb_unmap() it can happen that the LEB already got erased
but the LPT has the old accounting information.

Yes, and during recovery we'll notice that lprops says an LEB is
freeable (has no used data), and we'll adjust lprops and probably re-
unmap the LEB (I not remember 100%).

In the error you showed - there is 8K of used, and this is the problem.

Post by Richard Weinberger
What do you think?

1. If you believe some specific place like the one you pointed contains
a bug, then emulate a power cut in that place and try to reproduce the
bug.

2. Try to focus on recovery replay, this is the main suspect. Try to
figure out what is that used space in that LEB?

--
Best Regards,
Artem Bityutskiy

Richard Weinberger

2018-07-25 22:29:20 UTC

Artem, Adrian,

Post by Artem Bityutskiy
Hi, I looked at this quickly and talked to Adrian. I cannot solve this
for you but here are some thoughts.

Both of your thoughts are much appreciated!

So it has 126976 - 118072 = 8904 of used space.

[...]

Post by Artem Bityutskiy
1. If you believe some specific place like the one you pointed contains
a bug, then emulate a power cut in that place and try to reproduce the
bug.
2. Try to focus on recovery replay, this is the main suspect. Try to
figure out what is that used space in that LEB?

After spending the third night in a row staring at hexdumps, I think I know
what is going on.

I see that LEB 11 recently got unmapped, but the LPT still accounts 8904 bytes
as used.
To double check that value I've analyzed the index tree. To my surprise it was
referencing nodes at LEB 11!
Summing up the length of these nodes is exactly 8904. So, the LPT is correct.

All nodes on LEB 11 that are referenced by the index tree are of type
UBIFS_INO_NODE and have UBIFS_XATTR_FL set.
So, our problem is xattr related. We forget xattr inodes and the free space
accounting goes nuts sooner or later.

Inspecting the journal gave more details. Only buds of head type BASEHD and
DATAHD are present. No GCHD. Therefore GC was not interrupted.
With the IO transaction log from before the power-cut I was able to figure
where the orphaned xattr inodes belong to.
And indeed, the host inode of said xattr is in the journal for deletion.

In theory during replay, removal of the host inode should also remove all
xattr inodes and we are good.
So the question is, why fails UBIFS to remove the xattr inodes even if we have
the host inode in the journal?

Another round of digging though the index uncovered the problem. The
UBIFS_XENT_NODE nodes for all orphaned xattr inodes sat on LEB 11.
This means upon replay, when UBIFS tries to locate all xattr entries
of the host inode, it faces the dangling branch case and the loop in
ubifs_tnc_remove_ino() terminates with -ENOENT. The host inode will be
removed, but the xattr inodes stay.

Therefore the root of the problem is that UBIFS places only the host inode
into the journal upon removal of a file with xattrs and assumes that upon
replay we can discover all xattr inodes. That will fail if due to GC the LEB
got already unmapped.

As intermediate solution I suggest adding also xattr inodes to the journal
upon unlink. This will make unlink more expensive but for now it is IMHO ok
until a more sophisticated solution is found.
Files usually have only a few xattr, so the cost is not that bad.

What do you guys think? Does my analysis make sense to you?

Thanks,
//richard

Adrian Hunter

2018-07-30 06:55:24 UTC

Post by Richard Weinberger
Artem, Adrian,

Post by Artem Bityutskiy
Hi, I looked at this quickly and talked to Adrian. I cannot solve this
for you but here are some thoughts.

Both of your thoughts are much appreciated!

So it has 126976 - 118072 = 8904 of used space.

[...]

After spending the third night in a row staring at hexdumps, I think I know
what is going on.
I see that LEB 11 recently got unmapped, but the LPT still accounts 8904 bytes
as used.
To double check that value I've analyzed the index tree. To my surprise it was
referencing nodes at LEB 11!
Summing up the length of these nodes is exactly 8904. So, the LPT is correct.
All nodes on LEB 11 that are referenced by the index tree are of type
UBIFS_INO_NODE and have UBIFS_XATTR_FL set.
So, our problem is xattr related. We forget xattr inodes and the free space
accounting goes nuts sooner or later.
Inspecting the journal gave more details. Only buds of head type BASEHD and
DATAHD are present. No GCHD. Therefore GC was not interrupted.
With the IO transaction log from before the power-cut I was able to figure
where the orphaned xattr inodes belong to.
And indeed, the host inode of said xattr is in the journal for deletion.
In theory during replay, removal of the host inode should also remove all
xattr inodes and we are good.
So the question is, why fails UBIFS to remove the xattr inodes even if we have
the host inode in the journal?
Another round of digging though the index uncovered the problem. The
UBIFS_XENT_NODE nodes for all orphaned xattr inodes sat on LEB 11.
This means upon replay, when UBIFS tries to locate all xattr entries
of the host inode, it faces the dangling branch case and the loop in
ubifs_tnc_remove_ino() terminates with -ENOENT. The host inode will be
removed, but the xattr inodes stay.
Therefore the root of the problem is that UBIFS places only the host inode
into the journal upon removal of a file with xattrs and assumes that upon
replay we can discover all xattr inodes. That will fail if due to GC the LEB
got already unmapped.
As intermediate solution I suggest adding also xattr inodes to the journal
upon unlink. This will make unlink more expensive but for now it is IMHO ok
until a more sophisticated solution is found.
Files usually have only a few xattr, so the cost is not that bad.
What do you guys think? Does my analysis make sense to you?

It doesn't seem like there is a way to avoid adding more information to the
journal upon deletion, but you probably need to cater for the possibility
that there is a power cut while writing that information e.g. it wouldn't be
good to remove half the xattrs but leave the rest

Richard Weinberger

2018-07-30 07:28:27 UTC

Post by Richard Weinberger
Artem, Adrian,

Post by Artem Bityutskiy
Hi, I looked at this quickly and talked to Adrian. I cannot solve this
for you but here are some thoughts.

Both of your thoughts are much appreciated!

So it has 126976 - 118072 = 8904 of used space.

[...]

After spending the third night in a row staring at hexdumps, I think I know
what is going on.
I see that LEB 11 recently got unmapped, but the LPT still accounts 8904 bytes
as used.
To double check that value I've analyzed the index tree. To my surprise it was
referencing nodes at LEB 11!
Summing up the length of these nodes is exactly 8904. So, the LPT is correct.
All nodes on LEB 11 that are referenced by the index tree are of type
UBIFS_INO_NODE and have UBIFS_XATTR_FL set.
So, our problem is xattr related. We forget xattr inodes and the free space
accounting goes nuts sooner or later.
Inspecting the journal gave more details. Only buds of head type BASEHD and
DATAHD are present. No GCHD. Therefore GC was not interrupted.
With the IO transaction log from before the power-cut I was able to figure
where the orphaned xattr inodes belong to.
And indeed, the host inode of said xattr is in the journal for deletion.
In theory during replay, removal of the host inode should also remove all
xattr inodes and we are good.
So the question is, why fails UBIFS to remove the xattr inodes even if we have
the host inode in the journal?
Another round of digging though the index uncovered the problem. The
UBIFS_XENT_NODE nodes for all orphaned xattr inodes sat on LEB 11.
This means upon replay, when UBIFS tries to locate all xattr entries
of the host inode, it faces the dangling branch case and the loop in
ubifs_tnc_remove_ino() terminates with -ENOENT. The host inode will be
removed, but the xattr inodes stay.
Therefore the root of the problem is that UBIFS places only the host inode
into the journal upon removal of a file with xattrs and assumes that upon
replay we can discover all xattr inodes. That will fail if due to GC the LEB
got already unmapped.
As intermediate solution I suggest adding also xattr inodes to the journal
upon unlink. This will make unlink more expensive but for now it is IMHO ok
until a more sophisticated solution is found.
Files usually have only a few xattr, so the cost is not that bad.
What do you guys think? Does my analysis make sense to you?

With grouped nodes a power-cut is okay, the bigger problem is the number of xattrs.
We support up to 2^16. Do delete an inode via journal we need UBIFS_INO_NODE_SZ bytes
in the journal.
So, in worst case we need to write 2^16 times UBIFS_INO_NODE_SZ bytes to the journal.
I'm still looking into this option right now, just to be sure that this is really the
solution to the problem I see. Finding a stable reproducer is also not easy...

Another possibility is playing with ubifs_tnc_remove_ino(), upon unlink() it removes
all entries from the TNC and marks them as dirty.
The problem is that also all xattr inodes are marked as dirty.
If we manage to find a way that these get marked only as dirty when we are sure that
they are gone the GC will no longer unmap a "used" LEB.
But that all is very tricky.

Thanks,
//richard

Adrian Hunter

2018-07-30 08:07:25 UTC

Post by Richard Weinberger
Artem, Adrian,

Post by Artem Bityutskiy
Hi, I looked at this quickly and talked to Adrian. I cannot solve this
for you but here are some thoughts.

Both of your thoughts are much appreciated!

So it has 126976 - 118072 = 8904 of used space.

[...]

After spending the third night in a row staring at hexdumps, I think I know
what is going on.
I see that LEB 11 recently got unmapped, but the LPT still accounts 8904 bytes
as used.
To double check that value I've analyzed the index tree. To my surprise it was
referencing nodes at LEB 11!
Summing up the length of these nodes is exactly 8904. So, the LPT is correct.
All nodes on LEB 11 that are referenced by the index tree are of type
UBIFS_INO_NODE and have UBIFS_XATTR_FL set.
So, our problem is xattr related. We forget xattr inodes and the free space
accounting goes nuts sooner or later.
Inspecting the journal gave more details. Only buds of head type BASEHD and
DATAHD are present. No GCHD. Therefore GC was not interrupted.
With the IO transaction log from before the power-cut I was able to figure
where the orphaned xattr inodes belong to.
And indeed, the host inode of said xattr is in the journal for deletion.
In theory during replay, removal of the host inode should also remove all
xattr inodes and we are good.
So the question is, why fails UBIFS to remove the xattr inodes even if we have
the host inode in the journal?
Another round of digging though the index uncovered the problem. The
UBIFS_XENT_NODE nodes for all orphaned xattr inodes sat on LEB 11.
This means upon replay, when UBIFS tries to locate all xattr entries
of the host inode, it faces the dangling branch case and the loop in
ubifs_tnc_remove_ino() terminates with -ENOENT. The host inode will be
removed, but the xattr inodes stay.
Therefore the root of the problem is that UBIFS places only the host inode
into the journal upon removal of a file with xattrs and assumes that upon
replay we can discover all xattr inodes. That will fail if due to GC the LEB
got already unmapped.
As intermediate solution I suggest adding also xattr inodes to the journal
upon unlink. This will make unlink more expensive but for now it is IMHO ok
until a more sophisticated solution is found.
Files usually have only a few xattr, so the cost is not that bad.
What do you guys think? Does my analysis make sense to you?

With grouped nodes a power-cut is okay, the bigger problem is the number of xattrs.
We support up to 2^16. Do delete an inode via journal we need UBIFS_INO_NODE_SZ bytes
in the journal.
So, in worst case we need to write 2^16 times UBIFS_INO_NODE_SZ bytes to the journal.
I'm still looking into this option right now, just to be sure that this is really the
solution to the problem I see. Finding a stable reproducer is also not easy...
Another possibility is playing with ubifs_tnc_remove_ino(), upon unlink() it removes
all entries from the TNC and marks them as dirty.
The problem is that also all xattr inodes are marked as dirty.
If we manage to find a way that these get marked only as dirty when we are sure that
they are gone the GC will no longer unmap a "used" LEB.
But that all is very tricky.

What about: add the inode to be deleted to orphans, then delete the xattrs
one at a time, then when they are all gone, delete the inode and remove it
from orphans.

Richard Weinberger

2018-07-31 21:28:50 UTC