[Bugs] [Bug 1444892] When either killing or restarting a brick with performance.stat-prefetch on , stat sometimes returns a bad st_size value.

bugzilla at redhat.com bugzilla at redhat.com
Fri May 5 08:13:45 UTC 2017


https://bugzilla.redhat.com/show_bug.cgi?id=1444892



--- Comment #6 from miklos.fokin at appeartv.com ---
(In reply to Ravishankar N from comment #4)
> (In reply to miklos.fokin from comment #2)
> 
> > In afr_fsync_cbk, when receiving data from the bricks there are two times
> > when the fstat data can get updated: first is the initial update, and then
> > the one from the brick we selected to get the final data from.
> > During the debugging I found that if the initial update is coming from the
> > arbiter the size will be 0.
> > If the brick we selected to get the final data from is down, we get a struct
> > filled with zeroes, and an error value, thus we don't get a second update.
> 
> Hi Miklos, did you  observe while debugging that in afr_fsync_cbk(), the
> 'read_subvol' is indeed the brick you brought down? I haven't tried your
> test yet but if a brick is brought down, then the read_subvol is updated on
> the next lookup/inode_refresh where the brick is not marked readable any
> more until heal is complete. So, if you brought down brick1, then the call
> to afr_data_subvol_get() in afr_fsync_cbk() should give you brick2.

Hello Ravishankar, the thing I did was putting some logging into that
afr_fsync_cbk() to see what code paths get taken (after doing the same with
other ones and finally getting there with the debugging).
I also added logging for printing out the post_buff in each call to the
function.

The first log was in "if (op_ret == 0) {" the branch.
Another one was in "if (local->op_ret == -1)".
Another was in "if (child_index == read_subvol)".
I am also attaching the complete diff with the logs and the fix to show you the
code, perhaps it is easier to get it from there.

When things went wrong I received 3 calls:
  1 with a mostly correct structure, apart from the number of blocks and size
of the file which were zero, here I got a log from the first and second places
  1 with a completely correct structure, but here I only got a log from the
first place
  1 with a completely zero structure and no logs from any of the branches

I am writing this from memory, since I moved on and didn't save the logs, but I
can reproduce it and send you the file in case you need it.
These findings led me to believe that since the third branch never got taken,
it was the third reply that was selected, but since it returned an error it
never got processed.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
Unsubscribe from this bug https://bugzilla.redhat.com/token.cgi?t=gP7G2cArG4&a=cc_unsubscribe


More information about the Bugs mailing list