[Bugs] [Bug 1593078] [GSS][SAS library corruption on GlusterFS]

Wed Jun 20 03:02:40 UTC 2018

https://bugzilla.redhat.com/show_bug.cgi?id=1593078

Raghavendra G <rgowdapp at redhat.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED

--- Comment #1 from Raghavendra G <rgowdapp at redhat.com> ---
--- Additional comment from Csaba Henk on 2018-05-31 10:10:03 EDT ---

This is we find in the fusedump:

The file dm_errori.sas7bdat is managed via two names, this one ("the base
file") and dm_errori.sas7bdat.lck ("the lck file"). At start, the base file
exists and some interaction occur with it. Then the lck file is looked up,
failed to be found, and gets created:

{"Unique":479868,"op":"LOOKUP","Nodeid":139842860710128,"Pid":14012,"matches":["dm_errori.sas7bdat.lck"]}

{"Unique":479869,"op":"LOOKUP","Nodeid":139842860710128,"Pid":14012,"matches":["dm_errori.sas7bdat.lck"]}

{"Unique":479870,"op":"CREATE","Nodeid":139842860710128,"Pid":14012,"matches":["dm_errori.sas7bdat.lck"],"nodemap":{"dm_errori.sas7bdat.lck":139842869444064},"size":[0]}

(These are just summaries of the respective fuse messages. Occasionally, if
it's needed to make my point, I might provide the complete message. Here we
know that the LOOKUPs fail from the absence of the "nodemap" field (which
indicates association of a file name with a nodeid). That the lookup is doubled
does not have significance.)

Data is written to the lck file and finally it replaces the previous base file
via RENAME:

{"Unique":480062,"op":"RENAME","Nodeid":139842860710128,"Pid":14012,"matches":["dm_errori.sas7bdat.lck","dm_errori.sas7bdat"]}

Then for a long while the base file is used in various ways, when at some point
it seems the above scenario would be replayed.

{"Unique":822175,"op":"LOOKUP","Nodeid":139842860710128,"Pid":53795,"matches":["dm_errori.sas7bdat.lck",139843072187888],"nodemap":{"dm_errori.sas7bdat.lck":139843072187888},"size":[1481113600]}

{"Unique":822176,"op":"LOOKUP","Nodeid":139842860710128,"Pid":53795,"matches":["dm_errori.sas7bdat.lck",139843072187888],"nodemap":{"dm_errori.sas7bdat.lck":139843072187888},"size":[1481113600]}

{"Unique":822177,"op":"OPEN","Nodeid":139843072187888,"Pid":53795,"matches":[139843072187888]}

{"Unique":822178,"op":"SETATTR","Nodeid":139843072187888,"Pid":53795,"matches":[139843072187888]}

(While this data lacks temporal information, we say "long while" based on the
Unique values which index fuse messages. So this second window of action comes
about 350,000 messages later.)

This time the lck file is found, but the process wants to start from scratch,
as it's immediately truncated to 0 via the SETATTR request (Unique 822178),
which can be seen by looking at the complete request (inferring from the values
of the Valid and Size fields):

{"Truncated":false,"Msg":[null,"SETATTR",{"Len":128,"Opcode":4,"Unique":822178,"Nodeid":139843072187888,"Uid":4447,"Gid":9003,"Pid":53795,"Padding":0},[{"Valid":584,"Padding":0,"Fh":139842872218944,"Size":0,"LockOwner":8496685147530397345,"Atime":0,"Mtime":0,"Unused2":0,"Atimensec":0,"Mtimensec":0,"Unused3":0,"Mode":0,"Unused4":0,"Uid":0,"Gid":0,"Unused5":0}]]}

However, there is something irregular if we go back a bit further.

{"Unique":822172,"op":"LOOKUP","Nodeid":139842860710128,"Pid":53795,"matches":["dm_errori.sas7bdat",139843072187888],"nodemap":{"dm_errori.sas7bdat":139843072187888},"size":[1481113600]}

{"Unique":822173,"op":"OPEN","Nodeid":139843072187888,"Pid":53795,"matches":[139843072187888]}

{"Unique":822174,"op":"SETLK","Nodeid":139843072187888,"Pid":53795,"matches":[139843072187888]}

{"Unique":822175,"op":"LOOKUP","Nodeid":139842860710128,"Pid":53795,"matches":["dm_errori.sas7bdat.lck",139843072187888],"nodemap":{"dm_errori.sas7bdat.lck":139843072187888},"size":[1481113600]}

{"Unique":822176,"op":"LOOKUP","Nodeid":139842860710128,"Pid":53795,"matches":["dm_errori.sas7bdat.lck",139843072187888],"nodemap":{"dm_errori.sas7bdat.lck":139843072187888},"size":[1481113600]}

{"Unique":822177,"op":"OPEN","Nodeid":139843072187888,"Pid":53795,"matches":[139843072187888]}

{"Unique":822178,"op":"SETATTR","Nodeid":139843072187888,"Pid":53795,"matches":[139843072187888]}

Just before the lck file was accessed, the base file was also looked up -- and
it resolved to the same nodeid as the lck file! That is, they are the same file
under different names (hardlinks of each other). That is almost certainly
against the intent of the application, as the fusedump does not contain any
LINK messages (via which hardlinking is facilitated).

*** Please confirm if the applications accessing the volume do not do hard
linking -- as here we talk about the fuse dump of a single client,
theoretically it's possible that another client was creating the hardlink. ***

So, most likely the application assumes that the base file and the lck file are
independent entities; and it's the oddity of the glusterfs backend that these
two files are the same. The application might want to reset only the link file
-- but that being the same as the base file, it's doing away with all of the
content of the base file as well.

Hence the corruption.

How can the base file and the lck file get accidentally hardlinked to each
other on the gluster backend? Most likely it's an issue with dht.

One possibility is that the RENAME above was not properly executed. In that
case the situation of the base and the lck file being the same should linger on
through a longer period, from the RENAME (Unique 480062) on to the eventual
removal of the lck file (Unique 822200).

Other possibility is that just at the time when the base and the lck files get
resolved to the same id, some other client is performing a RENAME on them which
is not sufficiently synchronized and the LOOKUPs on this side can hit into an
inconsistent intermediate state; in this case the hardlinked situation is
ephemeral.

We'd be happy to get stat/getfattr information of *both files* (base and lck --
or a note on one's absence), on *all bricks* and *all clients*.

It would be however much more useful than post festam file metadata to monitor
what's going on during the workload. Is there a period when presence of the lck
file is observable? Is it a hard link of the base file?

--- Additional comment from Raghavendra G on 2018-06-01 08:56:20 EDT ---

(In reply to Csaba Henk from comment #23)
> Other possibility is that just at the time when the base and the lck files
> get resolved to the same id, some other client is performing a RENAME on
> them which is not sufficiently synchronized and the LOOKUPs on this side can
> hit into an inconsistent intermediate state; in this case the hardlinked
> situation is ephemeral.

There is one scenario in dht_rename where for a brief period lookups on src and
dst can be successful and identify them as hardlinks to each other. For the
scenario to happen following conditions should hold good for a rename (src,
dst).

(dst-cached != src-cached) and (dst-hashed == src-hashed) and (src-cached !=
dst-hashed)

In this scenario following is the control flow of dht-rename:

1. link (src, dst) on src-cached.
2. rename (src, dst) on dst-hashed/src-hashed (Note that dst-hashed ==
src-hashed).
3. rest of the rename which removes hardlink src on src-cached.

Note that between 2 and 3 till the hardlink is removed,
* lookup (src) would fail on src-hashed resulting in lookup-everywhere. Since
hardlink src exists on src-cached, lookup will be successful mapping it to
inode  with src-gfid.
* lookup (dst) would identify a linkto file on dst-hashed. The linkto file
points to src-cached, following which we'll find the hardlink dst on
src-cached. Lookup (dst) succeeds mapping dst to inode with src-gfid.

Both src and dst would be identified as hardlinks to file with src-gfid in
inode table of client. Same result is conveyed back to application.

If we've hit this scenario, we would see the failure of lookup (src) on
src-hashed and eventual finding it on src-cached through lookup-everywhere. We
need to set diagnostics.client-log-level to DEBUG to see these logs.

The other work-around (if the hypothesis is true) is to turn on
cluster.lookup-optimize to true. When lookup-optimize is turned on dht-lookup
doesn't resort to lookup-everywhere on not finding src on src-hashed. Instead
it just conveys a failure to application. Since lookup won't reach src-cached,
it won't find the hard-link.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.