[Bugs] [Bug 1223758] New: Read operation on a file which is in split-brain condition is successful

Thu May 21 12:06:18 UTC 2015

https://bugzilla.redhat.com/show_bug.cgi?id=1223758

            Bug ID: 1223758
           Summary: Read operation on a file which is in split-brain
                    condition is successful
           Product: Red Hat Gluster Storage
           Version: 3.1
         Component: gluster-afr
          Keywords: Triaged
          Severity: high
          Assignee: pkarampu at redhat.com
          Reporter: ssampat at redhat.com
        QA Contact: storage-qa-internal at redhat.com
                CC: bugs at gluster.org, gluster-bugs at redhat.com,
                    ravishankar at redhat.com, rtalur at redhat.com
        Depends On: 1220347
            Blocks: 1223636
             Group: redhat

+++ This bug was initially created as a clone of Bug #1220347 +++

Description of problem:
------------------------

`cat' on a file that was in split-brain condition was successful. This should
ideally fail with `Input/output error'.

Version-Release number of selected component (if applicable):
--------------------------------------------------------------
glusterfs-3.7.0beta1-0.69.git1a32479.el6.x86_64

How reproducible:
------------------
Always

Steps to Reproduce:
--------------------

1. Create a distributed-replicate volume and mount it via fuse.
2. Create a file `1' on the mount point -
# touch 1
3. Bring down one brick in the replica pair where `1' resides.
#kill -9 <pid-of-brick-process>
4. Write to the file -
# echo "Hello" > 1
5. Start volume with force option.
6. Bring down the other brick in the replica pair and write to the file again -
# echo "World" > 1
7. `cat' the file -
# cat 1

Actual results:
----------------

# cat 1
World

Expected results:
------------------

`cat' should fail with `Input/output error'.

Additional info:
-----------------

The volume configuration -

# gluster volume info 2-test

Volume Name: 2-test
Type: Distributed-Replicate
Volume ID: 0e312bd3-0473-4fdc-ba2f-7df53b9e9683
Status: Started
Number of Bricks: 6 x 2 = 12
Transport-type: tcp
Bricks:
Brick1: dhcp37-126.lab.eng.blr.redhat.com:/rhs/brick4/b1
Brick2: dhcp37-123.lab.eng.blr.redhat.com:/rhs/brick4/b1
Brick3: dhcp37-98.lab.eng.blr.redhat.com:/rhs/brick4/b1
Brick4: dhcp37-54.lab.eng.blr.redhat.com:/rhs/brick4/b1
Brick5: dhcp37-210.lab.eng.blr.redhat.com:/rhs/brick4/b1
Brick6: dhcp37-59.lab.eng.blr.redhat.com:/rhs/brick4/b1
Brick7: dhcp37-126.lab.eng.blr.redhat.com:/rhs/brick5/b1
Brick8: dhcp37-123.lab.eng.blr.redhat.com:/rhs/brick5/b1
Brick9: dhcp37-98.lab.eng.blr.redhat.com:/rhs/brick5/b1
Brick10: dhcp37-54.lab.eng.blr.redhat.com:/rhs/brick5/b1
Brick11: dhcp37-210.lab.eng.blr.redhat.com:/rhs/brick5/b1
Brick12: dhcp37-59.lab.eng.blr.redhat.com:/rhs/brick5/b1
Options Reconfigured:
performance.readdir-ahead: on
cluster.self-heal-daemon: off
cluster.entry-self-heal: off
cluster.data-self-heal: off
cluster.metadata-self-heal: off
features.uss: enable
features.quota: on
performance.write-behind: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
performance.quick-read: off
performance.open-behind: off
features.bitrot: on
features.scrub: Active
diagnostics.client-log-level: DEBUG

--- Additional comment from Ravishankar N on 2015-05-11 08:44:51 EDT ---

Observations from debugging the setup.

When debugging the mount process with gdb, it was observed that in
afr_lookup_done, we do afr_inode_read_subvol_reset() and consequently when
afr_read_txn(), afr_read_txn_refresh_done()  is called, we bail out because
there are no read subvols and the client gets EIO.

When no gdb was attached, the client again began reading stale data. On further
examination, it was observed that fuse sends the following FOPS when 'cat' was
performed on the mount:

1)fuse_fop_resume-->fuse_lookup_resume
2)fuse_fop_resume-->fuse_open_resume
3)fuse_fop_resume-->fuse_getattr_resume--->afr_fstat-->afr_read_txn-->bail out
with EIO.
4)fuse_fop_resume-->fuse_flush_resume

However when 'cat' was done in rapid succession, (3) was not being called. i.e
only fuse_lookup_resume, fuse_open_resume and fuse_flush_resume were being
called. Since the getattr was not sent by fuse, it did not get the EIO and was
serving data from kernel cache. It was noted that this data returned was always
the one written to the latest brick, "World" in this case.

I don't think we should hit the issue if we perform a 1) drop_caches on the
existing mount, or 2) do a remount or 3)mount with the options 
attribute-timeout and entry-timeout set to zero to begin with.

--- Additional comment from Shruti Sampat on 2015-05-11 11:06:39 EDT ---

> 
> I don't think we should hit the issue if we perform a 1) drop_caches on the
> existing mount, or 2) do a remount or 3)mount with the options 
> attribute-timeout and entry-timeout set to zero to begin with.

Tried each of the above 3 and did not hit the issue.

--- Additional comment from Raghavendra Talur on 2015-05-19 09:45:32 EDT ---

Can be closed now that it is proved it kernel cache in action? or can be this
taken as a feature?

Ravi, I guess you can decide.

--- Additional comment from Ravishankar N on 2015-05-19 10:09:32 EDT ---

Raghavendra G has suggested a fix where we can set attribute-timeout to zero
for the files that are in split-brain forcing fuse to send a
fuse_getattr_resume(). I'll send a patch for it, let  us see if it is
acceptable. Keeping the bug open until then.

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1220347
[Bug 1220347] Read operation on a file which is in split-brain condition is
successful
https://bugzilla.redhat.com/show_bug.cgi?id=1223636
[Bug 1223636] 3.1 QE Tracker
-- 
You are receiving this mail because:
You are on the CC list for the bug.
Unsubscribe from this bug https://bugzilla.redhat.com/token.cgi?t=KANsthz7tG&a=cc_unsubscribe