[Bugs] [Bug 1521041] New: rpc: fix the timedout tests

bugzilla at redhat.com bugzilla at redhat.com
Tue Dec 5 16:55:30 UTC 2017


https://bugzilla.redhat.com/show_bug.cgi?id=1521041

            Bug ID: 1521041
           Summary: rpc: fix the timedout tests
           Product: GlusterFS
           Version: mainline
         Component: rpc
          Severity: high
          Assignee: bugs at gluster.org
          Reporter: atumball at redhat.com
                CC: bugs at gluster.org



Description of problem:

rpc: Prevent frame-timeouts from hanging syncops

Summary:
It was observed while testing  the SHD threading code, that under high loads
SHD/AFR related
SyncOps & SyncTasks can actually hang/deadlock as the transport
disconnected event (for frame timeouts) never gets bubbled up correctly. 
Various                                                                         
tests indicated the ping timeouts worked fine, while "frame timeouts"
did not.  The only difference?  Ping timeouts actually disconnect
the transport while frame timeouts did not.  So from a high-level we
know this prevents deadlock as subsequent tests showed the deadlocks
no longer ocurred (after this change).  That said, there may be some
more elegant solution.  For now though, forcing a reconnect is
preferential vs hanging clients or deadlocking the SHD.

Test Plan:
It's fairly difficult to write a good prove test for this since it requires
human eyes to observe if the SHD is deadlocked (I'm open to ideas).  Here's the
repro though:
1. Create a 3x replicated cluster on a host.
2. Set the frame-timeout low (say 2 sec)
3. Down a brick, and write a pile of files (maybe 2000)
4. Bring up the downed brick and let the SHD begin healing files
5. During the heal process, kill -STOP <pid of brick> (hang) one of the bricks

Without this patch the SHD will be deadlocked, even though the frame timed out
after 2 seconds.  With the patch, the plug is pulled on the transport, a
disconnect is bubbled up
to the syncop and the SHD resumes.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.


More information about the Bugs mailing list