[Bugs] [Bug 1810842] New: frequent heal observed when file opened during one brick is down

Fri Mar 6 02:00:10 UTC 2020

https://bugzilla.redhat.com/show_bug.cgi?id=1810842

            Bug ID: 1810842
           Summary: frequent heal observed when file opened during one
                    brick is down
           Product: GlusterFS
           Version: 7
          Hardware: x86_64
                OS: Linux
            Status: NEW
         Component: protocol
          Severity: high
          Assignee: bugs at gluster.org
          Reporter: zz.sh.cynthia at gmail.com
                CC: bugs at gluster.org
  Target Milestone: ---
    Classification: Community

Description of problem:
frequent file heal observed

Version-Release number of selected component (if applicable):

gluster7
How reproducible:

Steps to Reproduce:
1>      open a file from client0 on sn0 when sn1 related brick is down
2>      start up brick1
3>      raise/cancel alarm on client0 or client1 with “fsclish -c "set alarm
raise specific-problem 70025 managed-object cluster application-id node:/mn-X"
4>      show volume heal info by “gluster v heal services info”

Actual results:
[root at mn-1:/home/robot]
# gluster v heal services info
Brick mn-0.local:/mnt/bricks/services/brick
/SS_AlLightProcessor/AlarmFileSystem/AlarmHistory/alarm-event-history.0002
Status: Connected
Number of entries: 1

Brick mn-1.local:/mnt/bricks/services/brick
Status: Connected
Number of entries: 0

Brick dbm-0.local:/mnt/bricks/services/brick
/SS_AlLightProcessor/AlarmFileSystem/AlarmHistory/alarm-event-history.0002
Status: Connected
Number of entries: 1

Expected results:

no heal info

Additional info:
following is mail discussion with glusterfs expert

Hi Glusterfs expert,
 Good day!
  When I am testing glusterfs7, I often find following warnning logs in
glusterfs client, without rebooting the test glusterfs client process, each
time when those files do flush fop, the following logs will appear. This is an
permanent issue.

[2020-03-04 06:13:50.044046] W [MSGID: 114061]
[client-common.c:2625:client_pre_flush_v2] 0-services-client-1: 
(1f074c5e-7442-4044-9663-5c30be6ae59d) remote_fd is -1. EBADFD [File descriptor
in bad state]
[2020-03-04 06:13:50.045122] W [MSGID: 114061]
[client-common.c:2625:client_pre_flush_v2] 0-services-client-1: 
(690697bf-2f95-44fb-b4d7-bd26de32aae2) remote_fd is -1. EBADFD [File descriptor
in bad state]
[2020-03-04 06:13:50.045677] W [MSGID: 114061]
[client-common.c:2625:client_pre_flush_v2] 0-services-client-1: 
(75ac50c2-a7ba-4317-8763-d726eac4eeb1) remote_fd is -1. EBADFD [File descriptor
in bad state]
[2020-03-04 06:13:50.046181] W [MSGID: 114061]
[client-common.c:2625:client_pre_flush_v2] 0-services-client-1: 
(392449a5-cf6c-4891-9402-6c3891c01b05) remote_fd is -1. EBADFD [File descriptor
in bad state]
[2020-03-04 06:13:50.047041] W [MSGID: 114061]
[client-common.c:2625:client_pre_flush_v2] 0-services-client-1: 
(fe314cbc-96b0-4ade-9b0f-a3084e7c1a64) remote_fd is -1. EBADFD [File descriptor
in bad state]
[2020-03-04 06:13:50.049349] W [MSGID: 114061]
[client-common.c:2644:client_pre_fsync_v2] 0-services-client-1: 
(690697bf-2f95-44fb-b4d7-bd26de32aae2) remote_fd is -1. EBADFD [File descriptor
in bad state]

  I compare glusterfs7 and glusterfs3.12 source code, I think this is
introduced by following commit.
SHA-1: 92ae26ae8039847e38c738ef98835a14be9d4296

* protocol/client: Do not fallback to anon-fd if fd is not open

[Analysis:]
  From the commit message, I checked the source code, and find without restart
the glusterfs client process all the flush operation executed on following
files will be failed(also has been confirmed by my local test) because each
time client_pre_flush_v2 will abort the fop without really sending flush
request to remote brick process.
[Question:] 
  1>Is this an expected behavior of glusterfs client? Why for client_pre_readv/
client_pre_writev/ client_pre_finodelk…. There is FALLBACK_TO_ANON_FD to enable
analymous fd, but not for flush fop? 

Flush fop is not defined on an anon-fd. Flush fop is supposed to do cleanup of
the resources on an fd that was opened, like locks etc. So it doesn't make
sense to have fallback-to-anon-fd for flush.

  2>This issue also has an side effect that each time after flush fop is
executed from client0 (sn0) , sn1 glustershd will do heal, since the related
files always appear in volume heal info command output. Is this heal necessary?

flush shouldn't result in any pending data/metadata heals. I see from the logs
you sent the following:
[2020-03-04 06:13:50.049349] W [MSGID: 114061]
[client-common.c:2644:client_pre_fsync_v2] 0-services-client-1: 
(690697bf-2f95-44fb-b4d7-bd26de32aae2) remote_fd is -1. EBADFD [File descriptor
in bad state]

fsync can lead to pending flags. fsync is an inode operation, so for fsync we
can add a fall-back-to-anon-fd. Could you check if that fixes the issue you are
facing? If yes, could you send that patch?

[root at mn-1:/home/robot]
# gluster v heal services info
Brick mn-0.local:/mnt/bricks/services/brick
/SS_AlLightProcessor/AlarmFileSystem/AlarmHistory/alarm-event-history.0002
Status: Connected
Number of entries: 1

Brick mn-1.local:/mnt/bricks/services/brick
Status: Connected
Number of entries: 0

Brick dbm-0.local:/mnt/bricks/services/brick
/SS_AlLightProcessor/AlarmFileSystem/AlarmHistory/alarm-event-history.0002
Status: Connected
Number of entries: 1

Cynthia

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.