[Bugs] [Bug 1810842] New: frequent heal observed when file opened during one brick is down
bugzilla at redhat.com
bugzilla at redhat.com
Fri Mar 6 02:00:10 UTC 2020
https://bugzilla.redhat.com/show_bug.cgi?id=1810842
Bug ID: 1810842
Summary: frequent heal observed when file opened during one
brick is down
Product: GlusterFS
Version: 7
Hardware: x86_64
OS: Linux
Status: NEW
Component: protocol
Severity: high
Assignee: bugs at gluster.org
Reporter: zz.sh.cynthia at gmail.com
CC: bugs at gluster.org
Target Milestone: ---
Classification: Community
Description of problem:
frequent file heal observed
Version-Release number of selected component (if applicable):
gluster7
How reproducible:
Steps to Reproduce:
1> open a file from client0 on sn0 when sn1 related brick is down
2> start up brick1
3> raise/cancel alarm on client0 or client1 with “fsclish -c "set alarm
raise specific-problem 70025 managed-object cluster application-id node:/mn-X"
4> show volume heal info by “gluster v heal services info”
Actual results:
[root at mn-1:/home/robot]
# gluster v heal services info
Brick mn-0.local:/mnt/bricks/services/brick
/SS_AlLightProcessor/AlarmFileSystem/AlarmHistory/alarm-event-history.0002
Status: Connected
Number of entries: 1
Brick mn-1.local:/mnt/bricks/services/brick
Status: Connected
Number of entries: 0
Brick dbm-0.local:/mnt/bricks/services/brick
/SS_AlLightProcessor/AlarmFileSystem/AlarmHistory/alarm-event-history.0002
Status: Connected
Number of entries: 1
Expected results:
no heal info
Additional info:
following is mail discussion with glusterfs expert
Hi Glusterfs expert,
Good day!
When I am testing glusterfs7, I often find following warnning logs in
glusterfs client, without rebooting the test glusterfs client process, each
time when those files do flush fop, the following logs will appear. This is an
permanent issue.
[2020-03-04 06:13:50.044046] W [MSGID: 114061]
[client-common.c:2625:client_pre_flush_v2] 0-services-client-1:
(1f074c5e-7442-4044-9663-5c30be6ae59d) remote_fd is -1. EBADFD [File descriptor
in bad state]
[2020-03-04 06:13:50.045122] W [MSGID: 114061]
[client-common.c:2625:client_pre_flush_v2] 0-services-client-1:
(690697bf-2f95-44fb-b4d7-bd26de32aae2) remote_fd is -1. EBADFD [File descriptor
in bad state]
[2020-03-04 06:13:50.045677] W [MSGID: 114061]
[client-common.c:2625:client_pre_flush_v2] 0-services-client-1:
(75ac50c2-a7ba-4317-8763-d726eac4eeb1) remote_fd is -1. EBADFD [File descriptor
in bad state]
[2020-03-04 06:13:50.046181] W [MSGID: 114061]
[client-common.c:2625:client_pre_flush_v2] 0-services-client-1:
(392449a5-cf6c-4891-9402-6c3891c01b05) remote_fd is -1. EBADFD [File descriptor
in bad state]
[2020-03-04 06:13:50.047041] W [MSGID: 114061]
[client-common.c:2625:client_pre_flush_v2] 0-services-client-1:
(fe314cbc-96b0-4ade-9b0f-a3084e7c1a64) remote_fd is -1. EBADFD [File descriptor
in bad state]
[2020-03-04 06:13:50.049349] W [MSGID: 114061]
[client-common.c:2644:client_pre_fsync_v2] 0-services-client-1:
(690697bf-2f95-44fb-b4d7-bd26de32aae2) remote_fd is -1. EBADFD [File descriptor
in bad state]
I compare glusterfs7 and glusterfs3.12 source code, I think this is
introduced by following commit.
SHA-1: 92ae26ae8039847e38c738ef98835a14be9d4296
* protocol/client: Do not fallback to anon-fd if fd is not open
[Analysis:]
From the commit message, I checked the source code, and find without restart
the glusterfs client process all the flush operation executed on following
files will be failed(also has been confirmed by my local test) because each
time client_pre_flush_v2 will abort the fop without really sending flush
request to remote brick process.
[Question:]
1>Is this an expected behavior of glusterfs client? Why for client_pre_readv/
client_pre_writev/ client_pre_finodelk…. There is FALLBACK_TO_ANON_FD to enable
analymous fd, but not for flush fop?
Flush fop is not defined on an anon-fd. Flush fop is supposed to do cleanup of
the resources on an fd that was opened, like locks etc. So it doesn't make
sense to have fallback-to-anon-fd for flush.
2>This issue also has an side effect that each time after flush fop is
executed from client0 (sn0) , sn1 glustershd will do heal, since the related
files always appear in volume heal info command output. Is this heal necessary?
flush shouldn't result in any pending data/metadata heals. I see from the logs
you sent the following:
[2020-03-04 06:13:50.049349] W [MSGID: 114061]
[client-common.c:2644:client_pre_fsync_v2] 0-services-client-1:
(690697bf-2f95-44fb-b4d7-bd26de32aae2) remote_fd is -1. EBADFD [File descriptor
in bad state]
fsync can lead to pending flags. fsync is an inode operation, so for fsync we
can add a fall-back-to-anon-fd. Could you check if that fixes the issue you are
facing? If yes, could you send that patch?
[root at mn-1:/home/robot]
# gluster v heal services info
Brick mn-0.local:/mnt/bricks/services/brick
/SS_AlLightProcessor/AlarmFileSystem/AlarmHistory/alarm-event-history.0002
Status: Connected
Number of entries: 1
Brick mn-1.local:/mnt/bricks/services/brick
Status: Connected
Number of entries: 0
Brick dbm-0.local:/mnt/bricks/services/brick
/SS_AlLightProcessor/AlarmFileSystem/AlarmHistory/alarm-event-history.0002
Status: Connected
Number of entries: 1
Cynthia
--
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.
More information about the Bugs
mailing list