[Bugs] [Bug 1634664] Inconsistent quorum checks during open and fd based operations

bugzilla at redhat.com bugzilla at redhat.com
Mon Oct 1 11:27:43 UTC 2018


https://bugzilla.redhat.com/show_bug.cgi?id=1634664



--- Comment #1 from Raghavendra G <rgowdapp at redhat.com> ---
I seem to be running into fstat failures on a 1x3 replicated volume while
running pgbench on it. I see corruption warnings even when all perf xlators are
turned off. 

pgbench has the pattern of,

1. Application 1, creating a tmp file, writing to it and renaming it to "file"
2. Application 2, Keeps opening "file" and does reads and fstats and closes the
fd

I see fstat on an fd failing with ENOTCONN from strace of application 2. On
delving deeper,

[2018-07-12 12:55:11.241793] T [fuse-bridge.c:2015:fuse_rename_resume]
0-glusterfs-fuse: 14245087: RENAME
`<gfid:ff5d66b2-7fde-45bd-b873-d688ae604990>/pg_stat_tmp/pgstat.tmp ()' ->
`<gfid:ff5d66b2-7fde-45bd-b873
-d688ae604990>/pg_stat_tmp/pgstat.stat ()'
[2018-07-12 12:55:11.244548] T [fuse-bridge.c:2347:fuse_open_resume]
0-glusterfs-fuse: 14245088: OPEN
<gfid:ff5d66b2-7fde-45bd-b873-d688ae604990>/pg_stat_tmp/pgstat.stat
[2018-07-12 12:55:11.246429] T [MSGID: 0]
[afr-lk-common.c:248:afr_unlock_common_cbk] 8-db-replicate-0: All internal
locks unlocked
[2018-07-12 12:55:11.247258] D [MSGID: 114031]
[client-rpc-fops_v2.c:265:client4_0_open_cbk] 8-db-client-0: remote operation
failed. Path:
<gfid:ff5d66b2-7fde-45bd-b873-d688ae604990>/pg_stat_tmp/pgstat.stat
(02e0f34b-f9a5-46ff-83fe-abca95955e8b) [No such file or directory]
[2018-07-12 12:55:11.247414] D [MSGID: 114031]
[client-rpc-fops_v2.c:265:client4_0_open_cbk] 8-db-client-1: remote operation
failed. Path:
<gfid:ff5d66b2-7fde-45bd-b873-d688ae604990>/pg_stat_tmp/pgstat.stat
(02e0f34b-f9a5-46ff-83fe-abca95955e8b) [No such file or directory]
[2018-07-12 12:55:11.247597] T [fuse-bridge.c:1111:fuse_fd_cbk]
0-glusterfs-fuse: 14245088: OPEN()
<gfid:ff5d66b2-7fde-45bd-b873-d688ae604990>/pg_stat_tmp/pgstat.stat =>
0x7fb32bd8ed98
[2018-07-12 12:55:11.247759] T [fuse-bridge.c:1945:fuse_rename_cbk]
0-glusterfs-fuse: 14245087:
<gfid:ff5d66b2-7fde-45bd-b873-d688ae604990>/pg_stat_tmp/pgstat.tmp ->
<gfid:ff5d66b2-7fde-45bd-b873-d688ae604990>/pg_stat_tmp/pgstat.stat => 0
(buf->ia_ino=10066098376112020953)

Note that open failed on two subvolumes of afr, but open fop itself is
successful as can be seen from fuse_fd_cbk output.

Later when fstat comes on the same fd 0x7fb32bd8ed98

[2018-07-12 12:55:11.247963] T [fuse-bridge.c:955:fuse_getattr_resume]
0-glusterfs-fuse: 14245089: FGETATTR 140407508526648
(<gfid:02e0f34b-f9a5-46ff-83fe-abca95955e8b>/0x7fb32bd8ed98)
[2018-07-12 12:55:11.248224] D [MSGID: 0] [afr-open.c:332:afr_fix_open]
8-db-replicate-0: need open count: 2
[2018-07-12 12:55:11.248253] D [MSGID: 0] [afr-open.c:352:afr_fix_open]
8-db-replicate-0: opening fd for file
<gfid:02e0f34b-f9a5-46ff-83fe-abca95955e8b> on subvolume db-client-0
[2018-07-12 12:55:11.248424] D [MSGID: 0] [afr-open.c:352:afr_fix_open]
8-db-replicate-0: opening fd for file
<gfid:02e0f34b-f9a5-46ff-83fe-abca95955e8b> on subvolume db-client-1
[2018-07-12 12:55:11.248562] D [MSGID: 0] [afr-read-txn.c:252:afr_read_txn]
8-db-replicate-0: 02e0f34b-f9a5-46ff-83fe-abca95955e8b: generation now vs
cached: 3, 3
[2018-07-12 12:55:11.248742] D [MSGID: 0]
[afr-lk-common.c:694:afr_lock_blocking] 8-db-replicate-0: we're done locking
[2018-07-12 12:55:11.249756] T [MSGID: 0]
[afr-lk-common.c:248:afr_unlock_common_cbk] 8-db-replicate-0: All internal
locks unlocked
[2018-07-12 12:55:11.250458] D [MSGID: 114031]
[client-rpc-fops_v2.c:265:client4_0_open_cbk] 8-db-client-1: remote operation
failed. Path: <gfid:02e0f34b-f9a5-46ff-83fe-abca95955e8b>
(02e0f34b-f9a5-46ff-83fe-abc
a95955e8b) [Stale file handle]
[2018-07-12 12:55:11.250513] D [MSGID: 108009]
[afr-open.c:229:afr_openfd_fix_open_cbk] 8-db-replicate-0: Failed to open
<gfid:02e0f34b-f9a5-46ff-83fe-abca95955e8b> on subvolume db-client-1 [Stale
file handle]
[2018-07-12 12:55:11.251015] D [MSGID: 114031]
[client-rpc-fops_v2.c:265:client4_0_open_cbk] 8-db-client-0: remote operation
failed. Path: <gfid:02e0f34b-f9a5-46ff-83fe-abca95955e8b>
(02e0f34b-f9a5-46ff-83fe-abc
a95955e8b) [Stale file handle]
[2018-07-12 12:55:11.251066] D [MSGID: 108009]
[afr-open.c:229:afr_openfd_fix_open_cbk] 8-db-replicate-0: Failed to open
<gfid:02e0f34b-f9a5-46ff-83fe-abca95955e8b> on subvolume db-client-0 [Stale
file handle]
[2018-07-12 12:55:11.251462] W [MSGID: 114031]
[client-rpc-fops_v2.c:1258:client4_0_fstat_cbk] 8-db-client-0: remote operation
failed [No such file or directory]
[2018-07-12 12:55:11.253450] D [MSGID: 0]
[dht-inode-read.c:182:dht_file_attr_cbk] 8-db-dht: subvolume db-replicate-0
returned -1 [Transport endpoint is not connected]
[2018-07-12 12:55:11.253574] W [fuse-bridge.c:899:fuse_attr_cbk]
0-glusterfs-fuse: 14245089: FSTAT() <gfid:02e0f34b-f9a5-46ff-83fe-abca95955e8b>
=> -1 (Transport endpoint is not connected)


[2018-07-12 12:55:11.253739] T [fuse-bridge.c:955:fuse_getattr_resume]
0-glusterfs-fuse: 14245090: FGETATTR 140407508526648 ((null)/0x7fb32bd8ed98)
[2018-07-12 12:55:11.253959] D [MSGID: 0] [afr-open.c:332:afr_fix_open]
8-db-replicate-0: need open count: 2
[2018-07-12 12:55:11.253984] D [MSGID: 0] [afr-open.c:352:afr_fix_open]
8-db-replicate-0: opening fd for file
<gfid:02e0f34b-f9a5-46ff-83fe-abca95955e8b> on subvolume db-client-0
[2018-07-12 12:55:11.254178] D [MSGID: 0] [afr-open.c:352:afr_fix_open]
8-db-replicate-0: opening fd for file
<gfid:02e0f34b-f9a5-46ff-83fe-abca95955e8b> on subvolume db-client-1
[2018-07-12 12:55:11.254300] D [MSGID: 0] [afr-read-txn.c:252:afr_read_txn]
8-db-replicate-0: 02e0f34b-f9a5-46ff-83fe-abca95955e8b: generation now vs
cached: 3, 3
[2018-07-12 12:55:11.254544] T [fuse-bridge.c:2525:fuse_write_resume]
0-glusterfs-fuse: 14245092: WRITE (0x7fb31ed28a98, size=8192, offset=138829824)
[2018-07-12 12:55:11.254696] T [MSGID: 0] [afr-lk-common.c:60:afr_set_lk_owner]
8-db-replicate-0: Setting lk-owner=140407222969688
[2018-07-12 12:55:11.256316] D [MSGID: 114031]
[client-rpc-fops_v2.c:1350:client4_0_finodelk_cbk] 8-db-client-2: remote
operation failed [Resource temporarily unavailable]
[2018-07-12 12:55:11.256506] D [MSGID: 114031]
[client-rpc-fops_v2.c:265:client4_0_open_cbk] 8-db-client-1: remote operation
failed. Path: <gfid:02e0f34b-f9a5-46ff-83fe-abca95955e8b>
(02e0f34b-f9a5-46ff-83fe-abc
a95955e8b) [Stale file handle]
[2018-07-12 12:55:11.256584] D [MSGID: 108009]
[afr-open.c:229:afr_openfd_fix_open_cbk] 8-db-replicate-0: Failed to open
<gfid:02e0f34b-f9a5-46ff-83fe-abca95955e8b> on subvolume db-client-1 [Stale
file handle]
[2018-07-12 12:55:11.256685] D [MSGID: 114031]
[client-rpc-fops_v2.c:265:client4_0_open_cbk] 8-db-client-0: remote operation
failed. Path: <gfid:02e0f34b-f9a5-46ff-83fe-abca95955e8b>
(02e0f34b-f9a5-46ff-83fe-abc
a95955e8b) [Stale file handle]
[2018-07-12 12:55:11.256737] D [MSGID: 108009]
[afr-open.c:229:afr_openfd_fix_open_cbk] 8-db-replicate-0: Failed to open
<gfid:02e0f34b-f9a5-46ff-83fe-abca95955e8b> on subvolume db-client-0 [Stale
file handle]
[2018-07-12 12:55:11.256781] D [MSGID: 114031]
[client-rpc-fops_v2.c:1350:client4_0_finodelk_cbk] 8-db-client-1: remote
operation failed [Resource temporarily unavailable]
[2018-07-12 12:55:11.257089] W [MSGID: 114031]
[client-rpc-fops_v2.c:1258:client4_0_fstat_cbk] 8-db-client-0: remote operation
failed [No such file or directory]
[2018-07-12 12:55:11.257190] D [MSGID: 114031]
[client-rpc-fops_v2.c:1350:client4_0_finodelk_cbk] 8-db-client-0: remote
operation failed [Resource temporarily unavailable]
[2018-07-12 12:55:11.257305] T [MSGID: 0]
[afr-lk-common.c:1006:afr_nonblocking_inodelk_cbk] 8-db-replicate-0: Last inode
locking reply received
[2018-07-12 12:55:11.257326] T [MSGID: 0]
[afr-lk-common.c:1020:afr_nonblocking_inodelk_cbk] 8-db-replicate-0: 0 servers
locked. Trying again with blocking calls
[2018-07-12 12:55:11.257381] T [MSGID: 0]
[afr-lk-common.c:321:afr_unlock_inodelk] 8-db-replicate-0: No internal locks
unlocked
[2018-07-12 12:55:11.257400] D [MSGID: 0]
[afr-transaction.c:1631:afr_post_nonblocking_inodelk_cbk] 8-db-replicate-0: Non
blocking inodelks failed. Proceeding to blocking
[2018-07-12 12:55:11.259009] D [MSGID: 0]
[dht-inode-read.c:182:dht_file_attr_cbk] 8-db-dht: subvolume db-replicate-0
returned -1 [Transport endpoint is not connected]
[2018-07-12 12:55:11.259079] W [fuse-bridge.c:899:fuse_attr_cbk]
0-glusterfs-fuse: 14245090: FSTAT() ERR => -1 (Transport endpoint is not
connected)

afr tries to "fix" fd by trying to open it on client-0 and client-1, which
fails as rename had overwritten the gfid. Then afr gives up and sends fstat to
client-0 which fails and the same error is communicated back to application.
Some questions,

* Why didn't afr send fstat to client-2 which has a valid fd opened?
* Does afr require Quorum on fstat? If yes, why did open succeeded even when
there was no quorum (open on client-0 and client-1 had failed at the time of
application sent open)?
* How to make sure either open from kernel fails or if succeeded all future
operations on it succeeds?

-- 
You are receiving this mail because:
You are on the CC list for the bug.
Unsubscribe from this bug https://bugzilla.redhat.com/token.cgi?t=sMBgqseEIT&a=cc_unsubscribe


More information about the Bugs mailing list