[Gluster-users] Stale locks on shards

Mon Jan 29 05:20:11 UTC 2018

Hi!

Yes, thank you for asking. I found out this line in the production 
environment:
lgetxattr("/tmp/zone2-ssd1-vmstor1.s6jvPu//.shard/f349ffbd-a423-4fb2-b83c-2d1d5e78e1fb.32", 
"glusterfs.clrlk.tinode.kblocked", 0x7f2d7c4379f0, 4096) = -1 EPERM 
(Operation not permitted)

And this one in test environment (with posix locks):
lgetxattr("/tmp/g1.gHj4Bw//file38", "glusterfs.clrlk.tposix.kblocked", 
"box1:/gluster/1/export/: posix blocked locks=1 granted locks=0", 4096) = 77

In test environment I tried running following command which seemed to 
release gluster locks:

getfattr -n glusterfs.clrlk.tposix.kblocked file38

So I think it would go like this in production environment with locks on 
shards (using aux-gfid-mount mount option):
getfattr -n glusterfs.clrlk.tinode.kall 
.shard/f349ffbd-a423-4fb2-b83c-2d1d5e78e1fb.32

I haven't been able to try this out in production environment yet.

Is there perhaps something else to notice?

Would you be able to tell more about bricks crashing after releasing 
locks? Under what circumstances that does happen? Is it only process 
exporting the brick crashes or is there a possibility of data corruption?

Best regards,
Samuli Heinonen

Pranith Kumar Karampuri wrote:
> Hi,
>       Did you find the command from strace?
>
> On 25 Jan 2018 1:52 pm, "Pranith Kumar Karampuri" <pkarampu at redhat.com
> <mailto:pkarampu at redhat.com>> wrote:
>
>
>
>     On Thu, Jan 25, 2018 at 1:49 PM, Samuli Heinonen
>     <samppah at neutraali.net <mailto:samppah at neutraali.net>> wrote:
>
>         Pranith Kumar Karampuri kirjoitti 25.01.2018 07:09:
>
>             On Thu, Jan 25, 2018 at 2:27 AM, Samuli Heinonen
>             <samppah at neutraali.net <mailto:samppah at neutraali.net>> wrote:
>
>                 Hi!
>
>                 Thank you very much for your help so far. Could you
>                 please tell an
>                 example command how to use aux-gid-mount to remove
>                 locks? "gluster
>                 vol clear-locks" seems to mount volume by itself.
>
>
>             You are correct, sorry, this was implemented around 7 years
>             back and I
>             forgot that bit about it :-(. Essentially it becomes a getxattr
>             syscall on the file.
>             Could you give me the clear-locks command you were trying to
>             execute
>             and I can probably convert it to the getfattr command?
>
>
>         I have been testing this in test environment and with command:
>         gluster vol clear-locks g1
>         /.gfid/14341ccb-df7b-4f92-90d5-7814431c5a1c kind all inode
>
>
>     Could you do strace of glusterd when this happens? It will have a
>     getxattr with "glusterfs.clrlk" in the key. You need to execute that
>     on the gfid-aux-mount
>
>
>
>
>                 Best regards,
>                 Samuli Heinonen
>
>                     Pranith Kumar Karampuri <mailto:pkarampu at redhat.com
>                     <mailto:pkarampu at redhat.com>>
>                     23 January 2018 at 10.30
>
>                     On Tue, Jan 23, 2018 at 1:38 PM, Samuli Heinonen
>                     <samppah at neutraali.net
>                     <mailto:samppah at neutraali.net>
>                     <mailto:samppah at neutraali.net
>                     <mailto:samppah at neutraali.net>>> wrote:
>
>                     Pranith Kumar Karampuri kirjoitti 23.01.2018 09:34:
>
>                     On Mon, Jan 22, 2018 at 12:33 AM, Samuli Heinonen
>
>                     <samppah at neutraali.net
>                     <mailto:samppah at neutraali.net>
>                     <mailto:samppah at neutraali.net
>                     <mailto:samppah at neutraali.net>>>
>                     wrote:
>
>                     Hi again,
>
>                     here is more information regarding issue described
>                     earlier
>
>                     It looks like self healing is stuck. According to
>                     "heal
>                     statistics"
>                     crawl began at Sat Jan 20 12:56:19 2018 and it's still
>                     going on
>                     (It's around Sun Jan 21 20:30 when writing this).
>                     However
>                     glustershd.log says that last heal was completed at
>                     "2018-01-20
>                     11:00:13.090697" (which is 13:00 UTC+2). Also "heal
>                     info"
>                     has been
>                     running now for over 16 hours without any information.
>                     In
>                     statedump
>                     I can see that storage nodes have locks on files and
>                     some
>                     of those
>                     are blocked. Ie. Here again it says that ovirt8z2 is
>                     having active
>                     lock even ovirt8z2 crashed after the lock was
>                     granted.:
>
>                     [xlator.features.locks.zone2-ssd1-vmstor1-locks.inode]
>                     path=/.shard/3d55f8cc-cda9-489a-b0a3-fd0f43d67876.27
>                     mandatory=0
>                     inodelk-count=3
>
>                     lock-dump.domain.domain=zone2-ssd1-vmstor1-replicate-0:self-heal
>                     inodelk.inodelk[0](ACTIVE)=type=WRITE, whence=0,
>                     start=0,
>                     len=0, pid
>                     = 18446744073709551610, owner=d0c6d857a87f0000,
>                     client=0x7f885845efa0,
>
>
>
>
>             connection-id=sto2z2.xxx-10975-2018/01/20-10:56:14:649541-zone2-ssd1-vmstor1-client-0-0-0,
>
>
>                     granted at 2018-01-20 10:59:52
>
>                     lock-dump.domain.domain=zone2-ssd1-vmstor1-replicate-0:metadata
>                     lock-dump.domain.domain=zone2-ssd1-vmstor1-replicate-0
>                     inodelk.inodelk[0](ACTIVE)=type=WRITE, whence=0,
>                     start=0,
>                     len=0, pid
>                     = 3420, owner=d8b9372c397f0000, client=0x7f8858410be0,
>
>                     connection-id=ovirt8z2.xxx.com
>                     <http://ovirt8z2.xxx.com> [1]
>
>
>
>             <http://ovirt8z2.xxx.com>-5652-2017/12/27-09:49:02:946825-zone2-ssd1-vmstor1-client-0-7-0,
>
>
>                     granted at 2018-01-20 08:57:23
>                     inodelk.inodelk[1](BLOCKED)=type=WRITE, whence=0,
>                     start=0,
>                     len=0,
>                     pid = 18446744073709551610, owner=d0c6d857a87f0000,
>                     client=0x7f885845efa0,
>
>
>
>
>             connection-id=sto2z2.xxx-10975-2018/01/20-10:56:14:649541-zone2-ssd1-vmstor1-client-0-0-0,
>
>
>                     blocked at 2018-01-20 10:59:52
>
>                     I'd also like to add that volume had arbiter brick
>                     before
>                     crash
>                     happened. We decided to remove it because we thought
>                     that
>                     it was
>                     causing issues. However now I think that this was
>                     unnecessary. After
>                     the crash arbiter logs had lots of messages like this:
>                     [2018-01-20 10:19:36.515717] I [MSGID: 115072]
>                     [server-rpc-fops.c:1640:server_setattr_cbk]
>                     0-zone2-ssd1-vmstor1-server: 37374187: SETATTR
>                     <gfid:a52055bd-e2e9-42dd-92a3-e96b693bcafe>
>                     (a52055bd-e2e9-42dd-92a3-e96b693bcafe) ==> (Operation
>                     not
>                     permitted)
>                     [Operation not permitted]
>
>                     Is there anyways to force self heal to stop? Any help
>                     would be very
>                     much appreciated :)
>
>                     Exposing .shard to a normal mount is opening a can of
>                     worms. You
>                     should probably look at mounting the volume with gfid
>                     aux-mount where
>                     you can access a file with
>                     <path-to-mount>/.gfid/<gfid-string>to clear
>                     locks on it.
>
>                     Mount command:  mount -t glusterfs -o aux-gfid-mount
>                     vm1:test
>                     /mnt/testvol
>
>                     A gfid string will have some hyphens like:
>                     11118443-1894-4273-9340-4b212fa1c0e4
>
>                     That said. Next disconnect on the brick where you
>                     successfully
>                     did the
>                     clear-locks will crash the brick. There was a bug in
>                     3.8.x
>                     series with
>                     clear-locks which was fixed in 3.9.0 with a feature. The
>                     self-heal
>                     deadlocks that you witnessed also is fixed in 3.10
>                     version
>                     of the
>                     release.
>
>                     Thank you the answer. Could you please tell more
>                     about crash?
>                     What
>                     will actually happen or is there a bug report about
>                     it? Just
>                     want
>                     to make sure that we can do everything to secure data on
>                     bricks.
>                     We will look into upgrade but we have to make sure
>                     that new
>                     version works for us and of course get self healing
>                     working
>                     before
>                     doing anything :)
>
>                     Locks xlator/module maintains a list of locks that
>                     are granted to
>                     a client. Clear locks had an issue where it forgets
>                     to remove the
>                     lock from this list. So the connection list ends up
>                     pointing to
>                     data that is freed in that list after a clear lock.
>                     When a
>                     disconnect happens, all the locks that are granted
>                     to a client
>                     need to be unlocked. So the process starts
>                     traversing through this
>                     list and when it starts trying to access this freed
>                     data it leads
>                     to a crash. I found it while reviewing a feature
>                     patch sent by
>                     facebook folks to locks xlator
>                     (http://review.gluster.org/14816
>                     <http://review.gluster.org/14816>
>                     [2]) for 3.9.0 and they also fixed this bug as well
>                     as part of
>
>                     that feature patch.
>
>                     Br,
>                     Samuli
>
>                     3.8.x is EOLed, so I recommend you to upgrade to a
>                     supported
>                     version
>                     soon.
>
>                     Best regards,
>                     Samuli Heinonen
>
>                     Samuli Heinonen
>                     20 January 2018 at 21.57
>
>                     Hi all!
>
>                     One hypervisor on our virtualization environment
>                     crashed and now
>                     some of the VM images cannot be accessed. After
>                     investigation we
>                     found out that there was lots of images that still
>                     had
>                     active lock
>                     on crashed hypervisor. We were able to remove
>                     locks
>                     from "regular
>                     files", but it doesn't seem possible to remove
>                     locks
>                     from shards.
>
>                     We are running GlusterFS 3.8.15 on all nodes.
>
>                     Here is part of statedump that shows shard having
>                     active lock on
>                     crashed node:
>
>                     [xlator.features.locks.zone2-ssd1-vmstor1-locks.inode]
>
>                     path=/.shard/75353c17-d6b8-485d-9baf-fd6c700e39a1.21
>                     mandatory=0
>                     inodelk-count=1
>
>                     lock-dump.domain.domain=zone2-ssd1-vmstor1-replicate-0:metadata
>
>                     lock-dump.domain.domain=zone2-ssd1-vmstor1-replicate-0:self-heal
>
>                     lock-dump.domain.domain=zone2-ssd1-vmstor1-replicate-0
>                     inodelk.inodelk[0](ACTIVE)=type=WRITE, whence=0,
>                     start=0, len=0,
>                     pid = 3568, owner=14ce372c397f0000,
>                     client=0x7f3198388770,
>                     connection-id
>
>
>
>
>             ovirt8z2.xxx-5652-2017/12/27-09:49:02:946825-zone2-ssd1-vmstor1-client-1-7-0,
>
>
>                     granted at 2018-01-20 08:57:24
>
>                     If we try to run clear-locks we get following
>                     error
>                     message:
>                     # gluster volume clear-locks zone2-ssd1-vmstor1
>                     /.shard/75353c17-d6b8-485d-9baf-fd6c700e39a1.21
>                     kind
>                     all inode
>                     Volume clear-locks unsuccessful
>                     clear-locks getxattr command failed. Reason:
>                     Operation not
>                     permitted
>
>                     Gluster vol info if needed:
>                     Volume Name: zone2-ssd1-vmstor1
>                     Type: Replicate
>                     Volume ID: b6319968-690b-4060-8fff-b212d2295208
>                     Status: Started
>                     Snapshot Count: 0
>                     Number of Bricks: 1 x 2 = 2
>                     Transport-type: rdma
>                     Bricks:
>                     Brick1: sto1z2.xxx:/ssd1/zone2-vmstor1/export
>                     Brick2: sto2z2.xxx:/ssd1/zone2-vmstor1/export
>                     Options Reconfigured:
>                     cluster.shd-wait-qlength: 10000
>                     cluster.shd-max-threads: 8
>                     cluster.locking-scheme: granular
>                     performance.low-prio-threads: 32
>                     cluster.data-self-heal-algorithm: full
>                     performance.client-io-threads: off
>                     storage.linux-aio: off
>                     performance.readdir-ahead: on
>                     client.event-threads: 16
>                     server.event-threads: 16
>                     performance.strict-write-ordering: off
>                     performance.quick-read: off
>                     performance.read-ahead: on
>                     performance.io-cache: off
>                     performance.stat-prefetch: off
>                     cluster.eager-lock: enable
>                     network.remote-dio: on
>                     cluster.quorum-type: none
>                     network.ping-timeout: 22
>                     performance.write-behind: off
>                     nfs.disable: on
>                     features.shard: on
>                     features.shard-block-size: 512MB
>                     storage.owner-uid: 36
>                     storage.owner-gid: 36
>                     performance.io-thread-count: 64
>                     performance.cache-size: 2048MB
>                     performance.write-behind-window-size: 256MB
>                     server.allow-insecure: on
>                     cluster.ensure-durability: off
>                     config.transport: rdma
>                     server.outstanding-rpc-limit: 512
>                     diagnostics.brick-log-level: INFO
>
>                     Any recommendations how to advance from here?
>
>                     Best regards,
>                     Samuli Heinonen
>
>                     _______________________________________________
>                     Gluster-users mailing list
>                     Gluster-users at gluster.org
>                     <mailto:Gluster-users at gluster.org>
>                     <mailto:Gluster-users at gluster.org
>                     <mailto:Gluster-users at gluster.org>>
>
>                     http://lists.gluster.org/mailman/listinfo/gluster-users
>                     <http://lists.gluster.org/mailman/listinfo/gluster-users>
>                     [3]
>                     <http://lists.gluster.org/mailman/listinfo/gluster-users
>                     <http://lists.gluster.org/mailman/listinfo/gluster-users>
>                     [3]>
>                     [1]
>
>                     _______________________________________________
>                     Gluster-users mailing list
>                     Gluster-users at gluster.org
>                     <mailto:Gluster-users at gluster.org>
>                     <mailto:Gluster-users at gluster.org
>                     <mailto:Gluster-users at gluster.org>>
>
>                     http://lists.gluster.org/mailman/listinfo/gluster-users
>                     <http://lists.gluster.org/mailman/listinfo/gluster-users>
>                     [3]
>
>                     <http://lists.gluster.org/mailman/listinfo/gluster-users
>                     <http://lists.gluster.org/mailman/listinfo/gluster-users>
>                     [3]> [1]
>
>                     --
>
>                     Pranith
>
>                     Links:
>                     ------
>                     [1]
>                     http://lists.gluster.org/mailman/listinfo/gluster-users
>                     <http://lists.gluster.org/mailman/listinfo/gluster-users>
>                     [3]
>                     <http://lists.gluster.org/mailman/listinfo/gluster-users
>                     <http://lists.gluster.org/mailman/listinfo/gluster-users>
>                     [3]>
>
>
>                     --
>                     Pranith
>                     Samuli Heinonen <mailto:samppah at neutraali.net
>                     <mailto:samppah at neutraali.net>>
>                     21 January 2018 at 21.03
>                     Hi again,
>
>                     here is more information regarding issue described
>                     earlier
>
>                     It looks like self healing is stuck. According to "heal
>                     statistics" crawl began at Sat Jan 20 12:56:19 2018
>                     and it's still
>                     going on (It's around Sun Jan 21 20:30 when writing
>                     this). However
>                     glustershd.log says that last heal was completed at
>                     "2018-01-20
>                     11:00:13.090697" (which is 13:00 UTC+2). Also "heal
>                     info" has been
>                     running now for over 16 hours without any
>                     information. In
>                     statedump I can see that storage nodes have locks on
>                     files and
>                     some of those are blocked. Ie. Here again it says
>                     that ovirt8z2 is
>                     having active lock even ovirt8z2 crashed after the
>                     lock was
>                     granted.:
>
>                     [xlator.features.locks.zone2-ssd1-vmstor1-locks.inode]
>                     path=/.shard/3d55f8cc-cda9-489a-b0a3-fd0f43d67876.27
>                     mandatory=0
>                     inodelk-count=3
>                     lock-dump.domain.domain=zone2-ssd1-vmstor1-replicate-0:self-heal
>                     inodelk.inodelk[0](ACTIVE)=type=WRITE, whence=0,
>                     start=0, len=0,
>                     pid = 18446744073709551610, owner=d0c6d857a87f0000,
>                     client=0x7f885845efa0,
>
>
>             connection-id=sto2z2.xxx-10975-2018/01/20-10:56:14:649541-zone2-ssd1-vmstor1-client-0-0-0,
>
>                     granted at 2018-01-20 10:59:52
>                     lock-dump.domain.domain=zone2-ssd1-vmstor1-replicate-0:metadata
>                     lock-dump.domain.domain=zone2-ssd1-vmstor1-replicate-0
>                     inodelk.inodelk[0](ACTIVE)=type=WRITE, whence=0,
>                     start=0, len=0,
>                     pid = 3420, owner=d8b9372c397f0000,
>                     client=0x7f8858410be0,
>                     connection-id=ovirt8z2.xxx.com <http://ovirt8z2.xxx.com>
>
>                 [1]-5652-2017/12/27-09:49:02:946825-zone2-ssd1-vmstor1-client-0-7-0,
>
>                     granted at 2018-01-20 08:57:23
>                     inodelk.inodelk[1](BLOCKED)=type=WRITE, whence=0,
>                     start=0, len=0,
>                     pid = 18446744073709551610, owner=d0c6d857a87f0000,
>                     client=0x7f885845efa0,
>
>
>             connection-id=sto2z2.xxx-10975-2018/01/20-10:56:14:649541-zone2-ssd1-vmstor1-client-0-0-0,
>
>                     blocked at 2018-01-20 10:59:52
>
>                     I'd also like to add that volume had arbiter brick
>                     before crash
>                     happened. We decided to remove it because we thought
>                     that it was
>                     causing issues. However now I think that this was
>                     unnecessary.
>                     After the crash arbiter logs had lots of messages
>                     like this:
>                     [2018-01-20 10:19:36.515717] I [MSGID: 115072]
>                     [server-rpc-fops.c:1640:server_setattr_cbk]
>                     0-zone2-ssd1-vmstor1-server: 37374187: SETATTR
>                     <gfid:a52055bd-e2e9-42dd-92a3-e96b693bcafe>
>                     (a52055bd-e2e9-42dd-92a3-e96b693bcafe) ==>
>                     (Operation not
>                     permitted) [Operation not permitted]
>
>                     Is there anyways to force self heal to stop? Any
>                     help would be
>                     very much appreciated :)
>
>                     Best regards,
>                     Samuli Heinonen
>
>                     _______________________________________________
>                     Gluster-users mailing list
>                     Gluster-users at gluster.org
>                     <mailto:Gluster-users at gluster.org>
>                     http://lists.gluster.org/mailman/listinfo/gluster-users
>                     <http://lists.gluster.org/mailman/listinfo/gluster-users>
>                     [3]
>
>                     Samuli Heinonen <mailto:samppah at neutraali.net
>                     <mailto:samppah at neutraali.net>>
>
>                     20 January 2018 at 21.57
>                     Hi all!
>
>                     One hypervisor on our virtualization environment
>                     crashed and now
>                     some of the VM images cannot be accessed. After
>                     investigation we
>                     found out that there was lots of images that still
>                     had active lock
>                     on crashed hypervisor. We were able to remove locks
>                     from "regular
>                     files", but it doesn't seem possible to remove locks
>                     from shards.
>
>                     We are running GlusterFS 3.8.15 on all nodes.
>
>                     Here is part of statedump that shows shard having
>                     active lock on
>                     crashed node:
>                     [xlator.features.locks.zone2-ssd1-vmstor1-locks.inode]
>                     path=/.shard/75353c17-d6b8-485d-9baf-fd6c700e39a1.21
>                     mandatory=0
>                     inodelk-count=1
>                     lock-dump.domain.domain=zone2-ssd1-vmstor1-replicate-0:metadata
>                     lock-dump.domain.domain=zone2-ssd1-vmstor1-replicate-0:self-heal
>                     lock-dump.domain.domain=zone2-ssd1-vmstor1-replicate-0
>                     inodelk.inodelk[0](ACTIVE)=type=WRITE, whence=0,
>                     start=0, len=0,
>                     pid = 3568, owner=14ce372c397f0000,
>                     client=0x7f3198388770,
>                     connection-id
>
>
>             ovirt8z2.xxx-5652-2017/12/27-09:49:02:946825-zone2-ssd1-vmstor1-client-1-7-0,
>
>                     granted at 2018-01-20 08:57:24
>
>                     If we try to run clear-locks we get following error
>                     message:
>                     # gluster volume clear-locks zone2-ssd1-vmstor1
>                     /.shard/75353c17-d6b8-485d-9baf-fd6c700e39a1.21 kind
>                     all inode
>                     Volume clear-locks unsuccessful
>                     clear-locks getxattr command failed. Reason:
>                     Operation not
>                     permitted
>
>                     Gluster vol info if needed:
>                     Volume Name: zone2-ssd1-vmstor1
>                     Type: Replicate
>                     Volume ID: b6319968-690b-4060-8fff-b212d2295208
>                     Status: Started
>                     Snapshot Count: 0
>                     Number of Bricks: 1 x 2 = 2
>                     Transport-type: rdma
>                     Bricks:
>                     Brick1: sto1z2.xxx:/ssd1/zone2-vmstor1/export
>                     Brick2: sto2z2.xxx:/ssd1/zone2-vmstor1/export
>                     Options Reconfigured:
>                     cluster.shd-wait-qlength: 10000
>                     cluster.shd-max-threads: 8
>                     cluster.locking-scheme: granular
>                     performance.low-prio-threads: 32
>                     cluster.data-self-heal-algorithm: full
>                     performance.client-io-threads: off
>                     storage.linux-aio: off
>                     performance.readdir-ahead: on
>                     client.event-threads: 16
>                     server.event-threads: 16
>                     performance.strict-write-ordering: off
>                     performance.quick-read: off
>                     performance.read-ahead: on
>                     performance.io-cache: off
>                     performance.stat-prefetch: off
>                     cluster.eager-lock: enable
>                     network.remote-dio: on
>                     cluster.quorum-type: none
>                     network.ping-timeout: 22
>                     performance.write-behind: off
>                     nfs.disable: on
>                     features.shard: on
>                     features.shard-block-size: 512MB
>                     storage.owner-uid: 36
>                     storage.owner-gid: 36
>                     performance.io-thread-count: 64
>                     performance.cache-size: 2048MB
>                     performance.write-behind-window-size: 256MB
>                     server.allow-insecure: on
>                     cluster.ensure-durability: off
>                     config.transport: rdma
>                     server.outstanding-rpc-limit: 512
>                     diagnostics.brick-log-level: INFO
>
>                     Any recommendations how to advance from here?
>
>                     Best regards,
>                     Samuli Heinonen
>
>                     _______________________________________________
>                     Gluster-users mailing list
>                     Gluster-users at gluster.org
>                     <mailto:Gluster-users at gluster.org>
>                     http://lists.gluster.org/mailman/listinfo/gluster-users
>                     <http://lists.gluster.org/mailman/listinfo/gluster-users>
>                     [3]
>
>
>             --
>
>             Pranith
>
>
>             Links:
>             ------
>             [1] http://ovirt8z2.xxx.com
>             [2] http://review.gluster.org/14816
>             <http://review.gluster.org/14816>
>             [3] http://lists.gluster.org/mailman/listinfo/gluster-users
>             <http://lists.gluster.org/mailman/listinfo/gluster-users>
>
>
>
>
>     --
>     Pranith
>