[Gluster-users] client coherence problem with locks and truncate
Robert L. Millner
rmillner at webappvm.com
Fri Sep 4 19:01:09 UTC 2009
Hi,
We're observing a coherence issue with GlusterFS 2.0.6. One client
opens a file, locks, truncates and writes. Another client waiting on a
read lock may see a zero length file after the read lock is granted.
If both nodes read/write in a loop, this tends to happen within a few
hundred tries. The same code runs for 10000 loops without a problem if
both programs run on GlusterFS on the same node or local file system
(ext3) on the same node.
Node1 does the following (strace):
2206 1252031615.509555 open("testfile", O_RDWR|O_CREAT|O_LARGEFILE, 0644) = 3
2206 1252031615.514886 fcntl64(3, F_SETLKW64, {type=F_WRLCK, whence=SEEK_SET, start=0, len=0}, 0xbfcaee78) = 0
2206 1252031615.517742 select(0, NULL, NULL, NULL, {0, 0}) = 0 (Timeout)
2206 1252031615.517788 _llseek(3, 0, [0], SEEK_SET) = 0
2206 1252031615.517829 ftruncate64(3, 0) = 0
2206 1252031615.520632 write(3, "01234567890123456789012345678901"..., 900) = 900
2206 1252031615.599782 close(3) = 0
2206 1252031615.604731 open("testfile", O_RDONLY|O_CREAT|O_LARGEFILE, 0644) = 3
2206 1252031615.615158 fcntl64(3, F_SETLKW64, {type=F_RDLCK, whence=SEEK_SET, start=0, len=0}, 0xbfcaee78) = 0
2206 1252031615.624680 fstat64(3, {st_dev=makedev(0, 13), st_ino=182932, st_mode=S_IFREG|0644, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=16, st_size=900, st_atime=2009/09/03-19:33:35, st_mtime=2009/09/03-19:33:35, st_ctime=2009/09/03-19:33:35}) = 0
2206 1252031615.624787 _llseek(3, 0, [0], SEEK_SET) = 0
2206 1252031615.624851 read(3, "01234567890123456789012345678901"..., 4096) = 900
2206 1252031615.625126 close(3) = 0
Node2 does the following (strace):
2126 1252031615.504350 open("testfile", O_RDONLY|O_CREAT|O_LARGEFILE, 0644) = 3
2126 1252031615.509004 fcntl64(3, F_SETLKW64, {type=F_RDLCK, whence=SEEK_SET, start=0, len=0}, 0xbfc05dc8) = 0
2126 1252031615.587697 fstat64(3, {st_dev=makedev(0, 13), st_ino=182932, st_mode=S_IFREG|0644, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=8, st_size=0, st_atime=2009/09/03-19:33:35, st_mtime=2009/09/03-19:33:35, st_ctime=2009/09/03-19:33:35}) = 0
2126 1252031615.588027 _llseek(3, 0, [0], SEEK_SET) = 0
2126 1252031615.588089 read(3, "", 4096) = 0
2126 1252031615.588228 close(3) = 0
Both node clocks are NTP disciplined. As these are virtual machines,
there's a higher dispersion but I believe you can round to the nearest
0.1s for time correlation.
Node2 waits for the write lock to clear before getting its read lock.
Node1 also tries to read the file and agrees with node2 on every stat
field except st_size. Node2 tries to read the file and gets no data.
This is on 32 bit CentOS5 with a 2.6.27 kernel, fuse 2.7.4 on VMware.
Also observed on Amazon EC2 with their 2.6.21 fc8xen kernel.
I can make the problem unrepeatable in 10000 tries by changing the
select on Node1 to timeout in 0.1 seconds. The problem repeats in under
5000 tries if select is set to timeout in 0.01 seconds.
This happens whether or not gluster is run with
--disable-direct-io-mode.
The volume is mirrored between four servers. Below is the server
configuration. The export directory is on ext3.
volume posix
type storage/posix
option directory /var/data/export
end-volume
volume locks
type features/locks
option mandatory-locks on
subvolumes posix
end-volume
volume brick
type performance/io-threads
option thread-count 8
subvolumes locks
end-volume
volume server
type protocol/server
option transport-type tcp
option auth.addr.brick.allow *
subvolumes brick
end-volume
And the client configuration:
volume remote1
type protocol/client
option transport-type tcp
option remote-host 10.10.10.145
option remote-subvolume brick
end-volume
volume remote2
type protocol/client
option transport-type tcp
option remote-host 10.10.10.130
option remote-subvolume brick
end-volume
volume remote3
type protocol/client
option transport-type tcp
option remote-host 10.10.10.221
option remote-subvolume brick
end-volume
volume remote4
type protocol/client
option transport-type tcp
option remote-host 10.10.10.104
option remote-subvolume brick
end-volume
volume replicated
type cluster/replicate
subvolumes remote1 remote2 remote3 remote4
end-volume
volume writebehind
type performance/write-behind
subvolumes replicated
end-volume
volume cache
type performance/io-cache
subvolumes writebehind
end-volume
The problem persists with those configurations and if any or all of the
following tweaks are made:
1. Remove the replicated volume and just use remote1.
2. Get rid of threads on the server.
3. Get rid of io-cache and writebehind on the clients.
4. Use mandatory locking on the test file.
Please let me know if there's any more information needed to debug this
further or any guidance on how to avoid it.
Thank you!
Cheers,
Rob
More information about the Gluster-users
mailing list