[Gluster-users] flush-behind behaving synchronously
Zak Wilcox
zakwilcox at gmail.com
Tue Jun 30 10:19:23 UTC 2009
[Apologies if this gets posted twice]
Hi,
I can't seem to get flush-behind to flush behind and I was wondering
if anyone would be kind enough to lend some brain cells to help me
see what I'm missing.
I'm using glusterfs 2.0.1 on Debian/x86. I've got a "cluster/
replicate" volume replicated between two servers that are connected
(over a VPN) over a slooooow (ADSL) link. Synchronous creation of
large files over such a link is very slow, so I added the
"performance/write-behind" layer with "option flush-behind on". When
I write a file to the volume using cp I see the write()s returning
implausibly quickly for the speed of the link, as expected, but the
final close() still suffers a long delay as though I'd never
specified flush-behind. The following strace shows what happens when
a 100kB file gets copied into the gluster (which is mounted on /mnt/
tmp). strace prints timestamps at syscall entry.
============
quince:/mnt# strace -ttt cp ~/typically-sized-files/hundredkay /mnt/
tmp/100k
1246310229.679320 execve("/bin/cp", ["cp", "/root/typically-sized-
files/hund"..., "/mnt/tmp/100k"], [/* 17 vars */]) = 0
[...]
1246310230.343767 open("/root/typically-sized-files/hundredkay",
O_RDONLY|O_LARGEFILE) = 3
1246310230.343818 fstat64(3, {st_mode=S_IFREG|0644,
st_size=102400, ...}) = 0
1246310230.343861 open("/mnt/tmp/100k", O_WRONLY|O_CREAT|O_EXCL|
O_LARGEFILE, 0644) = 4
1246310230.563629 fstat64(4, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
1246310230.563694 read(3, "MZ\220\0\3\0\0\0\4\0\0\0\377\377\0\0\270\0
\0\0\0\0\0\0@\0\0\0\0\0\0\0\0"..., 8192) = 8192
1246310230.563758 write(4, "MZ\220\0\3\0\0\0\4\0\0\0\377\377\0\0\270\0
\0\0\0\0\0\0@\0\0\0\0\0\0\0\0"..., 8192) = 8192
1246310231.480604 read(3, "M\364_^d\211\r\0\0\0\0\311\303\213\3013\311
\211H\4\211H\10\211H\f\307@\20\1\0\0\0"..., 8192) = 8192
1246310231.480667 write(4, "M\364_^d\211\r\0\0\0\0\311\303\213\3013
\311\211H\4\211H\10\211H\f\307@\20\1\0\0\0"..., 8192) = 8192
1246310231.480940 read(3, "\377\377C;_\10|\360\213E\f\211u\0109p\10\17
\216\220\0\0\0\213E\f\213M\10\213@\f\213"..., 8192) = 8192
1246310231.480990 write(4, "\377\377C;_\10|\360\213E\f\211u\0109p\10
\17\216\220\0\0\0\213E\f\213M\10\213@\f\213"..., 8192) = 8192
[...]
1246310231.481945 read(3, "\312u\1\0\336u\1\0\352u\1\0\374u\1\0\16v\1
\0\36v\1\0,v\1\0:v\1\0H"..., 8192) = 8192
1246310231.481986 write(4, "\312u\1\0\336u\1\0\352u\1\0\374u\1\0\16v\1
\0\36v\1\0,v\1\0:v\1\0H"..., 8192) = 8192
1246310231.482063 read(3, "\377\271^\0\377\302v\0\322\247m\0\364\313
\226\0\377\300h\0\374\277h\0\372\304x\0\340\261p\0\341"..., 8192) = 4096
1246310231.482102 write(4, "\377\271^\0\377\302v\0\322\247m\0\364\313
\226\0\377\300h\0\374\277h\0\372\304x\0\340\261p\0\341"..., 4096) = 4096
1246310231.482174 close(4) = 0
1246310235.602280 close(3) = 0
[...]
1246310235.602419 exit_group(0) = ?
============
This delay at close (4.2 seconds here) gets longer as the file gets
bigger - exactly as you'd expect if write()ing was asynchronous but
close()ing/flush()ing was synchronous. Here's the (slightly more
abbreviated) timings for a 1MB file:
===========
quince:/mnt# strace -ttt cp ~/typically-sized-files/onemeg /mnt/tmp/
1246315526.483531 execve("/bin/cp", ["cp", "/root/typically-sized-
files/onem"..., "/mnt/tmp/"], [/* 17 vars */]) = 0
[...]
1246315527.034448 read(3, "MZ\220\0\3\0\0\0\4\0\0\0\377\377\0\0\270\0
\0\0\0\0\0\0@\0\0\0\0\0\0\0\0"..., 8192) = 8192
1246315527.034515 write(4, "MZ\220\0\3\0\0\0\4\0\0\0\377\377\0\0\270\0
\0\0\0\0\0\0@\0\0\0\0\0\0\0\0"..., 8192) = 8192
1246315527.500059 read(3, "M\364_^d\211\r\0\0\0\0\311\303\213\3013\311
\211H\4\211H\10\211H\f\307@\20\1\0\0\0"..., 8192) = 8192
1246315527.500113 write(4, "M\364_^d\211\r\0\0\0\0\311\303\213\3013
\311\211H\4\211H\10\211H\f\307@\20\1\0\0\0"..., 8192) = 8192
[...]
1246315527.522606 write(4, "\362\303\34:?/\5+\351\262\277\35\2413\5(eP
\0\357\223\226w\373\311r4\325\347\215\30\r\270"..., 8192) = 8192
1246315527.522733 read(3, ""..., 8192) = 0
1246315527.522770 close(4) = 0
1246315563.780185 close(3) = 0
===========
That's 36.26 seconds or ~28kB/s, which is typical of my upload
speeds, strongly suggesting glusterfs is copying the file before
returning from close().
Each of the two slow-connected machines is a gluster server with a
localhost gluster client. I'm using server-based replication (only
out of desperation when client-based flush-behind didn't work). This
latest config does the write-back on the server, but I've also tried
doing it on the client, with the same problem. I'll include my most
recent server-replication-based/server-write-delaying config because
it's where those straces came from.
Client config (identical both ends):
========
volume loopback
type protocol/client
option transport-type tcp
option remote-host 127.0.0.1
option remote-subvolume delayedwrite # name of the remote volume
end-volume
=========
Server config. I'm just giving one side because they differ only by
the IP address and hostname.
=========
volume shared-raw
type storage/posix # POSIX FS translator
option directory /mnt/big/fileserving/gluster-test/shared #
Export this directory
end-volume
volume quince-shared
type features/locks
subvolumes shared-raw
end-volume
volume remote
type protocol/client
option transport-type tcp
option remote-host 10.8.0.3 # aka kumquat
option remote-subvolume kumquat-shared # name of the remote
volume
end-volume
volume replicated
type cluster/replicate
subvolumes quince-shared remote
option read-subvolume quince-shared
end-volume
volume threads
type performance/io-threads
subvolumes replicated
end-volume
volume delayedwrite
type performance/write-behind
option flush-behind on
option cache-size 4MB
subvolumes threads
end-volume
volume server
type protocol/server
option transport-type tcp
subvolumes delayedwrite quince-shared
option auth.addr.delayedwrite.allow 127.0.0.1 # Allow access to
replicated volume from just the local client
option auth.addr.quince-shared.allow 10.8.0.3 # Allow access to
lockified tree from kumquat
end-volume
==========
Logs show nothing unusual, and small files replicate between sites
just fine with acceptable delay. I've unmounted/stopped/started/
remounted everything every time I've changed the configs, so these
straces definitely belong to these configs.
So obviously the question is: why am I seeing delays on close() even
though I've got "option flush-behind on"? Is there something else I
need to have in my config? What's the best way to debug this if I've
done nothing obviously wrong?
For what it's worth, a little more about my setup: got two offices
full of secretaries, connected to t'Internet by ADSL links
(Infiniband it certainly ain't!). The secretaries create new Word
docs, read each other's docs and move files around, but never edit
anything. One office is currently accessing the other over Samba
over ADSL, which is predictably unusably slow. I'm trying to use
GlusterFS to replicate the tree between the two offices to provide
both fast local reads and some level of link death resilience, and
I'm hoping to layer Samba on top either end. I'll resolve conflicts
manually because they're so rare. I seriously considered Coda first,
but Gluster just looks so much tidier and more lively and current as
a project, and - except for this one problem - seems to do exactly
what I want.
Thanks in advance for any light shed,
Zak
More information about the Gluster-users
mailing list