[Gluster-users] flush-behind behaving synchronously

Tue Jun 30 10:19:23 UTC 2009

[Apologies if this gets posted twice]

Hi,

I can't seem to get flush-behind to flush behind and I was wondering  
if anyone would be kind enough to lend some brain cells to help me  
see what I'm missing.

I'm using glusterfs 2.0.1 on Debian/x86.  I've got a "cluster/ 
replicate" volume replicated between two servers that are connected  
(over a VPN) over a slooooow (ADSL) link.  Synchronous creation of  
large files over such a link is very slow, so I added the  
"performance/write-behind" layer with "option flush-behind on".  When  
I write a file to the volume using cp I see the write()s returning  
implausibly quickly for the speed of the link, as expected, but the  
final close() still suffers a long delay as though I'd never  
specified flush-behind.  The following strace shows what happens when  
a 100kB file gets copied into the gluster (which is mounted on /mnt/ 
tmp).  strace prints timestamps at syscall entry.

============
quince:/mnt# strace -ttt cp ~/typically-sized-files/hundredkay /mnt/ 
tmp/100k
1246310229.679320 execve("/bin/cp", ["cp", "/root/typically-sized- 
files/hund"..., "/mnt/tmp/100k"], [/* 17 vars */]) = 0
[...]
1246310230.343767 open("/root/typically-sized-files/hundredkay",  
O_RDONLY|O_LARGEFILE) = 3
1246310230.343818 fstat64(3, {st_mode=S_IFREG|0644,  
st_size=102400, ...}) = 0
1246310230.343861 open("/mnt/tmp/100k", O_WRONLY|O_CREAT|O_EXCL| 
O_LARGEFILE, 0644) = 4
1246310230.563629 fstat64(4, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
1246310230.563694 read(3, "MZ\220\0\3\0\0\0\4\0\0\0\377\377\0\0\270\0 
\0\0\0\0\0\0@\0\0\0\0\0\0\0\0"..., 8192) = 8192
1246310230.563758 write(4, "MZ\220\0\3\0\0\0\4\0\0\0\377\377\0\0\270\0 
\0\0\0\0\0\0@\0\0\0\0\0\0\0\0"..., 8192) = 8192
1246310231.480604 read(3, "M\364_^d\211\r\0\0\0\0\311\303\213\3013\311 
\211H\4\211H\10\211H\f\307@\20\1\0\0\0"..., 8192) = 8192
1246310231.480667 write(4, "M\364_^d\211\r\0\0\0\0\311\303\213\3013 
\311\211H\4\211H\10\211H\f\307@\20\1\0\0\0"..., 8192) = 8192
1246310231.480940 read(3, "\377\377C;_\10|\360\213E\f\211u\0109p\10\17 
\216\220\0\0\0\213E\f\213M\10\213@\f\213"..., 8192) = 8192
1246310231.480990 write(4, "\377\377C;_\10|\360\213E\f\211u\0109p\10 
\17\216\220\0\0\0\213E\f\213M\10\213@\f\213"..., 8192) = 8192
[...]
1246310231.481945 read(3, "\312u\1\0\336u\1\0\352u\1\0\374u\1\0\16v\1 
\0\36v\1\0,v\1\0:v\1\0H"..., 8192) = 8192
1246310231.481986 write(4, "\312u\1\0\336u\1\0\352u\1\0\374u\1\0\16v\1 
\0\36v\1\0,v\1\0:v\1\0H"..., 8192) = 8192
1246310231.482063 read(3, "\377\271^\0\377\302v\0\322\247m\0\364\313 
\226\0\377\300h\0\374\277h\0\372\304x\0\340\261p\0\341"..., 8192) = 4096
1246310231.482102 write(4, "\377\271^\0\377\302v\0\322\247m\0\364\313 
\226\0\377\300h\0\374\277h\0\372\304x\0\340\261p\0\341"..., 4096) = 4096
1246310231.482174 close(4)              = 0
1246310235.602280 close(3)              = 0
[...]
1246310235.602419 exit_group(0)         = ?
============

This delay at close (4.2 seconds here) gets longer as the file gets  
bigger - exactly as you'd expect if write()ing was asynchronous but  
close()ing/flush()ing was synchronous.  Here's the (slightly more  
abbreviated) timings for a 1MB file:

===========
quince:/mnt# strace -ttt cp ~/typically-sized-files/onemeg /mnt/tmp/
1246315526.483531 execve("/bin/cp", ["cp", "/root/typically-sized- 
files/onem"..., "/mnt/tmp/"], [/* 17 vars */]) = 0
[...]
1246315527.034448 read(3, "MZ\220\0\3\0\0\0\4\0\0\0\377\377\0\0\270\0 
\0\0\0\0\0\0@\0\0\0\0\0\0\0\0"..., 8192) = 8192
1246315527.034515 write(4, "MZ\220\0\3\0\0\0\4\0\0\0\377\377\0\0\270\0 
\0\0\0\0\0\0@\0\0\0\0\0\0\0\0"..., 8192) = 8192
1246315527.500059 read(3, "M\364_^d\211\r\0\0\0\0\311\303\213\3013\311 
\211H\4\211H\10\211H\f\307@\20\1\0\0\0"..., 8192) = 8192
1246315527.500113 write(4, "M\364_^d\211\r\0\0\0\0\311\303\213\3013 
\311\211H\4\211H\10\211H\f\307@\20\1\0\0\0"..., 8192) = 8192
[...]
1246315527.522606 write(4, "\362\303\34:?/\5+\351\262\277\35\2413\5(eP 
\0\357\223\226w\373\311r4\325\347\215\30\r\270"..., 8192) = 8192
1246315527.522733 read(3, ""..., 8192)  = 0
1246315527.522770 close(4)              = 0
1246315563.780185 close(3)              = 0
===========

That's 36.26 seconds or ~28kB/s, which is typical of my upload  
speeds, strongly suggesting glusterfs is copying the file before  
returning from close().

Each of the two slow-connected machines is a gluster server with a  
localhost gluster client.  I'm using server-based replication (only  
out of desperation when client-based flush-behind didn't work).  This  
latest config does the write-back on the server, but I've also tried  
doing it on the client, with the same problem.  I'll include my most  
recent server-replication-based/server-write-delaying config because  
it's where those straces came from.

Client config (identical both ends):

========
volume loopback
   type protocol/client
   option transport-type tcp
   option remote-host 127.0.0.1
   option remote-subvolume delayedwrite  # name of the remote volume
end-volume
=========

Server config.  I'm just giving one side because they differ only by  
the IP address and hostname.

=========
volume shared-raw
   type storage/posix                   # POSIX FS translator
   option directory /mnt/big/fileserving/gluster-test/shared       #  
Export this directory
end-volume

volume quince-shared
   type features/locks
   subvolumes shared-raw
end-volume

volume remote
   type protocol/client
   option transport-type tcp
   option remote-host 10.8.0.3 # aka kumquat
   option remote-subvolume kumquat-shared        # name of the remote  
volume
end-volume

volume replicated
   type cluster/replicate
   subvolumes quince-shared remote
   option read-subvolume quince-shared
end-volume

volume threads
   type performance/io-threads
   subvolumes replicated
end-volume

volume delayedwrite
   type performance/write-behind
   option flush-behind on
   option cache-size 4MB
   subvolumes threads
end-volume

volume server
   type protocol/server
   option transport-type tcp
   subvolumes delayedwrite quince-shared
   option auth.addr.delayedwrite.allow 127.0.0.1 # Allow access to  
replicated volume from just the local client
   option auth.addr.quince-shared.allow 10.8.0.3 # Allow access to  
lockified tree from kumquat
end-volume
==========

Logs show nothing unusual, and small files replicate between sites  
just fine with acceptable delay.  I've unmounted/stopped/started/ 
remounted everything every time I've changed the configs, so these  
straces definitely belong to these configs.

So obviously the question is: why am I seeing delays on close() even  
though I've got "option flush-behind on"?  Is there something else I  
need to have in my config?  What's the best way to debug this if I've  
done nothing obviously wrong?

For what it's worth, a little more about my setup: got two offices  
full of secretaries, connected to t'Internet by ADSL links  
(Infiniband it certainly ain't!).  The secretaries create new Word  
docs, read each other's docs and move files around, but never edit  
anything.  One office is currently accessing the other over Samba  
over ADSL, which is predictably unusably slow.  I'm trying to use  
GlusterFS to replicate the tree between the two offices to provide  
both fast local reads and some level of link death resilience, and  
I'm hoping to layer Samba on top either end.  I'll resolve conflicts  
manually because they're so rare.  I seriously considered Coda first,  
but Gluster just looks so much tidier and more lively and current as  
a project, and - except for this one problem - seems to do exactly  
what I want.

Thanks in advance for any light shed,

Zak