[Gluster-users] Simultaneous reads and writes from specific apps to IPoIB volume seem to conflict and kill performance.

Tue Jul 24 05:19:33 UTC 2012

Some more info..

I think the problem is the way the bedtools is writing the output -
it's not getting buffered correctly.
Using some more useful strace flags to force strace into the fork'ed
child, you can see that the output is being written, just very slowly
due to the awful, horrible, skeezy, skanky, lazy, wanky way that
biologists (me included) tend to write code. ie:

after the data is read in and processed, you get gigantic amounts of
this kind of output being written to the file
[pid 17021] 21:56:21 write(1, "U\t137095\t43\n", 12) = 12 <0.000120>
[pid 17021] 21:56:21 write(1, "U\t137096\t40\n", 12) = 12 <0.000119>
[pid 17021] 21:56:21 write(1, "U\t137097\t40\n", 12) = 12 <0.000119>
[pid 17021] 21:56:21 write(1, "U\t137098\t40\n", 12) = 12 <0.000119>
[pid 17021] 21:56:21 write(1, "U\t137099\t38\n", 12) = 12 <0.000116>
[pid 17021] 21:56:21 write(1, "U\t137100\t38\n", 12) = 12 <0.000119>
[pid 17021] 21:56:21 write(1, "U\t137101\t38\n", 12) = 12 <0.000117>

ie (the file itself):
...
 137098 U       137098  40
 137099 U       137099  38
 137100 U       137100  38
 137101 U       137101  38
 137102 U       137102  36

IT looks like the current gluster config  isn't being set up to buffer
this particular output correctly, so it's being written on a
write-by-write basis.

As noted below, my gluster performance options are:

> performance.cache-size: 268435456
> performance.io-cache: on
> performance.quick-read: on
> performance.io-thread-count: 64

Is there an option to address this extremely slow write perf?

These options (p 38 of the 'Gluster File System 3.3.0 Administration
Guide') sound like they may help but without knowing what they
actually do, I'm hesitant to apply them to what is now a live fs.

performance.flush-behind:
        If this option is set ON,
        instructs write-behind
        translator to perform
        flush in background,
        by returning success
        (or any errors, if any of
        previous writes were
        failed) to application
        even before flush
        is sent to backend
        filesystem.

performance.write-behind-window-size
        Size of the per-file
        write-behind buffer.

Advice?

hjm

On Mon, Jul 23, 2012 at 4:59 PM, Harry Mangalam <hjmangalam at gmail.com> wrote:
> I have fairly new gluster fs of 4 nodes with 2  RAID6 bricks on each
> node connected to a cluster via IPoIB on QDR IB.
> The servers are all SL6.2, running gluster 3.3-1; the clients are
> running the gluster-released glusterfs-fuse-3.3.0qa42-1 &
> glusterfs-3.3.0qa42-1.
>
> The volume seems normal:
> $ gluster volume info
> Volume Name: gl
> Type: Distribute
> Volume ID: 21f480f7-fc5a-4fd8-a084-3964634a9332
> Status: Started
> Number of Bricks: 8
> Transport-type: tcp,rdma
> Bricks:
> Brick1: bs2:/raid1
> Brick2: bs2:/raid2
> Brick3: bs3:/raid1
> Brick4: bs3:/raid2
> Brick5: bs4:/raid1
> Brick6: bs4:/raid2
> Brick7: bs1:/raid1
> Brick8: bs1:/raid2
> Options Reconfigured:
> performance.cache-size: 268435456
> nfs.disable: on
> performance.io-cache: on
> performance.quick-read: on
> performance.io-thread-count: 64
> auth.allow: 10.2.*.*,10.1.*.*
>
> The logs on both the server and client are remarkable in their lack of
> anything amiss (the server has the previously reported zillion times
> repeating string of:
>
>     I [socket.c:1798:socket_event_handler] 0-transport: disconnecting now
>
> which seems to be correlated with turning the NFS server off.  This
> has been mentioned before.
>
> The gluster volume log, stripped of that line is here:
> <http://pastie.org/4309225>
>
> Individual large-file reads and writes are in the >300MB/s range which
> is not magnificent but tolerable. However, we've recently detected
> what appears to be a conflict in reading and writing for some
> applications.  When some applications are reading and writing to the
> gluster fs, the client
> /usr/sbin/glusterfs increases its CPU consunmption to >100% and the IO
> goes to almost zero.
>
> When the inputs are on the gluster fs and the output is on another fs,
> performance is as good as on a local RAID.
> This seems to be specific to a particular application (bedtools,
> perhaps some other openmp genomics apps - still checking).  Other
> utilities  (cp,  perl, tar, and other utilities ) that read and write
> to the gluster filesystem seem to be able to push and pull fairly
> large amount of data to/from it.
>
> The client is running a genomics utility (bedtools) which reads a very
> large chunks  of data from the gluster fs, then aligns it to a
> reference genome.  Stracing the run yields this stanza, after which it
> hangs until I kill it.  The user has said that it does complete but at
> a speed hundreds of times slower (maybe timing out at each step..?)
>
> open("/data/users/tdlong/bin/genomeCoverageBed", O_RDONLY) = 3
> ioctl(3, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fffcf0e5bb0) = -1 ENOTTY
> (Inappropriate ioctl for device)
> lseek(3, 0, SEEK_CUR)                   = 0
> read(3, "#!/bin/sh\n${0%/*}/bedtools genom"..., 80) = 42
> lseek(3, 0, SEEK_SET)                   = 0
> getrlimit(RLIMIT_NOFILE, {rlim_cur=4*1024, rlim_max=4*1024}) = 0
> dup2(3, 255)                            = 255
> close(3)                                = 0
> fcntl(255, F_SETFD, FD_CLOEXEC)         = 0
> fcntl(255, F_GETFL)                     = 0x8000 (flags O_RDONLY|O_LARGEFILE)
> fstat(255, {st_mode=S_IFREG|0755, st_size=42, ...}) = 0
> lseek(255, 0, SEEK_CUR)                 = 0
> rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
> read(255, "#!/bin/sh\n${0%/*}/bedtools genom"..., 42) = 42
> rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
> rt_sigprocmask(SIG_BLOCK, [INT CHLD], [], 8) = 0
> clone(child_stack=0,
> flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
> child_tidptr=0x2ae9318729e0) = 8229
> rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
> rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
> rt_sigaction(SIGINT, {0x436f40, [], SA_RESTORER, 0x3cb64302d0},
> {SIG_DFL, [], SA_RESTORER, 0x3cb64302d0}, 8) = 0
> wait4(-1,
>
> Does this indicate any optional tuning or operational parameters that
> we should be using?
>
> hjm
>
> --
> Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
> [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
> 415 South Circle View Dr, Irvine, CA, 92697 [shipping]
> MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)

-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)