[Gluster-users] Simultaneous reads and writes from specific apps to IPoIB volume seem to conflict and kill performance.
Harry Mangalam
hjmangalam at gmail.com
Tue Jul 24 05:19:33 UTC 2012
Some more info..
I think the problem is the way the bedtools is writing the output -
it's not getting buffered correctly.
Using some more useful strace flags to force strace into the fork'ed
child, you can see that the output is being written, just very slowly
due to the awful, horrible, skeezy, skanky, lazy, wanky way that
biologists (me included) tend to write code. ie:
after the data is read in and processed, you get gigantic amounts of
this kind of output being written to the file
[pid 17021] 21:56:21 write(1, "U\t137095\t43\n", 12) = 12 <0.000120>
[pid 17021] 21:56:21 write(1, "U\t137096\t40\n", 12) = 12 <0.000119>
[pid 17021] 21:56:21 write(1, "U\t137097\t40\n", 12) = 12 <0.000119>
[pid 17021] 21:56:21 write(1, "U\t137098\t40\n", 12) = 12 <0.000119>
[pid 17021] 21:56:21 write(1, "U\t137099\t38\n", 12) = 12 <0.000116>
[pid 17021] 21:56:21 write(1, "U\t137100\t38\n", 12) = 12 <0.000119>
[pid 17021] 21:56:21 write(1, "U\t137101\t38\n", 12) = 12 <0.000117>
ie (the file itself):
...
137098 U 137098 40
137099 U 137099 38
137100 U 137100 38
137101 U 137101 38
137102 U 137102 36
IT looks like the current gluster config isn't being set up to buffer
this particular output correctly, so it's being written on a
write-by-write basis.
As noted below, my gluster performance options are:
> performance.cache-size: 268435456
> performance.io-cache: on
> performance.quick-read: on
> performance.io-thread-count: 64
Is there an option to address this extremely slow write perf?
These options (p 38 of the 'Gluster File System 3.3.0 Administration
Guide') sound like they may help but without knowing what they
actually do, I'm hesitant to apply them to what is now a live fs.
performance.flush-behind:
If this option is set ON,
instructs write-behind
translator to perform
flush in background,
by returning success
(or any errors, if any of
previous writes were
failed) to application
even before flush
is sent to backend
filesystem.
performance.write-behind-window-size
Size of the per-file
write-behind buffer.
Advice?
hjm
On Mon, Jul 23, 2012 at 4:59 PM, Harry Mangalam <hjmangalam at gmail.com> wrote:
> I have fairly new gluster fs of 4 nodes with 2 RAID6 bricks on each
> node connected to a cluster via IPoIB on QDR IB.
> The servers are all SL6.2, running gluster 3.3-1; the clients are
> running the gluster-released glusterfs-fuse-3.3.0qa42-1 &
> glusterfs-3.3.0qa42-1.
>
> The volume seems normal:
> $ gluster volume info
> Volume Name: gl
> Type: Distribute
> Volume ID: 21f480f7-fc5a-4fd8-a084-3964634a9332
> Status: Started
> Number of Bricks: 8
> Transport-type: tcp,rdma
> Bricks:
> Brick1: bs2:/raid1
> Brick2: bs2:/raid2
> Brick3: bs3:/raid1
> Brick4: bs3:/raid2
> Brick5: bs4:/raid1
> Brick6: bs4:/raid2
> Brick7: bs1:/raid1
> Brick8: bs1:/raid2
> Options Reconfigured:
> performance.cache-size: 268435456
> nfs.disable: on
> performance.io-cache: on
> performance.quick-read: on
> performance.io-thread-count: 64
> auth.allow: 10.2.*.*,10.1.*.*
>
> The logs on both the server and client are remarkable in their lack of
> anything amiss (the server has the previously reported zillion times
> repeating string of:
>
> I [socket.c:1798:socket_event_handler] 0-transport: disconnecting now
>
> which seems to be correlated with turning the NFS server off. This
> has been mentioned before.
>
> The gluster volume log, stripped of that line is here:
> <http://pastie.org/4309225>
>
> Individual large-file reads and writes are in the >300MB/s range which
> is not magnificent but tolerable. However, we've recently detected
> what appears to be a conflict in reading and writing for some
> applications. When some applications are reading and writing to the
> gluster fs, the client
> /usr/sbin/glusterfs increases its CPU consunmption to >100% and the IO
> goes to almost zero.
>
> When the inputs are on the gluster fs and the output is on another fs,
> performance is as good as on a local RAID.
> This seems to be specific to a particular application (bedtools,
> perhaps some other openmp genomics apps - still checking). Other
> utilities (cp, perl, tar, and other utilities ) that read and write
> to the gluster filesystem seem to be able to push and pull fairly
> large amount of data to/from it.
>
> The client is running a genomics utility (bedtools) which reads a very
> large chunks of data from the gluster fs, then aligns it to a
> reference genome. Stracing the run yields this stanza, after which it
> hangs until I kill it. The user has said that it does complete but at
> a speed hundreds of times slower (maybe timing out at each step..?)
>
> open("/data/users/tdlong/bin/genomeCoverageBed", O_RDONLY) = 3
> ioctl(3, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fffcf0e5bb0) = -1 ENOTTY
> (Inappropriate ioctl for device)
> lseek(3, 0, SEEK_CUR) = 0
> read(3, "#!/bin/sh\n${0%/*}/bedtools genom"..., 80) = 42
> lseek(3, 0, SEEK_SET) = 0
> getrlimit(RLIMIT_NOFILE, {rlim_cur=4*1024, rlim_max=4*1024}) = 0
> dup2(3, 255) = 255
> close(3) = 0
> fcntl(255, F_SETFD, FD_CLOEXEC) = 0
> fcntl(255, F_GETFL) = 0x8000 (flags O_RDONLY|O_LARGEFILE)
> fstat(255, {st_mode=S_IFREG|0755, st_size=42, ...}) = 0
> lseek(255, 0, SEEK_CUR) = 0
> rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
> read(255, "#!/bin/sh\n${0%/*}/bedtools genom"..., 42) = 42
> rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
> rt_sigprocmask(SIG_BLOCK, [INT CHLD], [], 8) = 0
> clone(child_stack=0,
> flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
> child_tidptr=0x2ae9318729e0) = 8229
> rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
> rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
> rt_sigaction(SIGINT, {0x436f40, [], SA_RESTORER, 0x3cb64302d0},
> {SIG_DFL, [], SA_RESTORER, 0x3cb64302d0}, 8) = 0
> wait4(-1,
>
> Does this indicate any optional tuning or operational parameters that
> we should be using?
>
> hjm
>
> --
> Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
> [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
> 415 South Circle View Dr, Irvine, CA, 92697 [shipping]
> MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
--
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
More information about the Gluster-users
mailing list