[Gluster-users] Performance in VM guests when hosting VM images on Gluster

Thu Feb 28 15:54:06 UTC 2013

On 02/28/2013 05:58 AM, Torbjørn Thorsen wrote:
> On Wed, Feb 27, 2013 at 9:46 PM, Brian Foster <bfoster at redhat.com> wrote:
>> On 02/27/2013 10:14 AM, Torbjørn Thorsen wrote:
>>> I'm seeing less-than-stellar performance on my Gluster deployment when
>>> hosting VM images on the FUSE mount.
...
> 
> I'm not familiar with the profiling feature, but I think I'm seeing
> the same thing,
> requests being fractured in smaller ones.
> 

gluster profiling is pretty straight forward. Just run the commands as
described and you can dump some stats on the workload the volume is seeing:

http://www.gluster.org/community/documentation/index.php/Gluster_3.2:_Running_Gluster_Volume_Profile_Command

The 'info' command will print the stats since the last info invocation,
so you can easily compare results between different workloads provided
the volume is otherwise idle.

> However, by chance I found something which seems to impact the
> performance even more.
> I wanted to retry the dd-to-loop-device-with-sync today, the same one
> I pasted yesterday.
> However, today it was quite different.
> 
> torbjorn at xen01:/srv/ganeti/shared-file-storage/tmp$ sudo dd
> if=/dev/zero of=/dev/loop1 bs=1024k count=2000 oflag=sync
> 303038464 bytes (303 MB) copied, 123.95 s, 2.4 MB/s
> ^C
> 

I started testing on a slightly more up to date VM. I'm seeing fairly
consistent 10MB/s with sync I/O. This is with a loop device over a a
file on a locally mounted gluster volume.

> So I unmounted the loop device and mounted it again, and re-ran the test.
> 
> torbjorn at xen01:/srv/ganeti/shared-file-storage/tmp$ sudo losetup -d /dev/loop1
> torbjorn at xen01:/srv/ganeti/shared-file-storage/tmp$ sudo losetup -f
> loopback.img
> torbjorn at xen01:/srv/ganeti/shared-file-storage/tmp$ sudo dd
> if=/dev/zero of=/dev/loop1 bs=1024k count=2000 oflag=sync
> 2097152000 bytes (2.1 GB) copied, 55.9117 s, 37.5 MB/s
> 

I can reproduce something like this when dealing with non-sync I/O.
Smaller overall writes (relative to available cache) run much faster and
larger write tend to normalize to a lower value. Using xfs_io instead of
dd shows that writes are in fact hitting cache (e.g., smaller writes
complete at 1.5GB/s, larger writes normalize to 35MB/s when we've
dirtied enough memory and flushing/reclaim kicks in). It also appears
that a close() on the loop device results in aggressively flushing
whatever data hasn't been flushed (something fuse also does on open()).
My non-sync results in dd tend to jump around, so perhaps that is a
reason why.

> The situation inside the Xen instance was similar, although with
> different numbers.
> 
> After being on, but mostly idle, for ~5 days.:
> torbjorn at hennec:~$ sudo dd if=/dev/zero of=bigfile bs=1024k count=2000
> oflag=direct
> 28311552 bytes (28 MB) copied, 35.1314 s, 806 kB/s
> ^C
> 
> After reboot and a fresh loop device:
> torbjorn at hennec:~$ sudo dd if=/dev/zero of=bigfile bs=1024k count=2000
> oflag=direct
> 814743552 bytes (815 MB) copied, 34.7441 s, 23.4 MB/s
> ^C
> 
> These numbers might indicate that loop device performance degrades over time.
> However, I haven't seen this on local filesystems, so is this possibly
> only with files on Gluster or FUSE ?

I would expect this kind of behavior when caching is involved, as
described above, but I'm not quite sure what would cause it with sync I/O.

> 
> I'm on Debian stable, so things aren't exactly box fresh.
> torbjorn at xen01:~$ dpkg -l | grep "^ii  linux-image-$(uname -r)"
> ii  linux-image-2.6.32-5-xen-amd64      2.6.32-46
> Linux 2.6.32 for 64-bit PCs, Xen dom0 support
> 
> I'm not sure how to debug the Gluster -> FUSE -> loop device interaction,
> but I might try a newer kernel on the client.
> 

>From skimming through the code and watching a writepage tracepoint, I
think the high-level situation is as follows:

- write()'s to the loop (block) device hit page cache as buffers. This
data is subject to similar caching/writeback behavior as a local
filesystem (e.g., write returns when the data is cached; if the write is
sync, wait on a flush before returning).
- Flushing eventually kicks in, which is page based and results in a
bunch of writepage requests. The block/buffer handling code converts
these writepage requests into 4k I/O (bio) requests.
- These 4k I/O requests hit loop. In the file backed case, it issues
write() requests to the underlying file.
- In a local filesystem cache, I believe this would result in further
caching in the local filesystem mapping. In the case of fuse, requests
to userspace are submitted immediately, thus gluster is now receiving 4k
write requests rather than 128k requests when writing to the file
directly via dd with a 1MB buffer size.

Given that you can reproduce the sync write variance without Xen, I
would rule that out for the time being and suggest the following:

- Give the profile tool a try to compare the local loop case when
throughput is higher vs. lower. It would be interesting to see if
anything jumps out that could help explain what is happening differently
between the runs.
- See what caching/performance translators are enabled in your gluster
client graph (the volname-fuse.vol volfile) and see about disabling some
of those one at a time, e.g.:

	gluster volume set myvol io-cache disable
	(repeat for write-behind, read-ahead, quick-read, etc.)

... and see if you get any more consistent results (good or bad).
- Out of curiosity (and if you're running a recent enough gluster), try
the fopen-keep-cache mount option on your gluster mount and see if it
changes any behavior, particularly with a cleanly mapped loop dev.

Brian