[Gluster-users] Quick performance check?

Fri Feb 3 11:28:04 UTC 2017

Hi.  I'm looking for a clustered filesystem for a very simple
scenario.  I've set up Gluster but my tests have shown quite a
performance penalty when compared to using a local XFS filesystem.
This no doubt reflects the reality of moving to a proper distributed
filesystem, but I'd like to quickly check that I haven't missed
something obvious that might improve performance.

I plan to have two Amazon AWS EC2 instances (virtual machines) both
accessing the same filesystem for read/writes.  Access will be almost
entirely reads, with the occasional modification, deletion or creation
of files.  Ideally I wanted all those reads going straight to the
local XFS filesystem and just the writes incurring a distributed
performance penalty.  :-)

So I've set up two VMs with Centos 7.2 and Gluster 3.8.8, each machine
running as a combined Gluster server and client.  One brick on each
machine, one volume in a 1 x 2 replica configuration.

Everything works, it's just the performance penalty which is a surprise.  :-)

My test directory has 9,066 files and directories; 7,987 actual files.
Total size is 63MB data, 85MB allocated; an average size of 8KB data
per file.  The brick's files have a total of 117MB allocated, with the
extra 32MB working out pretty much to be exactly the sum of the extra
4KB extents that would have been allocated for the XFS attributes per
file - the VMs were installed with the default 256 byte inode size for
the local filesystem, and from what I've read Gluster will force the
filesystem to allocate an extent for its attributes.  'xfs_bmap' on a
few files shows this is the case.

A simple 'cat' of every file when laid out in 'native' directories on
the XFS filesystem takes about 3 seconds.  A cat of all the files in
the brick's directory on the same filesystem takes about 6.4 seconds,
which I figure is due to the extra I/O for the inode metadata extents
(although not quite certain; the additional extents added about 40%
extra to the disk block allocation, so I'm unsure as to why the time
increase was 100%).

Doing the same test through the glusterfs mount takes about 25
seconds; roughly four times longer than reading those same files
directly from the brick itself.

It took 30 seconds until I applied the 'md-cache' settings (for those
variables that still exist in 3.8.8) mentioned in this very helpful
article:

  http://blog.gluster.org/category/performance/

So use of the md-cache in a 'cold run' shaved off 5 seconds - due to
common directory LOOKUP operations being cached I guess.

Output of a 'volume info' is as follows:

Volume Name: g1
Type: Replicate
Volume ID: bac6cd70-ca0d-4173-9122-644051444fe5
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: serverA:/data/brick1
Brick2: serverC:/data/brick1
Options Reconfigured:
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
cluster.self-heal-daemon: enable
features.cache-invalidation: on
features.cache-invalidation-timeout: 600
performance.stat-prefetch: on
performance.md-cache-timeout: 60
network.inode-lru-limit: 90000

The article suggests a value of 600 for
features.cache-invalidation-timeout but my Gluster version only
permits a maximum value of 60.

Network speed between the two VMs is about 120 MBytes/sec - the two
VMs inhabit the same Amazon Virtual Private Cloud - so I don't think
bandwidth is a factor.

The 400% slowdown is no doubt the penalty incurred in moving to a
proper distributed filesystem.  That article and other web pages I've
read all say that each open of a file results in synchronous LOOKUP
operations on all the replicas, so I'm guessing it just takes that
much time for everything to happen before a file can be opened.
Gluster profiling shows that there are 11,198 LOOKUP operations on the
test cat of the 7,987 files.

As a Gluster newbie I'd appreciate some quick advice if possible -

1.  Is this sort of performance hit - on directories of small files -
typical for such a simple Gluster configuration?

2.  Is there anything I can do to speed things up?  :-)

3.  Repeating the 'cat' test immediately after the first test run saw
the time dive from 25 seconds down to 4 seconds.  Before I'd set those
md-cache variables it had taken 17 seconds, due, I assume, to the
actual file data being cached in the Linux buffer cache.  So those
md-cache settings really did make a change - taking off another 13
seconds - once everything was cached.

Flushing/invalidating the Linux memory cache made the next test go
back to the 25 seconds.  So it seems to me that the md-cache must hold
its contents in the Linux memory buffers cache ... which surprised me,
because I thought a user-space system like Gluster would have the
cache within the daemons or maybe a shared memory segment, nothing
that would be affected by clearing the Linux buffer cache.  I was
expecting a run after invalidating the linux cache would take
something between 4 seconds and 25 seconds, with the md-cache still
primed but the file data expired.

Just out of curiosity in how the md-cache is implemented ... why does
clearing the Linux buffers seem to affect it?

4.  The documentation says that Geo Gluster does 'asynchronous
replication', which is something that would really help, but that it's
'master/slave', so I'm assuming that Geo Gluster won't fulfill my
requirements of both servers being able to occasionally
write/modify/delete files?

5.  In my brick directory I have a '.trashcan' subdirectory - which is
documented - but also a '.glusterfs' directory, which seems to have
lots of magical files in some sort of housekeeping structure.
Surprisingly the total amount of data under .glusterfs is greater than
the total size of the actual files in my test directory.  I haven't
seen a description of what .glusterfs is used for ... are they vital
to the operation of Gluster, or can they be deleted?  Just curious.
At once stage I had 1.1 GB of files in my volume, which expanded to be
1.5GB in the brick (due to the metadata extents) and a whopping 1.6GB
of extra data materialized under the .glusterfs directory!

6.  Since I'm using Centos I try to stick with things that are
available through the Red Hat repository channel ... so in my looking
for distributed filesystems I saw mention of Ceph.  Because I wanted
only a simple replicated filesystem it seemed to me that Ceph - being
based/focused on 'object' storage? - wouldn't be as good a fit as
Gluster.  Evil question to a Gluster mailing list - will Ceph give me
any significantly better performance in reading small files?

I've tried to investigate and find out what I can but I could be
missing something really obvious in my ignorance, so I would
appreciate any quick tips/answers from the experts.  Thanks!