<html><head><meta http-equiv="Content-Type" content="text/html charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class=""><br class=""><div><blockquote type="cite" class=""><div class="">On 3 Feb 2017, at 13:48, Gambit15 <<a href="mailto:dougti+gluster@gmail.com" class="">dougti+gluster@gmail.com</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div dir="ltr" class=""><div class=""><div class=""><div class=""><div class="">Hi Alex,<br class=""></div> I don't use Gluster for storing large amounts of small files, however from what I've read, that does appear to its big achilles heel.<br class=""></div></div></div></div></div></blockquote><div><br class=""></div><div>I am not an expert but I agree, due to its distributed nature, the induced per file access latency plays a big role when you have to deal with lot of small files, but it seems there are some tuning options available, a good place to start could be : <a href="https://access.redhat.com/documentation/en-US/Red_Hat_Storage/3/html/Administration_Guide/Small_File_Performance_Enhancements.html" class="">https://access.redhat.com/documentation/en-US/Red_Hat_Storage/3/html/Administration_Guide/Small_File_Performance_Enhancements.html</a></div><div><br class=""></div><blockquote type="cite" class=""><div class=""><div dir="ltr" class=""><div class=""><div class="">Personally, if you're not looking to scale out to a lot more servers, I'd go with Ceph or DRBD. Gluster's best features are in its scalability.<br class=""></div></div></div></div></blockquote><div><br class=""></div><div>AFAIK Ceph need at least 3 monitors (aka a quorum) to be fully “hight available”, so the entry ticket is pretty high and from my point of view over-kill for such needs, except if you plane to scale out too. DRBD seems a more reasonable approach.</div><div><br class=""></div><div>Cheers </div><br class=""><blockquote type="cite" class=""><div class=""><div dir="ltr" class=""><div class="">Also, it's worth pointing out that in any setup, you've got to be careful with 2 node configurations as they're highly vulnerable to split-brain scenarios.<br class=""><br class=""></div><div class="">Given the relatively small size of your data, caching tweaks & an arbiter may well save you here, however I don't use enough of its caching features to be able to give advice on it.<br class=""></div><div class=""><br class=""></div>D<br class=""></div><div class="gmail_extra"><br class=""><div class="gmail_quote">On 3 February 2017 at 08:28, Alex Sudakar <span dir="ltr" class=""><<a href="mailto:alex.sudakar@gmail.com" target="_blank" class="">alex.sudakar@gmail.com</a>></span> wrote:<br class=""><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi. I'm looking for a clustered filesystem for a very simple<br class="">
scenario. I've set up Gluster but my tests have shown quite a<br class="">
performance penalty when compared to using a local XFS filesystem.<br class="">
This no doubt reflects the reality of moving to a proper distributed<br class="">
filesystem, but I'd like to quickly check that I haven't missed<br class="">
something obvious that might improve performance.<br class="">
<br class="">
I plan to have two Amazon AWS EC2 instances (virtual machines) both<br class="">
accessing the same filesystem for read/writes. Access will be almost<br class="">
entirely reads, with the occasional modification, deletion or creation<br class="">
of files. Ideally I wanted all those reads going straight to the<br class="">
local XFS filesystem and just the writes incurring a distributed<br class="">
performance penalty. :-)<br class="">
<br class="">
So I've set up two VMs with Centos 7.2 and Gluster 3.8.8, each machine<br class="">
running as a combined Gluster server and client. One brick on each<br class="">
machine, one volume in a 1 x 2 replica configuration.<br class="">
<br class="">
Everything works, it's just the performance penalty which is a surprise. :-)<br class="">
<br class="">
My test directory has 9,066 files and directories; 7,987 actual files.<br class="">
Total size is 63MB data, 85MB allocated; an average size of 8KB data<br class="">
per file. The brick's files have a total of 117MB allocated, with the<br class="">
extra 32MB working out pretty much to be exactly the sum of the extra<br class="">
4KB extents that would have been allocated for the XFS attributes per<br class="">
file - the VMs were installed with the default 256 byte inode size for<br class="">
the local filesystem, and from what I've read Gluster will force the<br class="">
filesystem to allocate an extent for its attributes. 'xfs_bmap' on a<br class="">
few files shows this is the case.<br class="">
<br class="">
A simple 'cat' of every file when laid out in 'native' directories on<br class="">
the XFS filesystem takes about 3 seconds. A cat of all the files in<br class="">
the brick's directory on the same filesystem takes about 6.4 seconds,<br class="">
which I figure is due to the extra I/O for the inode metadata extents<br class="">
(although not quite certain; the additional extents added about 40%<br class="">
extra to the disk block allocation, so I'm unsure as to why the time<br class="">
increase was 100%).<br class="">
<br class="">
Doing the same test through the glusterfs mount takes about 25<br class="">
seconds; roughly four times longer than reading those same files<br class="">
directly from the brick itself.<br class="">
<br class="">
It took 30 seconds until I applied the 'md-cache' settings (for those<br class="">
variables that still exist in 3.8.8) mentioned in this very helpful<br class="">
article:<br class="">
<br class="">
<a href="http://blog.gluster.org/category/performance/" rel="noreferrer" target="_blank" class="">http://blog.gluster.org/<wbr class="">category/performance/</a><br class="">
<br class="">
So use of the md-cache in a 'cold run' shaved off 5 seconds - due to<br class="">
common directory LOOKUP operations being cached I guess.<br class="">
<br class="">
Output of a 'volume info' is as follows:<br class="">
<br class="">
Volume Name: g1<br class="">
Type: Replicate<br class="">
Volume ID: bac6cd70-ca0d-4173-9122-<wbr class="">644051444fe5<br class="">
Status: Started<br class="">
Snapshot Count: 0<br class="">
Number of Bricks: 1 x 2 = 2<br class="">
Transport-type: tcp<br class="">
Bricks:<br class="">
Brick1: serverA:/data/brick1<br class="">
Brick2: serverC:/data/brick1<br class="">
Options Reconfigured:<br class="">
transport.address-family: inet<br class="">
performance.readdir-ahead: on<br class="">
nfs.disable: on<br class="">
cluster.self-heal-daemon: enable<br class="">
features.cache-invalidation: on<br class="">
features.cache-invalidation-<wbr class="">timeout: 600<br class="">
performance.stat-prefetch: on<br class="">
performance.md-cache-timeout: 60<br class="">
network.inode-lru-limit: 90000<br class="">
<br class="">
The article suggests a value of 600 for<br class="">
features.cache-invalidation-<wbr class="">timeout but my Gluster version only<br class="">
permits a maximum value of 60.<br class="">
<br class="">
Network speed between the two VMs is about 120 MBytes/sec - the two<br class="">
VMs inhabit the same Amazon Virtual Private Cloud - so I don't think<br class="">
bandwidth is a factor.<br class="">
<br class="">
The 400% slowdown is no doubt the penalty incurred in moving to a<br class="">
proper distributed filesystem. That article and other web pages I've<br class="">
read all say that each open of a file results in synchronous LOOKUP<br class="">
operations on all the replicas, so I'm guessing it just takes that<br class="">
much time for everything to happen before a file can be opened.<br class="">
Gluster profiling shows that there are 11,198 LOOKUP operations on the<br class="">
test cat of the 7,987 files.<br class="">
<br class="">
As a Gluster newbie I'd appreciate some quick advice if possible -<br class="">
<br class="">
1. Is this sort of performance hit - on directories of small files -<br class="">
typical for such a simple Gluster configuration?<br class="">
<br class="">
2. Is there anything I can do to speed things up? :-)<br class="">
<br class="">
3. Repeating the 'cat' test immediately after the first test run saw<br class="">
the time dive from 25 seconds down to 4 seconds. Before I'd set those<br class="">
md-cache variables it had taken 17 seconds, due, I assume, to the<br class="">
actual file data being cached in the Linux buffer cache. So those<br class="">
md-cache settings really did make a change - taking off another 13<br class="">
seconds - once everything was cached.<br class="">
<br class="">
Flushing/invalidating the Linux memory cache made the next test go<br class="">
back to the 25 seconds. So it seems to me that the md-cache must hold<br class="">
its contents in the Linux memory buffers cache ... which surprised me,<br class="">
because I thought a user-space system like Gluster would have the<br class="">
cache within the daemons or maybe a shared memory segment, nothing<br class="">
that would be affected by clearing the Linux buffer cache. I was<br class="">
expecting a run after invalidating the linux cache would take<br class="">
something between 4 seconds and 25 seconds, with the md-cache still<br class="">
primed but the file data expired.<br class="">
<br class="">
Just out of curiosity in how the md-cache is implemented ... why does<br class="">
clearing the Linux buffers seem to affect it?<br class="">
<br class="">
4. The documentation says that Geo Gluster does 'asynchronous<br class="">
replication', which is something that would really help, but that it's<br class="">
'master/slave', so I'm assuming that Geo Gluster won't fulfill my<br class="">
requirements of both servers being able to occasionally<br class="">
write/modify/delete files?<br class="">
<br class="">
5. In my brick directory I have a '.trashcan' subdirectory - which is<br class="">
documented - but also a '.glusterfs' directory, which seems to have<br class="">
lots of magical files in some sort of housekeeping structure.<br class="">
Surprisingly the total amount of data under .glusterfs is greater than<br class="">
the total size of the actual files in my test directory. I haven't<br class="">
seen a description of what .glusterfs is used for ... are they vital<br class="">
to the operation of Gluster, or can they be deleted? Just curious.<br class="">
At once stage I had 1.1 GB of files in my volume, which expanded to be<br class="">
1.5GB in the brick (due to the metadata extents) and a whopping 1.6GB<br class="">
of extra data materialized under the .glusterfs directory!<br class="">
<br class="">
6. Since I'm using Centos I try to stick with things that are<br class="">
available through the Red Hat repository channel ... so in my looking<br class="">
for distributed filesystems I saw mention of Ceph. Because I wanted<br class="">
only a simple replicated filesystem it seemed to me that Ceph - being<br class="">
based/focused on 'object' storage? - wouldn't be as good a fit as<br class="">
Gluster. Evil question to a Gluster mailing list - will Ceph give me<br class="">
any significantly better performance in reading small files?<br class="">
<br class="">
I've tried to investigate and find out what I can but I could be<br class="">
missing something really obvious in my ignorance, so I would<br class="">
appreciate any quick tips/answers from the experts. Thanks!<br class="">
______________________________<wbr class="">_________________<br class="">
Gluster-users mailing list<br class="">
<a href="mailto:Gluster-users@gluster.org" class="">Gluster-users@gluster.org</a><br class="">
<a href="http://lists.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer" target="_blank" class="">http://lists.gluster.org/<wbr class="">mailman/listinfo/gluster-users</a><br class="">
</blockquote></div><br class=""></div>
_______________________________________________<br class="">Gluster-users mailing list<br class=""><a href="mailto:Gluster-users@gluster.org" class="">Gluster-users@gluster.org</a><br class="">http://lists.gluster.org/mailman/listinfo/gluster-users</div></blockquote></div><br class=""></body></html>