[Gluster-users] BTRFS as a GlusterFS storage back-end, and what I've learned from using it as such.

Wed Apr 12 01:43:41 UTC 2017

Adding gluster-users list. I think there are a few users out there 
running gluster on top of btrfs, so this might benefit a broader audience.

On 04/11/2017 09:10 PM, Austin S. Hemmelgarn wrote:
> About a year ago now, I decided to set up a small storage cluster to 
> store backups (and partially replace Dropbox for my usage, but that's 
> a separate story).  I ended up using GlusterFS as the clustering 
> software itself, and BTRFS as the back-end storage.
>
> GlusterFS itself is actually a pretty easy workload as far as cluster 
> software goes.  It does some processing prior to actually storing the 
> data (a significant amount in fact), but the actual on-device storage 
> on any given node is pretty simple.  You have the full directory 
> structure for the whole volume, and whatever files happen to be on 
> that node are located within that tree exactly like they are in the 
> GlusterFS volume. Beyond the basic data, gluster only stores 2-4 
> xattrs per-file (which are used to track synchronization, and also for 
> it's internal data scrubbing), and a directory called .glusterfs in 
> the top of the back-end storage location for the volume which contains 
> the data required to figure out which node a file is on.  Overall, the 
> access patterns mostly mirror whatever is using the Gluster volume, or 
> are reduced to slow streaming writes (when writing files and the 
> back-end nodes are computationally limited instead of I/O limited), 
> with the addition of some serious metadata operations in the 
> .glusterfs directory (lots of stat calls there, together with large 
> numbers of small files).
>
> As far as overall performance, BTRFS is actually on par for this usage 
> with both ext4 and XFS (at least, on my hardware it is), and I 
> actually see more SSD friendly access patterns when using BTRFS in 
> this case than any other FS I tried.
>
> After some serious experimentation with various configurations for 
> this during the past few months, I've noticed a handful of other things:
>
> 1. The 'ssd' mount option does not actually improve performance on 
> these SSD's.  To a certain extent, this actually surprised me at 
> first, but having seen Hans' e-mail and what he found about this 
> option, it actually makes sense, since erase-blocks on these devices 
> are 4MB, not 2MB, and the drives have a very good FTL (so they will 
> aggregate all the little writes properly).
>
> Given this, I'm beginning to wonder if it actually makes sense to not 
> automatically enable this on mount when dealing with certain types of 
> storage (for example, most SATA and SAS SSD's have reasonably good 
> FTL's, so I would expect them to have similar behavior).  
> Extrapolating further, it might instead make sense to just never 
> automatically enable this, and expose the value this option is 
> manipulating as a mount option as there are other circumstances where 
> setting specific values could improve performance (for example, if 
> you're on hardware RAID6, setting this to the stripe size would 
> probably improve performance on many cheaper controllers).
>
> 2. Up to a certain point, running a single larger BTRFS volume with 
> multiple sub-volumes is more computationally efficient than running 
> multiple smaller BTRFS volumes.  More specifically, there is lower 
> load on the system and lower CPU utilization by BTRFS itself without 
> much noticeable difference in performance (in my tests it was about 
> 0.5-1% performance difference, YMMV).  To a certain extent this makes 
> some sense, but the turnover point was actually a lot higher than I 
> expected (with this workload, the turnover point was around half a 
> terabyte).
>
> I believe this to be a side-effect of how we use per-filesystem 
> worker-pools.  In essence, we can schedule parallel access better when 
> it's all through the same worker pool than we can when using multiple 
> worker pools.  Having realized this, I think it might be interesting 
> to see if using a worker-pool per physical device (or at least what 
> the system sees as a physical device) might make more sense in terms 
> of performance than our current method of using a pool per-filesystem.
>
> 3. On these SSD's, running a single partition in dup mode is actually 
> marginally more efficient than running 2 partitions in raid1 mode.  I 
> was actually somewhat surprised by this, and I haven't been able to 
> find a clear explanation as to why (I suspect caching may have 
> something to do with it, but I'm not 100% certain about that),  but 
> some limited testing with other SSD's seems to indicate that it's the 
> case for most SSD's, with the difference being smaller on smaller and 
> faster devices. On a traditional hard disk, it's significantly more 
> efficient, but that's generally to be expected.
>
> 4. Depending on other factors, compression can actually slow you down 
> pretty significantly.  In the particular case I saw this happen (all 
> cores completely utilized by userspace software), LZO compression 
> actually caused around 5-10% performance degradation compared to no 
> compression.  This is somewhat obvious once it's explained, but it's 
> not exactly intuitive  and as such it's probably worth documenting in 
> the man pages that compression won't always make things better.  I may 
> send a patch to add this at some point in the near future.
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html