[Gluster-users] BTRFS as a GlusterFS storage back-end, and what I've learned from using it as such.
Ravishankar N
ravishankar at redhat.com
Wed Apr 12 01:43:41 UTC 2017
Adding gluster-users list. I think there are a few users out there
running gluster on top of btrfs, so this might benefit a broader audience.
On 04/11/2017 09:10 PM, Austin S. Hemmelgarn wrote:
> About a year ago now, I decided to set up a small storage cluster to
> store backups (and partially replace Dropbox for my usage, but that's
> a separate story). I ended up using GlusterFS as the clustering
> software itself, and BTRFS as the back-end storage.
>
> GlusterFS itself is actually a pretty easy workload as far as cluster
> software goes. It does some processing prior to actually storing the
> data (a significant amount in fact), but the actual on-device storage
> on any given node is pretty simple. You have the full directory
> structure for the whole volume, and whatever files happen to be on
> that node are located within that tree exactly like they are in the
> GlusterFS volume. Beyond the basic data, gluster only stores 2-4
> xattrs per-file (which are used to track synchronization, and also for
> it's internal data scrubbing), and a directory called .glusterfs in
> the top of the back-end storage location for the volume which contains
> the data required to figure out which node a file is on. Overall, the
> access patterns mostly mirror whatever is using the Gluster volume, or
> are reduced to slow streaming writes (when writing files and the
> back-end nodes are computationally limited instead of I/O limited),
> with the addition of some serious metadata operations in the
> .glusterfs directory (lots of stat calls there, together with large
> numbers of small files).
>
> As far as overall performance, BTRFS is actually on par for this usage
> with both ext4 and XFS (at least, on my hardware it is), and I
> actually see more SSD friendly access patterns when using BTRFS in
> this case than any other FS I tried.
>
> After some serious experimentation with various configurations for
> this during the past few months, I've noticed a handful of other things:
>
> 1. The 'ssd' mount option does not actually improve performance on
> these SSD's. To a certain extent, this actually surprised me at
> first, but having seen Hans' e-mail and what he found about this
> option, it actually makes sense, since erase-blocks on these devices
> are 4MB, not 2MB, and the drives have a very good FTL (so they will
> aggregate all the little writes properly).
>
> Given this, I'm beginning to wonder if it actually makes sense to not
> automatically enable this on mount when dealing with certain types of
> storage (for example, most SATA and SAS SSD's have reasonably good
> FTL's, so I would expect them to have similar behavior).
> Extrapolating further, it might instead make sense to just never
> automatically enable this, and expose the value this option is
> manipulating as a mount option as there are other circumstances where
> setting specific values could improve performance (for example, if
> you're on hardware RAID6, setting this to the stripe size would
> probably improve performance on many cheaper controllers).
>
> 2. Up to a certain point, running a single larger BTRFS volume with
> multiple sub-volumes is more computationally efficient than running
> multiple smaller BTRFS volumes. More specifically, there is lower
> load on the system and lower CPU utilization by BTRFS itself without
> much noticeable difference in performance (in my tests it was about
> 0.5-1% performance difference, YMMV). To a certain extent this makes
> some sense, but the turnover point was actually a lot higher than I
> expected (with this workload, the turnover point was around half a
> terabyte).
>
> I believe this to be a side-effect of how we use per-filesystem
> worker-pools. In essence, we can schedule parallel access better when
> it's all through the same worker pool than we can when using multiple
> worker pools. Having realized this, I think it might be interesting
> to see if using a worker-pool per physical device (or at least what
> the system sees as a physical device) might make more sense in terms
> of performance than our current method of using a pool per-filesystem.
>
> 3. On these SSD's, running a single partition in dup mode is actually
> marginally more efficient than running 2 partitions in raid1 mode. I
> was actually somewhat surprised by this, and I haven't been able to
> find a clear explanation as to why (I suspect caching may have
> something to do with it, but I'm not 100% certain about that), but
> some limited testing with other SSD's seems to indicate that it's the
> case for most SSD's, with the difference being smaller on smaller and
> faster devices. On a traditional hard disk, it's significantly more
> efficient, but that's generally to be expected.
>
> 4. Depending on other factors, compression can actually slow you down
> pretty significantly. In the particular case I saw this happen (all
> cores completely utilized by userspace software), LZO compression
> actually caused around 5-10% performance degradation compared to no
> compression. This is somewhat obvious once it's explained, but it's
> not exactly intuitive and as such it's probably worth documenting in
> the man pages that compression won't always make things better. I may
> send a patch to add this at some point in the near future.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
More information about the Gluster-users
mailing list