[Gluster-users] Would there be a use for cluster-specific filesystem tools?
John Brunelle
john_brunelle at harvard.edu
Wed Apr 16 23:09:21 UTC 2014
This is great stuff! I think there is a huge need for this. It's amazing
how much faster basic coreutils operations can be, even when just using one
client as you show.
I've had a lot of success using dftw[1,2] to speed up these recursive
operations on distributed filesystems (conditional finds, lustre OST
retirement, etc). It's MPI-based (hybrid, actually) and many such tasks
really scale well, and keep scaling when you spread the work over multiple
client nodes (especially with gluster it seems). It's really fantastic.
I've done things like combine it with mrmpi[3] to create a general
mapreduce[4] for these situations.
Take for example du: standard /usr/bin/du (or just a find that prints size)
on our 10-node distributed gluster filesystem takes well over a couple of
hours for trees with tens of thousands of files. We've cut that down over
an order of magnitude, to about 10 minutes, with a simple parallel du
calculation using the above[5] (running with 16 procs across 4 nodes, ~85%
strong scaling efficiency).
Like you, I have hopes of making a package of such utilities. Probably
your threaded model will make this much more approachable, though, and
elastic, too. I'll happily try out your tools when you're ready to post
them, and a bet a lot of others will, too.
Best,
John
[1] https://github.com/hpc/libdftw
[2] http://dl.acm.org/citation.cfm?id=2389114
[3] http://mapreduce.sandia.gov/
[4] https://github.com/jabrcx/fsmr
[5]
https://github.com/jabrcx/fsmr/blob/master/examples/fsmr.du_by_owner/example.c
On Wed, Apr 16, 2014 at 10:31 AM, Joe Julian <joe at julianfamily.org> wrote:
> Excellent! I've been toying with the same concept in the back of my mind
> for a long while now. I'm sure there is an unrealized desire for such tools.
>
> When your ready, please put such a toolset on forge.gluster.org<https://urldefense.proofpoint.com/v1/url?u=http://forge.gluster.org&k=AjZjj3dyY74kKL92lieHqQ%3D%3D%0A&r=%2FQG8E5VPw7JQVkUUr0s3matmYRxzaY0KPU5nnXGigBI%3D%0A&m=w3pb14ExePHzQy9Y9dkKj8Y7PhtZd1RDukRFHcy8eig%3D%0A&s=aaaa6ff566767b846e9debfd2f0390ee8c500a231e4d7f3b479d1931309a11a8>.
>
>
> On April 16, 2014 6:50:48 AM PDT, Michael Peek <peek at nimbios.org> wrote:
>
>> Hi guys,
>>
>> (I'm new to this, so pardon me if my shenanigans turns out to be a waste
>> of your time.)
>>
>> I have been experimenting with Gluster by copying and deleting large
>> numbers of files of all sizes. What I found was that when deleting a large
>> number of small files, the deletion process seems to take a good chunk of
>> my time -- in some cases it seemed to take a significant percentage of the
>> time that it took to copy the files to the cluster to begin with. I'm
>> guessing that the reason is a combination of find and rm -fr processing
>> files serially and having to wait on the packets to travel back and forth
>> over the network. But with a clustering filesystem, the bottleneck is
>> processing files serially and waiting for network packets when you don't
>> have to.
>>
>> So I decided to try an experiment. Instead of using /bin/rm to delete
>> files serially, I wrote my own quick-and-dirty recursive rm (and recursive
>> ls) that uses pthreads (listed as "cluster-rm" and "cluster-ls" in the
>> table below):
>>
>> Methods:
>>
>> 1) This was done on a Linux system. I suspect that Linux (or any modern
>> OS) caches filesystem information. For example, after setting up a
>> directory, when running rm -fr on that directory, the time for rm to
>> complete is lessened if I first run find on the same directory. So to
>> avoid this caching effect, each command was run on it's own test
>> directory. (I.e. find was never run on the same directory as rm -fr or
>> cluster-rm.) This approach seemed to prevent inconsistencies resulting
>> from any caching behavior, resulting in run times that were more consistent.
>>
>> 2) Each test directory contained the exact same data for each of the four
>> commands tested (find, cluster-ls, rm, cluster-rm) for each test run.
>>
>> 3) All commands were run on a client machine and not one of the cluster
>> nodes.
>>
>> Results:
>>
>> *Data Size*
>> *Command*
>> *Test #1*
>> *Test #2*
>> *Test #3*
>> *Test #4*
>> 49GB
>> find -print
>> real 6m45.066s
>> user 0m0.172s
>> sys 0m0.748s
>> real 6m18.524s
>> user 0m0.140s
>> sys 0m0.508s
>> real 5m45.301s
>> user 0m0.156s
>> sys 0m0.484s
>> real 5m58.577s
>> user 0m0.132s
>> sys 0m0.480s
>>
>> cluster-ls
>> real 2m32.770s
>> user 0m0.208s
>> sys 0m1.876s
>> real 2m21.376s
>> user 0m0.164s
>> sys 0m1.568s
>> real 2m40.511s
>> user 0m0.184s
>> sys 0m1.488s
>> real 2m36.202s
>> user 0m0.172s
>> sys 0m1.412s
>>
>>
>>
>>
>>
>>
>> 49GB
>> rm -fr
>> real 16m36.264s
>> user 0m0.232s
>> sys 0m1.724s
>> real 16m16.795s
>> user 0m0.248s
>> sys 0m1.528s
>> real 15m54.503s
>> user 0m0.204s
>> sys 0m1.396s
>> real 16m10.037s
>> user 0m0.168s
>> sys 0m1.448s
>>
>> cluster-rm
>> real 1m50.717s
>> user 0m0.236s
>> sys 0m1.820s
>> real 1m44.803s
>> user 0m0.192s
>> sys 0m2.100s
>> real 2m6.250s
>> user 0m0.224s
>> sys 0m2.200s
>> real 2m6.367s
>> user 0m0.224s
>> sys 0m2.316s
>>
>>
>>
>>
>>
>>
>> 97GB
>> find -print
>> real 11m39.990s
>> user 0m0.380s
>> sys 0m1.428s
>> real 11m21.018s
>> user 0m0.380s
>> sys 0m1.224s
>> real 11m33.257s
>> user 0m0.288s
>> sys 0m0.924s
>> real 11m4.867s
>> user 0m0.332s
>> sys 0m1.244s
>>
>> cluster-ls
>> real 4m46.829s
>> user 0m0.504s
>> sys 0m3.228s
>> real 5m15.538s
>> user 0m0.408s
>> sys 0m3.736s
>> real 4m52.075s
>> user 0m0.364s
>> sys 0m3.004s
>> real 4m43.134s
>> user 0m0.452s
>> sys 0m3.140s
>>
>>
>>
>>
>>
>>
>> 97GB
>> rm -fr
>> real 29m34.138s
>> user 0m0.520s
>> sys 0m3.908s
>> real 28m11.000s
>> user 0m0.556s
>> sys 0m3.480s
>> real 28m37.154s
>> user 0m0.412s
>> sys 0m2.756s
>> real 28m41.724s
>> user 0m0.380s
>> sys 0m4.184s
>>
>> cluster-rm
>> real 3m30.750s
>> user 0m0.524s
>> sys 0m4.932s
>> real 4m20.195s
>> user 0m0.456s
>> sys 0m5.316s
>> real 4m45.206s
>> user 0m0.444s
>> sys 0m4.584s
>> real 4m26.894s
>> user 0m0.436s
>> sys 0m4.732s
>>
>>
>>
>>
>>
>>
>> 145GB
>> find -print
>> real 16m26.498s
>> user 0m0.520s
>> sys 0m2.244s
>> real 16m53.047s
>> user 0m0.596s
>> sys 0m1.740s
>> real 15m10.704s
>> user 0m0.364s
>> sys 0m1.748s
>> real 15m53.943s
>> user 0m0.456s
>> sys 0m1.764s
>>
>> cluster-ls
>> real 6m52.006s
>> user 0m0.644s
>> sys 0m5.664s
>> real 7m7.361s
>> user 0m0.804s
>> sys 0m5.432s
>> real 7m4.109s
>> user 0m0.652s
>> sys 0m4.800s
>> real 6m37.229s
>> user 0m0.656s
>> sys 0m4.652s
>>
>>
>>
>>
>>
>>
>> 145GB
>> rm -fr
>> real 40m10.396s
>> user 0m0.624s
>> sys 0m5.492s
>> real 42m17.851s
>> user 0m0.844s
>> sys 0m4.872s
>> real 39m6.493s
>> user 0m0.484s
>> sys 0m4.868s
>> real 39m52.047s
>> user 0m0.496s
>> sys 0m4.980s
>>
>> cluster-rm
>> real 6m49.769s
>> user 0m0.708s
>> sys 0m6.440s
>> real 8m34.644s
>> user 0m0.852s
>> sys 0m8.345s
>> real 6m3.563s
>> user 0m0.636s
>> sys 0m5.844s
>> real 6m31.808s
>> user 0m0.664s
>> sys 0m5.996s
>>
>>
>>
>>
>>
>>
>> 1.1TB
>> find -printreal 62m4.043s
>> user 0m1.300s
>> sys 0m5.448s
>> real 61m11.584s
>> user 0m1.204s
>> sys 0m5.172s
>> real 65m37.389s
>> user 0m1.708s
>> sys 0m4.276s
>> real 63m51.822s
>> user 0m3.096s
>> sys 0m9.869s
>>
>> cluster-ls
>> real 73m12.463s
>> user 0m2.472s
>> sys 0m19.289s
>> real 68m37.846s
>> user 0m2.080s
>> sys 0m18.625s
>> real 72m56.417s
>> user 0m2.516s
>> sys 0m18.601s
>> real 69m3.575s
>> user 0m4.316s
>> sys 0m35.986s
>>
>>
>>
>>
>>
>>
>> 1.1TB
>> rm -fr
>> real 188m1.925s
>> user 0m2.240s
>> sys 0m21.705s
>> real 190m21.850s
>> user 0m2.372s
>> sys 0m18.885s
>> real 200m25.712s
>> user 0m5.840s
>> sys 0m46.363s
>> real 196m12.686s
>> user 0m4.916s
>> sys 0m41.519s
>>
>> cluster-rm
>> real 85m46.463s
>> user 0m2.512s
>> sys 0m30.478s
>> real 90m29.055s
>> user 0m2.600s
>> sys 0m30.382s
>> real 88m16.063s
>> user 0m4.456s
>> sys 0m51.667
>> real 77m42.096s
>> user 0m2.464s
>> sys 0m31.638s
>>
>>
>> Conclusions:
>>
>> 1) Once I had a threaded version of rm, a threaded version of ls was easy
>> to make, so I included it in the tests (listed above as cluster-ls).
>> Performance looked spiffy up until the 1.1TB range, when cluster-ls started
>> taking more time than find. Right now I can't explain why. 1.1TB takes a
>> long time to set up and process (about a day for each set of four
>> commands), it could be that regular nightly backups might be interfering
>> with performance. If that's the case, then it calls into question the
>> usefulness of my threaded approach. Also, naturally the output from
>> cluster-ls is out of order, so grep and sed would most likely be used in
>> conjunction with something like that, and I haven't yet time-tested
>> 'cluster-ls | some-other-command' against using plain old find by itself.
>>
>> 2) Results from cluster-rm look pretty good to me across the board.
>> Again, performance seems to fall off in the 1.1TB tests, and the reasons
>> are not clear to me at this time, but performance is still half that of rm
>> -fr. Run times fluctuate more than in the previous tests, but I suppose
>> that's to be expected. But since performance does drop, it makes me wonder
>> how well this approach scales on larger sets of data.
>>
>> 3) My threaded cluster-rm/ls commands are not clever. While traversing
>> directories, any subdirectories found would result in a new thread to
>> process it, up until some hard-coded limit is reached (for the above
>> results, 100 threads were used). After the thread count limit is reached,
>> directories are processed using plain old recursion until a thread exits,
>> freeing up a thread to process another subdirectory.
>>
>> Further Research:
>>
>> A) I would like to test further with larger data sets.
>>
>> B) I would like to implement a smarter algorithm for determining how many
>> threads to use to maximize performance. Rather than a hard-coded maximum,
>> a better approach might be to use some metric for measuring number of
>> inodes processed per second, and use that to determine the effectiveness of
>> adding more threads until a local maxima is reached.
>>
>> C) How do these numbers change if the commands are run on one of the
>> cluster nodes instead of a client?
>>
>> I have some ideas of smarter things to try, but I am at best an
>> inexperienced (if enthusiastic) dabbler in the programming arts. A
>> professional would likely do a much better job.
>>
>> But if this data looks at all interesting or useful, then maybe there
>> would be a call for a handful of cluster-specific filesystem tools?
>>
>> Michael Peek
>>
>> ------------------------------
>>
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> http://supercolony.gluster.org/mailman/listinfo/gluster-users <https://urldefense.proofpoint.com/v1/url?u=http://supercolony.gluster.org/mailman/listinfo/gluster-users&k=AjZjj3dyY74kKL92lieHqQ%3D%3D%0A&r=%2FQG8E5VPw7JQVkUUr0s3matmYRxzaY0KPU5nnXGigBI%3D%0A&m=w3pb14ExePHzQy9Y9dkKj8Y7PhtZd1RDukRFHcy8eig%3D%0A&s=ce0aec8f1b919c8c9e81c267d3db402ee9a82bc3c1a843620a4d401c97d88f68>
>>
>>
> --
> Sent from my Android device with K-9 Mail. Please excuse my brevity.
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20140416/6790db48/attachment.html>
More information about the Gluster-users
mailing list