<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Jul 26, 2018 at 2:41 PM, Hu Bert <span dir="ltr"><<a href="mailto:revirii@googlemail.com" target="_blank">revirii@googlemail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span class="gmail-">> Sorry, bad copy/paste :-(.<br>
<br>
np :-)<br>
<br>
The question regarding version 4.1 was meant more generally: does<br>
gluster v4.0 etc. have a better performance than version 3.12 etc.?<br>
Just curious :-) Sooner or later we have to upgrade anyway.</span></blockquote><div><br></div>You can check what changed @</div><div class="gmail_quote"><a href="https://github.com/gluster/glusterfs/blob/release-4.0/doc/release-notes/4.0.0.md#performance">https://github.com/gluster/glusterfs/blob/release-4.0/doc/release-notes/4.0.0.md#performance</a><br></div><div class="gmail_quote"><a href="https://github.com/gluster/glusterfs/blob/release-4.1/doc/release-notes/4.1.0.md#performance">https://github.com/gluster/glusterfs/blob/release-4.1/doc/release-notes/4.1.0.md#performance</a></div><div class="gmail_quote"> <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span class="gmail-">
<br>
btw.: gluster12 was the node with the failed brick, and i started the<br>
full heal on this node (has the biggest uuid as well). Is it normal<br>
that the glustershd.log on this node is rather empty (some hundred<br>
entries), but the glustershd.log files on the 2 other nodes have<br>
hundreds of thousands of entries?<br></span></blockquote><div><br></div><div>heals happen on the good bricks, so this is expected.</div><div> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span class="gmail-">
<br>
</span>(sry, mail twice, didn't go to the list, but maybe others are<br>
interested... :-) )<br>
<div class="gmail-HOEnZb"><div class="gmail-h5"><br>
2018-07-26 10:17 GMT+02:00 Pranith Kumar Karampuri <<a href="mailto:pkarampu@redhat.com">pkarampu@redhat.com</a>>:<br>
><br>
><br>
> On Thu, Jul 26, 2018 at 12:59 PM, Hu Bert <<a href="mailto:revirii@googlemail.com">revirii@googlemail.com</a>> wrote:<br>
>><br>
>> Hi Pranith,<br>
>><br>
>> thanks a lot for your efforts and for tracking "my" problem with an issue.<br>
>> :-)<br>
>><br>
>><br>
>> I've set this params on the gluster volume and will start the<br>
>> 'find...' command within a short time. I'll probably add another<br>
>> answer to the list to document the progress.<br>
>><br>
>> btw. - you had some typos:<br>
>> gluster volume set <volname> cluster.cluster.heal-wait-<wbr>queue-length<br>
>> 10000 => cluster is doubled<br>
>> gluster volume set <volname> cluster.data-self-heal-window-<wbr>size 16 =><br>
>> it's actually cluster.self-heal-window-size<br>
>><br>
>> but actually no problem :-)<br>
><br>
><br>
> Sorry, bad copy/paste :-(.<br>
><br>
>><br>
>><br>
>> Just curious: would gluster 4.1 improve the performance for healing<br>
>> and in general for "my" scenario?<br>
><br>
><br>
> No, this issue is present in all the existing releases. But it is solvable.<br>
> You can follow that issue to see progress and when it is fixed etc.<br>
><br>
>><br>
>><br>
>> 2018-07-26 8:56 GMT+02:00 Pranith Kumar Karampuri <<a href="mailto:pkarampu@redhat.com">pkarampu@redhat.com</a>>:<br>
>> > Thanks a lot for detailed write-up, this helps find the bottlenecks<br>
>> > easily.<br>
>> > On a high level, to handle this directory hierarchy i.e. lots of<br>
>> > directories<br>
>> > with files, we need to improve healing<br>
>> > algorithms. Based on the data you provided, we need to make the<br>
>> > following<br>
>> > enhancements:<br>
>> ><br>
>> > 1) At the moment directories are healed one at a time, but files can be<br>
>> > healed upto 64 in parallel per replica subvolume.<br>
>> > So if you have nX2 or nX3 distributed subvolumes, it can heal 64n number<br>
>> > of<br>
>> > files in parallel.<br>
>> ><br>
>> > I raised <a href="https://github.com/gluster/glusterfs/issues/477" rel="noreferrer" target="_blank">https://github.com/gluster/<wbr>glusterfs/issues/477</a> to track this.<br>
>> > In<br>
>> > the mean-while you can use the following work-around:<br>
>> > a) Increase background heals on the mount:<br>
>> > gluster volume set <volname> cluster.background-self-heal-<wbr>count 256<br>
>> > gluster volume set <volname> cluster.cluster.heal-wait-<wbr>queue-length<br>
>> > 10000<br>
>> > find <mnt> -type d | xargs stat<br>
>> ><br>
>> > one 'find' will trigger 10256 directories. So you may have to do this<br>
>> > periodically until all directories are healed.<br>
>> ><br>
>> > 2) Self-heal heals a file 128KB at a time(data-self-heal-window-<wbr>size). I<br>
>> > think for your environment bumping upto MBs is better. Say 2MB i.e.<br>
>> > 16*128KB?<br>
>> ><br>
>> > Command to do that is:<br>
>> > gluster volume set <volname> cluster.data-self-heal-window-<wbr>size 16<br>
>> ><br>
>> ><br>
>> > On Thu, Jul 26, 2018 at 10:40 AM, Hu Bert <<a href="mailto:revirii@googlemail.com">revirii@googlemail.com</a>><br>
>> > wrote:<br>
>> >><br>
>> >> Hi Pranith,<br>
>> >><br>
>> >> Sry, it took a while to count the directories. I'll try to answer your<br>
>> >> questions as good as possible.<br>
>> >><br>
>> >> > What kind of data do you have?<br>
>> >> > How many directories in the filesystem?<br>
>> >> > On average how many files per directory?<br>
>> >> > What is the depth of your directory hierarchy on average?<br>
>> >> > What is average filesize?<br>
>> >><br>
>> >> We have mostly images (more than 95% of disk usage, 90% of file<br>
>> >> count), some text files (like css, jsp, gpx etc.) and some binaries.<br>
>> >><br>
>> >> There are about 190.000 directories in the file system; maybe there<br>
>> >> are some more because we're hit by bug 1512371 (parallel-readdir =<br>
>> >> TRUE prevents directories listing). But the number of directories<br>
>> >> could/will rise in the future (maybe millions).<br>
>> >><br>
>> >> files per directory: ranges from 0 to 100, on average it should be 20<br>
>> >> files per directory (well, at least in the deepest dirs, see<br>
>> >> explanation below).<br>
>> >><br>
>> >> Average filesize: ranges from a few hundred bytes up to 30 MB, on<br>
>> >> average it should be 2-3 MB.<br>
>> >><br>
>> >> Directory hierarchy: maximum depth as seen from within the volume is<br>
>> >> 6, the average should be 3.<br>
>> >><br>
>> >> volume name: shared<br>
>> >> mount point on clients: /data/repository/shared/<br>
>> >> below /shared/ there are 2 directories:<br>
>> >> - public/: mainly calculated images (file sizes from a few KB up to<br>
>> >> max 1 MB) and some resouces (small PNGs with a size of a few hundred<br>
>> >> bytes).<br>
>> >> - private/: mainly source images; file sizes from 50 KB up to 30MB<br>
>> >><br>
>> >> We migrated from a NFS server (SPOF) to glusterfs and simply copied<br>
>> >> our files. The images (which have an ID) are stored in the deepest<br>
>> >> directories of the dir tree. I'll better explain it :-)<br>
>> >><br>
>> >> directory structure for the images (i'll omit some other miscellaneous<br>
>> >> stuff, but it looks quite similar):<br>
>> >> - ID of an image has 7 or 8 digits<br>
>> >> - /shared/private/: /(first 3 digits of ID)/(next 3 digits of<br>
>> >> ID)/$ID.jpg<br>
>> >> - /shared/public/: /(first 3 digits of ID)/(next 3 digits of<br>
>> >> ID)/$ID/$misc_formats.jpg<br>
>> >><br>
>> >> That's why we have that many (sub-)directories. Files are only stored<br>
>> >> in the lowest directory hierarchy. I hope i could make our structure<br>
>> >> at least a bit more transparent.<br>
>> >><br>
>> >> i hope there's something we can do to raise performance a bit. thx in<br>
>> >> advance :-)<br>
>> >><br>
>> >><br>
>> >> 2018-07-24 10:40 GMT+02:00 Pranith Kumar Karampuri<br>
>> >> <<a href="mailto:pkarampu@redhat.com">pkarampu@redhat.com</a>>:<br>
>> >> ><br>
>> >> ><br>
>> >> > On Mon, Jul 23, 2018 at 4:16 PM, Hu Bert <<a href="mailto:revirii@googlemail.com">revirii@googlemail.com</a>><br>
>> >> > wrote:<br>
>> >> >><br>
>> >> >> Well, over the weekend about 200GB were copied, so now there are<br>
>> >> >> ~400GB copied to the brick. That's far beyond a speed of 10GB per<br>
>> >> >> hour. If I copied the 1.6 TB directly, that would be done within max<br>
>> >> >> 2<br>
>> >> >> days. But with the self heal this will take at least 20 days<br>
>> >> >> minimum.<br>
>> >> >><br>
>> >> >> Why is the performance that bad? No chance of speeding this up?<br>
>> >> ><br>
>> >> ><br>
>> >> > What kind of data do you have?<br>
>> >> > How many directories in the filesystem?<br>
>> >> > On average how many files per directory?<br>
>> >> > What is the depth of your directory hierarchy on average?<br>
>> >> > What is average filesize?<br>
>> >> ><br>
>> >> > Based on this data we can see if anything can be improved. Or if<br>
>> >> > there<br>
>> >> > are<br>
>> >> > some<br>
>> >> > enhancements that need to be implemented in gluster to address this<br>
>> >> > kind<br>
>> >> > of<br>
>> >> > data layout<br>
>> >> >><br>
>> >> >><br>
>> >> >> 2018-07-20 9:41 GMT+02:00 Hu Bert <<a href="mailto:revirii@googlemail.com">revirii@googlemail.com</a>>:<br>
>> >> >> > hmm... no one any idea?<br>
>> >> >> ><br>
>> >> >> > Additional question: the hdd on server gluster12 was changed, so<br>
>> >> >> > far<br>
>> >> >> > ~220 GB were copied. On the other 2 servers i see a lot of entries<br>
>> >> >> > in<br>
>> >> >> > glustershd.log, about 312.000 respectively 336.000 entries there<br>
>> >> >> > yesterday, most of them (current log output) looking like this:<br>
>> >> >> ><br>
>> >> >> > [2018-07-20 07:30:49.757595] I [MSGID: 108026]<br>
>> >> >> > [afr-self-heal-common.c:1724:<wbr>afr_log_selfheal]<br>
>> >> >> > 0-shared-replicate-3:<br>
>> >> >> > Completed data selfheal on 0d863a62-0dd8-401c-b699-<wbr>2b642d9fd2b6.<br>
>> >> >> > sources=0 [2] sinks=1<br>
>> >> >> > [2018-07-20 07:30:49.992398] I [MSGID: 108026]<br>
>> >> >> > [afr-self-heal-metadata.c:52:_<wbr>_afr_selfheal_metadata_do]<br>
>> >> >> > 0-shared-replicate-3: performing metadata selfheal on<br>
>> >> >> > 0d863a62-0dd8-401c-b699-<wbr>2b642d9fd2b6<br>
>> >> >> > [2018-07-20 07:30:50.243551] I [MSGID: 108026]<br>
>> >> >> > [afr-self-heal-common.c:1724:<wbr>afr_log_selfheal]<br>
>> >> >> > 0-shared-replicate-3:<br>
>> >> >> > Completed metadata selfheal on<br>
>> >> >> > 0d863a62-0dd8-401c-b699-<wbr>2b642d9fd2b6.<br>
>> >> >> > sources=0 [2] sinks=1<br>
>> >> >> ><br>
>> >> >> > or like this:<br>
>> >> >> ><br>
>> >> >> > [2018-07-20 07:38:41.726943] I [MSGID: 108026]<br>
>> >> >> > [afr-self-heal-metadata.c:52:_<wbr>_afr_selfheal_metadata_do]<br>
>> >> >> > 0-shared-replicate-3: performing metadata selfheal on<br>
>> >> >> > 9276097a-cdac-4d12-9dc6-<wbr>04b1ea4458ba<br>
>> >> >> > [2018-07-20 07:38:41.855737] I [MSGID: 108026]<br>
>> >> >> > [afr-self-heal-common.c:1724:<wbr>afr_log_selfheal]<br>
>> >> >> > 0-shared-replicate-3:<br>
>> >> >> > Completed metadata selfheal on<br>
>> >> >> > 9276097a-cdac-4d12-9dc6-<wbr>04b1ea4458ba.<br>
>> >> >> > sources=[0] 2 sinks=1<br>
>> >> >> > [2018-07-20 07:38:44.755800] I [MSGID: 108026]<br>
>> >> >> > [afr-self-heal-entry.c:887:<wbr>afr_selfheal_entry_do]<br>
>> >> >> > 0-shared-replicate-3: performing entry selfheal on<br>
>> >> >> > 9276097a-cdac-4d12-9dc6-<wbr>04b1ea4458ba<br>
>> >> >> ><br>
>> >> >> > is this behaviour normal? I'd expect these messages on the server<br>
>> >> >> > with<br>
>> >> >> > the failed brick, not on the other ones.<br>
>> >> >> ><br>
>> >> >> > 2018-07-19 8:31 GMT+02:00 Hu Bert <<a href="mailto:revirii@googlemail.com">revirii@googlemail.com</a>>:<br>
>> >> >> >> Hi there,<br>
>> >> >> >><br>
>> >> >> >> sent this mail yesterday, but somehow it didn't work? Wasn't<br>
>> >> >> >> archived,<br>
>> >> >> >> so please be indulgent it you receive this mail again :-)<br>
>> >> >> >><br>
>> >> >> >> We are currently running a replicate setup and are experiencing a<br>
>> >> >> >> quite poor performance. It got even worse when within a couple of<br>
>> >> >> >> weeks 2 bricks (disks) crashed. Maybe some general information of<br>
>> >> >> >> our<br>
>> >> >> >> setup:<br>
>> >> >> >><br>
>> >> >> >> 3 Dell PowerEdge R530 (Xeon E5-1650 v3 Hexa-Core, 64 GB DDR4, OS<br>
>> >> >> >> on<br>
>> >> >> >> separate disks); each server has 4 10TB disks -> each is a brick;<br>
>> >> >> >> replica 3 setup (see gluster volume status below). Debian<br>
>> >> >> >> stretch,<br>
>> >> >> >> kernel 4.9.0, gluster version 3.12.12. Servers and clients are<br>
>> >> >> >> connected via 10 GBit ethernet.<br>
>> >> >> >><br>
>> >> >> >> About a month ago and 2 days ago a disk died (on different<br>
>> >> >> >> servers);<br>
>> >> >> >> disk were replaced, were brought back into the volume and full<br>
>> >> >> >> self<br>
>> >> >> >> heal started. But the speed for this is quite... disappointing.<br>
>> >> >> >> Each<br>
>> >> >> >> brick has ~1.6TB of data on it (mostly the infamous small files).<br>
>> >> >> >> The<br>
>> >> >> >> full heal i started yesterday copied only ~50GB within 24 hours<br>
>> >> >> >> (48<br>
>> >> >> >> hours: about 100GB) - with<br>
>> >> >> >> this rate it would take weeks until the self heal finishes.<br>
>> >> >> >><br>
>> >> >> >> After the first heal (started on gluster13 about a month ago,<br>
>> >> >> >> took<br>
>> >> >> >> about 3 weeks) finished we had a terrible performance; CPU on one<br>
>> >> >> >> or<br>
>> >> >> >> two of the nodes (gluster11, gluster12) was up to 1200%, consumed<br>
>> >> >> >> by<br>
>> >> >> >> the brick process of the former crashed brick (bricksdd1),<br>
>> >> >> >> interestingly not on the server with the failed this, but on the<br>
>> >> >> >> other<br>
>> >> >> >> 2 ones...<br>
>> >> >> >><br>
>> >> >> >> Well... am i doing something wrong? Some options wrongly<br>
>> >> >> >> configured?<br>
>> >> >> >> Terrible setup? Anyone got an idea? Any additional information<br>
>> >> >> >> needed?<br>
>> >> >> >><br>
>> >> >> >><br>
>> >> >> >> Thx in advance :-)<br>
>> >> >> >><br>
>> >> >> >> gluster volume status<br>
>> >> >> >><br>
>> >> >> >> Volume Name: shared<br>
>> >> >> >> Type: Distributed-Replicate<br>
>> >> >> >> Volume ID: e879d208-1d8c-4089-85f3-<wbr>ef1b3aa45d36<br>
>> >> >> >> Status: Started<br>
>> >> >> >> Snapshot Count: 0<br>
>> >> >> >> Number of Bricks: 4 x 3 = 12<br>
>> >> >> >> Transport-type: tcp<br>
>> >> >> >> Bricks:<br>
>> >> >> >> Brick1: gluster11:/gluster/bricksda1/<wbr>shared<br>
>> >> >> >> Brick2: gluster12:/gluster/bricksda1/<wbr>shared<br>
>> >> >> >> Brick3: gluster13:/gluster/bricksda1/<wbr>shared<br>
>> >> >> >> Brick4: gluster11:/gluster/bricksdb1/<wbr>shared<br>
>> >> >> >> Brick5: gluster12:/gluster/bricksdb1/<wbr>shared<br>
>> >> >> >> Brick6: gluster13:/gluster/bricksdb1/<wbr>shared<br>
>> >> >> >> Brick7: gluster11:/gluster/bricksdc1/<wbr>shared<br>
>> >> >> >> Brick8: gluster12:/gluster/bricksdc1/<wbr>shared<br>
>> >> >> >> Brick9: gluster13:/gluster/bricksdc1/<wbr>shared<br>
>> >> >> >> Brick10: gluster11:/gluster/bricksdd1/<wbr>shared<br>
>> >> >> >> Brick11: gluster12:/gluster/bricksdd1_<wbr>new/shared<br>
>> >> >> >> Brick12: gluster13:/gluster/bricksdd1_<wbr>new/shared<br>
>> >> >> >> Options Reconfigured:<br>
>> >> >> >> cluster.shd-max-threads: 4<br>
>> >> >> >> performance.md-cache-timeout: 60<br>
>> >> >> >> cluster.lookup-optimize: on<br>
>> >> >> >> cluster.readdir-optimize: on<br>
>> >> >> >> performance.cache-refresh-<wbr>timeout: 4<br>
>> >> >> >> performance.parallel-readdir: on<br>
>> >> >> >> server.event-threads: 8<br>
>> >> >> >> client.event-threads: 8<br>
>> >> >> >> performance.cache-max-file-<wbr>size: 128MB<br>
>> >> >> >> performance.write-behind-<wbr>window-size: 16MB<br>
>> >> >> >> performance.io-thread-count: 64<br>
>> >> >> >> cluster.min-free-disk: 1%<br>
>> >> >> >> performance.cache-size: 24GB<br>
>> >> >> >> nfs.disable: on<br>
>> >> >> >> transport.address-family: inet<br>
>> >> >> >> performance.high-prio-threads: 32<br>
>> >> >> >> performance.normal-prio-<wbr>threads: 32<br>
>> >> >> >> performance.low-prio-threads: 32<br>
>> >> >> >> performance.least-prio-<wbr>threads: 8<br>
>> >> >> >> performance.io-cache: on<br>
>> >> >> >> server.allow-insecure: on<br>
>> >> >> >> performance.strict-o-direct: off<br>
>> >> >> >> transport.listen-backlog: 100<br>
>> >> >> >> server.outstanding-rpc-limit: 128<br>
>> >> >> ______________________________<wbr>_________________<br>
>> >> >> Gluster-users mailing list<br>
>> >> >> <a href="mailto:Gluster-users@gluster.org">Gluster-users@gluster.org</a><br>
>> >> >> <a href="https://lists.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer" target="_blank">https://lists.gluster.org/<wbr>mailman/listinfo/gluster-users</a><br>
>> >> ><br>
>> >> ><br>
>> >> ><br>
>> >> ><br>
>> >> > --<br>
>> >> > Pranith<br>
>> ><br>
>> ><br>
>> ><br>
>> ><br>
>> > --<br>
>> > Pranith<br>
><br>
><br>
><br>
><br>
> --<br>
> Pranith<br>
</div></div></blockquote></div><br><br clear="all"><br>-- <br><div class="gmail_signature"><div dir="ltr">Pranith<br></div></div>
</div></div>