<div dir="ltr"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">@Artem what is the average size of the files for your apaches ?<br></blockquote><div><br></div><div>The average size is probably 15-20MB, but the files are as large as 100MB+. The files are a combination of Android APK files and image files.</div><div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><br>Sincerely,<br>Artem<br><br>--<br>Founder, <a href="http://www.androidpolice.com" target="_blank">Android Police</a>, <a href="http://www.apkmirror.com/" style="font-size:12.8px" target="_blank">APK Mirror</a><span style="font-size:12.8px">, Illogical Robot LLC</span></div><div dir="ltr"><a href="http://beerpla.net/" target="_blank">beerpla.net</a> | <a href="http://twitter.com/ArtemR" target="_blank">@ArtemR</a><br></div></div></div></div></div></div></div></div></div></div></div></div></div><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, May 15, 2020 at 11:20 PM Strahil Nikolov <<a href="mailto:hunter86_bg@yahoo.com">hunter86_bg@yahoo.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On May 16, 2020 1:14:17 AM GMT+03:00, Artem Russakovskii <<a href="mailto:archon810@gmail.com" target="_blank">archon810@gmail.com</a>> wrote:<br>
>Hi Hari,<br>
><br>
>Thanks for the analysis.<br>
><br>
> 1. I understand why the rebooted node would have 0 heal state files<br>
>whereas the other nodes would be going up. The problem with 5.11 was<br>
>that<br>
> there was a bug that prevented the heal from completing, which as I<br>
> mentioned was fixed in 5.13.<br>
><br>
> 2. If the number of files to heal is known, why are operations like<br>
>md5sum needed to perform the heal at all? Why doesn't the auto heal<br>
>agent<br>
>just go through the list one file at a time and perform the heal, then<br>
>mark<br>
>the files as healed? From what I've seen, even requesting a manual heal<br>
>didn't do anything until those files were touched in some way (like<br>
>md5sum).<br>
><br>
> 3. cscope? I am unable to find what you're talking about -<br>
> <a href="http://cscope.sourceforge.net/" rel="noreferrer" target="_blank">http://cscope.sourceforge.net/</a> seems to be a code search tool?<br>
><br>
> 4. What's the best way to analyze and present the data about what's<br>
> raising the load to 100+ on all nodes after reboots? If there's some<br>
>monitoring tool I could run, reproduce the issue, then save the output<br>
>and<br>
> send it here, that'd be great.<br>
><br>
> 5. Based on what I've seen when I straced apache processes, they would<br>
>all hang for a long time when they ran across some of the gluster data.<br>
> Specifically, one of the directories (with only several files, nothing<br>
> special) which may be queried in some way by a lot of page loads (for<br>
>context, we have 2 busy WordPress sites), would come up a lot in<br>
>straces<br>
>and hang. I even tried moving this directory out of gluster and adding<br>
>a<br>
> symlink, but I'm not sure that helped. I wonder if multiple conditions<br>
>cause frequent read access to a certain resource in gluster to spike<br>
>out of<br>
> control and go haywire.<br>
>Here, the gluster root directory is SITE/public/wp-content/uploads, and<br>
> the dirs I saw the most are symlinked as follows:<br>
> lrwxrwxrwx 1 archon810 users 73 Apr 26 15:47<br>
> wp-security-audit-log -> SITE/public/wp-content/wp-security-audit-log/<br>
>(belongs to <a href="https://wordpress.org/plugins/wp-security-audit-log/" rel="noreferrer" target="_blank">https://wordpress.org/plugins/wp-security-audit-log/</a> but<br>
>the<br>
> Pro version)<br>
><br>
> A sample strace log of requests to this dir:<br>
> [2020-05-15 15:10:15 PDT]<br>
> stat("SITE/public/wp-content/uploads/wp-security-audit-log",<br>
> {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0<br>
> [2020-05-15 15:10:15 PDT]<br>
>access("SITE/public/wp-content/uploads/wp-security-audit-log", R_OK) =<br>
>0<br>
> [2020-05-15 15:10:15 PDT]<br>
>access("SITE/public/wp-content/uploads/wp-security-audit-log/custom-alerts.php",<br>
> F_OK) = -1 ENOENT (No such file or directory)<br>
> [2020-05-15 15:10:15 PDT]<br>
>stat("SITE/public/wp-content/uploads/wp-security-audit-log/custom-sensors/",<br>
> 0x7ffeff14c4d0) = -1 ENOENT (No such file or directory)<br>
> [2020-05-15 15:10:18 PDT]<br>
> stat("SITE/public/wp-content/uploads/wp-security-audit-log",<br>
> {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0<br>
> [2020-05-15 15:10:18 PDT]<br>
>access("SITE/public/wp-content/uploads/wp-security-audit-log", R_OK) =<br>
>0<br>
> [2020-05-15 15:10:18 PDT]<br>
>access("SITE/public/wp-content/uploads/wp-security-audit-log/custom-alerts.php",<br>
> F_OK) = -1 ENOENT (No such file or directory)<br>
> [2020-05-15 15:10:18 PDT]<br>
>stat("SITE/public/wp-content/uploads/wp-security-audit-log/custom-sensors/",<br>
> 0x7ffeff14c4d0) = -1 ENOENT (No such file or directory)<br>
> [2020-05-15 15:10:19 PDT]<br>
> stat("SITE/public/wp-content/uploads/wp-security-audit-log",<br>
> {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0<br>
> [2020-05-15 15:10:19 PDT]<br>
>access("SITE/public/wp-content/uploads/wp-security-audit-log", R_OK) =<br>
>0<br>
> [2020-05-15 15:10:19 PDT]<br>
>access("SITE/public/wp-content/uploads/wp-security-audit-log/custom-alerts.php",<br>
> F_OK) = -1 ENOENT (No such file or directory)<br>
> [2020-05-15 15:10:19 PDT]<br>
>stat("SITE/public/wp-content/uploads/wp-security-audit-log/custom-sensors/",<br>
> 0x7ffeff14c4d0) = -1 ENOENT (No such file or directory)<br>
> [2020-05-15 15:10:21 PDT]<br>
> stat("SITE/public/wp-content/uploads/wp-security-audit-log",<br>
> {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0<br>
> [2020-05-15 15:10:21 PDT]<br>
>access("SITE/public/wp-content/uploads/wp-security-audit-log", R_OK) =<br>
>0<br>
> [2020-05-15 15:10:21 PDT]<br>
>access("SITE/public/wp-content/uploads/wp-security-audit-log/custom-alerts.php",<br>
> F_OK) = -1 ENOENT (No such file or directory)<br>
> [2020-05-15 15:10:21 PDT]<br>
>stat("SITE/public/wp-content/uploads/wp-security-audit-log/custom-sensors/",<br>
> 0x7ffeff14c4d0) = -1 ENOENT (No such file or directory)<br>
> [2020-05-15 15:10:23 PDT]<br>
> stat("SITE/public/wp-content/uploads/wp-security-audit-log",<br>
> {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0<br>
> [2020-05-15 15:10:23 PDT]<br>
>access("SITE/public/wp-content/uploads/wp-security-audit-log", R_OK) =<br>
>0<br>
> [2020-05-15 15:10:23 PDT]<br>
>access("SITE/public/wp-content/uploads/wp-security-audit-log/custom-alerts.php",<br>
> F_OK) = -1 ENOENT (No such file or directory)<br>
> [2020-05-15 15:10:23 PDT]<br>
>stat("SITE/public/wp-content/uploads/wp-security-audit-log/custom-sensors/",<br>
> 0x7ffeff14c4d0) = -1 ENOENT (No such file or directory)<br>
> [2020-05-15 15:10:25 PDT]<br>
> stat("SITE/public/wp-content/uploads/wp-security-audit-log",<br>
> {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0<br>
> [2020-05-15 15:10:25 PDT]<br>
>access("SITE/public/wp-content/uploads/wp-security-audit-log", R_OK) =<br>
>0<br>
> [2020-05-15 15:10:25 PDT]<br>
>access("SITE/public/wp-content/uploads/wp-security-audit-log/custom-alerts.php",<br>
> F_OK) = -1 ENOENT (No such file or directory)<br>
> [2020-05-15 15:10:25 PDT]<br>
>stat("SITE/public/wp-content/uploads/wp-security-audit-log/custom-sensors/",<br>
> 0x7ffeff14c4d0) = -1 ENOENT (No such file or directory)<br>
> [2020-05-15 15:10:27 PDT]<br>
> stat("SITE/public/wp-content/uploads/wp-security-audit-log",<br>
> {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0<br>
> [2020-05-15 15:10:27 PDT]<br>
>access("SITE/public/wp-content/uploads/wp-security-audit-log", R_OK) =<br>
>0<br>
> [2020-05-15 15:10:27 PDT]<br>
>access("SITE/public/wp-content/uploads/wp-security-audit-log/custom-alerts.php",<br>
> F_OK) = -1 ENOENT (No such file or directory)<br>
> [2020-05-15 15:10:27 PDT]<br>
>stat("SITE/public/wp-content/uploads/wp-security-audit-log/custom-sensors/",<br>
> 0x7ffeff14c4d0) = -1 ENOENT (No such file or directory)<br>
><br>
>The load spikes and everything hanging is seriously stressing me out<br>
>because at any point it can cause an outage for us.<br>
><br>
>Sincerely,<br>
>Artem<br>
><br>
>--<br>
>Founder, Android Police <<a href="http://www.androidpolice.com" rel="noreferrer" target="_blank">http://www.androidpolice.com</a>>, APK Mirror<br>
><<a href="http://www.apkmirror.com/" rel="noreferrer" target="_blank">http://www.apkmirror.com/</a>>, Illogical Robot LLC<br>
><a href="http://beerpla.net" rel="noreferrer" target="_blank">beerpla.net</a> | @ArtemR <<a href="http://twitter.com/ArtemR" rel="noreferrer" target="_blank">http://twitter.com/ArtemR</a>><br>
><br>
><br>
>On Thu, May 7, 2020 at 11:32 PM Hari Gowtham <<a href="mailto:hgowtham@redhat.com" target="_blank">hgowtham@redhat.com</a>><br>
>wrote:<br>
><br>
>> The heal info is working fine. The explanation to what's happening:<br>
>> When a node goes down, the changes to this node can't be done. So on<br>
>the<br>
>> other nodes which were up, get the changes and keeps track saying<br>
>> these files were changed (note: this change hasn't been reflected to<br>
>the<br>
>> node which was down). Once the node down comes back up,<br>
>> it doesn't know what happened when it was down. But the other nodes<br>
>know<br>
>> that there are a few changes which didn't make it to the rebooted<br>
>node.<br>
>> So the node down is blamed by the other nodes .This is what is shown<br>
>in<br>
>> the heal info. As the node which was up doesn't have any change that<br>
>went<br>
>> into that node alone.<br>
>> It says 0 files to be healed and the other nodes as it has the data<br>
>say<br>
>> which are the files that need to heal.<br>
>> This is the expected working.<br>
>> So as per the rebooted node, heal info is working fine.<br>
>><br>
>> About healing the file itself:<br>
>> Doing an operation on a file, triggers client side heal as per the<br>
>design,<br>
>> that's the reason these files are getting corrected after the md5sum<br>
>(I<br>
>> hope this is done from the client side not the backend itself).<br>
>> So this is expected.<br>
>> About the heals not happening for a long time, there can be some<br>
>issue<br>
>> there.<br>
>> @Karthik Subrahmanya <<a href="mailto:ksubrahm@redhat.com" target="_blank">ksubrahm@redhat.com</a>> is the better person to<br>
>help<br>
>> you with this.<br>
>><br>
>> About the CPU usage going higher:<br>
>> We need info about what is consuming more CPU.<br>
>> Glusterd needs to do a bit of handshake and connect after reboot.<br>
>During<br>
>> this a little bit of data is transferred as well.<br>
>> If the number of nodes goes higher it can contribute to hike.<br>
>> Similarly, if the heal is happening, then it can increase the usage.<br>
>> So we need info about what is consuming the cpu to know if it's<br>
>expected<br>
>> or not.<br>
>> If this hike is expected, you can try using cscope to restrict the<br>
>cpu<br>
>> usage by that particular process.<br>
>><br>
>> On Tue, Apr 28, 2020 at 3:02 AM Artem Russakovskii<br>
><<a href="mailto:archon810@gmail.com" target="_blank">archon810@gmail.com</a>><br>
>> wrote:<br>
>><br>
>>> Good news, after upgrading to 5.13 and running this scenario again,<br>
>the<br>
>>> self heal actually succeeded without my intervention following a<br>
>server<br>
>>> reboot.<br>
>>><br>
>>> The load was still high during this process, but at least the<br>
>endless<br>
>>> heal issue is resolved.<br>
>>><br>
>>> I'd still love to hear from the team on managing heal load spikes.<br>
>>><br>
>>> Sincerely,<br>
>>> Artem<br>
>>><br>
>>> --<br>
>>> Founder, Android Police <<a href="http://www.androidpolice.com" rel="noreferrer" target="_blank">http://www.androidpolice.com</a>>, APK Mirror<br>
>>> <<a href="http://www.apkmirror.com/" rel="noreferrer" target="_blank">http://www.apkmirror.com/</a>>, Illogical Robot LLC<br>
>>> <a href="http://beerpla.net" rel="noreferrer" target="_blank">beerpla.net</a> | @ArtemR <<a href="http://twitter.com/ArtemR" rel="noreferrer" target="_blank">http://twitter.com/ArtemR</a>><br>
>>><br>
>>><br>
>>> On Sun, Apr 26, 2020 at 3:13 PM Artem Russakovskii<br>
><<a href="mailto:archon810@gmail.com" target="_blank">archon810@gmail.com</a>><br>
>>> wrote:<br>
>>><br>
>>>> Hi all,<br>
>>>><br>
>>>> I've been observing this problem for a long time now and it's time<br>
>to<br>
>>>> finally figure out what's going on.<br>
>>>><br>
>>>> We're running gluster 5.11 and have a 10TB 1 x 4 = 4 replicate<br>
>volume.<br>
>>>> I'll include its slightly redacted config below.<br>
>>>><br>
>>>> When I reboot one of the servers and it goes offline for a bit,<br>
>when it<br>
>>>> comes back, heal info tells me there are some files and dirs that<br>
>are "heal<br>
>>>> pending". 0 "split-brain" and "possibly healing" - only "heal<br>
>pending"<br>
>>>> are >0.<br>
>>>><br>
>>>> 1. For some reason, the server that was rebooted shows "heal<br>
>>>> pending" 0. All other servers show "heal pending" with some<br>
>number, say 65.<br>
>>>> 2. We have cluster.self-heal-daemon enabled.<br>
>>>> 3. The logs are full of "performing entry selfheal" and<br>
>"completed<br>
>>>> entry selfheal" messages that continue to print endlessly.<br>
>>>> 4. This "heal pending" number never goes down by itself, but it<br>
>does<br>
>>>> if I run some operation on it, like md5sum.<br>
>>>> 5. When the server goes down for reboot and especially when it<br>
>comes<br>
>>>> back, the load on ALL servers shoots up through the roof (load<br>
>of 100+) and<br>
>>>> ends up bringing everything down, including apache and nginx. My<br>
>theory is<br>
>>>> that self-heal kicks in so hard that it kills IO on these<br>
>attached Linode<br>
>>>> block devices. However, after some time - say 10 minutes - the<br>
>load<br>
>>>> subsides, but the "heal pending" remains and the gluster logs<br>
>continue to<br>
>>>> output "performing entry selfheal" and "completed entry<br>
>selfheal" messages.<br>
>>>> This load spike has become a huge issue for us because it brings<br>
>down the<br>
>>>> whole site for entire minutes.<br>
>>>> 6. At this point in my investigation, I noticed that the<br>
>>>> selfheal messages actually repeat for the same gfids over and<br>
>over.<br>
>>>> [2020-04-26 21:32:29.877987] I [MSGID: 108026]<br>
>>>> [afr-self-heal-entry.c:897:afr_selfheal_entry_do]<br>
>0-SNIP_data1-replicate-0:<br>
>>>> performing entry selfheal on<br>
>96d282cf-402f-455c-9add-5f03c088a1bc<br>
>>>> [2020-04-26 21:32:29.901246] I [MSGID: 108026]<br>
>>>> [afr-self-heal-common.c:1729:afr_log_selfheal]<br>
>0-SNIP_data1-replicate-0:<br>
>>>> Completed entry selfheal on<br>
>96d282cf-402f-455c-9add-5f03c088a1bc. sources=<br>
>>>> sinks=0 1 2<br>
>>>> [2020-04-26 21:32:32.171959] I [MSGID: 108026]<br>
>>>> [afr-self-heal-entry.c:897:afr_selfheal_entry_do]<br>
>0-SNIP_data1-replicate-0:<br>
>>>> performing entry selfheal on<br>
>96d282cf-402f-455c-9add-5f03c088a1bc<br>
>>>> [2020-04-26 21:32:32.225828] I [MSGID: 108026]<br>
>>>> [afr-self-heal-common.c:1729:afr_log_selfheal]<br>
>0-SNIP_data1-replicate-0:<br>
>>>> Completed entry selfheal on<br>
>96d282cf-402f-455c-9add-5f03c088a1bc. sources=<br>
>>>> sinks=0 1 2<br>
>>>> [2020-04-26 21:32:33.346990] I [MSGID: 108026]<br>
>>>> [afr-self-heal-entry.c:897:afr_selfheal_entry_do]<br>
>0-SNIP_data1-replicate-0:<br>
>>>> performing entry selfheal on<br>
>96d282cf-402f-455c-9add-5f03c088a1bc<br>
>>>> [2020-04-26 21:32:33.374413] I [MSGID: 108026]<br>
>>>> [afr-self-heal-common.c:1729:afr_log_selfheal]<br>
>0-SNIP_data1-replicate-0:<br>
>>>> Completed entry selfheal on<br>
>96d282cf-402f-455c-9add-5f03c088a1bc. sources=<br>
>>>> sinks=0 1 2<br>
>>>> 7. I used gfid-resolver.sh from<br>
><a href="https://gist.github.com/4392640.git" rel="noreferrer" target="_blank">https://gist.github.com/4392640.git</a><br>
>>>> to resolve this gfid to the real location and yup - it was one<br>
>of the files<br>
>>>> (a dir actually) listed as "heal pending" in heal info. As soon<br>
>as I ran<br>
>>>> md5sum on the file inside (which was also listed in "heal<br>
>pending"), the<br>
>>>> log messages stopped repeating for this entry and it disappeared<br>
>from "heal<br>
>>>> pending" heal info. These were the final log lines:<br>
>>>> [2020-04-26 21:32:35.642662] I [MSGID: 108026]<br>
>>>> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]<br>
>>>> 0-SNIP_data1-replicate-0: performing metadata selfheal on<br>
>>>> 96d282cf-402f-455c-9add-5f03c088a1bc<br>
>>>> [2020-04-26 21:32:35.658714] I [MSGID: 108026]<br>
>>>> [afr-self-heal-common.c:1729:afr_log_selfheal]<br>
>0-SNIP_data1-replicate-0:<br>
>>>> Completed metadata selfheal on<br>
>96d282cf-402f-455c-9add-5f03c088a1bc.<br>
>>>> sources=0 [1] 2 sinks=3<br>
>>>> [2020-04-26 21:32:35.686509] I [MSGID: 108026]<br>
>>>> [afr-self-heal-entry.c:897:afr_selfheal_entry_do]<br>
>0-SNIP_data1-replicate-0:<br>
>>>> performing entry selfheal on<br>
>96d282cf-402f-455c-9add-5f03c088a1bc<br>
>>>> [2020-04-26 21:32:35.720387] I [MSGID: 108026]<br>
>>>> [afr-self-heal-common.c:1729:afr_log_selfheal]<br>
>0-SNIP_data1-replicate-0:<br>
>>>> Completed entry selfheal on<br>
>96d282cf-402f-455c-9add-5f03c088a1bc. sources=0<br>
>>>> [1] 2 sinks=3<br>
>>>><br>
>>>> I have to repeat this song and dance every time I reboot servers<br>
>and run<br>
>>>> md5sum on each "heal pending" file or else the messages will<br>
>continue<br>
>>>> presumably indefinitely. In the meantime, the files seem to be fine<br>
>when<br>
>>>> accessed.<br>
>>>><br>
>>>> What I don't understand is:<br>
>>>><br>
>>>> 1. Why doesn't gluster just heal them properly instead of<br>
>getting<br>
>>>> stuck? Or maybe this was fixed in v6 or v7, which I haven't<br>
>upgraded to due<br>
>>>> to waiting for another unrelated issue to be fixed?<br>
>>>> 2. Why does heal info show 0 "heal pending" files on the server<br>
>that<br>
>>>> was rebooted, but all other servers show the same number of<br>
>"heal pending"<br>
>>>> entries >0?<br>
>>>> 3. Why are there these insane load spikes upon going down and<br>
>>>> especially coming back online? Is it related to the issue here?<br>
>I'm pretty<br>
>>>> sure that it didn't happen in previous versions of gluster, when<br>
>this issue<br>
>>>> didn't manifest - I could easily bring down one of the servers<br>
>without it<br>
>>>> creating havoc when it comes back online.<br>
>>>><br>
>>>> Here's the volume info:<br>
>>>><br>
>>>> Volume Name: SNIP_data1<br>
>>>><br>
>>>> Type: Replicate<br>
>>>><br>
>>>> Volume ID: 11ecee7e-d4f8-497a-9994-ceb144d6841e<br>
>>>><br>
>>>> Status: Started<br>
>>>><br>
>>>> Snapshot Count: 0<br>
>>>><br>
>>>> Number of Bricks: 1 x 4 = 4<br>
>>>><br>
>>>> Transport-type: tcp<br>
>>>><br>
>>>> Bricks:<br>
>>>><br>
>>>> Brick1: SNIP:/mnt/SNIP_block1/SNIP_data1<br>
>>>><br>
>>>> Brick2: SNIP:/mnt/SNIP_block1/SNIP_data1<br>
>>>><br>
>>>> Brick3: SNIP:/mnt/SNIP_block1/SNIP_data1<br>
>>>><br>
>>>> Brick4: SNIP:/mnt/SNIP_block1/SNIP_data1<br>
>>>><br>
>>>> Options Reconfigured:<br>
>>>><br>
>>>> performance.client-io-threads: on<br>
>>>><br>
>>>> nfs.disable: on<br>
>>>><br>
>>>> transport.address-family: inet<br>
>>>><br>
>>>> cluster.self-heal-daemon: enable<br>
>>>><br>
>>>> performance.cache-size: 1GB<br>
>>>><br>
>>>> cluster.lookup-optimize: on<br>
>>>><br>
>>>> performance.read-ahead: off<br>
>>>><br>
>>>> client.event-threads: 4<br>
>>>><br>
>>>> server.event-threads: 4<br>
>>>><br>
>>>> performance.io-thread-count: 32<br>
>>>><br>
>>>> cluster.readdir-optimize: on<br>
>>>><br>
>>>> features.cache-invalidation: on<br>
>>>><br>
>>>> features.cache-invalidation-timeout: 600<br>
>>>><br>
>>>> performance.stat-prefetch: on<br>
>>>><br>
>>>> performance.cache-invalidation: on<br>
>>>><br>
>>>> performance.md-cache-timeout: 600<br>
>>>><br>
>>>> network.inode-lru-limit: 500000<br>
>>>><br>
>>>> performance.parallel-readdir: on<br>
>>>><br>
>>>> performance.readdir-ahead: on<br>
>>>><br>
>>>> performance.rda-cache-limit: 256MB<br>
>>>><br>
>>>> network.remote-dio: enable<br>
>>>><br>
>>>> network.ping-timeout: 5<br>
>>>><br>
>>>> cluster.quorum-type: fixed<br>
>>>><br>
>>>> cluster.quorum-count: 1<br>
>>>><br>
>>>> cluster.granular-entry-heal: enable<br>
>>>><br>
>>>> cluster.data-self-heal-algorithm: full<br>
>>>><br>
>>>><br>
>>>> Appreciate any insight. Thank you.<br>
>>>><br>
>>>> Sincerely,<br>
>>>> Artem<br>
>>>><br>
>>>> --<br>
>>>> Founder, Android Police <<a href="http://www.androidpolice.com" rel="noreferrer" target="_blank">http://www.androidpolice.com</a>>, APK Mirror<br>
>>>> <<a href="http://www.apkmirror.com/" rel="noreferrer" target="_blank">http://www.apkmirror.com/</a>>, Illogical Robot LLC<br>
>>>> <a href="http://beerpla.net" rel="noreferrer" target="_blank">beerpla.net</a> | @ArtemR <<a href="http://twitter.com/ArtemR" rel="noreferrer" target="_blank">http://twitter.com/ArtemR</a>><br>
>>>><br>
>>> ________<br>
>>><br>
>>><br>
>>><br>
>>> Community Meeting Calendar:<br>
>>><br>
>>> Schedule -<br>
>>> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC<br>
>>> Bridge: <a href="https://bluejeans.com/441850968" rel="noreferrer" target="_blank">https://bluejeans.com/441850968</a><br>
>>><br>
>>> Gluster-users mailing list<br>
>>> <a href="mailto:Gluster-users@gluster.org" target="_blank">Gluster-users@gluster.org</a><br>
>>> <a href="https://lists.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer" target="_blank">https://lists.gluster.org/mailman/listinfo/gluster-users</a><br>
>>><br>
>><br>
>><br>
>> --<br>
>> Regards,<br>
>> Hari Gowtham.<br>
>><br>
<br>
Hi Hari,<br>
<br>
I can confirm that I have observed what Artem has described on v7.X .<br>
When one of my Gluster nodes was down for some time (more data for heal) and powered up - the node is barely responsive over ssh (separate network than the gluster one) and the system is seriously loaded untill the heal is over.<br>
I'm using the defaults for healing options.<br>
Does the healing process require some checksumming on blamed entries ?<br>
<br>
<br>
@Artem, Gluster has 2 kinds of healing.<br>
A) FUSE clients can cause a healing of a file which is not the same on all bricks<br>
This is why md5sum causes a heal.<br>
Usually even a simple 'stat' will trigger that, but I have noticed that Gluster with sharding might require reading the file with an offset that matches the shard in order this type of heal to work.<br>
<br>
B) There is a heal daemon that runs every 15min (or somewhere there) which crawls over the entries blamed and triggers healing .<br>
<br>
Also, as far as I know, each file that is being healed is also locked for the duration of the heal. That was the reason why oVirt uses sharding , so instead of the whole disk being locked - only a small piece (shard ) is locked untill healed.<br>
<br>
@Artem what is the average size of the files for your apaches ?<br>
<br>
Best Regards,<br>
Strahil Nikolov<br>
<br>
</blockquote></div>