[Gluster-users] not healing one file

Richard Neuboeck hawk at tbi.univie.ac.at
Thu Oct 26 07:37:39 UTC 2017


Hi Amar,

thanks for the information! I tried this tool on all machines.

# gluster-health-report

Loaded reports: glusterd-op-version, georep, gfid-mismatch-dht-report,
glusterd-peer-disconnect, disk_usage, errors_in_logs, coredump,
glusterd, glusterd_volume_version_cksum_errors, kernel_issues,
errors_in_logs, ifconfig, nic-health, process_status

[     OK] Disk used percentage  path=/  percentage=4
[     OK] Disk used percentage  path=/var  percentage=4
[     OK] Disk used percentage  path=/tmp  percentage=4
[     OK] All peers are in connected state  connected_count=2
total_peer_count=2
[     OK] no gfid mismatch
[  ERROR] Report failure  report=report_check_glusterd_op_version
[ NOT OK] The maximum size of core files created is NOT set to unlimited.
[  ERROR] Report failure  report=report_check_worker_restarts
[  ERROR] Report failure  report=report_non_participating_bricks
[WARNING] Glusterd uptime is less than 24 hours  uptime_sec=72798
[WARNING] Errors in Glusterd log file  num_errors=35
[WARNING] Warnings in Glusterd log file  num_warning=37
[ NOT OK] Recieve errors in "ifconfig bond0" output
[ NOT OK] Errors seen in "cat /proc/net/dev -- bond0" output
High CPU usage by Self-heal
[WARNING] Errors in Glusterd log file num_errors=77
[WARNING] Warnings in Glusterd log file num_warnings=61

Basically it's the same message on all of them with varying error and
warning counts.
Glusterd is not up for long since I updated and then rebootet the
machines yesterday. That's also the reason for some of the errors and
warnings and also for the network errors since it always takes some time
until the bonded device (4x1Gbit, balanced alb) is fully functional.

From what I've seen in the getfattr output Karthik asked me to get GFIDs
are different on the file in question. Even though the report says there
is no mismatch.

So is this a split-brain situation gluster is not aware of?

Cheers
Richard

On 26.10.17 06:51, Amar Tumballi wrote:
> On a side note, try recently released health report tool, and see if it
> does diagnose any issues in setup. Currently you may have to run it in
> all the three machines.
> 
> 
> 
> On 26-Oct-2017 6:50 AM, "Amar Tumballi" <atumball at redhat.com
> <mailto:atumball at redhat.com>> wrote:
> 
>     Thanks for this report. This week many of the developers are at
>     Gluster Summit in Prague, will be checking this and respond next
>     week. Hope that's fine.
> 
>     Thanks,
>     Amar
> 
> 
>     On 25-Oct-2017 3:07 PM, "Richard Neuboeck" <hawk at tbi.univie.ac.at
>     <mailto:hawk at tbi.univie.ac.at>> wrote:
> 
>         Hi Gluster Gurus,
> 
>         I'm using a gluster volume as home for our users. The volume is
>         replica 3, running on CentOS 7, gluster version 3.10
>         (3.10.6-1.el7.x86_64). Clients are running Fedora 26 and also
>         gluster 3.10 (3.10.6-3.fc26.x86_64).
> 
>         During the data backup I got an I/O error on one file. Manually
>         checking for this file on a client confirms this:
> 
>         ls -l
>         romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-backups/
>         ls: cannot access
>         'romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-backups/recovery.baklz4':
>         Input/output error
>         total 2015
>         -rw-------. 1 romanoch tbi 998211 Sep 15 18:44 previous.js
>         -rw-------. 1 romanoch tbi  65222 Oct 17 17:57 previous.jsonlz4
>         -rw-------. 1 romanoch tbi 149161 Oct  1 13:46 recovery.bak
>         -?????????? ? ?        ?        ?            ? recovery.baklz4
> 
>         Out of curiosity I checked all the bricks for this file. It's
>         present there. Making a checksum shows that the file is different on
>         one of the three replica servers.
> 
>         Querying healing information shows that the file should be healed:
>         # gluster volume heal home info
>         Brick sphere-six:/srv/gluster_home/brick
>         /romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-backups/recovery.baklz4
> 
>         Status: Connected
>         Number of entries: 1
> 
>         Brick sphere-five:/srv/gluster_home/brick
>         /romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-backups/recovery.baklz4
> 
>         Status: Connected
>         Number of entries: 1
> 
>         Brick sphere-four:/srv/gluster_home/brick
>         Status: Connected
>         Number of entries: 0
> 
>         Manually triggering heal doesn't report an error but also does not
>         heal the file.
>         # gluster volume heal home
>         Launching heal operation to perform index self heal on volume home
>         has been successful
> 
>         Same with a full heal
>         # gluster volume heal home full
>         Launching heal operation to perform full self heal on volume home
>         has been successful
> 
>         According to the split brain query that's not the problem:
>         # gluster volume heal home info split-brain
>         Brick sphere-six:/srv/gluster_home/brick
>         Status: Connected
>         Number of entries in split-brain: 0
> 
>         Brick sphere-five:/srv/gluster_home/brick
>         Status: Connected
>         Number of entries in split-brain: 0
> 
>         Brick sphere-four:/srv/gluster_home/brick
>         Status: Connected
>         Number of entries in split-brain: 0
> 
> 
>         I have no idea why this situation arose in the first place and also
>         no idea as how to solve this problem. I would highly appreciate any
>         helpful feedback I can get.
> 
>         The only mention in the logs matching this file is a rename
>         operation:
>         /var/log/glusterfs/bricks/srv-gluster_home-brick.log:[2017-10-23
>         09:19:11.561661] I [MSGID: 115061]
>         [server-rpc-fops.c:1022:server_rename_cbk] 0-home-server: 5266153:
>         RENAME
>         /romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-backups/recovery.jsonlz4
>         (48e9eea6-cda6-4e53-bb4a-72059debf4c2/recovery.jsonlz4) ->
>         /romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-backups/recovery.baklz4
>         (48e9eea6-cda6-4e53-bb4a-72059debf4c2/recovery.baklz4), client:
>         romulus.tbi.univie.ac.at-11894-2017/10/18-07:06:07:206366-home-client-3-0-0,
>         error-xlator: home-posix [No data available]
> 
>         I enabled directory quotas the same day this problem showed up but
>         I'm not sure how quotas could have an effect like this (maybe unless
>         the limit is reached but that's also not the case).
> 
>         Thanks again if anyone as an idea.
>         Cheers
>         Richard
>         --
>         /dev/null
> 
> 
>         _______________________________________________
>         Gluster-users mailing list
>         Gluster-users at gluster.org <mailto:Gluster-users at gluster.org>
>         http://lists.gluster.org/mailman/listinfo/gluster-users
>         <http://lists.gluster.org/mailman/listinfo/gluster-users>
> 


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 801 bytes
Desc: OpenPGP digital signature
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20171026/177a9595/attachment.sig>


More information about the Gluster-users mailing list