[Gluster-users] Self-heal Problems with gluster and nfs

Tue Jul 8 12:34:59 UTC 2014

On 07/08/2014 05:23 PM, Norman Mähler wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
>
>
> Am 08.07.2014 13:24, schrieb Pranith Kumar Karampuri:
>> On 07/08/2014 04:49 PM, Norman Mähler wrote:
>>
>>
>> Am 08.07.2014 13:02, schrieb Pranith Kumar Karampuri:
>>>>> On 07/08/2014 04:23 PM, Norman Mähler wrote: Of course:
>>>>>
>>>>> The configuration is:
>>>>>
>>>>> Volume Name: gluster_dateisystem Type: Replicate Volume ID:
>>>>> 2766695c-b8aa-46fd-b84d-4793b7ce847a Status: Started Number
>>>>> of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1:
>>>>> filecluster1:/mnt/raid Brick2: filecluster2:/mnt/raid
>>>>> Options Reconfigured: nfs.enable-ino32: on
>>>>> performance.cache-size: 512MB diagnostics.brick-log-level:
>>>>> WARNING diagnostics.client-log-level: WARNING
>>>>> nfs.addr-namelookup: off performance.cache-refresh-timeout:
>>>>> 60 performance.cache-max-file-size: 100MB
>>>>> performance.write-behind-window-size: 10MB
>>>>> performance.io-thread-count: 18 performance.stat-prefetch:
>>>>> off
>>>>>
>>>>>
>>>>> The file count in xattrop is
>>>>>> Do "gluster volume set gluster_dateisystem
>>>>>> cluster.self-heal-daemon off" This should stop all the
>>>>>> entry self-heals and should also get the CPU usage low.
>>>>>> When you don't have a lot of activity you can enable it
>>>>>> again using "gluster volume set gluster_dateisystem
>>>>>> cluster.self-heal-daemon on" If it doesn't get the CPU down
>>>>>> execute "gluster volume set gluster_dateisystem
>>>>>> cluster.entry-self-heal off". Let me know how it goes.
>>>>>> Pranith
>> Thanks for your help so far but stopping the self heal deamon and
>> the self heal machanism itself did not improve the situation.
>>
>> Do you have further suggestions? Is it simply the load on the
>> system? NFS could handle it easily before...
>>> Is it at least a little better or no improvement at all?
> After waiting half an hour more the system load is falling steadily.
> At the moment it is around 10 which is not good but a lot better than
> before.
> There are no messages in the nfs.log and the glusterfshd.log anymore.
> In the brick log there are still "inode not found - anonymous fd
> creation failed" messages.
They should go away once the heal is complete and the system is back to 
normal. I believe you have directories with lots of files?
When can you start the healing process again (i.e. window where there 
won't be a lot of activity and you can afford the high CPU usage) so 
that things will be back to normal?

Pranith
>
>
>
> Norman
>
>>> Pranith
>> Norman
>>
>>>>> Brick 1: 2706 Brick 2: 2687
>>>>>
>>>>> Norman
>>>>>
>>>>> Am 08.07.2014 12:28, schrieb Pranith Kumar Karampuri:
>>>>>>>> It seems like entry self-heal is happening. What is
>>>>>>>> the volume configuration? Could you give ls
>>>>>>>> <brick-path>/.glusterfs/indices/xattrop | wc -l Count
>>>>>>>> for all the bricks
>>>>>>>>
>>>>>>>> Pranith On 07/08/2014 03:36 PM, Norman Mähler wrote:
>>>>>>>>> Hello Pranith,
>>>>>>>>>
>>>>>>>>> here are the logs. I only giv you the last 3000
>>>>>>>>> lines, because the nfs.log from today is already 550
>>>>>>>>> MB.
>>>>>>>>>
>>>>>>>>> There are the standard files from a user home on the
>>>>>>>>> gluster system. All you normally find in a user
>>>>>>>>> home. Config files, firefox and thunderbird files
>>>>>>>>> etc.
>>>>>>>>>
>>>>>>>>> Thanks in advance Norman
>>>>>>>>>
>>>>>>>>> Am 08.07.2014 11:46, schrieb Pranith Kumar
>>>>>>>>> Karampuri:
>>>>>>>>>> On 07/08/2014 02:46 PM, Norman Mähler wrote: Hello
>>>>>>>>>> again,
>>>>>>>>>>
>>>>>>>>>> i could resolve the self heal problems with the
>>>>>>>>>> missing gfid files on one of the servers by
>>>>>>>>>> deleting the gfid files on the other server.
>>>>>>>>>>
>>>>>>>>>> They had a link count of 1 which means that the
>>>>>>>>>> file on that the gfid pointed was already deleted.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> We have still these errors
>>>>>>>>>>
>>>>>>>>>> [2014-07-08 09:09:43.564488] W
>>>>>>>>>> [client-rpc-fops.c:2469:client3_3_link_cbk]
>>>>>>>>>> 0-gluster_dateisystem-client-0: remote operation
>>>>>>>>>> failed: File exists
>>>>>>>>>> (00000000-0000-0000-0000-000000000000 ->
>>>>>>>>>> <gfid:b338b09e-2577-45b3-82bd-032f954dd083>/lock)
>>>>>>>>>>
>>>>>>>>>> which appear in the glusterfshd.log and these
>>>>>>>>>>
>>>>>>>>>> [2014-07-08 09:13:31.198462] E
>>>>>>>>>> [client-rpc-fops.c:5179:client3_3_inodelk]
>>>>>>>>>> (-->/usr/lib/x86_64-linux-gnu/glusterfs/3.4.4/xlator/cluster/replicate.so(+0x466b8)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
> [0x7f5d29d4e6b8]
>>>>>>>>>> (-->/usr/lib/x86_64-linux-gnu/glusterfs/3.4.4/xlator/cluster/replicate.so(afr_lock_blocking+0x844)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
> [0x7f5d29d4e2e4]
>>>>>>>>>> (-->/usr/lib/x86_64-linux-gnu/glusterfs/3.4.4/xlator/protocol/client.so(client_inodelk+0x99)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
> [0x7f5d29f8b3c9]))) 0-: Assertion failed: 0
>>>>>>>>>> from the nfs.log.
>>>>>>>>>>> Could you attach mount (nfs.log) and brick logs
>>>>>>>>>>> please. Do you have files with lots of
>>>>>>>>>>> hard-links? Pranith
>>>>>>>>>> I think the error messages belong together but I
>>>>>>>>>> don't have any idea how to solve them.
>>>>>>>>>>
>>>>>>>>>> Still we have got a very bad performance issue.
>>>>>>>>>> The system load on the servers is above 20 and
>>>>>>>>>> nearly no one is able to work in here on a
>>>>>>>>>> client...
>>>>>>>>>>
>>>>>>>>>> Hope for help Norman
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Am 07.07.2014 15:39, schrieb Pranith Kumar
>>>>>>>>>> Karampuri:
>>>>>>>>>>>>> On 07/07/2014 06:58 PM, Norman Mähler wrote:
>>>>>>>>>>>>> Dear community,
>>>>>>>>>>>>>
>>>>>>>>>>>>> we have got some serious problems with our
>>>>>>>>>>>>> Gluster installation.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Here is the setting:
>>>>>>>>>>>>>
>>>>>>>>>>>>> We have got 2 bricks (version 3.4.4) on a
>>>>>>>>>>>>> debian 7.5, one of them with an nfs export.
>>>>>>>>>>>>> There are about 120 clients connecting to the
>>>>>>>>>>>>> exported nfs. These clients are thin clients
>>>>>>>>>>>>> reading and writing their Linux home
>>>>>>>>>>>>> directories from the exported nfs.
>>>>>>>>>>>>>
>>>>>>>>>>>>> We want to change the access of these clients
>>>>>>>>>>>>> one by one to access via gluster client.
>>>>>>>>>>>>>> I did not understand what you meant by
>>>>>>>>>>>>>> this. Are you moving to glusterfs-fuse
>>>>>>>>>>>>>> based mounts?
>>>>>>>>>>>>> Here are our problems:
>>>>>>>>>>>>>
>>>>>>>>>>>>> In the moment we have got two types of error
>>>>>>>>>>>>> messages which come in burts to our
>>>>>>>>>>>>> glusterfshd.log
>>>>>>>>>>>>>
>>>>>>>>>>>>> [2014-07-07 13:10:21.572487] W
>>>>>>>>>>>>> [client-rpc-fops.c:1538:client3_3_inodelk_cbk]
>>>>>>>>>>>>>
>>>>>>>>>>>>>
> 0-gluster_dateisystem-client-1: remote operation
>>>>>>>>>>>>> failed: No such file or directory
>>>>>>>>>>>>> [2014-07-07 13:10:21.573448] W
>>>>>>>>>>>>> [client-rpc-fops.c:471:client3_3_open_cbk]
>>>>>>>>>>>>> 0-gluster_dateisystem-client-1: remote
>>>>>>>>>>>>> operation failed: No such file or directory.
>>>>>>>>>>>>> Path:
>>>>>>>>>>>>> <gfid:b0c4f78a-249f-4db7-9d5b-0902c7d8f6cc>
>>>>>>>>>>>>> (00000000-0000-0000-0000-000000000000)
>>>>>>>>>>>>> [2014-07-07 13:10:21.573468] E
>>>>>>>>>>>>> [afr-self-heal-data.c:1270:afr_sh_data_open_cbk]
>>>>>>>>>>>>>
>>>>>>>>>>>>>
> 0-gluster_dateisystem-replicate-0: open of
>>>>>>>>>>>>> <gfid:b0c4f78a-249f-4db7-9d5b-0902c7d8f6cc>
>>>>>>>>>>>>> failed on child gluster_dateisystem-client-1
>>>>>>>>>>>>> (No such file or directory)
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> This looks like a missing gfid file on one of
>>>>>>>>>>>>> the bricks. I looked it up and yes the file
>>>>>>>>>>>>> is missing on the second brick.
>>>>>>>>>>>>>
>>>>>>>>>>>>> We got these messages the other way round,
>>>>>>>>>>>>> too (missing on client-0 and the first
>>>>>>>>>>>>> brick).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Is it possible to repair this one by copying
>>>>>>>>>>>>> the gfid file to the brick where it was
>>>>>>>>>>>>> missing? Or ist there another way to repair
>>>>>>>>>>>>> it?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> The second message is
>>>>>>>>>>>>>
>>>>>>>>>>>>> [2014-07-07 13:06:35.948738] W
>>>>>>>>>>>>> [client-rpc-fops.c:2469:client3_3_link_cbk]
>>>>>>>>>>>>> 0-gluster_dateisystem-client-1: remote
>>>>>>>>>>>>> operation failed: File exists
>>>>>>>>>>>>> (00000000-0000-0000-0000-000000000000 ->
>>>>>>>>>>>>> <gfid:aae47250-8f69-480c-ac75-2da2f4d21d7a>/lock)
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
> and I really do not know what to do with this
>>>>>>>>>>>>> one...
>>>>>>>>>>>>>> Did any of the bricks went offline and came
>>>>>>>>>>>>>> back online? Pranith
>>>>>>>>>>>>> I am really looking forward to your help
>>>>>>>>>>>>> because this is an active system and the
>>>>>>>>>>>>> system load on the nfs brick is about 25
>>>>>>>>>>>>> (!!)
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks in advance! Norman Maehler
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
> Gluster-users mailing list
>>>>>>>>>>>>>> Gluster-users at gluster.org
>>>>>>>>>>>>>> http://supercolony.gluster.org/mailman/listinfo/gluster-users
> - -- Mit freundlichen Grüßen,
>>>>> Norman Mähler
>>>>>
>>>>> Bereichsleiter IT-Hochschulservice uni-assist e. V.
>>>>> Geneststr. 5 Aufgang H, 3. Etage 10829 Berlin
>>>>>
>>>>> Tel.: 030-66644382 n.maehler at uni-assist.de
>>>>>
>> -- Mit freundlichen Grüßen,
>>
>> Norman Mähler
>>
>> Bereichsleiter IT-Hochschulservice uni-assist e. V. Geneststr. 5
>> Aufgang H, 3. Etage 10829 Berlin
>>
>> Tel.: 030-66644382 n.maehler at uni-assist.de
>>
> - -- 
> Mit freundlichen Grüßen,
>
> Norman Mähler
>
> Bereichsleiter IT-Hochschulservice
> uni-assist e. V.
> Geneststr. 5
> Aufgang H, 3. Etage
> 10829 Berlin
>
> Tel.: 030-66644382
> n.maehler at uni-assist.de
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
>
> iQEcBAEBAgAGBQJTu9u6AAoJEB810LSP8y+Rf2UIAIwlwr6fX87MpvXAgSkN8jsW
> zKAbiMQzGEJmYnKGTKHghUbtlAj1yJmhmdNOXOm5Z6mUBuwWC5U+saww9zzGqwh7
> XM650Oqv//PkTcudSgBCf0SX/CwcKjw2/U+apSSvAx2xeMwbVx9gpoXJWG3koCGl
> Cimq6QnMDohaMLFbV8ENodf/q6Oa72NpyheX1wY+xHtNOCNan1ioIpqQUxKKZkbd
> lztfccmBvXwAVsPsKgQFw8k1ecnR1AaCDrGcjhHTpIcunu18UyPiiYw7M2yMi9UG
> qxWUCtYutG4Qx4htbLPv/wl5i4q5tqRKuCQjeS85TXHZ9ORt5bQhgZxJ9oFCNqE=
> =IMVQ
> -----END PGP SIGNATURE-----