[Gluster-users] Self-heal Problems with gluster and nfs

Tue Jul 8 12:44:57 UTC 2014

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Am 08.07.2014 14:34, schrieb Pranith Kumar Karampuri:
> 
> On 07/08/2014 05:23 PM, Norman Mähler wrote:
> 
> 
> Am 08.07.2014 13:24, schrieb Pranith Kumar Karampuri:
>>>> On 07/08/2014 04:49 PM, Norman Mähler wrote:
>>>> 
>>>> 
>>>> Am 08.07.2014 13:02, schrieb Pranith Kumar Karampuri:
>>>>>>> On 07/08/2014 04:23 PM, Norman Mähler wrote: Of
>>>>>>> course:
>>>>>>> 
>>>>>>> The configuration is:
>>>>>>> 
>>>>>>> Volume Name: gluster_dateisystem Type: Replicate Volume
>>>>>>> ID: 2766695c-b8aa-46fd-b84d-4793b7ce847a Status:
>>>>>>> Started Number of Bricks: 1 x 2 = 2 Transport-type: tcp
>>>>>>> Bricks: Brick1: filecluster1:/mnt/raid Brick2:
>>>>>>> filecluster2:/mnt/raid Options Reconfigured:
>>>>>>> nfs.enable-ino32: on performance.cache-size: 512MB
>>>>>>> diagnostics.brick-log-level: WARNING
>>>>>>> diagnostics.client-log-level: WARNING 
>>>>>>> nfs.addr-namelookup: off
>>>>>>> performance.cache-refresh-timeout: 60
>>>>>>> performance.cache-max-file-size: 100MB 
>>>>>>> performance.write-behind-window-size: 10MB 
>>>>>>> performance.io-thread-count: 18
>>>>>>> performance.stat-prefetch: off
>>>>>>> 
>>>>>>> 
>>>>>>> The file count in xattrop is
>>>>>>>> Do "gluster volume set gluster_dateisystem 
>>>>>>>> cluster.self-heal-daemon off" This should stop all
>>>>>>>> the entry self-heals and should also get the CPU
>>>>>>>> usage low. When you don't have a lot of activity you
>>>>>>>> can enable it again using "gluster volume set
>>>>>>>> gluster_dateisystem cluster.self-heal-daemon on" If
>>>>>>>> it doesn't get the CPU down execute "gluster volume
>>>>>>>> set gluster_dateisystem cluster.entry-self-heal off".
>>>>>>>> Let me know how it goes. Pranith
>>>> Thanks for your help so far but stopping the self heal deamon
>>>> and the self heal machanism itself did not improve the
>>>> situation.
>>>> 
>>>> Do you have further suggestions? Is it simply the load on
>>>> the system? NFS could handle it easily before...
>>>>> Is it at least a little better or no improvement at all?
> After waiting half an hour more the system load is falling
> steadily. At the moment it is around 10 which is not good but a lot
> better than before. There are no messages in the nfs.log and the
> glusterfshd.log anymore. In the brick log there are still "inode
> not found - anonymous fd creation failed" messages.
>> They should go away once the heal is complete and the system is
>> back to normal. I believe you have directories with lots of
>> files? When can you start the healing process again (i.e. window
>> where there won't be a lot of activity and you can afford the
>> high CPU usage) so that things will be back to normal?
> 

We have got a window at night, but by now our admin decided to copy
the files back to an nfs system, because even with diabled self heal
our colleagues can not do their work with such a slow system.

After that we may be able to start again with a new system.
We are considering taking another network cluster sytem, but we are
not quite sure what to do.

There are a lot of small files and lock files in these directories.

Norman

>> Pranith
> 
> 
> 
> Norman
> 
>>>>> Pranith
>>>> Norman
>>>> 
>>>>>>> Brick 1: 2706 Brick 2: 2687
>>>>>>> 
>>>>>>> Norman
>>>>>>> 
>>>>>>> Am 08.07.2014 12:28, schrieb Pranith Kumar Karampuri:
>>>>>>>>>> It seems like entry self-heal is happening. What
>>>>>>>>>> is the volume configuration? Could you give ls 
>>>>>>>>>> <brick-path>/.glusterfs/indices/xattrop | wc -l
>>>>>>>>>> Count for all the bricks
>>>>>>>>>> 
>>>>>>>>>> Pranith On 07/08/2014 03:36 PM, Norman Mähler
>>>>>>>>>> wrote:
>>>>>>>>>>> Hello Pranith,
>>>>>>>>>>> 
>>>>>>>>>>> here are the logs. I only giv you the last
>>>>>>>>>>> 3000 lines, because the nfs.log from today is
>>>>>>>>>>> already 550 MB.
>>>>>>>>>>> 
>>>>>>>>>>> There are the standard files from a user home
>>>>>>>>>>> on the gluster system. All you normally find in
>>>>>>>>>>> a user home. Config files, firefox and
>>>>>>>>>>> thunderbird files etc.
>>>>>>>>>>> 
>>>>>>>>>>> Thanks in advance Norman
>>>>>>>>>>> 
>>>>>>>>>>> Am 08.07.2014 11:46, schrieb Pranith Kumar 
>>>>>>>>>>> Karampuri:
>>>>>>>>>>>> On 07/08/2014 02:46 PM, Norman Mähler wrote:
>>>>>>>>>>>> Hello again,
>>>>>>>>>>>> 
>>>>>>>>>>>> i could resolve the self heal problems with
>>>>>>>>>>>> the missing gfid files on one of the servers
>>>>>>>>>>>> by deleting the gfid files on the other
>>>>>>>>>>>> server.
>>>>>>>>>>>> 
>>>>>>>>>>>> They had a link count of 1 which means that
>>>>>>>>>>>> the file on that the gfid pointed was already
>>>>>>>>>>>> deleted.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> We have still these errors
>>>>>>>>>>>> 
>>>>>>>>>>>> [2014-07-08 09:09:43.564488] W 
>>>>>>>>>>>> [client-rpc-fops.c:2469:client3_3_link_cbk] 
>>>>>>>>>>>> 0-gluster_dateisystem-client-0: remote
>>>>>>>>>>>> operation failed: File exists 
>>>>>>>>>>>> (00000000-0000-0000-0000-000000000000 -> 
>>>>>>>>>>>> <gfid:b338b09e-2577-45b3-82bd-032f954dd083>/lock)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 
which appear in the glusterfshd.log and these
>>>>>>>>>>>> 
>>>>>>>>>>>> [2014-07-08 09:13:31.198462] E 
>>>>>>>>>>>> [client-rpc-fops.c:5179:client3_3_inodelk] 
>>>>>>>>>>>> (-->/usr/lib/x86_64-linux-gnu/glusterfs/3.4.4/xlator/cluster/replicate.so(+0x466b8)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>
>>>>>>>>>>>> 
[0x7f5d29d4e6b8]
>>>>>>>>>>>> (-->/usr/lib/x86_64-linux-gnu/glusterfs/3.4.4/xlator/cluster/replicate.so(afr_lock_blocking+0x844)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>
>>>>>>>>>>>> 
[0x7f5d29d4e2e4]
>>>>>>>>>>>> (-->/usr/lib/x86_64-linux-gnu/glusterfs/3.4.4/xlator/protocol/client.so(client_inodelk+0x99)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>
>>>>>>>>>>>> 
[0x7f5d29f8b3c9]))) 0-: Assertion failed: 0
>>>>>>>>>>>> from the nfs.log.
>>>>>>>>>>>>> Could you attach mount (nfs.log) and brick
>>>>>>>>>>>>> logs please. Do you have files with lots
>>>>>>>>>>>>> of hard-links? Pranith
>>>>>>>>>>>> I think the error messages belong together
>>>>>>>>>>>> but I don't have any idea how to solve them.
>>>>>>>>>>>> 
>>>>>>>>>>>> Still we have got a very bad performance
>>>>>>>>>>>> issue. The system load on the servers is
>>>>>>>>>>>> above 20 and nearly no one is able to work in
>>>>>>>>>>>> here on a client...
>>>>>>>>>>>> 
>>>>>>>>>>>> Hope for help Norman
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Am 07.07.2014 15:39, schrieb Pranith Kumar 
>>>>>>>>>>>> Karampuri:
>>>>>>>>>>>>>>> On 07/07/2014 06:58 PM, Norman Mähler
>>>>>>>>>>>>>>> wrote: Dear community,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> we have got some serious problems with
>>>>>>>>>>>>>>> our Gluster installation.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Here is the setting:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> We have got 2 bricks (version 3.4.4) on
>>>>>>>>>>>>>>> a debian 7.5, one of them with an nfs
>>>>>>>>>>>>>>> export. There are about 120 clients
>>>>>>>>>>>>>>> connecting to the exported nfs. These
>>>>>>>>>>>>>>> clients are thin clients reading and
>>>>>>>>>>>>>>> writing their Linux home directories
>>>>>>>>>>>>>>> from the exported nfs.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> We want to change the access of these
>>>>>>>>>>>>>>> clients one by one to access via
>>>>>>>>>>>>>>> gluster client.
>>>>>>>>>>>>>>>> I did not understand what you meant
>>>>>>>>>>>>>>>> by this. Are you moving to
>>>>>>>>>>>>>>>> glusterfs-fuse based mounts?
>>>>>>>>>>>>>>> Here are our problems:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> In the moment we have got two types of
>>>>>>>>>>>>>>> error messages which come in burts to
>>>>>>>>>>>>>>> our glusterfshd.log
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> [2014-07-07 13:10:21.572487] W 
>>>>>>>>>>>>>>> [client-rpc-fops.c:1538:client3_3_inodelk_cbk]
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>
>>>>>>>>>>>>>>> 
0-gluster_dateisystem-client-1: remote operation
>>>>>>>>>>>>>>> failed: No such file or directory 
>>>>>>>>>>>>>>> [2014-07-07 13:10:21.573448] W 
>>>>>>>>>>>>>>> [client-rpc-fops.c:471:client3_3_open_cbk]
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 
0-gluster_dateisystem-client-1: remote
>>>>>>>>>>>>>>> operation failed: No such file or
>>>>>>>>>>>>>>> directory. Path: 
>>>>>>>>>>>>>>> <gfid:b0c4f78a-249f-4db7-9d5b-0902c7d8f6cc>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 
(00000000-0000-0000-0000-000000000000)
>>>>>>>>>>>>>>> [2014-07-07 13:10:21.573468] E 
>>>>>>>>>>>>>>> [afr-self-heal-data.c:1270:afr_sh_data_open_cbk]
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>
>>>>>>>>>>>>>>> 
0-gluster_dateisystem-replicate-0: open of
>>>>>>>>>>>>>>> <gfid:b0c4f78a-249f-4db7-9d5b-0902c7d8f6cc>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 
failed on child gluster_dateisystem-client-1
>>>>>>>>>>>>>>> (No such file or directory)
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> This looks like a missing gfid file on
>>>>>>>>>>>>>>> one of the bricks. I looked it up and
>>>>>>>>>>>>>>> yes the file is missing on the second
>>>>>>>>>>>>>>> brick.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> We got these messages the other way
>>>>>>>>>>>>>>> round, too (missing on client-0 and the
>>>>>>>>>>>>>>> first brick).
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Is it possible to repair this one by
>>>>>>>>>>>>>>> copying the gfid file to the brick
>>>>>>>>>>>>>>> where it was missing? Or ist there
>>>>>>>>>>>>>>> another way to repair it?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> The second message is
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> [2014-07-07 13:06:35.948738] W 
>>>>>>>>>>>>>>> [client-rpc-fops.c:2469:client3_3_link_cbk]
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 
0-gluster_dateisystem-client-1: remote
>>>>>>>>>>>>>>> operation failed: File exists 
>>>>>>>>>>>>>>> (00000000-0000-0000-0000-000000000000
>>>>>>>>>>>>>>> -> 
>>>>>>>>>>>>>>> <gfid:aae47250-8f69-480c-ac75-2da2f4d21d7a>/lock)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>
>>>>>>>>>>>>>>> 
and I really do not know what to do with this
>>>>>>>>>>>>>>> one...
>>>>>>>>>>>>>>>> Did any of the bricks went offline
>>>>>>>>>>>>>>>> and came back online? Pranith
>>>>>>>>>>>>>>> I am really looking forward to your
>>>>>>>>>>>>>>> help because this is an active system
>>>>>>>>>>>>>>> and the system load on the nfs brick is
>>>>>>>>>>>>>>> about 25 (!!)
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks in advance! Norman Maehler
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>
>>>>>>>>>>>>>>>> 
Gluster-users mailing list
>>>>>>>>>>>>>>>> Gluster-users at gluster.org 
>>>>>>>>>>>>>>>> http://supercolony.gluster.org/mailman/listinfo/gluster-users
>>>>>>>>>>>>>>>>
>
>>>>>>>>>>>>>>>> 
- -- Mit freundlichen Grüßen,
>>>>>>> Norman Mähler
>>>>>>> 
>>>>>>> Bereichsleiter IT-Hochschulservice uni-assist e. V. 
>>>>>>> Geneststr. 5 Aufgang H, 3. Etage 10829 Berlin
>>>>>>> 
>>>>>>> Tel.: 030-66644382 n.maehler at uni-assist.de
>>>>>>> 
>>>> -- Mit freundlichen Grüßen,
>>>> 
>>>> Norman Mähler
>>>> 
>>>> Bereichsleiter IT-Hochschulservice uni-assist e. V.
>>>> Geneststr. 5 Aufgang H, 3. Etage 10829 Berlin
>>>> 
>>>> Tel.: 030-66644382 n.maehler at uni-assist.de
>>>> 
> -- Mit freundlichen Grüßen,
> 
> Norman Mähler
> 
> Bereichsleiter IT-Hochschulservice uni-assist e. V. Geneststr. 5 
> Aufgang H, 3. Etage 10829 Berlin
> 
> Tel.: 030-66644382 n.maehler at uni-assist.de
> 

- -- 
Mit freundlichen Grüßen,

Norman Mähler

Bereichsleiter IT-Hochschulservice
uni-assist e. V.
Geneststr. 5
Aufgang H, 3. Etage
10829 Berlin

Tel.: 030-66644382
n.maehler at uni-assist.de
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJTu+fJAAoJEB810LSP8y+RidoH/i0MzPi4WNFdzlpv+U/Y+8a6
4g3FDGJWDg5+6jYHH884mSyfvVZqP0SR8Pcur/GgxjBzi8y8SnKbL5UNo1OqqFrP
Q31O5w5JXMyO2Xl3K7H05OJykLfAWn5vRWaS/f239a3KE7H0wZEuVHUA9v9EcxAi
cYrwRNce7skf1fPEifmReoDZzYTAPm+iYRzzQuTFKL9l/ky7dJbKq1/bl4xGk/40
CL8/+x/l7uBE1CaKXEGMsF1aOV/BDVyYOK+PfCaAva/+jc5eSRKl2dHfY1xGebTN
I4rm6Wy8qwMXQDlfWXfXZVsXEqfxISZrOQuJwDYQOxKRZPDRXYSITx5k6/JNOSY=
=hpyg
-----END PGP SIGNATURE-----