[Gluster-users] Gluster errors create zombie processes [LOGS ATTACHED]

Sun Mar 8 13:36:31 UTC 2015

I don't have volfiles, they are not on our machines as I said previously we
don't have impact on gluster servers.

I saw some graph that looks similiar to volume file on logs. I will paste
it here but we don't really have any impact on that. We are just using
client to connect to gluster servers, we are not in control of.

*1: volume drslk-prod-client-0*
*  2:     type protocol/client*
*  3:     option ping-timeout 20*
*  4:     option remote-host brick13.gluster.iadm*
*  5:     option remote-subvolume /GLUSTERFS/drslk-prod*
*  6:     option transport-type socket*
*  7:     option frame-timeout 60*
*  8:     option send-gids true*
*  9: end-volume*
* 10:  *
* 11: volume drslk-prod-client-1*
* 12:     type protocol/client*
* 13:     option ping-timeout 20*
* 14:     option remote-host brick14.gluster.iadm*
* 15:     option remote-subvolume /GLUSTERFS/drslk-prod*
* 16:     option transport-type socket*
* 17:     option frame-timeout 60*
* 18:     option send-gids true*
* 19: end-volume*
* 20:  *
* 21: volume drslk-prod-client-2*
* 22:     type protocol/client*
* 23:     option ping-timeout 20*
* 24:     option remote-host brick15.gluster.iadm*
* 25:     option remote-subvolume /GLUSTERFS/drslk-prod*
* 26:     option transport-type socket*
* 27:     option frame-timeout 60*
* 28:     option send-gids true*
* 29: end-volume*
* 30:  *
* 31: volume drslk-prod-replicate-0*
* 32:     type cluster/replicate*
* 33:     option read-hash-mode 2*
* 34:     option data-self-heal-window-size 128*
* 35:     option quorum-type auto*
* 36:     subvolumes drslk-prod-client-0 drslk-prod-client-1
drslk-prod-client-2*
* 37: end-volume*
* 38:  *
* 39: volume drslk-prod-client-3*
* 40:     type protocol/client*
* 41:     option ping-timeout 20*
* 42:     option remote-host brick16.gluster.iadm*
* 43:     option remote-subvolume /GLUSTERFS/drslk-prod*
* 44:     option transport-type socket*
* 45:     option frame-timeout 60*
* 46:     option send-gids true*
* 47: end-volume*
* 48:  *
* 49: volume drslk-prod-client-4*
* 50:     type protocol/client*
* 51:     option ping-timeout 20*
* 52:     option remote-host brick17.gluster.iadm*
* 53:     option remote-subvolume /GLUSTERFS/drslk-prod*
* 54:     option transport-type socket*
* 55:     option frame-timeout 60*
* 56:     option send-gids true*
* 57: end-volume*
* 58:  *
* 59: volume drslk-prod-client-5*
* 60:     type protocol/client*
* 61:     option ping-timeout 20*
* 62:     option remote-host brick18.gluster.iadm*
* 63:     option remote-subvolume /GLUSTERFS/drslk-prod*
* 64:     option transport-type socket*
* 65:     option frame-timeout 60*
* 66:     option send-gids true*
* 67: end-volume*
* 68:  *
* 69: volume drslk-prod-replicate-1*
* 70:     type cluster/replicate*
* 71:     option read-hash-mode 2*
* 72:     option data-self-heal-window-size 128*
* 73:     option quorum-type auto*
* 74:     subvolumes drslk-prod-client-3 drslk-prod-client-4
drslk-prod-client-5*
* 75: end-volume*
* 76:  *
* 77: volume drslk-prod-client-6*
* 78:     type protocol/client*
* 79:     option ping-timeout 20*
* 80:     option remote-host brick19.gluster.iadm*
* 81:     option remote-subvolume /GLUSTERFS/drslk-prod*
* 82:     option transport-type socket*
* 83:     option frame-timeout 60*
* 84:     option send-gids true*
* 85: end-volume*
* 86:  *
* 87: volume drslk-prod-client-7*
* 88:     type protocol/client*
* 89:     option ping-timeout 20*
* 90:     option remote-host brick20.gluster.iadm*
* 91:     option remote-subvolume /GLUSTERFS/drslk-prod*
* 92:     option transport-type socket*
* 93:     option frame-timeout 60*
* 94:     option send-gids true*
* 95: end-volume*
* 96:  *
* 97: volume drslk-prod-client-8*
* 98:     type protocol/client*
* 99:     option ping-timeout 20*
*100:     option remote-host brick21.gluster.iadm*
*101:     option remote-subvolume /GLUSTERFS/drslk-prod*
*102:     option transport-type socket*
*103:     option frame-timeout 60*
*104:     option send-gids true*
*105: end-volume*
*106:  *
*107: volume drslk-prod-replicate-2*
*108:     type cluster/replicate*
*109:     option read-hash-mode 2*
*110:     option data-self-heal-window-size 128*
*111:     option quorum-type auto*
*112:     subvolumes drslk-prod-client-6 drslk-prod-client-7
drslk-prod-client-8*
*113: end-volume*
*114:  *
*115: volume drslk-prod-client-9*
*116:     type protocol/client*
*117:     option ping-timeout 20*
*118:     option remote-host brick22.gluster.iadm*
*119:     option remote-subvolume /GLUSTERFS/drslk-prod*
*120:     option transport-type socket*
*121:     option frame-timeout 60*
*122:     option send-gids true*
*123: end-volume*
*124:  *
*125: volume drslk-prod-client-10*
*126:     type protocol/client*
*127:     option ping-timeout 20*
*128:     option remote-host brick23.gluster.iadm*
*129:     option remote-subvolume /GLUSTERFS/drslk-prod*
*130:     option transport-type socket*
*131:     option frame-timeout 60*
*132:     option send-gids true*
*133: end-volume*
*134:  *
*135: volume drslk-prod-client-11*
*136:     type protocol/client*
*137:     option ping-timeout 20*
*138:     option remote-host brick24.gluster.iadm*
*139:     option remote-subvolume /GLUSTERFS/drslk-prod*
*140:     option transport-type socket*
*141:     option frame-timeout 60*
*142:     option send-gids true*
*143: end-volume*
*144:  *
*145: volume drslk-prod-replicate-3*
*146:     type cluster/replicate*
*147:     option read-hash-mode 2*
*148:     option data-self-heal-window-size 128*
*149:     option quorum-type auto*
*150:     subvolumes drslk-prod-client-9 drslk-prod-client-10
drslk-prod-client-11*
*151: end-volume*
*152:  *
*153: volume drslk-prod-dht*
*154:     type cluster/distribute*
*155:     option min-free-disk 10%*
*156:     option readdir-optimize on*
*157:     subvolumes drslk-prod-replicate-0 drslk-prod-replicate-1
drslk-prod-replicate-2 drslk-prod-replicate-3*
*158: end-volume*
*159:  *
*160: volume drslk-prod-write-behind*
*161:     type performance/write-behind*
*162:     option cache-size 1MB*
*163:     subvolumes drslk-prod-dht*
*164: end-volume*
*165:  *
*166: volume drslk-prod-read-ahead*
*167:     type performance/read-ahead*
*168:     subvolumes drslk-prod-write-behind*
*169: end-volume*
*170:  *
*171: volume drslk-prod-readdir-ahead*
*172:     type performance/readdir-ahead*
*173:     subvolumes drslk-prod-read-ahead*
*174: end-volume*
*175:  *
*176: volume drslk-prod-io-cache*
*177:     type performance/io-cache*
*178:     option cache-timeout 60*
*179:     option cache-size 512MB*
*180:     subvolumes drslk-prod-readdir-ahead*
*181: end-volume*
*182:  *
*183: volume drslk-prod-quick-read*
*184:     type performance/quick-read*
*185:     option cache-size 512MB*
*186:     subvolumes drslk-prod-io-cache*
*187: end-volume*
*188:  *
*189: volume drslk-prod-md-cache*
*190:     type performance/md-cache*
*191:     subvolumes drslk-prod-quick-read*
*192: end-volume*
*193:  *
*194: volume drslk-prod*
*195:     type debug/io-stats*
*196:     option latency-measurement off*
*197:     option count-fop-hits off*
*198:     subvolumes drslk-prod-md-cache*
*199: end-volume*
*200:  *
*201: volume meta-autoload*
*202:     type meta*
*203:     subvolumes drslk-prod*
*204: end-volume*
*205:  *

Btw, do you think that different versions of gluster client and gluster
server could be an issue here?

2015-03-08 1:29 GMT+01:00 Vijay Bellur <vbellur at redhat.com>:

> On 03/07/2015 06:20 PM, Przemysław Mroczek wrote:
>
>> Hi guys,
>>
>> We have rails app, which is using gluster for our distributed file
>> system. The glusters servers are hosted independently as part of deal
>> with other, we don't have any impact on them, we are connected o them by
>> using gluster native client.
>>
>> We tried to resolve this issue using help from the admins of the company
>> that is hosting our gluster servers, but they say that's the client
>> issue and we ran out of ideas how that's possible if we are not doing
>> anything special here.
>>
>> Information about independent gluster servers:
>> -version: 3.6.0.42.1
>> - They are using red hat
>> -They are enterprise so the are always using older versions
>>
>> Our servers:
>> System version: Ubuntu 14.04
>> Our gluster client version: 3.6.2
>>
>> The exact problem is that it often happens(couple times a week) that
>> errors in gluster causes proceses to become zombies. It happens with our
>> application server(unicorn), nginx and our crawling script that is run
>> as daemon.
>>
>> Our fstab file:
>>
>> 10.10.11.17:/drslk-prod     /mnt/storage          glusterfs
>> defaults,_netdev,nobootwait,fetch-attempts=10 0 0
>> 10.10.11.17:/drslk-backup     /mnt/backup          glusterfs
>> defaults,_netdev,nobootwait,fetch-attempts=10 0 0
>>
>> Logs from gluster:
>>
>> 2015-02-18 12:36:12.375695] E [rpc-clnt.c:362:saved_frames_unwind] (-->
>> /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x186)[
>> 0x7fb41ddeada6]
>> (-->
>> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_
>> unwind+0x1de)[0x7fb41d
>> bc1c7e] (-->
>> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_
>> destroy+0xe)[0x7fb41dbc1d8e]
>> (-->
>> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_
>> connection_cleanup+0x82)[0x7fb41dbc3602]
>> (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc
>> _clnt_notify+0x48)[0x7fb41dbc3d98] ))))) 0-drslk-prod-client-10: forced
>> unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2015-02-18
>> 12:36:12.361489 (xid=0x5d475da)
>> [2015-02-18 12:36:12.375765] W
>> [client-rpc-fops.c:2766:client3_3_lookup_cbk] 0-drslk-prod-client-10:
>> remote operation failed: Transport endpoint is not connected. Path:
>> /system/posts/00/00/71/77/59.jpg (2ad81c2b-a141-478d-9dd4-253345edbce
>> b)
>> [2015-02-18 12:36:12.376288] E [rpc-clnt.c:362:saved_frames_unwind] (-->
>> /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x186)[
>> 0x7fb41ddeada6]
>> (-->
>> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_
>> unwind+0x1de)[0x7fb41d
>> bc1c7e] (-->
>> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_
>> destroy+0xe)[0x7fb41dbc1d8e]
>> (-->
>> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_
>> connection_cleanup+0x82)[0x7fb41dbc3602]
>> (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc
>> _clnt_notify+0x48)[0x7fb41dbc3d98] ))))) 0-drslk-prod-client-10: forced
>> unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2015-02-18
>> 12:36:12.361858 (xid=0x5d475db)
>> [2015-02-18 12:36:12.376355] W
>> [client-rpc-fops.c:2766:client3_3_lookup_cbk] 0-drslk-prod-client-10:
>> remote operation failed: Transport endpoint is not connected. Path:
>> /system/posts/00/00/08 (f5c33a99-719e-4ea2-ad1f-33b893af103d)
>> [2015-02-18 12:36:12.376711] I [socket.c:3292:socket_submit_request]
>> 0-drslk-prod-client-10: not connected (priv->connected = 0)
>> [2015-02-18 12:36:12.376749] W [rpc-clnt.c:1562:rpc_clnt_submit]
>> 0-drslk-prod-client-10: failed to submit rpc-request (XID: 0x5d475dc
>> Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport
>> (drslk-prod-client-10)
>> [2015-02-18 12:36:12.376814] W
>> [client-rpc-fops.c:2766:client3_3_lookup_cbk] 0-drslk-prod-client-10:
>> remote operation failed: Transport endpoint is not connected. Path:
>> (null) (00000000-0000-0000-0000-000000000000)
>> [2015-02-18 12:36:12.376829] I [client.c:2215:client_rpc_notify]
>> 0-drslk-prod-client-10: disconnected from drslk-prod-client-10. Client
>> process will keep trying to connect to glusterd until brick's port is
>> available
>> [2015-02-18 12:36:12.376834] W [rpc-clnt.c:1562:rpc_clnt_submit]
>> 0-drslk-prod-client-10: failed to submit rpc-request (XID: 0x5d475dd
>> Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport
>> (drslk-prod-client-10)
>> [2015-02-18 12:36:12.376906] W
>> [client-rpc-fops.c:2766:client3_3_lookup_cbk] 0-drslk-prod-client-10:
>> remote operation failed: Transport endpoint is not connected. Path:
>> (null) (00000000-0000-0000-0000-000000000000)
>> [2015-02-18 12:36:12.376931] E [socket.c:2267:socket_connect_finish]
>> 0-drslk-prod-client-10: connection to 10.10.11.23:24007
>> <http://10.10.11.23:24007/> failed (Connection refused)
>>
>> [2015-02-18 12:36:12.379296] W
>> [client-rpc-fops.c:2766:client3_3_lookup_cbk] 0-drslk-prod-client-10:
>> remote operation failed: Transport endpoint is not connected. Path:
>> (null) (00000000-0000-0000-0000-000000000000)
>> [2015-02-18 12:36:12.379700] W
>> [client-rpc-fops.c:2766:client3_3_lookup_cbk] 0-drslk-prod-client-10:
>> remote operation failed: Transport endpoint is not connected. Path:
>> (null) (00000000-0000-0000-0000-000000000000)
>> [2015-02-18 13:10:52.759736] E
>> [client-handshake.c:1496:client_query_portmap_cbk]
>> 0-drslk-prod-client-10: failed to get the port number for remote
>> subvolume. Please run 'gluster volume status' on server to see if brick
>> process is running.
>> [2015-02-18 13:10:52.759796] I [client.c:2215:client_rpc_notify]
>> 0-drslk-prod-client-10: disconnected from drslk-prod-client-10. Client
>> process will keep trying to connect to glusterd until brick's port is
>> available
>> [2015-02-18 13:11:02.897307] I [rpc-clnt.c:1761:rpc_clnt_reconfig]
>> 0-drslk-prod-client-10: changing port to 49349 (from 0)
>> [2015-02-18 13:11:02.898097] I
>> [client-handshake.c:1413:select_server_supported_programs]
>> 0-drslk-prod-client-10: Using Program GlusterFS 3.3, Num (1298437),
>> Version (330)
>> [2015-02-18 13:11:02.898446] I
>> [client-handshake.c:1200:client_setvolume_cbk] 0-drslk-prod-client-10:
>> Connected to drslk-prod-client-10, attached to remote volume
>> '/GLUSTERFS/drslk-prod'.
>> [2015-02-18 13:11:02.898460] I
>> [client-handshake.c:1210:client_setvolume_cbk] 0-drslk-prod-client-10:
>> Server and Client lk-version numbers are not same, reopening the fds
>>
>>
> Can you provide the gluster volume configuration details?
>
> It does look like frame-timeout for the volume has been set to 60. Is
> there any specific reason? Normally altering the frame-timeout is not
> recommended.
>
> -Vijay
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20150308/e331c36c/attachment.html>