[Gluster-users] Gluster errors create zombie processes [LOGS ATTACHED]
Przemysław Mroczek
przemek at durszlak.pl
Sun Mar 8 13:36:31 UTC 2015
I don't have volfiles, they are not on our machines as I said previously we
don't have impact on gluster servers.
I saw some graph that looks similiar to volume file on logs. I will paste
it here but we don't really have any impact on that. We are just using
client to connect to gluster servers, we are not in control of.
*1: volume drslk-prod-client-0*
* 2: type protocol/client*
* 3: option ping-timeout 20*
* 4: option remote-host brick13.gluster.iadm*
* 5: option remote-subvolume /GLUSTERFS/drslk-prod*
* 6: option transport-type socket*
* 7: option frame-timeout 60*
* 8: option send-gids true*
* 9: end-volume*
* 10: *
* 11: volume drslk-prod-client-1*
* 12: type protocol/client*
* 13: option ping-timeout 20*
* 14: option remote-host brick14.gluster.iadm*
* 15: option remote-subvolume /GLUSTERFS/drslk-prod*
* 16: option transport-type socket*
* 17: option frame-timeout 60*
* 18: option send-gids true*
* 19: end-volume*
* 20: *
* 21: volume drslk-prod-client-2*
* 22: type protocol/client*
* 23: option ping-timeout 20*
* 24: option remote-host brick15.gluster.iadm*
* 25: option remote-subvolume /GLUSTERFS/drslk-prod*
* 26: option transport-type socket*
* 27: option frame-timeout 60*
* 28: option send-gids true*
* 29: end-volume*
* 30: *
* 31: volume drslk-prod-replicate-0*
* 32: type cluster/replicate*
* 33: option read-hash-mode 2*
* 34: option data-self-heal-window-size 128*
* 35: option quorum-type auto*
* 36: subvolumes drslk-prod-client-0 drslk-prod-client-1
drslk-prod-client-2*
* 37: end-volume*
* 38: *
* 39: volume drslk-prod-client-3*
* 40: type protocol/client*
* 41: option ping-timeout 20*
* 42: option remote-host brick16.gluster.iadm*
* 43: option remote-subvolume /GLUSTERFS/drslk-prod*
* 44: option transport-type socket*
* 45: option frame-timeout 60*
* 46: option send-gids true*
* 47: end-volume*
* 48: *
* 49: volume drslk-prod-client-4*
* 50: type protocol/client*
* 51: option ping-timeout 20*
* 52: option remote-host brick17.gluster.iadm*
* 53: option remote-subvolume /GLUSTERFS/drslk-prod*
* 54: option transport-type socket*
* 55: option frame-timeout 60*
* 56: option send-gids true*
* 57: end-volume*
* 58: *
* 59: volume drslk-prod-client-5*
* 60: type protocol/client*
* 61: option ping-timeout 20*
* 62: option remote-host brick18.gluster.iadm*
* 63: option remote-subvolume /GLUSTERFS/drslk-prod*
* 64: option transport-type socket*
* 65: option frame-timeout 60*
* 66: option send-gids true*
* 67: end-volume*
* 68: *
* 69: volume drslk-prod-replicate-1*
* 70: type cluster/replicate*
* 71: option read-hash-mode 2*
* 72: option data-self-heal-window-size 128*
* 73: option quorum-type auto*
* 74: subvolumes drslk-prod-client-3 drslk-prod-client-4
drslk-prod-client-5*
* 75: end-volume*
* 76: *
* 77: volume drslk-prod-client-6*
* 78: type protocol/client*
* 79: option ping-timeout 20*
* 80: option remote-host brick19.gluster.iadm*
* 81: option remote-subvolume /GLUSTERFS/drslk-prod*
* 82: option transport-type socket*
* 83: option frame-timeout 60*
* 84: option send-gids true*
* 85: end-volume*
* 86: *
* 87: volume drslk-prod-client-7*
* 88: type protocol/client*
* 89: option ping-timeout 20*
* 90: option remote-host brick20.gluster.iadm*
* 91: option remote-subvolume /GLUSTERFS/drslk-prod*
* 92: option transport-type socket*
* 93: option frame-timeout 60*
* 94: option send-gids true*
* 95: end-volume*
* 96: *
* 97: volume drslk-prod-client-8*
* 98: type protocol/client*
* 99: option ping-timeout 20*
*100: option remote-host brick21.gluster.iadm*
*101: option remote-subvolume /GLUSTERFS/drslk-prod*
*102: option transport-type socket*
*103: option frame-timeout 60*
*104: option send-gids true*
*105: end-volume*
*106: *
*107: volume drslk-prod-replicate-2*
*108: type cluster/replicate*
*109: option read-hash-mode 2*
*110: option data-self-heal-window-size 128*
*111: option quorum-type auto*
*112: subvolumes drslk-prod-client-6 drslk-prod-client-7
drslk-prod-client-8*
*113: end-volume*
*114: *
*115: volume drslk-prod-client-9*
*116: type protocol/client*
*117: option ping-timeout 20*
*118: option remote-host brick22.gluster.iadm*
*119: option remote-subvolume /GLUSTERFS/drslk-prod*
*120: option transport-type socket*
*121: option frame-timeout 60*
*122: option send-gids true*
*123: end-volume*
*124: *
*125: volume drslk-prod-client-10*
*126: type protocol/client*
*127: option ping-timeout 20*
*128: option remote-host brick23.gluster.iadm*
*129: option remote-subvolume /GLUSTERFS/drslk-prod*
*130: option transport-type socket*
*131: option frame-timeout 60*
*132: option send-gids true*
*133: end-volume*
*134: *
*135: volume drslk-prod-client-11*
*136: type protocol/client*
*137: option ping-timeout 20*
*138: option remote-host brick24.gluster.iadm*
*139: option remote-subvolume /GLUSTERFS/drslk-prod*
*140: option transport-type socket*
*141: option frame-timeout 60*
*142: option send-gids true*
*143: end-volume*
*144: *
*145: volume drslk-prod-replicate-3*
*146: type cluster/replicate*
*147: option read-hash-mode 2*
*148: option data-self-heal-window-size 128*
*149: option quorum-type auto*
*150: subvolumes drslk-prod-client-9 drslk-prod-client-10
drslk-prod-client-11*
*151: end-volume*
*152: *
*153: volume drslk-prod-dht*
*154: type cluster/distribute*
*155: option min-free-disk 10%*
*156: option readdir-optimize on*
*157: subvolumes drslk-prod-replicate-0 drslk-prod-replicate-1
drslk-prod-replicate-2 drslk-prod-replicate-3*
*158: end-volume*
*159: *
*160: volume drslk-prod-write-behind*
*161: type performance/write-behind*
*162: option cache-size 1MB*
*163: subvolumes drslk-prod-dht*
*164: end-volume*
*165: *
*166: volume drslk-prod-read-ahead*
*167: type performance/read-ahead*
*168: subvolumes drslk-prod-write-behind*
*169: end-volume*
*170: *
*171: volume drslk-prod-readdir-ahead*
*172: type performance/readdir-ahead*
*173: subvolumes drslk-prod-read-ahead*
*174: end-volume*
*175: *
*176: volume drslk-prod-io-cache*
*177: type performance/io-cache*
*178: option cache-timeout 60*
*179: option cache-size 512MB*
*180: subvolumes drslk-prod-readdir-ahead*
*181: end-volume*
*182: *
*183: volume drslk-prod-quick-read*
*184: type performance/quick-read*
*185: option cache-size 512MB*
*186: subvolumes drslk-prod-io-cache*
*187: end-volume*
*188: *
*189: volume drslk-prod-md-cache*
*190: type performance/md-cache*
*191: subvolumes drslk-prod-quick-read*
*192: end-volume*
*193: *
*194: volume drslk-prod*
*195: type debug/io-stats*
*196: option latency-measurement off*
*197: option count-fop-hits off*
*198: subvolumes drslk-prod-md-cache*
*199: end-volume*
*200: *
*201: volume meta-autoload*
*202: type meta*
*203: subvolumes drslk-prod*
*204: end-volume*
*205: *
Btw, do you think that different versions of gluster client and gluster
server could be an issue here?
2015-03-08 1:29 GMT+01:00 Vijay Bellur <vbellur at redhat.com>:
> On 03/07/2015 06:20 PM, Przemysław Mroczek wrote:
>
>> Hi guys,
>>
>> We have rails app, which is using gluster for our distributed file
>> system. The glusters servers are hosted independently as part of deal
>> with other, we don't have any impact on them, we are connected o them by
>> using gluster native client.
>>
>> We tried to resolve this issue using help from the admins of the company
>> that is hosting our gluster servers, but they say that's the client
>> issue and we ran out of ideas how that's possible if we are not doing
>> anything special here.
>>
>> Information about independent gluster servers:
>> -version: 3.6.0.42.1
>> - They are using red hat
>> -They are enterprise so the are always using older versions
>>
>> Our servers:
>> System version: Ubuntu 14.04
>> Our gluster client version: 3.6.2
>>
>> The exact problem is that it often happens(couple times a week) that
>> errors in gluster causes proceses to become zombies. It happens with our
>> application server(unicorn), nginx and our crawling script that is run
>> as daemon.
>>
>> Our fstab file:
>>
>> 10.10.11.17:/drslk-prod /mnt/storage glusterfs
>> defaults,_netdev,nobootwait,fetch-attempts=10 0 0
>> 10.10.11.17:/drslk-backup /mnt/backup glusterfs
>> defaults,_netdev,nobootwait,fetch-attempts=10 0 0
>>
>> Logs from gluster:
>>
>> 2015-02-18 12:36:12.375695] E [rpc-clnt.c:362:saved_frames_unwind] (-->
>> /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x186)[
>> 0x7fb41ddeada6]
>> (-->
>> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_
>> unwind+0x1de)[0x7fb41d
>> bc1c7e] (-->
>> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_
>> destroy+0xe)[0x7fb41dbc1d8e]
>> (-->
>> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_
>> connection_cleanup+0x82)[0x7fb41dbc3602]
>> (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc
>> _clnt_notify+0x48)[0x7fb41dbc3d98] ))))) 0-drslk-prod-client-10: forced
>> unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2015-02-18
>> 12:36:12.361489 (xid=0x5d475da)
>> [2015-02-18 12:36:12.375765] W
>> [client-rpc-fops.c:2766:client3_3_lookup_cbk] 0-drslk-prod-client-10:
>> remote operation failed: Transport endpoint is not connected. Path:
>> /system/posts/00/00/71/77/59.jpg (2ad81c2b-a141-478d-9dd4-253345edbce
>> b)
>> [2015-02-18 12:36:12.376288] E [rpc-clnt.c:362:saved_frames_unwind] (-->
>> /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x186)[
>> 0x7fb41ddeada6]
>> (-->
>> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_
>> unwind+0x1de)[0x7fb41d
>> bc1c7e] (-->
>> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_
>> destroy+0xe)[0x7fb41dbc1d8e]
>> (-->
>> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_
>> connection_cleanup+0x82)[0x7fb41dbc3602]
>> (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc
>> _clnt_notify+0x48)[0x7fb41dbc3d98] ))))) 0-drslk-prod-client-10: forced
>> unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2015-02-18
>> 12:36:12.361858 (xid=0x5d475db)
>> [2015-02-18 12:36:12.376355] W
>> [client-rpc-fops.c:2766:client3_3_lookup_cbk] 0-drslk-prod-client-10:
>> remote operation failed: Transport endpoint is not connected. Path:
>> /system/posts/00/00/08 (f5c33a99-719e-4ea2-ad1f-33b893af103d)
>> [2015-02-18 12:36:12.376711] I [socket.c:3292:socket_submit_request]
>> 0-drslk-prod-client-10: not connected (priv->connected = 0)
>> [2015-02-18 12:36:12.376749] W [rpc-clnt.c:1562:rpc_clnt_submit]
>> 0-drslk-prod-client-10: failed to submit rpc-request (XID: 0x5d475dc
>> Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport
>> (drslk-prod-client-10)
>> [2015-02-18 12:36:12.376814] W
>> [client-rpc-fops.c:2766:client3_3_lookup_cbk] 0-drslk-prod-client-10:
>> remote operation failed: Transport endpoint is not connected. Path:
>> (null) (00000000-0000-0000-0000-000000000000)
>> [2015-02-18 12:36:12.376829] I [client.c:2215:client_rpc_notify]
>> 0-drslk-prod-client-10: disconnected from drslk-prod-client-10. Client
>> process will keep trying to connect to glusterd until brick's port is
>> available
>> [2015-02-18 12:36:12.376834] W [rpc-clnt.c:1562:rpc_clnt_submit]
>> 0-drslk-prod-client-10: failed to submit rpc-request (XID: 0x5d475dd
>> Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport
>> (drslk-prod-client-10)
>> [2015-02-18 12:36:12.376906] W
>> [client-rpc-fops.c:2766:client3_3_lookup_cbk] 0-drslk-prod-client-10:
>> remote operation failed: Transport endpoint is not connected. Path:
>> (null) (00000000-0000-0000-0000-000000000000)
>> [2015-02-18 12:36:12.376931] E [socket.c:2267:socket_connect_finish]
>> 0-drslk-prod-client-10: connection to 10.10.11.23:24007
>> <http://10.10.11.23:24007/> failed (Connection refused)
>>
>> [2015-02-18 12:36:12.379296] W
>> [client-rpc-fops.c:2766:client3_3_lookup_cbk] 0-drslk-prod-client-10:
>> remote operation failed: Transport endpoint is not connected. Path:
>> (null) (00000000-0000-0000-0000-000000000000)
>> [2015-02-18 12:36:12.379700] W
>> [client-rpc-fops.c:2766:client3_3_lookup_cbk] 0-drslk-prod-client-10:
>> remote operation failed: Transport endpoint is not connected. Path:
>> (null) (00000000-0000-0000-0000-000000000000)
>> [2015-02-18 13:10:52.759736] E
>> [client-handshake.c:1496:client_query_portmap_cbk]
>> 0-drslk-prod-client-10: failed to get the port number for remote
>> subvolume. Please run 'gluster volume status' on server to see if brick
>> process is running.
>> [2015-02-18 13:10:52.759796] I [client.c:2215:client_rpc_notify]
>> 0-drslk-prod-client-10: disconnected from drslk-prod-client-10. Client
>> process will keep trying to connect to glusterd until brick's port is
>> available
>> [2015-02-18 13:11:02.897307] I [rpc-clnt.c:1761:rpc_clnt_reconfig]
>> 0-drslk-prod-client-10: changing port to 49349 (from 0)
>> [2015-02-18 13:11:02.898097] I
>> [client-handshake.c:1413:select_server_supported_programs]
>> 0-drslk-prod-client-10: Using Program GlusterFS 3.3, Num (1298437),
>> Version (330)
>> [2015-02-18 13:11:02.898446] I
>> [client-handshake.c:1200:client_setvolume_cbk] 0-drslk-prod-client-10:
>> Connected to drslk-prod-client-10, attached to remote volume
>> '/GLUSTERFS/drslk-prod'.
>> [2015-02-18 13:11:02.898460] I
>> [client-handshake.c:1210:client_setvolume_cbk] 0-drslk-prod-client-10:
>> Server and Client lk-version numbers are not same, reopening the fds
>>
>>
> Can you provide the gluster volume configuration details?
>
> It does look like frame-timeout for the volume has been set to 60. Is
> there any specific reason? Normally altering the frame-timeout is not
> recommended.
>
> -Vijay
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20150308/e331c36c/attachment.html>
More information about the Gluster-users
mailing list