[Bugs] [Bug 1387960] New: Sequential volume start&stop is failing with SSL enabled setup.

Mon Oct 24 04:14:51 UTC 2016

https://bugzilla.redhat.com/show_bug.cgi?id=1387960

            Bug ID: 1387960
           Summary: Sequential volume start&stop  is failing  with SSL
                    enabled setup.
           Product: GlusterFS
           Version: 3.9
         Component: rpc
          Keywords: Triaged
          Severity: high
          Assignee: kaushal at redhat.com
          Reporter: kaushal at redhat.com
                CC: amukherj at redhat.com, bsrirama at redhat.com,
                    bugs at gluster.org, kaushal at redhat.com
        Depends On: 1336371, 1336508
            Blocks: 1336339, 1336369, 1336376, 1381822

+++ This bug was initially created as a clone of Bug #1336371 +++

+++ This bug was initially created as a clone of Bug #1336369 +++

+++ This bug was initially created as a clone of Bug #1336339 +++

Description of problem:
=======================
I done 400 times sequential volume start and stop, it's failed after some time
with CLI error message 
"""
volume start: rep: success
volume stop: rep: success
volume start: rep: success
volume stop: rep: failed: Commit failed on localhost. Please check the log file
for more details.
volume start: rep: failed: Pre Validation failed on
dhcp43-215.lab.eng.blr.redhat.com. Volume rep already started
volume stop: rep: failed: Volume rep is not in the started state
volume start: rep: failed: Pre-validation failed on localhost. Please check log
file for details
volume stop: rep: failed: Volume rep is not in the started state
volume start: rep: failed: Pre-validation failed on localhost. Please check lo
"""

glusterd logs are posted in the comment.

Version-Release number of selected component (if applicable):
=============================================================
glusterfs-3.7.9-4.

How reproducible:
=================
Always

Steps to Reproduce:
===================
1. Create a volume 16 x 2 = 32 type using two nodes.
2. Do FUSE mount and umount the volume 300 times.
3. Do volume start/stop 400 times //it fail after some start/stop operation.
4. Check the glusterd logs for error messages and warnings.

Actual results:
===============
Sequential volume start and stop is not working with SSL enabled setup. 

Expected results:
=================
Sequential volume start and stop should work with SSL enabled setup. 

Additional info:

--- Additional comment from Red Hat Bugzilla Rules Engine on 2016-05-16
13:58:53 IST ---

This bug is automatically being proposed for the current z-stream release of
Red Hat Gluster Storage 3 by setting the release flag 'rhgs‑3.1.z' to '?'. 

If this bug should be proposed for a different release, please manually change
the proposed release flag.

--- Additional comment from Byreddy on 2016-05-16 13:59:59 IST ---

[2016-05-16 06:52:45.028639] W [socket.c:2719:socket_server_event_handler]
0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028646] W [socket.c:2719:socket_server_event_handler]
0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028654] W [socket.c:2719:socket_server_event_handler]
0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028661] W [socket.c:2719:socket_server_event_handler]
0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028674] W [socket.c:2719:socket_server_event_handler]
0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028679] I [socket.c:459:ssl_setup_connection]
0-socket.management: peer CN = dhcp43-215.lab.eng.blr.redhat.com
[2016-05-16 06:52:45.028689] W [socket.c:2719:socket_server_event_handler]
0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028750] W [socket.c:2719:socket_server_event_handler]
0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028760] W [socket.c:2719:socket_server_event_handler]
0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028768] W [socket.c:2719:socket_server_event_handler]
0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028776] W [socket.c:2719:socket_server_event_handler]
0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028784] W [socket.c:2719:socket_server_event_handler]
0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028792] W [socket.c:2719:socket_server_event_handler]
0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028806] W [socket.c:2719:socket_server_event_handler]
0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028815] W [socket.c:2719:socket_server_event_handler]
0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028823] W [socket.c:2719:socket_server_event_handler]
0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028831] W [socket.c:2719:socket_server_event_handler]
0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028838] W [socket.c:2719:socket_server_event_handler]
0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028864] W [socket.c:2719:socket_server_event_handler]
0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028872] W [socket.c:2719:socket_server_event_handler]
0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028880] W [socket.c:2719:socket_server_event_handler]
0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028888] W [socket.c:2719:socket_server_event_handler]
0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028895] W [socket.c:2719:socket_server_event_handler]
0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028903] W [socket.c:2719:socket_server_event_handler]
0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028911] W [socket.c:2719:socket_server_event_handler]
0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028918] W [socket.c:2719:socket_server_event_handler]
0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028926] W [socket.c:2719:socket_server_event_handler]
0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028934] W [socket.c:2719:socket_server_event_handler]
0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028941] W [socket.c:2719:socket_server_event_handler]
0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028949] W [socket.c:2719:socket_server_event_handler]
0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028956] W [socket.c:2719:socket_server_event_handler]
0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028965] W [socket.c:2719:socket_server_event_handler]
0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028973] W [socket.c:2719:socket_server_event_handler]
0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.029132] W [socket.c:2719:socket_server_event_handler]
0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.029152] W [socket.c:2719:socket_server_event_handler]
0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.029168] W [socket.c:2719:socket_server_event_handler]
0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.029183] W [socket.c:2719:socket_server_event_handler]
0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.029191] W [socket.c:2719:socket_server_event_handler]
0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.029198] W [socket.c:2719:socket_server_event_handler]
0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.029206] W [socket.c:2719:socket_server_event_handler]
0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.029213] W [socket.c:2719:socket_server_event_handler]
0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.029221] W [socket.c:2719:socket_server_event_handler]
0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.029228] W [socket.c:2719:socket_server_event_handler]
0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.029236] W [socket.c:2719:socket_server_event_handler]
0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.029244] W [socket.c:2719:socket_server_event_handler]
0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.029251] W [socket.c:2719:socket_server_event_handler]
0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.029258] W [socket.c:2719:socket_server_event_handler]
0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.030909] W [socket.c:2719:socket_server_event_handler]
0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.031075] E [socket.c:2868:socket_server_event_handler]
0-socket.management: could not create pipe
[2016-05-16 06:52:45.041646] I [socket.c:459:ssl_setup_connection]
0-socket.management: peer CN = dhcp43-215.lab.eng.blr.redhat.com
[2016-05-16 06:52:45.043605] E [socket.c:2616:socket_poller]
0-socket.management: error in polling loop
The message "W [MSGID: 101088] [common-utils.c:3662:gf_backtrace_save]
0-management: Failed to save the backtrace." repeated 2 times between
[2016-05-16 06:52:44.061901] and [2016-05-16 06:52:47.161342]
[2016-05-16 06:52:47.163334] E [MSGID: 106280]
[glusterd-volume-ops.c:1653:glusterd_op_stage_stop_volume] 0-management: Volume
rep is not in the started state
[2016-05-16 06:52:47.163376] E [MSGID: 106301]
[glusterd-syncop.c:1281:gd_stage_op_phase] 0-management: Staging of operation
'Volume Stop' failed on localhost : Volume rep is not in the started state
[2016-05-16 06:52:47.163551] W [MSGID: 101088]
[common-utils.c:3662:gf_backtrace_save] 0-management: Failed to save the
backtrace.
The message "W [MSGID: 101088] [common-utils.c:3662:gf_backtrace_save]
0-management: Failed to save the backtrace." repeated 2 times between
[2016-05-16 06:52:47.163551] and [2016-05-16 06:52:47.252524]
[2016-05-16 06:52:47.254692] E [MSGID: 106408]
[glusterd-peer-utils.c:120:glusterd_peerinfo_find_by_hostname] 0-management:
error in getaddrinfo: System error
 [Unknown error -11]
[2016-05-16 06:52:47.254771] E [MSGID: 101075]
[common-utils.c:3137:gf_is_local_addr] 0-management: error in getaddrinfo:
System error

[2016-05-16 06:52:47.254784] E [MSGID: 106187]
[glusterd-volume-ops.c:1486:glusterd_op_stage_start_volume] 0-management: Could
not find peer on which brick
dhcp43-214.lab.eng.blr.redhat.com:/bricks/brick0/brr0 resides
[2016-05-16 06:52:47.254793] W [MSGID: 106122]
[glusterd-mgmt.c:198:gd_mgmt_v3_pre_validate_fn] 0-management: Volume start
prevalidation failed.
[2016-05-16 06:52:47.254800] E [MSGID: 106122]
[glusterd-mgmt.c:879:glusterd_mgmt_v3_pre_validate] 0-management: Pre
Validation failed for operation Start on local node
[2016-05-16 06:52:47.254809] E [MSGID: 106122]
[glusterd-mgmt.c:1991:glusterd_mgmt_v3_initiate_all_phases] 0-management: Pre
Validation Failed
[2016-05-16 06:52:47.255009] W [MSGID: 101088]
[common-utils.c:3662:gf_backtrace_save] 0-management: Failed to save the
backtrace.
The message "W [MSGID: 101088] [common-utils.c:3662:gf_backtrace_save]
0-management: Failed to save the backtrace." repeated 2 times between
[2016-05-16 06:52:47.255009] and [2016-05-16 06:52:50.350609]
[2016-05-16 06:52:50.352683] E [MSGID: 106280]
[glusterd-volume-ops.c:1653:glusterd_op_stage_stop_volume] 0-management: Volume
rep is not in the started state
[2016-05-16 06:52:50.352756] E [MSGID: 106301]
[glusterd-syncop.c:1281:gd_stage_op_phase] 0-management: Staging of operation
'Volume Stop' failed on localhost : Volume rep is not in the started state
[2016-05-16 06:52:50.352909] W [MSGID: 101088]
[common-utils.c:3662:gf_backtrace_save] 0-management: Failed to save the
backtrace.
The message "W [MSGID: 101088] [common-utils.c:3662:gf_backtrace_save]
0-management: Failed to save the backtrace." repeated 2 times between
[2016-05-16 06:52:50.352909] and [2016-05-16 06:52:50.448995]
[2016-05-16 06:52:50.451250] E [MSGID: 106408]
[glusterd-peer-utils.c:120:glusterd_peerinfo_find_by_hostname] 0-management:
error in getaddrinfo: System error
 [Unknown error -11]
[2016-05-16 06:52:50.451338] E [MSGID: 101075]
[common-utils.c:3137:gf_is_local_addr] 0-management: error in getaddrinfo:
System error

[2016-05-16 06:52:50.451353] E [MSGID: 106187]
[glusterd-volume-ops.c:1486:glusterd_op_stage_start_volume] 0-management: Could
not find peer on which brick
dhcp43-214.lab.eng.blr.redhat.com:/bricks/brick0/brr0 resides
[2016-05-16 06:52:50.451363] W [MSGID: 106122]
[glusterd-mgmt.c:198:gd_mgmt_v3_pre_validate_fn] 0-management: Volume start
prevalidation failed.
[2016-05-16 06:52:50.451371] E [MSGID: 106122]
[glusterd-mgmt.c:879:glusterd_mgmt_v3_pre_validate] 0-management: Pre
Validation failed for operation Start on local node
[2016-05-16 06:52:50.451381] E [MSGID: 106122]
[glusterd-mgmt.c:1991:glusterd_mgmt_v3_initiate_all_phases] 0-management: Pre
Validation Failed
[2016-05-16 06:52:50.451505] W [MSGID: 101088]
[common-utils.c:3662:gf_backtrace_save] 0-management: Failed to save the
backtrace.
The message "W [MSGID: 101088] [common-utils.c:3662:gf_backtrace_save]
0-management: Failed to save the backtrace." repeated 2 times between
[2016-05-16 06:52:50.451505] and [2016-05-16 06:52:53.542390]
[2016-05-16 06:52:53.544425] E [MSGID: 106280]
[glusterd-volume-ops.c:1653:glusterd_op_stage_stop_volume] 0-management: Volume
rep is not in the started state
[2016-05-16 06:52:53.544467] E [MSGID: 106301]
[glusterd-syncop.c:1281:gd_stage_op_phase] 0-management: Staging of operation
'Volume Stop' failed on localhost : Volume rep is not in the star:

[2016-05-16 06:53:00.020201] E [MSGID: 101075]
[common-utils.c:3137:gf_is_local_addr] 0-management: error in getaddrinfo:
System error

[2016-05-16 06:53:00.020216] E [MSGID: 106187]
[glusterd-volume-ops.c:1486:glusterd_op_stage_start_volume] 0-management: Could
not find peer on which brick
dhcp43-214.lab.eng.blr.redhat.com:/bricks/brick0/brr0 resides
[2016-05-16 06:53:00.020225] W [MSGID: 106122]
[glusterd-mgmt.c:198:gd_mgmt_v3_pre_validate_fn] 0-management: Volume start
prevalidation failed.
[2016-05-16 06:53:00.020233] E [MSGID: 106122]
[glusterd-mgmt.c:879:glusterd_mgmt_v3_pre_validate] 0-management: Pre
Validation failed for operation Start on local node
[2016-05-16 06:53:00.020243] E [MSGID: 106122]
[glusterd-mgmt.c:1991:glusterd_mgmt_v3_initiate_all_phases] 0-management: Pre
Validation Failed
[2016-05-16 06:53:00.020359] W [MSGID: 101088]
[common-utils.c:3662:gf_backtrace_save] 0-management: Failed to save the
backtrace.
[2016-05-16 06:56:30.764560] I [MSGID: 106487]
[glusterd-handler.c:1472:__glusterd_handle_cli_list_friends] 0-glusterd:
Received cli list req
[2016-05-16 07:27:02.476621] E [MSGID: 101012] [logging.c:376:gf_log_rotate]
0-logrotate: failed to open logfile [Too many open files]
[2016-05-16 07:27:02.476615] W [MSGID: 101095]
[xlator.c:148:xlator_volopt_dynload] 0-xlator:
/usr/lib64/glusterfs/3.7.9/xlator/cluster/distribute.so: cannot open shared
object file: Too many open files
[2016-05-16 07:27:02.476753] W [MSGID: 101095]
[xlator.c:148:xlator_volopt_dynload] 0-xlator:
/usr/lib64/glusterfs/3.7.9/xlator/cluster/replicate.so: cannot open shared
object file: Too many open files
[2016-05-16 07:27:02.476902] W [MSGID: 101095]
[xlator.c:148:xlator_volopt_dynload] 0-xlator:
/usr/lib64/glusterfs/3.7.9/xlator/cluster/disperse.so: cannot open shared
object file: Too many open files
[2016-05-16 07:27:02.476983] W [MSGID: 101095]
[xlator.c:148:xlator_volopt_dynload] 0-xlator:
/usr/lib64/glusterfs/3.7.9/xlator/cluster/stripe.so: cannot open shared object
file: Too many open files
[2016-05-16 07:27:02.477002] W [MSGID: 101095]
[xlator.c:148:xlator_volopt_dynload] 0-xlator:
/usr/lib64/glusterfs/3.7.9/xlator/cluster/stripe.so: cannot open shared object
file: Too many open files
[2016-05-16 07:27:02.477010] W [MSGID: 101095]
[xlator.c:148:xlator_volopt_dynload] 0-xlator:
/usr/lib64/glusterfs/3.7.9/xlator/debug/io-stats.so: cannot open shared object
file: Too many open files
The message "W [MSGID: 101095] [xlator.c:148:xlator_volopt_dynload] 0-xlator:
/usr/lib64/glusterfs/3.7.9/xlator/debug/io-stats.so: cannot open shared object
file: Too many open files" repeated 13 times between [2016-05-16
07:27:02.477010] and [2016-05-16 07:27:02.477115]
[2016-05-16 07:27:02.477124] W [MSGID: 101095]
[xlator.c:148:xlator_volopt_dynload] 0-xlator:
/usr/lib64/glusterfs/3.7.9/xlator/performance/io-cache.so: cannot open shared
object file: Too many open files
The message "W [MSGID: 101095] [xlator.c:148:xlator_volopt_dynload] 0-xlator:
/usr/lib64/glusterfs/3.7.9/xlator/performance/io-cache.so: cannot open shared
object file: Too many open files" repeated 4 times between [2016-05-16
07:27:02.477124] and [2016-05-16 07:27:02.477170]
[2016-05-16 07:27:02.477177] W [MSGID: 101095]
[xlator.c:148:xlator_volopt_dynload] 0-xlator:
/usr/lib64/glusterfs/3.7.9/xlator/performance/io-threads.so: cannot open shared
object file: Too many open files
The message "W [MSGID: 101095] [xlator.c:148:xlator_volopt_dynload] 0-xlator:
/usr/lib64/glusterfs/3.7.9/xlator/performance/io-threads.so: cannot open shared
object file: Too many open files" repeated 6 times between [2016-05-16
07:27:02.477177] and [2016-05-16 07:27:02.477235]
[2016-05-16 07:27:02.477242] W [MSGID: 101095]
[xlator.c:148:xlator_volopt_dynload] 0-xlator:
/usr/lib64/glusterfs/3.7.9/xlator/performance/write-behind.so: cannot open
shared object file: Too many open files
The message "W [MSGID: 101095] [xlator.c:148:xlator_volopt_dynload] 0-xlator:
/usr/lib64/glusterfs/3.7.9/xlator/performance/write-behind.so: cannot open
shared object file: Too many open files" repeated 7 times between [2016-05-16
07:27:02.477242] and [2016-05-16 07:27:02.477312]
[2016-05-16 07:27:02.477319] W [MSGID: 101095]
[xlator.c:148:xlator_volopt_dynload] 0-xlator:
/usr/lib64/glusterfs/3.7.9/xlator/performance/open-behind.so: cannot open
shared object file: Too many open files
[2016-05-16 07:27:02.477343] W [MSGID: 101095]
[xlator.c:148:xlator_volopt_dynload] 0-xlator:
/usr/lib64/glusterfs/3.7.9/xlator/performance/open-behind.so: cannot open
shared object file: Too many open files
[2016-05-16 07:27:02.477350] W [MSGID: 101095]
[xlator.c:148:xlator_volopt_dynload] 0-xlator:
/usr/lib64/glusterfs/3.7.9/xlator/performance/read-ahead.so: cannot open shared
object file: Too many open files
[2016-05-16 07:27:02.477371] W [MSGID: 101095]
[xlator.c:148:xlator_volopt_dynload] 0-xlator:
/usr/lib64/glusterfs/3.7.9/xlator/performance/md-cache.so: cannot open shared
object file: Too many open files
[2016-05-16 07:27:02.477390] W [MSGID: 101095]
[xlator.c:148:xlator_volopt_dynload] 0-xlator:
/usr/lib64/glusterfs/3.7.9/xlator/encryption/crypt.so: cannot open shared
object file: Too many open files
The message "W [MSGID: 101095] [xlator.c:148:xlator_volopt_dynload] 0-xlator:
/usr/lib64/glusterfs/3.7.9/xlator/encryption/crypt.so: cannot open shared
object file: Too many open files" repeated 2 times between [2016-05-16
07:27:02.477390] and [2016-05-16 07:27:02.477412]
The message "W [MSGID: 101095] [xlator.c:148:xlator_volopt_dynload] 0-xlator:
/usr/lib64/glusterfs/3.7.9/xlator/cluster/distribute.so: cannot open shared
object file: Too many open files" repeated 9 times between [2016-05-16
07:27:02.476615] and [2016-05-16 07:27:02.478514]
The message "W [MSGID: 101095] [xlator.c:148:xlator_volopt_dynload] 0-xlator:
/usr/lib64/glusterfs/3.7.9/xlator/cluster/replicate.so: cannot open shared
object file: Too many open files" repeated 2 times between [2016-05-16
07:27:02.478527] and [2016-05-16 07:27:02.478556]
[2016-05-16 08:04:49.794095] W [MSGID: 101088]
[common-utils.c:3662:gf_backtrace_save] 0-management: Failed to save the
backtrace.
[2016-05-16 08:04:49.849469] I [MSGID: 106499]
[glusterd-handler.c:4330:__glusterd_handle_status_volume] 0-management:
Received status volume req for volume rep
[2016-05-16 08:04:49.851449] E [MSGID: 106301]
[glusterd-syncop.c:1281:gd_stage_op_phase] 0-management: Staging of operation
'Volume Status' failed on localhost : Volume rep is not started
[2016-05-16 08:04:55.999928] E [MSGID: 106408]
[glusterd-peer-utils.c:120:glusterd_peerinfo_find_by_hostname] 0-management:
error in getaddrinfo: System error
 [Unknown error -11]
[2016-05-16 08:04:56.000034] E [MSGID: 101075]
[common-utils.c:3137:gf_is_local_addr] 0-management: error in getaddrinfo:
System error

[2016-05-16 08:04:56.000059] E [MSGID: 106187]
[glusterd-volume-ops.c:1486:glusterd_op_stage_start_volume] 0-management: Could
not find peer on which brick
dhcp43-214.lab.eng.blr.redhat.com:/bricks/brick0/brr0 resides
[2016-05-16 08:04:56.000096] W [MSGID: 106122]
[glusterd-mgmt.c:198:gd_mgmt_v3_pre_validate_fn] 0-management: Volume start
prevalidation failed.
[2016-05-16 08:04:56.000107] E [MSGID: 106122]
[glusterd-mgmt.c:879:glusterd_mgmt_v3_pre_validate] 0-management: Pre
Validation failed for operation Start on local node
[2016-05-16 08:04:56.000117] E [MSGID: 106122]
[glusterd-mgmt.c:1991:glusterd_mgmt_v3_initiate_all_phases] 0-management: Pre
Validation Failed
[2016-05-16 08:05:27.990060] E [MSGID: 106408]
[glusterd-peer-utils.c:120:glusterd_peerinfo_find_by_hostname] 0-management:
error in getaddrinfo: System error
 [Unknown error -11]
[2016-05-16 08:05:27.990090] E [MSGID: 101075]
[common-utils.c:3137:gf_is_local_addr] 0-management: error in getaddrinfo:
System error

[2016-05-16 08:05:27.990157] E [MSGID: 106187]
[glusterd-volume-ops.c:1486:glusterd_op_stage_start_volume] 0-management: Could
not find peer on which brick
dhcp43-214.lab.eng.blr.redhat.com:/bricks/brick0/brr0 resides
[2016-05-16 08:05:27.990177] W [MSGID: 106122]
[glusterd-mgmt.c:198:gd_mgmt_v3_pre_validate_fn] 0-management: Volume start
prevalidation failed.
[2016-05-16 08:05:27.990246] E [MSGID: 106122]
[glusterd-mgmt.c:879:glusterd_mgmt_v3_pre_validate] 0-management: Pre
Validation failed for operation Start on local node
[2016-05-16 08:05:27.990272] E [MSGID: 106122]
[glusterd-mgmt.c:1991:glusterd_mgmt_v3_initiate_all_phases] 0-management: Pre
Validation Failed
The message "W [MSGID: 101088] [common-utils.c:3662:gf_backtrace_save]
0-management: Failed to save the backtrace." repeated 10 times between
[2016-05-16 08:04:49.794095] and [2016-05-16 08:05:27.990523]
[2016-05-16 08:13:05.184001] W [MSGID: 101088]
[common-utils.c:3662:gf_backtrace_save] 0-management: Failed to save the
backtrace.
[2016-05-16 08:13:05.186580] E [MSGID: 106408]
[glusterd-peer-utils.c:120:glusterd_peerinfo_find_by_hostname] 0-management:
error in getaddrinfo: System error
 [Unknown error -11]
[2016-05-16 08:13:05.186639] E [MSGID: 101075]
[common-utils.c:3137:gf_is_local_addr] 0-management: error in getaddrinfo:
System error

[2016-05-16 08:13:05.186654] E [MSGID: 106187]
[glusterd-volume-ops.c:1486:glusterd_op_stage_start_volume] 0-management: Could
not find peer on which brick
dhcp43-214.lab.eng.blr.redhat.com:/bricks/brick0/brr0 resides
[2016-05-16 08:13:05.186665] W [MSGID: 106122]
[glusterd-mgmt.c:198:gd_mgmt_v3_pre_validate_fn] 0-management: Volume start
prevalidation failed.
[2016-05-16 08:13:05.186675] E [MSGID: 106122]
[glusterd-mgmt.c:879:glusterd_mgmt_v3_pre_validate] 0-management: Pre
Validation failed for operation Start on local node
[2016-05-16 08:13:05.186690] E [MSGID: 106122]
[glusterd-mgmt.c:1991:glusterd_mgmt_v3_initiate_all_phases] 0-management: Pre
Validation Failed
[2016-05-16 08:13:06.581765] E [MSGID: 106187]
[glusterd-volume-ops.c:1486:glusterd_op_stage_start_volume] 0-management: Could
not find peer on which brick
dhcp43-214.lab.eng.blr.redhat.com:/bricks/brick0/brr0 resides
[2016-05-16 08:13:06.581767] W [MSGID: 106122]
[glusterd-mgmt.c:198:gd_mgmt_v3_pre_validate_fn] 0-management: Volume start
prevalidation failed.
[2016-05-16 08:13:06.581833] E [MSGID: 106122]
[glusterd-mgmt.c:879:glusterd_mgmt_v3_pre_validate] 0-management: Pre
Validation failed for operation Start on local node
[2016-05-16 08:13:06.581722] E [MSGID: 106408]
[glusterd-peer-utils.c:120:glusterd_peerinfo_find_by_hostname] 0-management:
error in getaddrinfo: System error
 [Unknown error -11]
[2016-05-16 08:13:06.581761] E [MSGID: 101075]
[common-utils.c:3137:gf_is_local_addr] 0-management: error in getaddrinfo:
System error

[2016-05-16 08:13:06.581873] E [MSGID: 106122]
[glusterd-mgmt.c:1991:glusterd_mgmt_v3_initiate_all_phases] 0-management: Pre
Validation Failed
The message "W [MSGID: 101088] [common-utils.c:3662:gf_backtrace_save]
0-management: Failed to save the backtrace." repeated 5 times between
[2016-05-16 08:13:05.184001] and [2016-05-16 08:13:06.581980]
[2016-05-16 08:19:40.223288] I [MSGID: 106488]
[glusterd-handler.c:1533:__glusterd_handle_cli_get_volume] 0-glusterd: Received
get vol req
[2016-05-16 08:19:40.239294] I [MSGID: 106488]
[glusterd-handler.c:1533:__glusterd_handle_cli_get_volume] 0-glusterd: Received
get vol req
(END)

--- Additional comment from Kaushal on 2016-05-16 15:12:08 IST ---

Each encrypted connections create a pipe (which uses 2 fds), which are not
closed after disconnections. This is an fd leak, which can lead to resource
starvation and lead to other operations failing afterwards.

--- Additional comment from Vijay Bellur on 2016-05-16 15:18:56 IST ---

REVIEW: http://review.gluster.org/14356 (rpc/socket: Close pipe on
disconnection) posted (#1) for review on master by Kaushal M
(kaushal at redhat.com)

--- Additional comment from Vijay Bellur on 2016-05-16 15:53:41 IST ---

REVIEW: http://review.gluster.org/14356 (rpc/socket: Close pipe on
disconnection) posted (#2) for review on master by Kaushal M
(kaushal at redhat.com)

--- Additional comment from Vijay Bellur on 2016-05-18 15:58:03 IST ---

REVIEW: http://review.gluster.org/14356 (rpc/socket: Close pipe on
disconnection) posted (#3) for review on master by Kaushal M
(kaushal at redhat.com)

--- Additional comment from Vijay Bellur on 2016-06-15 13:36:26 IST ---

REVIEW: http://review.gluster.org/14356 (rpc/socket: Close pipe on
disconnection) posted (#4) for review on master by Kaushal M
(kaushal at redhat.com)

--- Additional comment from Atin Mukherjee on 2016-08-10 12:34:15 IST ---

http://review.gluster.org/#/c/14694/ actually fixes this problem.

--- Additional comment from Kaushal on 2016-10-20 14:13:19 IST ---

http://review.gluster.org/#/c/14694/ doesn't fix the issue completely.

With the change pipes are being only closed when spawning own-thread
(socket_spawn) fails. They are still being left open when a working connection
disconnects.

So the fix is only partial. It still requires http://review.gluster.org/14356
which closes pipes after an SSL connection disconnects.

--- Additional comment from Worker Ant on 2016-10-20 14:15:18 IST ---

REVIEW: http://review.gluster.org/14356 (rpc/socket: Close pipe on
disconnection) posted (#5) for review on master by Kaushal M
(kaushal at redhat.com)

--- Additional comment from Worker Ant on 2016-10-20 20:08:23 IST ---

REVIEW: http://review.gluster.org/14356 (rpc/socket: Close pipe on
disconnection) posted (#6) for review on master by Kaushal M
(kaushal at redhat.com)

--- Additional comment from Worker Ant on 2016-10-20 20:22:14 IST ---

REVIEW: http://review.gluster.org/14356 (rpc/socket: Close pipe on
disconnection) posted (#7) for review on master by Kaushal M
(kaushal at redhat.com)

--- Additional comment from Worker Ant on 2016-10-21 11:34:59 IST ---

REVIEW: http://review.gluster.org/14356 (rpc/socket: Close pipe on
disconnection) posted (#8) for review on master by Kaushal M
(kaushal at redhat.com)

--- Additional comment from Worker Ant on 2016-10-24 09:40:54 IST ---

COMMIT: http://review.gluster.org/14356 committed in master by Raghavendra G
(rgowdapp at redhat.com) 
------
commit d4b993610c1ee64ab7c2f41cf457e63a3da7f521
Author: Kaushal M <kaushal at redhat.com>
Date:   Mon May 16 15:15:49 2016 +0530

    rpc/socket: Close pipe on disconnection

    Encrypted connections create a pipe, which isn't closed when the
    connection disconnects. This leaks fds, and gluster eventually ends up
    in a situation with fd starvation which leads to operation failures.

    Change-Id: I144e1f767cec8c6fc1aa46b00cd234129d2a4adc
    BUG: 1336371
    Signed-off-by: Kaushal M <kaushal at redhat.com>
    Reviewed-on: http://review.gluster.org/14356
    Tested-by: MOHIT AGRAWAL <moagrawa at redhat.com>
    Smoke: Gluster Build System <jenkins at build.gluster.org>
    CentOS-regression: Gluster Build System <jenkins at build.gluster.org>
    NetBSD-regression: NetBSD Build System <jenkins at build.gluster.org>
    Reviewed-by: Raghavendra G <rgowdapp at redhat.com>

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1336339
[Bug 1336339] Sequential volume start&stop  is failing  with SSL  enabled
setup.
https://bugzilla.redhat.com/show_bug.cgi?id=1336369
[Bug 1336369] Sequential volume start&stop  is failing  with SSL  enabled
setup.
https://bugzilla.redhat.com/show_bug.cgi?id=1336371
[Bug 1336371] Sequential volume start&stop  is failing  with SSL  enabled
setup.
https://bugzilla.redhat.com/show_bug.cgi?id=1336376
[Bug 1336376] Sequential volume start&stop  is failing  with SSL  enabled
setup.
https://bugzilla.redhat.com/show_bug.cgi?id=1336508
[Bug 1336508] rpc-transport: compiler warning format string
https://bugzilla.redhat.com/show_bug.cgi?id=1381822
[Bug 1381822] glusterd.log is flooded with socket.management: accept on 11
failed (Too many open files) and glusterd service stops
-- 
You are receiving this mail because:
You are on the CC list for the bug.
Unsubscribe from this bug https://bugzilla.redhat.com/token.cgi?t=ptNuAHMF31&a=cc_unsubscribe