[Gluster-users] Initial mount problem - all subvolumes are down

Wed Apr 1 16:51:07 UTC 2015

Any update here? Can I hope to see a fix incorporated into the release of
3.6.3 ?

On Tue, Mar 31, 2015 at 10:53 AM, Pranith Kumar Karampuri <
pkarampu at redhat.com> wrote:

>
> On 03/31/2015 10:47 PM, Rumen Telbizov wrote:
>
>  Pranith and Atin,
>
>  Thank you for looking into this and confirming it's a bug. Please log
> the bug yourself since I am not familiar with the project's bug-tracking
> system.
>
>  Assessing its severity and the fact that this effectively stops the
> cluster from functioning properly after boot, what do you think would be
> the timeline for fixing this issue? What version do you expect to see this
> fixed in?
>
>  In the meantime, is there another workaround that you might suggest
> besides running a secondary mount later after the boot is over?
>
> Adding glusterd maintainers to the thread: +kaushal, +krishnan
> I will let them answer your questions.
>
> Pranith
>
>
>  Thank you again for your help,
>  Rumen Telbizov
>
>
>
> On Tue, Mar 31, 2015 at 2:53 AM, Pranith Kumar Karampuri <
> pkarampu at redhat.com> wrote:
>
>>
>> On 03/31/2015 01:55 PM, Atin Mukherjee wrote:
>>
>>>
>>> On 03/31/2015 01:03 PM, Pranith Kumar Karampuri wrote:
>>>
>>>> On 03/31/2015 12:53 PM, Atin Mukherjee wrote:
>>>>
>>>>> On 03/31/2015 12:27 PM, Pranith Kumar Karampuri wrote:
>>>>>
>>>>>> Atin,
>>>>>>          Could it be because bricks are started with
>>>>>> PROC_START_NO_WAIT?
>>>>>>
>>>>> That's the correct analysis Pranith. Mount was attempted before the
>>>>> bricks were started. If we can have a time lag in some seconds between
>>>>> mount and volume start the problem will go away.
>>>>>
>>>> Atin,
>>>>         I think one way to solve this issue is to start the bricks with
>>>> NO_WAIT so that we can handle pmap-signin but wait for the pmap-signins
>>>> to complete before responding to cli/completing 'init'?
>>>>
>>> Logically it should solve the problem. We need to think around it more
>>> from the existing design perspective.
>>>
>>  Rumen,
>>      Feel free to log a bug. This should be fixed in later release. We
>> can raise the bug and work it as well if you prefer it this way.
>>
>> Pranith
>>
>>
>>> ~Atin
>>>
>>>> Pranith
>>>>
>>>>>
>>>>>  Pranith
>>>>>> On 03/31/2015 04:41 AM, Rumen Telbizov wrote:
>>>>>>
>>>>>>> Hello everyone,
>>>>>>>
>>>>>>> I have a problem that I am trying to resolve and not sure which way
>>>>>>> to
>>>>>>> go so here I am asking for your advise.
>>>>>>>
>>>>>>> What it comes down to is that upon initial boot of all my GlusterFS
>>>>>>> machines the shared volume doesn't get mounted. Nevertheless the
>>>>>>> volume successfully created and started and further attempts to mount
>>>>>>> it manually succeed. I suspect what's happening is that gluster
>>>>>>> processes/bricks/etc haven't fully started at the time the /etc/fstab
>>>>>>> entry is read and the initial mount attempt is being made. Again, by
>>>>>>> the time I log in and run a mount -a -- the volume mounts without any
>>>>>>> issues.
>>>>>>>
>>>>>>> _Details from the logs:_
>>>>>>>
>>>>>>> [2015-03-30 22:29:04.381918] I [MSGID: 100030]
>>>>>>> [glusterfsd.c:2018:main] 0-/usr/sbin/glusterfs: Started running
>>>>>>> /usr/sbin/glusterfs version 3.6.2 (args: /usr/sbin/glusterfs
>>>>>>> --log-file=/var/log/glusterfs/glusterfs.log --attribute-timeout=0
>>>>>>> --entry-timeout=0 --volfile-server=localhost
>>>>>>> --volfile-server=10.12.130.21 --volfile-server=10.12.130.22
>>>>>>> --volfile-server=10.12.130.23 --volfile-id=/myvolume /opt/shared)
>>>>>>> [2015-03-30 22:29:04.394913] E [socket.c:2267:socket_connect_finish]
>>>>>>> 0-glusterfs: connection to 127.0.0.1:24007 <http://127.0.0.1:24007>
>>>>>>> failed (Connection refused)
>>>>>>> [2015-03-30 22:29:04.394950] E
>>>>>>> [glusterfsd-mgmt.c:1811:mgmt_rpc_notify] 0-glusterfsd-mgmt: failed to
>>>>>>> connect with remote-host: localhost (Transport endpoint is not
>>>>>>> connected)
>>>>>>> [2015-03-30 22:29:04.394964] I
>>>>>>> [glusterfsd-mgmt.c:1838:mgmt_rpc_notify] 0-glusterfsd-mgmt:
>>>>>>> connecting
>>>>>>> to next volfile server 10.12.130.21
>>>>>>> [2015-03-30 22:29:08.390687] E
>>>>>>> [glusterfsd-mgmt.c:1811:mgmt_rpc_notify] 0-glusterfsd-mgmt: failed to
>>>>>>> connect with remote-host: 10.12.130.21 (Transport endpoint is not
>>>>>>> connected)
>>>>>>> [2015-03-30 22:29:08.390720] I
>>>>>>> [glusterfsd-mgmt.c:1838:mgmt_rpc_notify] 0-glusterfsd-mgmt:
>>>>>>> connecting
>>>>>>> to next volfile server 10.12.130.22
>>>>>>> [2015-03-30 22:29:11.392015] E
>>>>>>> [glusterfsd-mgmt.c:1811:mgmt_rpc_notify] 0-glusterfsd-mgmt: failed to
>>>>>>> connect with remote-host: 10.12.130.22 (Transport endpoint is not
>>>>>>> connected)
>>>>>>> [2015-03-30 22:29:11.392050] I
>>>>>>> [glusterfsd-mgmt.c:1838:mgmt_rpc_notify] 0-glusterfsd-mgmt:
>>>>>>> connecting
>>>>>>> to next volfile server 10.12.130.23
>>>>>>> [2015-03-30 22:29:14.406429] I [dht-shared.c:337:dht_init_regex]
>>>>>>> 0-brain-dht: using regex rsync-hash-regex = ^\.(.+)\.[^.]+$
>>>>>>> [2015-03-30 22:29:14.408964] I
>>>>>>> [rpc-clnt.c:969:rpc_clnt_connection_init] 0-host-client-2: setting
>>>>>>> frame-timeout to 60
>>>>>>> [2015-03-30 22:29:14.409183] I
>>>>>>> [rpc-clnt.c:969:rpc_clnt_connection_init] 0-host-client-1: setting
>>>>>>> frame-timeout to 60
>>>>>>> [2015-03-30 22:29:14.409388] I
>>>>>>> [rpc-clnt.c:969:rpc_clnt_connection_init] 0-host-client-0: setting
>>>>>>> frame-timeout to 60
>>>>>>> [2015-03-30 22:29:14.409430] I [client.c:2280:notify]
>>>>>>> 0-host-client-0:
>>>>>>> parent translators are ready, attempting connect on transport
>>>>>>> [2015-03-30 22:29:14.409658] I [client.c:2280:notify]
>>>>>>> 0-host-client-1:
>>>>>>> parent translators are ready, attempting connect on transport
>>>>>>> [2015-03-30 22:29:14.409844] I [client.c:2280:notify]
>>>>>>> 0-host-client-2:
>>>>>>> parent translators are ready, attempting connect on transport
>>>>>>> Final graph:
>>>>>>>
>>>>>>> ....
>>>>>>>
>>>>>>> [2015-03-30 22:29:14.411045] I [client.c:2215:client_rpc_notify]
>>>>>>> 0-host-client-2: disconnected from host-client-2. Client process will
>>>>>>> keep trying to connect to glusterd until brick's port is available
>>>>>>> *[2015-03-30 22:29:14.411063] E [MSGID: 108006]
>>>>>>> [afr-common.c:3591:afr_notify] 0-myvolume-replicate-0: All subvolumes
>>>>>>> are down. Going offline until atleast one of them comes back up.
>>>>>>> *[2015-03-30 22:29:14.414871] I [fuse-bridge.c:5080:fuse_graph_setup]
>>>>>>> 0-fuse: switched to graph 0
>>>>>>> [2015-03-30 22:29:14.415003] I [fuse-bridge.c:4009:fuse_init]
>>>>>>> 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.22
>>>>>>> kernel 7.17
>>>>>>> [2015-03-30 22:29:14.415101] I [afr-common.c:3722:afr_local_init]
>>>>>>> 0-myvolume-replicate-0: no subvolumes up
>>>>>>> [2015-03-30 22:29:14.415215] I [afr-common.c:3722:afr_local_init]
>>>>>>> 0-myvolume-replicate-0: no subvolumes up
>>>>>>> [2015-03-30 22:29:14.415236] W [fuse-bridge.c:779:fuse_attr_cbk]
>>>>>>> 0-glusterfs-fuse: 2: LOOKUP() / => -1 (Transport endpoint is not
>>>>>>> connected)
>>>>>>> [2015-03-30 22:29:14.419007] I [fuse-bridge.c:4921:fuse_thread_proc]
>>>>>>> 0-fuse: unmounting /opt/shared
>>>>>>> *[2015-03-30 22:29:14.420176] W [glusterfsd.c:1194:cleanup_and_exit]
>>>>>>> (--> 0-: received signum (15), shutting down*
>>>>>>> [2015-03-30 22:29:14.420192] I [fuse-bridge.c:5599:fini] 0-fuse:
>>>>>>> Unmounting '/opt/shared'.
>>>>>>>
>>>>>>>
>>>>>>> _Relevant /etc/fstab entries are:_
>>>>>>>
>>>>>>> /dev/xvdb /opt/local xfs defaults,noatime,nodiratime 0 0
>>>>>>>
>>>>>>> localhost:/myvolume /opt/shared glusterfs
>>>>>>>
>>>>>>> defaults,_netdev,attribute-timeout=0,entry-timeout=0,log-file=/var/log/glusterfs/glusterfs.log,backup-volfile-servers=10.12.130.21:10
>>>>>>> .12.130.22:10.12.130.23
>>>>>>>
>>>>>>> 0 0
>>>>>>>
>>>>>>>
>>>>>>> _Volume configuration is:_
>>>>>>>
>>>>>>> Volume Name: myvolume
>>>>>>> Type: Replicate
>>>>>>> Volume ID: xxxx
>>>>>>> Status: Started
>>>>>>> Number of Bricks: 1 x 3 = 3
>>>>>>> Transport-type: tcp
>>>>>>> Bricks:
>>>>>>> Brick1: host1:/opt/local/brick
>>>>>>> Brick2: host2:/opt/local/brick
>>>>>>> Brick3: host3:/opt/local/brick
>>>>>>> Options Reconfigured:
>>>>>>> storage.health-check-interval: 5
>>>>>>> network.ping-timeout: 5
>>>>>>> nfs.disable: on
>>>>>>> auth.allow: 10.12.130.21,10.12.130.22,10.12.130.23
>>>>>>> cluster.quorum-type: auto
>>>>>>> network.frame-timeout: 60
>>>>>>>
>>>>>>>
>>>>>>> I run Debian 7 and the following GlusterFS version 3.6.2-2.
>>>>>>>
>>>>>>> While I could together some rc.local type of script which retries to
>>>>>>> mount the volume for a while until it succeeds or times out I was
>>>>>>> wondering if there's a better way to solve this problem?
>>>>>>>
>>>>>>> Thank you for your help.
>>>>>>>
>>>>>>> Regards,
>>>>>>> --
>>>>>>> Rumen Telbizov
>>>>>>> Unix Systems Administrator <http://telbizov.com>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Gluster-users mailing list
>>>>>>> Gluster-users at gluster.org
>>>>>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>>>>>>
>>>>>>
>>>>
>>>>
>>
>
>
> --
>  Rumen Telbizov
> Unix Systems Administrator <http://telbizov.com>
>
>
>

-- 
Rumen Telbizov
Unix Systems Administrator <http://telbizov.com>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20150401/12ad1db1/attachment.html>