[Gluster-users] Gluster and NFS-Ganesha - cluster is down after reboot
Soumya Koduri
skoduri at redhat.com
Mon May 15 10:56:01 UTC 2017
On 05/12/2017 06:27 PM, Adam Ru wrote:
> Hi Soumya,
>
> Thank you very much for last response – very useful.
>
> I apologize for delay, I had to find time for another testing.
>
> I updated instructions that I provided in previous e-mail. *** means
> that the step was added.
>
> Instructions:
> - Clean installation of CentOS 7.3 with all updates, 3x node,
> resolvable IPs and VIPs
> - Stopped firewalld (just for testing)
> - *** SELinux in permissive mode (I had to, will explain bellow)
> - Install "centos-release-gluster" to get "centos-gluster310" repo
> and install following (nothing else):
> --- glusterfs-server
> --- glusterfs-ganesha
> - Passwordless SSH between all nodes
> (/var/lib/glusterd/nfs/secret.pem and secret.pem.pub on all nodes)
> - systemctl enable and start glusterd
> - gluster peer probe <other nodes>
> - gluster volume set all cluster.enable-shared-storage enable
> - systemctl enable and start pcsd.service
> - systemctl enable pacemaker.service (cannot be started at this moment)
> - Set password for hacluster user on all nodes
> - pcs cluster auth <node 1> <node 2> <node 3> -u hacluster -p blabla
> - mkdir /var/run/gluster/shared_storage/nfs-ganesha/
> - touch /var/run/gluster/shared_storage/nfs-ganesha/ganesha.conf (not
> sure if needed)
> - vi /var/run/gluster/shared_storage/nfs-ganesha/ganesha-ha.conf and
> insert configuration
> - Try list files on other nodes: ls
> /var/run/gluster/shared_storage/nfs-ganesha/
> - gluster nfs-ganesha enable
> - *** systemctl enable pacemaker.service (again, since pacemaker was
> disabled at this point)
> - *** Check owner of "state", "statd", "sm" and "sm.bak" in
> /var/lib/nfs/ (I had to: chown rpcuser:rpcuser
> /var/lib/nfs/statd/state)
> - Check on other nodes that nfs-ganesha.service is running and "pcs
> status" shows started resources
> - gluster volume create mynewshare replica 3 transport tcp
> node1:/<dir> node2:/<dir> node3:/<dir>
> - gluster volume start mynewshare
> - gluster vol set mynewshare ganesha.enable on
>
> At this moment, this is status of important (I think) services:
>
> -- corosync.service disabled
> -- corosync-notifyd.service disabled
> -- glusterd.service enabled
> -- glusterfsd.service disabled
> -- pacemaker.service enabled
> -- pcsd.service enabled
> -- nfs-ganesha.service disabled
> -- nfs-ganesha-config.service static
> -- nfs-ganesha-lock.service static
>
> -- corosync.service active (running)
> -- corosync-notifyd.service inactive (dead)
> -- glusterd.service active (running)
> -- glusterfsd.service inactive (dead)
> -- pacemaker.service active (running)
> -- pcsd.service active (running)
> -- nfs-ganesha.service active (running)
> -- nfs-ganesha-config.service inactive (dead)
> -- nfs-ganesha-lock.service active (running)
>
> May I ask you a few questions please?
>
> 1. Could you please confirm that services above has correct status/state?
Looks good to the best of my knowledge.
>
> 2. When I restart a node then nfs-ganesha is not running. Of course I
> cannot enable it since it needs to be enabled after shared storage is
> mounted. What is best practice to start it automatically so I don’t
> have to worry about restarting node? Should I create a script that
> will check whether shared storage was mounted and then start
> nfs-ganesha? How do you do this in production?
That's right.. We have plans to address this in near future (probably by
having a new .service which mounts shared_storage before starting
nfs-ganesha). But until then ..yes having a custom defined script to do
so is the only way to automate it.
>
> 3. SELinux is an issue, is that a known bug?
>
> When I restart a node and start nfs-ganesha.service with SELinux in
> permissive mode:
>
> sudo grep 'statd' /var/log/messages
> May 12 12:05:46 mynode1 rpc.statd[2415]: Version 1.3.0 starting
> May 12 12:05:46 mynode1 rpc.statd[2415]: Flags: TI-RPC
> May 12 12:05:46 mynode1 rpc.statd[2415]: Failed to read
> /var/lib/nfs/statd/state: Success
> May 12 12:05:46 mynode1 rpc.statd[2415]: Initializing NSM state
> May 12 12:05:52 mynode1 rpc.statd[2415]: Received SM_UNMON_ALL request
> from mynode1.localdomain while not monitoring any hosts
>
> systemctl status nfs-ganesha-lock.service --full
> ● nfs-ganesha-lock.service - NFS status monitor for NFSv2/3 locking.
> Loaded: loaded (/usr/lib/systemd/system/nfs-ganesha-lock.service;
> static; vendor preset: disabled)
> Active: active (running) since Fri 2017-05-12 12:05:46 UTC; 1min 43s ago
> Process: 2414 ExecStart=/usr/sbin/rpc.statd --no-notify $STATDARGS
> (code=exited, status=0/SUCCESS)
> Main PID: 2415 (rpc.statd)
> CGroup: /system.slice/nfs-ganesha-lock.service
> └─2415 /usr/sbin/rpc.statd --no-notify
>
> May 12 12:05:46 mynode1.localdomain systemd[1]: Starting NFS status
> monitor for NFSv2/3 locking....
> May 12 12:05:46 mynode1.localdomain rpc.statd[2415]: Version 1.3.0 starting
> May 12 12:05:46 mynode1.localdomain rpc.statd[2415]: Flags: TI-RPC
> May 12 12:05:46 mynode1.localdomain rpc.statd[2415]: Failed to read
> /var/lib/nfs/statd/state: Success
> May 12 12:05:46 mynode1.localdomain rpc.statd[2415]: Initializing NSM state
> May 12 12:05:46 mynode1.localdomain systemd[1]: Started NFS status
> monitor for NFSv2/3 locking..
> May 12 12:05:52 mynode1.localdomain rpc.statd[2415]: Received
> SM_UNMON_ALL request from mynode1.localdomain while not monitoring any
> hosts
>
>
> When I restart a node and start nfs-ganesha.service with SELinux in
> enforcing mode:
>
>
> sudo grep 'statd' /var/log/messages
> May 12 12:14:01 mynode1 rpc.statd[1743]: Version 1.3.0 starting
> May 12 12:14:01 mynode1 rpc.statd[1743]: Flags: TI-RPC
> May 12 12:14:01 mynode1 rpc.statd[1743]: Failed to open directory sm:
> Permission denied
> May 12 12:14:01 mynode1 rpc.statd[1743]: Failed to open
> /var/lib/nfs/statd/state: Permission denied
>
> systemctl status nfs-ganesha-lock.service --full
> ● nfs-ganesha-lock.service - NFS status monitor for NFSv2/3 locking.
> Loaded: loaded (/usr/lib/systemd/system/nfs-ganesha-lock.service;
> static; vendor preset: disabled)
> Active: failed (Result: exit-code) since Fri 2017-05-12 12:14:01
> UTC; 1min 21s ago
> Process: 1742 ExecStart=/usr/sbin/rpc.statd --no-notify $STATDARGS
> (code=exited, status=1/FAILURE)
>
> May 12 12:14:01 mynode1.localdomain systemd[1]: Starting NFS status
> monitor for NFSv2/3 locking....
> May 12 12:14:01 mynode1.localdomain rpc.statd[1743]: Version 1.3.0 starting
> May 12 12:14:01 mynode1.localdomain rpc.statd[1743]: Flags: TI-RPC
> May 12 12:14:01 mynode1.localdomain rpc.statd[1743]: Failed to open
> directory sm: Permission denied
> May 12 12:14:01 mynode1.localdomain systemd[1]:
> nfs-ganesha-lock.service: control process exited, code=exited status=1
> May 12 12:14:01 mynode1.localdomain systemd[1]: Failed to start NFS
> status monitor for NFSv2/3 locking..
> May 12 12:14:01 mynode1.localdomain systemd[1]: Unit
> nfs-ganesha-lock.service entered failed state.
> May 12 12:14:01 mynode1.localdomain systemd[1]: nfs-ganesha-lock.service failed.
Cant remember right now. Could you please paste the AVCs you get, and
se-linux packages version. Or preferably please file a bug. We can get
the details verified from selinux members.
Thanks,
Soumya
>
> On Fri, May 5, 2017 at 8:10 PM, Soumya Koduri <skoduri at redhat.com> wrote:
>>
>>
>> On 05/05/2017 08:04 PM, Adam Ru wrote:
>>>
>>> Hi Soumya,
>>>
>>> Thank you for the answer.
>>>
>>> Enabling Pacemaker? Yes, you’re completely right, I didn’t do it. Thank
>>> you.
>>>
>>> I spent some time by testing and I have some results. This is what I did:
>>>
>>> - Clean installation of CentOS 7.3 with all updates, 3x node,
>>> resolvable IPs and VIPs
>>> - Stopped firewalld (just for testing)
>>> - Install "centos-release-gluster" to get "centos-gluster310" repo and
>>> install following (nothing else):
>>> --- glusterfs-server
>>> --- glusterfs-ganesha
>>> - Passwordless SSH between all nodes (/var/lib/glusterd/nfs/secret.pem
>>> and secret.pem.pub on all nodes)
>>> - systemctl enable and start glusterd
>>> - gluster peer probe <other nodes>
>>> - gluster volume set all cluster.enable-shared-storage enable
>>> - systemctl enable and start pcsd.service
>>> - systemctl enable pacemaker.service (cannot be started at this moment)
>>> - Set password for hacluster user on all nodes
>>> - pcs cluster auth <node 1> <node 2> <node 3> -u hacluster -p blabla
>>> - mkdir /var/run/gluster/shared_storage/nfs-ganesha/
>>> - touch /var/run/gluster/shared_storage/nfs-ganesha/ganesha.conf (not
>>> sure if needed)
>>> - vi /var/run/gluster/shared_storage/nfs-ganesha/ganesha-ha.conf and
>>> insert configuration
>>> - Try list files on other nodes: ls
>>> /var/run/gluster/shared_storage/nfs-ganesha/
>>> - gluster nfs-ganesha enable
>>> - Check on other nodes that nfs-ganesha.service is running and "pcs
>>> status" shows started resources
>>> - gluster volume create mynewshare replica 3 transport tcp node1:/<dir>
>>> node2:/<dir> node3:/<dir>
>>> - gluster volume start mynewshare
>>> - gluster vol set mynewshare ganesha.enable on
>>>
>>> After these steps, all VIPs are pingable and I can mount node1:/mynewshare
>>>
>>> Funny thing is that pacemaker.service is disabled again (something
>>> disabled it). This is status of important (I think) services:
>>
>>
>> yeah. We too had observed this recently. We guess probably pcs cluster setup
>> command first destroys existing cluster (if any) which may be disabling
>> pacemaker too.
>>
>>>
>>> systemctl list-units --all
>>> # corosync.service loaded active running
>>> # glusterd.service loaded active running
>>> # nfs-config.service loaded inactive dead
>>> # nfs-ganesha-config.service loaded inactive dead
>>> # nfs-ganesha-lock.service loaded active running
>>> # nfs-ganesha.service loaded active running
>>> # nfs-idmapd.service loaded inactive dead
>>> # nfs-mountd.service loaded inactive dead
>>> # nfs-server.service loaded inactive dead
>>> # nfs-utils.service loaded inactive dead
>>> # pacemaker.service loaded active running
>>> # pcsd.service loaded active running
>>>
>>> systemctl list-unit-files --all
>>> # corosync-notifyd.service disabled
>>> # corosync.service disabled
>>> # glusterd.service enabled
>>> # glusterfsd.service disabled
>>> # nfs-blkmap.service disabled
>>> # nfs-config.service static
>>> # nfs-ganesha-config.service static
>>> # nfs-ganesha-lock.service static
>>> # nfs-ganesha.service disabled
>>> # nfs-idmap.service static
>>> # nfs-idmapd.service static
>>> # nfs-lock.service static
>>> # nfs-mountd.service static
>>> # nfs-rquotad.service disabled
>>> # nfs-secure-server.service static
>>> # nfs-secure.service static
>>> # nfs-server.service disabled
>>> # nfs-utils.service static
>>> # nfs.service disabled
>>> # nfslock.service static
>>> # pacemaker.service disabled
>>> # pcsd.service enabled
>>>
>>> I enabled pacemaker again on all nodes and restart all nodes one by one.
>>>
>>> After reboot all VIPs are gone and I can see that nfs-ganesha.service
>>> isn’t running. When I start it on at least two nodes then VIPs are
>>> pingable again and I can mount NFS again. But there is still some issue
>>> in the setup because when I check nfs-ganesha-lock.service I get:
>>>
>>> systemctl -l status nfs-ganesha-lock.service
>>> ● nfs-ganesha-lock.service - NFS status monitor for NFSv2/3 locking.
>>> Loaded: loaded (/usr/lib/systemd/system/nfs-ganesha-lock.service;
>>> static; vendor preset: disabled)
>>> Active: failed (Result: exit-code) since Fri 2017-05-05 13:43:37 UTC;
>>> 31min ago
>>> Process: 6203 ExecStart=/usr/sbin/rpc.statd --no-notify $STATDARGS
>>> (code=exited, status=1/FAILURE)
>>>
>>> May 05 13:43:37 node0.localdomain systemd[1]: Starting NFS status
>>> monitor for NFSv2/3 locking....
>>> May 05 13:43:37 node0.localdomain rpc.statd[6205]: Version 1.3.0 starting
>>> May 05 13:43:37 node0.localdomain rpc.statd[6205]: Flags: TI-RPC
>>> May 05 13:43:37 node0.localdomain rpc.statd[6205]: Failed to open
>>> directory sm: Permission denied
>>
>>
>> Okay this issue was fixed and the fix should be present in 3.10 too -
>> https://review.gluster.org/#/c/16433/
>>
>> Please check '/var/log/messages' for statd related errors and cross-check
>> permissions of that directory. You could manually chown owner:group of
>> /var/lib/nfs/statd/sm directory for now and then restart nfs-ganesha*
>> services.
>>
>> Thanks,
>> Soumya
>>
>>> May 05 13:43:37 node0.localdomain rpc.statd[6205]: Failed to open
>>> /var/lib/nfs/statd/state: Permission denied
>>> May 05 13:43:37 node0.localdomain systemd[1]: nfs-ganesha-lock.service:
>>> control process exited, code=exited status=1
>>> May 05 13:43:37 node0.localdomain systemd[1]: Failed to start NFS status
>>> monitor for NFSv2/3 locking..
>>> May 05 13:43:37 node0.localdomain systemd[1]: Unit
>>> nfs-ganesha-lock.service entered failed state.
>>> May 05 13:43:37 node0.localdomain systemd[1]: nfs-ganesha-lock.service
>>> failed.
>>>
>>> Thank you,
>>>
>>> Kind regards,
>>>
>>> Adam
>>>
>>> On Wed, May 3, 2017 at 10:32 AM, Mahdi Adnan <mahdi.adnan at outlook.com
>>> <mailto:mahdi.adnan at outlook.com>> wrote:
>>>
>>> Hi,
>>>
>>>
>>> Same here, when i reboot the node i have to manually execute "pcs
>>> cluster start gluster01" and pcsd already enabled and started.
>>>
>>> Gluster 3.8.11
>>>
>>> Centos 7.3 latest
>>>
>>> Installed using CentOS Storage SIG repository
>>>
>>>
>>>
>>> --
>>>
>>> Respectfully*
>>> **Mahdi A. Mahdi*
>>>
>>>
>>> ------------------------------------------------------------------------
>>> *From:* gluster-users-bounces at gluster.org
>>> <mailto:gluster-users-bounces at gluster.org>
>>> <gluster-users-bounces at gluster.org
>>> <mailto:gluster-users-bounces at gluster.org>> on behalf of Adam Ru
>>> <ad.ruckel at gmail.com <mailto:ad.ruckel at gmail.com>>
>>> *Sent:* Wednesday, May 3, 2017 12:09:58 PM
>>> *To:* Soumya Koduri
>>> *Cc:* gluster-users at gluster.org <mailto:gluster-users at gluster.org>
>>> *Subject:* Re: [Gluster-users] Gluster and NFS-Ganesha - cluster is
>>>
>>> down after reboot
>>>
>>> Hi Soumya,
>>>
>>> thank you very much for your reply.
>>>
>>> I enabled pcsd during setup and after reboot during troubleshooting
>>> I manually started it and checked resources (pcs status). They were
>>> not running. I didn’t find what was wrong but I’m going to try it
>>> again.
>>>
>>> I’ve thoroughly checked
>>>
>>> http://gluster.readthedocs.io/en/latest/Administrator%20Guide/NFS-Ganesha%20GlusterFS%20Integration/
>>>
>>> <http://gluster.readthedocs.io/en/latest/Administrator%20Guide/NFS-Ganesha%20GlusterFS%20Integration/>
>>> and I can confirm that I followed all steps with one exception. I
>>> installed following RPMs:
>>> glusterfs-server
>>> glusterfs-fuse
>>> glusterfs-cli
>>> glusterfs-ganesha
>>> nfs-ganesha-xfs
>>>
>>> and the guide referenced above specifies:
>>> glusterfs-server
>>> glusterfs-api
>>> glusterfs-ganesha
>>>
>>> glusterfs-api is a dependency of one of RPMs that I installed so
>>> this is not a problem. But I cannot find any mention to install
>>> nfs-ganesha-xfs.
>>>
>>> I’ll try to setup the whole environment again without installing
>>> nfs-ganesha-xfs (I assume glusterfs-ganesha has all required
>>> binaries).
>>>
>>> Again, thank you for you time to answer my previous message.
>>>
>>> Kind regards,
>>> Adam
>>>
>>> On Tue, May 2, 2017 at 8:49 AM, Soumya Koduri <skoduri at redhat.com
>>> <mailto:skoduri at redhat.com>> wrote:
>>>
>>> Hi,
>>>
>>> On 05/02/2017 01:34 AM, Rudolf wrote:
>>>
>>> Hi Gluster users,
>>>
>>> First, I'd like to thank you all for this amazing
>>> open-source! Thank you!
>>>
>>> I'm working on home project – three servers with Gluster and
>>> NFS-Ganesha. My goal is to create HA NFS share with three
>>> copies of each
>>> file on each server.
>>>
>>> My systems are CentOS 7.3 Minimal install with the latest
>>> updates and
>>> the most current RPMs from "centos-gluster310" repository.
>>>
>>> I followed this tutorial:
>>>
>>> http://blog.gluster.org/2015/10/linux-scale-out-nfsv4-using-nfs-ganesha-and-glusterfs-one-step-at-a-time/
>>>
>>> <http://blog.gluster.org/2015/10/linux-scale-out-nfsv4-using-nfs-ganesha-and-glusterfs-one-step-at-a-time/>
>>> (second half that describes multi-node HA setup)
>>>
>>> with a few exceptions:
>>>
>>> 1. All RPMs are from "centos-gluster310" repo that is
>>> installed by "yum
>>> -y install centos-release-gluster"
>>> 2. I have three nodes (not four) with "replica 3" volume.
>>> 3. I created empty ganesha.conf and not empty ganesha-ha.conf
>>> in
>>> "/var/run/gluster/shared_storage/nfs-ganesha/" (referenced
>>> blog post is
>>> outdated, this is now requirement)
>>> 4. ganesha-ha.conf doesn't have "HA_VOL_SERVER" since this
>>> isn't needed
>>> anymore.
>>>
>>>
>>> Please refer to
>>>
>>> http://gluster.readthedocs.io/en/latest/Administrator%20Guide/NFS-Ganesha%20GlusterFS%20Integration/
>>>
>>> <http://gluster.readthedocs.io/en/latest/Administrator%20Guide/NFS-Ganesha%20GlusterFS%20Integration/>
>>>
>>> It is being updated with latest changes happened wrt setup.
>>>
>>> When I finish configuration, all is good.
>>> nfs-ganesha.service is active
>>> and running and from client I can ping all three VIPs and I
>>> can mount
>>> NFS. Copied files are replicated to all nodes.
>>>
>>> But when I restart nodes (one by one, with 5 min. delay
>>> between) then I
>>> cannot ping or mount (I assume that all VIPs are down). So
>>> my setup
>>> definitely isn't HA.
>>>
>>> I found that:
>>> # pcs status
>>> Error: cluster is not currently running on this node
>>>
>>>
>>> This means pcsd service is not up. Did you enable (systemctl
>>> enable pcsd) pcsd service so that is comes up post reboot
>>> automatically. If not please start it manually.
>>>
>>>
>>> and nfs-ganesha.service is in inactive state. Btw. I didn't
>>> enable
>>> "systemctl enable nfs-ganesha" since I assume that this is
>>> something
>>> that Gluster does.
>>>
>>>
>>> Please check /var/log/ganesha.log for any errors/warnings.
>>>
>>> We recommend not to enable nfs-ganesha.service (by default), as
>>> the shared storage (where the ganesha.conf file resides now)
>>> should be up and running before nfs-ganesha gets started.
>>> So if enabled by default it could happen that shared_storage
>>> mount point is not yet up and it resulted in nfs-ganesha service
>>> failure. If you would like to address this, you could have a
>>> cron job which keeps checking the mount point health and then
>>> start nfs-ganesha service.
>>>
>>> Thanks,
>>> Soumya
>>>
>>>
>>> I assume that my issue is that I followed instructions in
>>> blog post from
>>> 2015/10 that are outdated. Unfortunately I cannot find
>>> anything better –
>>> I spent whole day by googling.
>>>
>>> Would you be so kind and check the instructions in blog post
>>> and let me
>>> know what steps are wrong / outdated? Or please do you have
>>> more current
>>> instructions for Gluster+Ganesha setup?
>>>
>>> Thank you.
>>>
>>> Kind regards,
>>> Adam
>>>
>>>
>>>
>>> _______________________________________________
>>> Gluster-users mailing list
>>> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org>
>>> http://lists.gluster.org/mailman/listinfo/gluster-users
>>> <http://lists.gluster.org/mailman/listinfo/gluster-users>
>>>
>>>
>>>
>>>
>>> --
>>> Adam
>>>
>>>
>>>
>>>
>>> --
>>> Adam
>
>
>
More information about the Gluster-users
mailing list