[Gluster-users] Gluster and NFS-Ganesha - cluster is down after reboot
Adam Ru
ad.ruckel at gmail.com
Fri May 12 12:57:18 UTC 2017
Hi Soumya,
Thank you very much for last response – very useful.
I apologize for delay, I had to find time for another testing.
I updated instructions that I provided in previous e-mail. *** means
that the step was added.
Instructions:
- Clean installation of CentOS 7.3 with all updates, 3x node,
resolvable IPs and VIPs
- Stopped firewalld (just for testing)
- *** SELinux in permissive mode (I had to, will explain bellow)
- Install "centos-release-gluster" to get "centos-gluster310" repo
and install following (nothing else):
--- glusterfs-server
--- glusterfs-ganesha
- Passwordless SSH between all nodes
(/var/lib/glusterd/nfs/secret.pem and secret.pem.pub on all nodes)
- systemctl enable and start glusterd
- gluster peer probe <other nodes>
- gluster volume set all cluster.enable-shared-storage enable
- systemctl enable and start pcsd.service
- systemctl enable pacemaker.service (cannot be started at this moment)
- Set password for hacluster user on all nodes
- pcs cluster auth <node 1> <node 2> <node 3> -u hacluster -p blabla
- mkdir /var/run/gluster/shared_storage/nfs-ganesha/
- touch /var/run/gluster/shared_storage/nfs-ganesha/ganesha.conf (not
sure if needed)
- vi /var/run/gluster/shared_storage/nfs-ganesha/ganesha-ha.conf and
insert configuration
- Try list files on other nodes: ls
/var/run/gluster/shared_storage/nfs-ganesha/
- gluster nfs-ganesha enable
- *** systemctl enable pacemaker.service (again, since pacemaker was
disabled at this point)
- *** Check owner of "state", "statd", "sm" and "sm.bak" in
/var/lib/nfs/ (I had to: chown rpcuser:rpcuser
/var/lib/nfs/statd/state)
- Check on other nodes that nfs-ganesha.service is running and "pcs
status" shows started resources
- gluster volume create mynewshare replica 3 transport tcp
node1:/<dir> node2:/<dir> node3:/<dir>
- gluster volume start mynewshare
- gluster vol set mynewshare ganesha.enable on
At this moment, this is status of important (I think) services:
-- corosync.service disabled
-- corosync-notifyd.service disabled
-- glusterd.service enabled
-- glusterfsd.service disabled
-- pacemaker.service enabled
-- pcsd.service enabled
-- nfs-ganesha.service disabled
-- nfs-ganesha-config.service static
-- nfs-ganesha-lock.service static
-- corosync.service active (running)
-- corosync-notifyd.service inactive (dead)
-- glusterd.service active (running)
-- glusterfsd.service inactive (dead)
-- pacemaker.service active (running)
-- pcsd.service active (running)
-- nfs-ganesha.service active (running)
-- nfs-ganesha-config.service inactive (dead)
-- nfs-ganesha-lock.service active (running)
May I ask you a few questions please?
1. Could you please confirm that services above has correct status/state?
2. When I restart a node then nfs-ganesha is not running. Of course I
cannot enable it since it needs to be enabled after shared storage is
mounted. What is best practice to start it automatically so I don’t
have to worry about restarting node? Should I create a script that
will check whether shared storage was mounted and then start
nfs-ganesha? How do you do this in production?
3. SELinux is an issue, is that a known bug?
When I restart a node and start nfs-ganesha.service with SELinux in
permissive mode:
sudo grep 'statd' /var/log/messages
May 12 12:05:46 mynode1 rpc.statd[2415]: Version 1.3.0 starting
May 12 12:05:46 mynode1 rpc.statd[2415]: Flags: TI-RPC
May 12 12:05:46 mynode1 rpc.statd[2415]: Failed to read
/var/lib/nfs/statd/state: Success
May 12 12:05:46 mynode1 rpc.statd[2415]: Initializing NSM state
May 12 12:05:52 mynode1 rpc.statd[2415]: Received SM_UNMON_ALL request
from mynode1.localdomain while not monitoring any hosts
systemctl status nfs-ganesha-lock.service --full
● nfs-ganesha-lock.service - NFS status monitor for NFSv2/3 locking.
Loaded: loaded (/usr/lib/systemd/system/nfs-ganesha-lock.service;
static; vendor preset: disabled)
Active: active (running) since Fri 2017-05-12 12:05:46 UTC; 1min 43s ago
Process: 2414 ExecStart=/usr/sbin/rpc.statd --no-notify $STATDARGS
(code=exited, status=0/SUCCESS)
Main PID: 2415 (rpc.statd)
CGroup: /system.slice/nfs-ganesha-lock.service
└─2415 /usr/sbin/rpc.statd --no-notify
May 12 12:05:46 mynode1.localdomain systemd[1]: Starting NFS status
monitor for NFSv2/3 locking....
May 12 12:05:46 mynode1.localdomain rpc.statd[2415]: Version 1.3.0 starting
May 12 12:05:46 mynode1.localdomain rpc.statd[2415]: Flags: TI-RPC
May 12 12:05:46 mynode1.localdomain rpc.statd[2415]: Failed to read
/var/lib/nfs/statd/state: Success
May 12 12:05:46 mynode1.localdomain rpc.statd[2415]: Initializing NSM state
May 12 12:05:46 mynode1.localdomain systemd[1]: Started NFS status
monitor for NFSv2/3 locking..
May 12 12:05:52 mynode1.localdomain rpc.statd[2415]: Received
SM_UNMON_ALL request from mynode1.localdomain while not monitoring any
hosts
When I restart a node and start nfs-ganesha.service with SELinux in
enforcing mode:
sudo grep 'statd' /var/log/messages
May 12 12:14:01 mynode1 rpc.statd[1743]: Version 1.3.0 starting
May 12 12:14:01 mynode1 rpc.statd[1743]: Flags: TI-RPC
May 12 12:14:01 mynode1 rpc.statd[1743]: Failed to open directory sm:
Permission denied
May 12 12:14:01 mynode1 rpc.statd[1743]: Failed to open
/var/lib/nfs/statd/state: Permission denied
systemctl status nfs-ganesha-lock.service --full
● nfs-ganesha-lock.service - NFS status monitor for NFSv2/3 locking.
Loaded: loaded (/usr/lib/systemd/system/nfs-ganesha-lock.service;
static; vendor preset: disabled)
Active: failed (Result: exit-code) since Fri 2017-05-12 12:14:01
UTC; 1min 21s ago
Process: 1742 ExecStart=/usr/sbin/rpc.statd --no-notify $STATDARGS
(code=exited, status=1/FAILURE)
May 12 12:14:01 mynode1.localdomain systemd[1]: Starting NFS status
monitor for NFSv2/3 locking....
May 12 12:14:01 mynode1.localdomain rpc.statd[1743]: Version 1.3.0 starting
May 12 12:14:01 mynode1.localdomain rpc.statd[1743]: Flags: TI-RPC
May 12 12:14:01 mynode1.localdomain rpc.statd[1743]: Failed to open
directory sm: Permission denied
May 12 12:14:01 mynode1.localdomain systemd[1]:
nfs-ganesha-lock.service: control process exited, code=exited status=1
May 12 12:14:01 mynode1.localdomain systemd[1]: Failed to start NFS
status monitor for NFSv2/3 locking..
May 12 12:14:01 mynode1.localdomain systemd[1]: Unit
nfs-ganesha-lock.service entered failed state.
May 12 12:14:01 mynode1.localdomain systemd[1]: nfs-ganesha-lock.service failed.
On Fri, May 5, 2017 at 8:10 PM, Soumya Koduri <skoduri at redhat.com> wrote:
>
>
> On 05/05/2017 08:04 PM, Adam Ru wrote:
>>
>> Hi Soumya,
>>
>> Thank you for the answer.
>>
>> Enabling Pacemaker? Yes, you’re completely right, I didn’t do it. Thank
>> you.
>>
>> I spent some time by testing and I have some results. This is what I did:
>>
>> - Clean installation of CentOS 7.3 with all updates, 3x node,
>> resolvable IPs and VIPs
>> - Stopped firewalld (just for testing)
>> - Install "centos-release-gluster" to get "centos-gluster310" repo and
>> install following (nothing else):
>> --- glusterfs-server
>> --- glusterfs-ganesha
>> - Passwordless SSH between all nodes (/var/lib/glusterd/nfs/secret.pem
>> and secret.pem.pub on all nodes)
>> - systemctl enable and start glusterd
>> - gluster peer probe <other nodes>
>> - gluster volume set all cluster.enable-shared-storage enable
>> - systemctl enable and start pcsd.service
>> - systemctl enable pacemaker.service (cannot be started at this moment)
>> - Set password for hacluster user on all nodes
>> - pcs cluster auth <node 1> <node 2> <node 3> -u hacluster -p blabla
>> - mkdir /var/run/gluster/shared_storage/nfs-ganesha/
>> - touch /var/run/gluster/shared_storage/nfs-ganesha/ganesha.conf (not
>> sure if needed)
>> - vi /var/run/gluster/shared_storage/nfs-ganesha/ganesha-ha.conf and
>> insert configuration
>> - Try list files on other nodes: ls
>> /var/run/gluster/shared_storage/nfs-ganesha/
>> - gluster nfs-ganesha enable
>> - Check on other nodes that nfs-ganesha.service is running and "pcs
>> status" shows started resources
>> - gluster volume create mynewshare replica 3 transport tcp node1:/<dir>
>> node2:/<dir> node3:/<dir>
>> - gluster volume start mynewshare
>> - gluster vol set mynewshare ganesha.enable on
>>
>> After these steps, all VIPs are pingable and I can mount node1:/mynewshare
>>
>> Funny thing is that pacemaker.service is disabled again (something
>> disabled it). This is status of important (I think) services:
>
>
> yeah. We too had observed this recently. We guess probably pcs cluster setup
> command first destroys existing cluster (if any) which may be disabling
> pacemaker too.
>
>>
>> systemctl list-units --all
>> # corosync.service loaded active running
>> # glusterd.service loaded active running
>> # nfs-config.service loaded inactive dead
>> # nfs-ganesha-config.service loaded inactive dead
>> # nfs-ganesha-lock.service loaded active running
>> # nfs-ganesha.service loaded active running
>> # nfs-idmapd.service loaded inactive dead
>> # nfs-mountd.service loaded inactive dead
>> # nfs-server.service loaded inactive dead
>> # nfs-utils.service loaded inactive dead
>> # pacemaker.service loaded active running
>> # pcsd.service loaded active running
>>
>> systemctl list-unit-files --all
>> # corosync-notifyd.service disabled
>> # corosync.service disabled
>> # glusterd.service enabled
>> # glusterfsd.service disabled
>> # nfs-blkmap.service disabled
>> # nfs-config.service static
>> # nfs-ganesha-config.service static
>> # nfs-ganesha-lock.service static
>> # nfs-ganesha.service disabled
>> # nfs-idmap.service static
>> # nfs-idmapd.service static
>> # nfs-lock.service static
>> # nfs-mountd.service static
>> # nfs-rquotad.service disabled
>> # nfs-secure-server.service static
>> # nfs-secure.service static
>> # nfs-server.service disabled
>> # nfs-utils.service static
>> # nfs.service disabled
>> # nfslock.service static
>> # pacemaker.service disabled
>> # pcsd.service enabled
>>
>> I enabled pacemaker again on all nodes and restart all nodes one by one.
>>
>> After reboot all VIPs are gone and I can see that nfs-ganesha.service
>> isn’t running. When I start it on at least two nodes then VIPs are
>> pingable again and I can mount NFS again. But there is still some issue
>> in the setup because when I check nfs-ganesha-lock.service I get:
>>
>> systemctl -l status nfs-ganesha-lock.service
>> ● nfs-ganesha-lock.service - NFS status monitor for NFSv2/3 locking.
>> Loaded: loaded (/usr/lib/systemd/system/nfs-ganesha-lock.service;
>> static; vendor preset: disabled)
>> Active: failed (Result: exit-code) since Fri 2017-05-05 13:43:37 UTC;
>> 31min ago
>> Process: 6203 ExecStart=/usr/sbin/rpc.statd --no-notify $STATDARGS
>> (code=exited, status=1/FAILURE)
>>
>> May 05 13:43:37 node0.localdomain systemd[1]: Starting NFS status
>> monitor for NFSv2/3 locking....
>> May 05 13:43:37 node0.localdomain rpc.statd[6205]: Version 1.3.0 starting
>> May 05 13:43:37 node0.localdomain rpc.statd[6205]: Flags: TI-RPC
>> May 05 13:43:37 node0.localdomain rpc.statd[6205]: Failed to open
>> directory sm: Permission denied
>
>
> Okay this issue was fixed and the fix should be present in 3.10 too -
> https://review.gluster.org/#/c/16433/
>
> Please check '/var/log/messages' for statd related errors and cross-check
> permissions of that directory. You could manually chown owner:group of
> /var/lib/nfs/statd/sm directory for now and then restart nfs-ganesha*
> services.
>
> Thanks,
> Soumya
>
>> May 05 13:43:37 node0.localdomain rpc.statd[6205]: Failed to open
>> /var/lib/nfs/statd/state: Permission denied
>> May 05 13:43:37 node0.localdomain systemd[1]: nfs-ganesha-lock.service:
>> control process exited, code=exited status=1
>> May 05 13:43:37 node0.localdomain systemd[1]: Failed to start NFS status
>> monitor for NFSv2/3 locking..
>> May 05 13:43:37 node0.localdomain systemd[1]: Unit
>> nfs-ganesha-lock.service entered failed state.
>> May 05 13:43:37 node0.localdomain systemd[1]: nfs-ganesha-lock.service
>> failed.
>>
>> Thank you,
>>
>> Kind regards,
>>
>> Adam
>>
>> On Wed, May 3, 2017 at 10:32 AM, Mahdi Adnan <mahdi.adnan at outlook.com
>> <mailto:mahdi.adnan at outlook.com>> wrote:
>>
>> Hi,
>>
>>
>> Same here, when i reboot the node i have to manually execute "pcs
>> cluster start gluster01" and pcsd already enabled and started.
>>
>> Gluster 3.8.11
>>
>> Centos 7.3 latest
>>
>> Installed using CentOS Storage SIG repository
>>
>>
>>
>> --
>>
>> Respectfully*
>> **Mahdi A. Mahdi*
>>
>>
>> ------------------------------------------------------------------------
>> *From:* gluster-users-bounces at gluster.org
>> <mailto:gluster-users-bounces at gluster.org>
>> <gluster-users-bounces at gluster.org
>> <mailto:gluster-users-bounces at gluster.org>> on behalf of Adam Ru
>> <ad.ruckel at gmail.com <mailto:ad.ruckel at gmail.com>>
>> *Sent:* Wednesday, May 3, 2017 12:09:58 PM
>> *To:* Soumya Koduri
>> *Cc:* gluster-users at gluster.org <mailto:gluster-users at gluster.org>
>> *Subject:* Re: [Gluster-users] Gluster and NFS-Ganesha - cluster is
>>
>> down after reboot
>>
>> Hi Soumya,
>>
>> thank you very much for your reply.
>>
>> I enabled pcsd during setup and after reboot during troubleshooting
>> I manually started it and checked resources (pcs status). They were
>> not running. I didn’t find what was wrong but I’m going to try it
>> again.
>>
>> I’ve thoroughly checked
>>
>> http://gluster.readthedocs.io/en/latest/Administrator%20Guide/NFS-Ganesha%20GlusterFS%20Integration/
>>
>> <http://gluster.readthedocs.io/en/latest/Administrator%20Guide/NFS-Ganesha%20GlusterFS%20Integration/>
>> and I can confirm that I followed all steps with one exception. I
>> installed following RPMs:
>> glusterfs-server
>> glusterfs-fuse
>> glusterfs-cli
>> glusterfs-ganesha
>> nfs-ganesha-xfs
>>
>> and the guide referenced above specifies:
>> glusterfs-server
>> glusterfs-api
>> glusterfs-ganesha
>>
>> glusterfs-api is a dependency of one of RPMs that I installed so
>> this is not a problem. But I cannot find any mention to install
>> nfs-ganesha-xfs.
>>
>> I’ll try to setup the whole environment again without installing
>> nfs-ganesha-xfs (I assume glusterfs-ganesha has all required
>> binaries).
>>
>> Again, thank you for you time to answer my previous message.
>>
>> Kind regards,
>> Adam
>>
>> On Tue, May 2, 2017 at 8:49 AM, Soumya Koduri <skoduri at redhat.com
>> <mailto:skoduri at redhat.com>> wrote:
>>
>> Hi,
>>
>> On 05/02/2017 01:34 AM, Rudolf wrote:
>>
>> Hi Gluster users,
>>
>> First, I'd like to thank you all for this amazing
>> open-source! Thank you!
>>
>> I'm working on home project – three servers with Gluster and
>> NFS-Ganesha. My goal is to create HA NFS share with three
>> copies of each
>> file on each server.
>>
>> My systems are CentOS 7.3 Minimal install with the latest
>> updates and
>> the most current RPMs from "centos-gluster310" repository.
>>
>> I followed this tutorial:
>>
>> http://blog.gluster.org/2015/10/linux-scale-out-nfsv4-using-nfs-ganesha-and-glusterfs-one-step-at-a-time/
>>
>> <http://blog.gluster.org/2015/10/linux-scale-out-nfsv4-using-nfs-ganesha-and-glusterfs-one-step-at-a-time/>
>> (second half that describes multi-node HA setup)
>>
>> with a few exceptions:
>>
>> 1. All RPMs are from "centos-gluster310" repo that is
>> installed by "yum
>> -y install centos-release-gluster"
>> 2. I have three nodes (not four) with "replica 3" volume.
>> 3. I created empty ganesha.conf and not empty ganesha-ha.conf
>> in
>> "/var/run/gluster/shared_storage/nfs-ganesha/" (referenced
>> blog post is
>> outdated, this is now requirement)
>> 4. ganesha-ha.conf doesn't have "HA_VOL_SERVER" since this
>> isn't needed
>> anymore.
>>
>>
>> Please refer to
>>
>> http://gluster.readthedocs.io/en/latest/Administrator%20Guide/NFS-Ganesha%20GlusterFS%20Integration/
>>
>> <http://gluster.readthedocs.io/en/latest/Administrator%20Guide/NFS-Ganesha%20GlusterFS%20Integration/>
>>
>> It is being updated with latest changes happened wrt setup.
>>
>> When I finish configuration, all is good.
>> nfs-ganesha.service is active
>> and running and from client I can ping all three VIPs and I
>> can mount
>> NFS. Copied files are replicated to all nodes.
>>
>> But when I restart nodes (one by one, with 5 min. delay
>> between) then I
>> cannot ping or mount (I assume that all VIPs are down). So
>> my setup
>> definitely isn't HA.
>>
>> I found that:
>> # pcs status
>> Error: cluster is not currently running on this node
>>
>>
>> This means pcsd service is not up. Did you enable (systemctl
>> enable pcsd) pcsd service so that is comes up post reboot
>> automatically. If not please start it manually.
>>
>>
>> and nfs-ganesha.service is in inactive state. Btw. I didn't
>> enable
>> "systemctl enable nfs-ganesha" since I assume that this is
>> something
>> that Gluster does.
>>
>>
>> Please check /var/log/ganesha.log for any errors/warnings.
>>
>> We recommend not to enable nfs-ganesha.service (by default), as
>> the shared storage (where the ganesha.conf file resides now)
>> should be up and running before nfs-ganesha gets started.
>> So if enabled by default it could happen that shared_storage
>> mount point is not yet up and it resulted in nfs-ganesha service
>> failure. If you would like to address this, you could have a
>> cron job which keeps checking the mount point health and then
>> start nfs-ganesha service.
>>
>> Thanks,
>> Soumya
>>
>>
>> I assume that my issue is that I followed instructions in
>> blog post from
>> 2015/10 that are outdated. Unfortunately I cannot find
>> anything better –
>> I spent whole day by googling.
>>
>> Would you be so kind and check the instructions in blog post
>> and let me
>> know what steps are wrong / outdated? Or please do you have
>> more current
>> instructions for Gluster+Ganesha setup?
>>
>> Thank you.
>>
>> Kind regards,
>> Adam
>>
>>
>>
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org>
>> http://lists.gluster.org/mailman/listinfo/gluster-users
>> <http://lists.gluster.org/mailman/listinfo/gluster-users>
>>
>>
>>
>>
>> --
>> Adam
>>
>>
>>
>>
>> --
>> Adam
--
Adam
More information about the Gluster-users
mailing list