[Gluster-users] glusterfs 4.1.6 error in starting glusterd service

Wed Jan 16 11:32:54 UTC 2019

Atin,
I have copied the content of 'gfs-tst' from vol folder in another node.
when starting service again fails with error msg in glusterd.log file.

[2019-01-15 20:16:59.513023] I [MSGID: 100030] [glusterfsd.c:2741:main]
0-/usr/local/sbin/glusterd: Started running /usr/local/sbin/glusterd
version 4.1.6 (args: /usr/local/sbin/glusterd -p /var/run/glusterd.pid)
[2019-01-15 20:16:59.517164] I [MSGID: 106478] [glusterd.c:1423:init]
0-management: Maximum allowed open file descriptors set to 65536
[2019-01-15 20:16:59.517264] I [MSGID: 106479] [glusterd.c:1481:init]
0-management: Using /var/lib/glusterd as working directory
[2019-01-15 20:16:59.517283] I [MSGID: 106479] [glusterd.c:1486:init]
0-management: Using /var/run/gluster as pid file working directory
[2019-01-15 20:16:59.521508] W [MSGID: 103071]
[rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event
channel creation failed [No such device]
[2019-01-15 20:16:59.521544] W [MSGID: 103055] [rdma.c:4938:init]
0-rdma.management: Failed to initialize IB Device
[2019-01-15 20:16:59.521562] W [rpc-transport.c:351:rpc_transport_load]
0-rpc-transport: 'rdma' initialization failed
[2019-01-15 20:16:59.521629] W [rpcsvc.c:1781:rpcsvc_create_listener]
0-rpc-service: cannot create listener, initing the transport failed
[2019-01-15 20:16:59.521648] E [MSGID: 106244] [glusterd.c:1764:init]
0-management: creation of 1 listeners failed, continuing with succeeded
transport
[2019-01-15 20:17:00.529390] I [MSGID: 106513]
[glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved
op-version: 40100
[2019-01-15 20:17:00.608354] I [MSGID: 106544]
[glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID:
d6bf51a7-c296-492f-8dac-e81efa9dd22d
[2019-01-15 20:17:00.650911] W [MSGID: 106425]
[glusterd-store.c:2643:glusterd_store_retrieve_bricks] 0-management: failed
to get statfs() call on brick /media/disk4/brick4 [No such file or
directory]
[2019-01-15 20:17:00.691240] I [MSGID: 106498]
[glusterd-handler.c:3614:glusterd_friend_add_from_peerinfo] 0-management:
connect returned 0
[2019-01-15 20:17:00.691307] W [MSGID: 106061]
[glusterd-handler.c:3408:glusterd_transport_inet_options_build] 0-glusterd:
Failed to get tcp-user-timeout
[2019-01-15 20:17:00.691331] I [rpc-clnt.c:1059:rpc_clnt_connection_init]
0-management: setting frame-timeout to 600
[2019-01-15 20:17:00.692547] E [MSGID: 106187]
[glusterd-store.c:4662:glusterd_resolve_all_bricks] 0-glusterd: resolve
brick failed in restore
[2019-01-15 20:17:00.692582] E [MSGID: 101019] [xlator.c:720:xlator_init]
0-management: Initialization of volume 'management' failed, review your
volfile again
[2019-01-15 20:17:00.692597] E [MSGID: 101066]
[graph.c:367:glusterfs_graph_init] 0-management: initializing translator
failed
[2019-01-15 20:17:00.692607] E [MSGID: 101176]
[graph.c:738:glusterfs_graph_activate] 0-graph: init failed
[2019-01-15 20:17:00.693004] W [glusterfsd.c:1514:cleanup_and_exit]
(-->/usr/local/sbin/glusterd(glusterfs_volumes_init+0xc2) [0x409f52]
-->/usr/local/sbin/glusterd(glusterfs_process_volfp+0x151) [0x409e41]
-->/usr/local/sbin/glusterd(cleanup_and_exit+0x5f) [0x40942f] ) 0-:
received signum (-1), shutting down

On Wed, Jan 16, 2019 at 4:34 PM Atin Mukherjee <amukherj at redhat.com> wrote:

> This is a case of partial write of a transaction and as the host ran out
> of space for the root partition where all the glusterd related
> configurations are persisted, the transaction couldn't be written and hence
> the new (replaced) brick's information wasn't persisted in the
> configuration. The workaround for this is to copy the content of
> /var/lib/glusterd/vols/gfs-tst/ from one of the nodes in the trusted
> storage pool to the node where glusterd service fails to come up and post
> that restarting the glusterd service should be able to make peer status
> reporting all nodes healthy and connected.
>
> On Wed, Jan 16, 2019 at 3:49 PM Amudhan P <amudhan83 at gmail.com> wrote:
>
>> Hi,
>>
>> In short, when I started glusterd service I am getting following error
>> msg in the glusterd.log file in one server.
>> what needs to be done?
>>
>> error logged in glusterd.log
>>
>> [2019-01-15 17:50:13.956053] I [MSGID: 100030] [glusterfsd.c:2741:main]
>> 0-/usr/local/sbin/glusterd: Started running /usr/local/sbin/glusterd
>> version 4.1.6 (args: /usr/local/sbin/glusterd -p /var/run/glusterd.pid)
>> [2019-01-15 17:50:13.960131] I [MSGID: 106478] [glusterd.c:1423:init]
>> 0-management: Maximum allowed open file descriptors set to 65536
>> [2019-01-15 17:50:13.960193] I [MSGID: 106479] [glusterd.c:1481:init]
>> 0-management: Using /var/lib/glusterd as working directory
>> [2019-01-15 17:50:13.960212] I [MSGID: 106479] [glusterd.c:1486:init]
>> 0-management: Using /var/run/gluster as pid file working directory
>> [2019-01-15 17:50:13.964437] W [MSGID: 103071]
>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event
>> channel creation failed [No such device]
>> [2019-01-15 17:50:13.964474] W [MSGID: 103055] [rdma.c:4938:init]
>> 0-rdma.management: Failed to initialize IB Device
>> [2019-01-15 17:50:13.964491] W [rpc-transport.c:351:rpc_transport_load]
>> 0-rpc-transport: 'rdma' initialization failed
>> [2019-01-15 17:50:13.964560] W [rpcsvc.c:1781:rpcsvc_create_listener]
>> 0-rpc-service: cannot create listener, initing the transport failed
>> [2019-01-15 17:50:13.964579] E [MSGID: 106244] [glusterd.c:1764:init]
>> 0-management: creation of 1 listeners failed, continuing with succeeded
>> transport
>> [2019-01-15 17:50:14.967681] I [MSGID: 106513]
>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved
>> op-version: 40100
>> [2019-01-15 17:50:14.973931] I [MSGID: 106544]
>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID:
>> d6bf51a7-c296-492f-8dac-e81efa9dd22d
>> [2019-01-15 17:50:15.046620] E [MSGID: 101032]
>> [store.c:441:gf_store_handle_retrieve] 0-: Path corresponding to
>> /var/lib/glusterd/vols/gfs-tst/bricks/IP.3:-media-disk3-brick3. [No such
>> file or directory]
>> [2019-01-15 17:50:15.046685] E [MSGID: 106201]
>> [glusterd-store.c:3384:glusterd_store_retrieve_volumes] 0-management:
>> Unable to restore volume: gfs-tst
>> [2019-01-15 17:50:15.046718] E [MSGID: 101019] [xlator.c:720:xlator_init]
>> 0-management: Initialization of volume 'management' failed, review your
>> volfile again
>> [2019-01-15 17:50:15.046732] E [MSGID: 101066]
>> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator
>> failed
>> [2019-01-15 17:50:15.046741] E [MSGID: 101176]
>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed
>> [2019-01-15 17:50:15.047171] W [glusterfsd.c:1514:cleanup_and_exit]
>> (-->/usr/local/sbin/glusterd(glusterfs_volumes
>>
>>
>>
>> In long, I am trying to simulate a situation. where volume stoped
>> abnormally and
>> entire cluster restarted with some missing disks.
>>
>> My test cluster is set up with 3 nodes and each has four disks, I have
>> setup a volume with disperse 4+2.
>> In Node-3 2 disks have failed, to replace I have shutdown all system
>>
>> below are the steps done.
>>
>> 1. umount from client machine
>> 2. shutdown all system by running `shutdown -h now` command ( without
>> stopping volume and stop service)
>> 3. replace faulty disk in Node-3
>> 4. powered ON all system
>> 5. format replaced drives, and mount all drives
>> 6. start glusterd service in all node (success)
>> 7. Now running `voulume status` command from node-3
>> output : [2019-01-15 16:52:17.718422]  : v status : FAILED : Staging
>> failed on 0083ec0c-40bf-472a-a128-458924e56c96. Please check log file for
>> details.
>> 8. running `voulume start gfs-tst` command from node-3
>> output : [2019-01-15 16:53:19.410252]  : v start gfs-tst : FAILED :
>> Volume gfs-tst already started
>>
>> 9. running `gluster v status` in other node. showing all brick available
>> but 'self-heal daemon' not running
>> @gfstst-node2:~$ sudo gluster v status
>> Status of volume: gfs-tst
>> Gluster process                             TCP Port  RDMA Port  Online
>> Pid
>>
>> ------------------------------------------------------------------------------
>> Brick IP.2:/media/disk1/brick1          49152     0          Y       1517
>> Brick IP.4:/media/disk1/brick1          49152     0          Y       1668
>> Brick IP.2:/media/disk2/brick2          49153     0          Y       1522
>> Brick IP.4:/media/disk2/brick2          49153     0          Y       1678
>> Brick IP.2:/media/disk3/brick3          49154     0          Y       1527
>> Brick IP.4:/media/disk3/brick3          49154     0          Y       1677
>> Brick IP.2:/media/disk4/brick4          49155     0          Y       1541
>> Brick IP.4:/media/disk4/brick4          49155     0          Y       1683
>> Self-heal Daemon on localhost               N/A       N/A        Y
>>  2662
>> Self-heal Daemon on IP.4                N/A       N/A        Y       2786
>>
>> 10. in the above output 'volume already started'. so, running
>> `reset-brick` command
>>    v reset-brick gfs-tst IP.3:/media/disk3/brick3
>> IP.3:/media/disk3/brick3 commit force
>>
>> output : [2019-01-15 16:57:37.916942]  : v reset-brick gfs-tst
>> IP.3:/media/disk3/brick3 IP.3:/media/disk3/brick3 commit force : FAILED :
>> /media/disk3/brick3 is already part of a volume
>>
>> 11. reset-brick command was not working, so, tried stopping volume and
>> start with force command
>> output : [2019-01-15 17:01:04.570794]  : v start gfs-tst force : FAILED :
>> Pre-validation failed on localhost. Please check log file for details
>>
>> 12. now stopped service in all node and tried starting again. except
>> node-3 other nodes service started successfully without any issues.
>>
>> in node-3 receiving following message.
>>
>> sudo service glusterd start
>> * Starting glusterd service glusterd
>>
>>           [fail]
>> /usr/local/sbin/glusterd: option requires an argument -- 'f'
>> Try `glusterd --help' or `glusterd --usage' for more information.
>>
>> 13. checking glusterd log file found that OS drive was running out of
>> space
>> output : [2019-01-15 16:51:37.210792] W [MSGID: 101012]
>> [store.c:372:gf_store_save_value] 0-management: fflush failed. [No space
>> left on device]
>> [2019-01-15 16:51:37.210874] E [MSGID: 106190]
>> [glusterd-store.c:1058:glusterd_volume_exclude_options_write] 0-management:
>> Unable to write volume values for gfs-tst
>>
>> 14. cleared some space in OS drive but still, service is not running.
>> below is the error logged in glusterd.log
>>
>> [2019-01-15 17:50:13.956053] I [MSGID: 100030] [glusterfsd.c:2741:main]
>> 0-/usr/local/sbin/glusterd: Started running /usr/local/sbin/glusterd
>> version 4.1.6 (args: /usr/local/sbin/glusterd -p /var/run/glusterd.pid)
>> [2019-01-15 17:50:13.960131] I [MSGID: 106478] [glusterd.c:1423:init]
>> 0-management: Maximum allowed open file descriptors set to 65536
>> [2019-01-15 17:50:13.960193] I [MSGID: 106479] [glusterd.c:1481:init]
>> 0-management: Using /var/lib/glusterd as working directory
>> [2019-01-15 17:50:13.960212] I [MSGID: 106479] [glusterd.c:1486:init]
>> 0-management: Using /var/run/gluster as pid file working directory
>> [2019-01-15 17:50:13.964437] W [MSGID: 103071]
>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event
>> channel creation failed [No such device]
>> [2019-01-15 17:50:13.964474] W [MSGID: 103055] [rdma.c:4938:init]
>> 0-rdma.management: Failed to initialize IB Device
>> [2019-01-15 17:50:13.964491] W [rpc-transport.c:351:rpc_transport_load]
>> 0-rpc-transport: 'rdma' initialization failed
>> [2019-01-15 17:50:13.964560] W [rpcsvc.c:1781:rpcsvc_create_listener]
>> 0-rpc-service: cannot create listener, initing the transport failed
>> [2019-01-15 17:50:13.964579] E [MSGID: 106244] [glusterd.c:1764:init]
>> 0-management: creation of 1 listeners failed, continuing with succeeded
>> transport
>> [2019-01-15 17:50:14.967681] I [MSGID: 106513]
>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved
>> op-version: 40100
>> [2019-01-15 17:50:14.973931] I [MSGID: 106544]
>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID:
>> d6bf51a7-c296-492f-8dac-e81efa9dd22d
>> [2019-01-15 17:50:15.046620] E [MSGID: 101032]
>> [store.c:441:gf_store_handle_retrieve] 0-: Path corresponding to
>> /var/lib/glusterd/vols/gfs-tst/bricks/IP.3:-media-disk3-brick3. [No such
>> file or directory]
>> [2019-01-15 17:50:15.046685] E [MSGID: 106201]
>> [glusterd-store.c:3384:glusterd_store_retrieve_volumes] 0-management:
>> Unable to restore volume: gfs-tst
>> [2019-01-15 17:50:15.046718] E [MSGID: 101019] [xlator.c:720:xlator_init]
>> 0-management: Initialization of volume 'management' failed, review your
>> volfile again
>> [2019-01-15 17:50:15.046732] E [MSGID: 101066]
>> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator
>> failed
>> [2019-01-15 17:50:15.046741] E [MSGID: 101176]
>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed
>> [2019-01-15 17:50:15.047171] W [glusterfsd.c:1514:cleanup_and_exit]
>> (-->/usr/local/sbin/glusterd(glusterfs_volumes_init+0xc2) [0x409f52]
>> -->/usr/local/sbin/glusterd(glusterfs_process_volfp+0x151) [0x409e41]
>> -->/usr/local/sbin/glusterd(cleanup_and_exit+0x5f) [0x40942f] ) 0-:
>> received signum (-1), shutting down
>>
>>
>> 15. In other node running `volume status' still shows bricks node3 is
>> live
>>      but 'peer status' showing node-3 disconnected
>>
>> @gfstst-node2:~$ sudo gluster v status
>> Status of volume: gfs-tst
>> Gluster process                             TCP Port  RDMA Port  Online
>> Pid
>>
>> ------------------------------------------------------------------------------
>> Brick IP.2:/media/disk1/brick1          49152     0          Y       1517
>> Brick IP.4:/media/disk1/brick1          49152     0          Y       1668
>> Brick IP.2:/media/disk2/brick2          49153     0          Y       1522
>> Brick IP.4:/media/disk2/brick2          49153     0          Y       1678
>> Brick IP.2:/media/disk3/brick3          49154     0          Y       1527
>> Brick IP.4:/media/disk3/brick3          49154     0          Y       1677
>> Brick IP.2:/media/disk4/brick4          49155     0          Y       1541
>> Brick IP.4:/media/disk4/brick4          49155     0          Y       1683
>> Self-heal Daemon on localhost           N/A       N/A        Y       2662
>> Self-heal Daemon on IP.4                N/A       N/A        Y       2786
>>
>> Task Status of Volume gfs-tst
>>
>> ------------------------------------------------------------------------------
>> There are no active volume tasks
>>
>>
>> root at gfstst-node2:~$ sudo gluster pool list
>> UUID                                    Hostname        State
>> d6bf51a7-c296-492f-8dac-e81efa9dd22d    IP.3        Disconnected
>> c1cbb58e-3ceb-4637-9ba3-3d28ef20b143    IP.4        Connected
>> 0083ec0c-40bf-472a-a128-458924e56c96    localhost       Connected
>>
>> root at gfstst-node2:~$ sudo gluster peer status
>> Number of Peers: 2
>>
>> Hostname: IP.3
>> Uuid: d6bf51a7-c296-492f-8dac-e81efa9dd22d
>> State: Peer in Cluster (Disconnected)
>>
>> Hostname: IP.4
>> Uuid: c1cbb58e-3ceb-4637-9ba3-3d28ef20b143
>> State: Peer in Cluster (Connected)
>>
>>
>> regards
>> Amudhan
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> https://lists.gluster.org/mailman/listinfo/gluster-users
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20190116/c3f8426e/attachment.html>