[Gluster-users] glusterfs 4.1.6 error in starting glusterd service

Thu Jan 31 06:20:19 UTC 2019

Hi Atin,

This is the steps exactly I have done which caused failure. additional to
this node3 OS drive was running out of space when service failed. so I have
cleared some space in OS drive but still service failed to start.

Trying to simulate a situation. where volume stoped abnormally and
entire cluster restarted with some missing disks.

My test cluster is set up with 3 nodes and each has four disks, I have
setup a volume with disperse 4+2.
In Node-3 2 disks have failed, to replace I have shutdown all system

below are the steps done.

1. umount from client machine
2. shutdown all system by running `shutdown -h now` command ( without
stopping volume and stop service)
3. replace faulty disk in Node-3
4. powered ON all system
5. format replaced drives, and mount all drives
6. start glusterd service in all node (success)
7. Now running `voulume status` command from node-3
output : [2019-01-15 16:52:17.718422]  : v status : FAILED : Staging failed
on 0083ec0c-40bf-472a-a128-458924e56c96. Please check log file for details.
8. running `voulume start gfs-tst` command from node-3
output : [2019-01-15 16:53:19.410252]  : v start gfs-tst : FAILED : Volume
gfs-tst already started

9. running `gluster v status` in other node. showing all brick available
but 'self-heal daemon' not running
@gfstst-node2:~$ sudo gluster v status
Status of volume: gfs-tst
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick IP.2:/media/disk1/brick1          49152     0          Y       1517
Brick IP.4:/media/disk1/brick1          49152     0          Y       1668
Brick IP.2:/media/disk2/brick2          49153     0          Y       1522
Brick IP.4:/media/disk2/brick2          49153     0          Y       1678
Brick IP.2:/media/disk3/brick3          49154     0          Y       1527
Brick IP.4:/media/disk3/brick3          49154     0          Y       1677
Brick IP.2:/media/disk4/brick4          49155     0          Y       1541
Brick IP.4:/media/disk4/brick4          49155     0          Y       1683
Self-heal Daemon on localhost               N/A       N/A        Y
 2662
Self-heal Daemon on IP.4                N/A       N/A        Y       2786

10. in the above output 'volume already started'. so, running `reset-brick`
command
   v reset-brick gfs-tst IP.3:/media/disk3/brick3 IP.3:/media/disk3/brick3
commit force

output : [2019-01-15 16:57:37.916942]  : v reset-brick gfs-tst
IP.3:/media/disk3/brick3 IP.3:/media/disk3/brick3 commit force : FAILED :
/media/disk3/brick3 is already part of a volume

11. reset-brick command was not working, so, tried stopping volume and
start with force command
output : [2019-01-15 17:01:04.570794]  : v start gfs-tst force : FAILED :
Pre-validation failed on localhost. Please check log file for details

12. now stopped service in all node and tried starting again. except node-3
other nodes service started successfully without any issues.

in node-3 receiving following message.

sudo service glusterd start
 * Starting glusterd service glusterd

          [fail]
/usr/local/sbin/glusterd: option requires an argument -- 'f'
Try `glusterd --help' or `glusterd --usage' for more information.

13. checking glusterd log file found that OS drive was running out of space
output : [2019-01-15 16:51:37.210792] W [MSGID: 101012]
[store.c:372:gf_store_save_value] 0-management: fflush failed. [No space
left on device]
 [2019-01-15 16:51:37.210874] E [MSGID: 106190]
[glusterd-store.c:1058:glusterd_volume_exclude_options_write] 0-management:
Unable to write volume values for gfs-tst

14. cleared some space in OS drive but still, service is not running. below
is the error logged in glusterd.log

[2019-01-15 17:50:13.956053] I [MSGID: 100030] [glusterfsd.c:2741:main]
0-/usr/local/sbin/glusterd: Started running /usr/local/sbin/glusterd
version 4.1.6 (args: /usr/local/sbin/glusterd -p /var/run/glusterd.pid)
[2019-01-15 17:50:13.960131] I [MSGID: 106478] [glusterd.c:1423:init]
0-management: Maximum allowed open file descriptors set to 65536
[2019-01-15 17:50:13.960193] I [MSGID: 106479] [glusterd.c:1481:init]
0-management: Using /var/lib/glusterd as working directory
[2019-01-15 17:50:13.960212] I [MSGID: 106479] [glusterd.c:1486:init]
0-management: Using /var/run/gluster as pid file working directory
[2019-01-15 17:50:13.964437] W [MSGID: 103071]
[rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event
channel creation failed [No such device]
[2019-01-15 17:50:13.964474] W [MSGID: 103055] [rdma.c:4938:init]
0-rdma.management: Failed to initialize IB Device
[2019-01-15 17:50:13.964491] W [rpc-transport.c:351:rpc_transport_load]
0-rpc-transport: 'rdma' initialization failed
[2019-01-15 17:50:13.964560] W [rpcsvc.c:1781:rpcsvc_create_listener]
0-rpc-service: cannot create listener, initing the transport failed
[2019-01-15 17:50:13.964579] E [MSGID: 106244] [glusterd.c:1764:init]
0-management: creation of 1 listeners failed, continuing with succeeded
transport
[2019-01-15 17:50:14.967681] I [MSGID: 106513]
[glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved
op-version: 40100
[2019-01-15 17:50:14.973931] I [MSGID: 106544]
[glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID:
d6bf51a7-c296-492f-8dac-e81efa9dd22d
[2019-01-15 17:50:15.046620] E [MSGID: 101032]
[store.c:441:gf_store_handle_retrieve] 0-: Path corresponding to
/var/lib/glusterd/vols/gfs-tst/bricks/IP.3:-media-disk3-brick3. [No such
file or directory]
[2019-01-15 17:50:15.046685] E [MSGID: 106201]
[glusterd-store.c:3384:glusterd_store_retrieve_volumes] 0-management:
Unable to restore volume: gfs-tst
[2019-01-15 17:50:15.046718] E [MSGID: 101019] [xlator.c:720:xlator_init]
0-management: Initialization of volume 'management' failed, review your
volfile again
[2019-01-15 17:50:15.046732] E [MSGID: 101066]
[graph.c:367:glusterfs_graph_init] 0-management: initializing translator
failed
[2019-01-15 17:50:15.046741] E [MSGID: 101176]
[graph.c:738:glusterfs_graph_activate] 0-graph: init failed
[2019-01-15 17:50:15.047171] W [glusterfsd.c:1514:cleanup_and_exit]
(-->/usr/local/sbin/glusterd(glusterfs_volumes_init+0xc2) [0x409f52]
-->/usr/local/sbin/glusterd(glusterfs_process_volfp+0x151) [0x409e41]
-->/usr/local/sbin/glusterd(cleanup_and_exit+0x5f) [0x40942f] ) 0-:
received signum (-1), shutting down

15. In other node running `volume status'

@gfstst-node2:~$ sudo gluster v status
Status of volume: gfs-tst
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick IP.2:/media/disk1/brick1          49152     0          Y       1517
Brick IP.4:/media/disk1/brick1          49152     0          Y       1668
Brick IP.2:/media/disk2/brick2          49153     0          Y       1522
Brick IP.4:/media/disk2/brick2          49153     0          Y       1678
Brick IP.2:/media/disk3/brick3          49154     0          Y       1527
Brick IP.4:/media/disk3/brick3          49154     0          Y       1677
Brick IP.2:/media/disk4/brick4          49155     0          Y       1541
Brick IP.4:/media/disk4/brick4          49155     0          Y       1683
Self-heal Daemon on localhost           N/A       N/A        Y       2662
Self-heal Daemon on IP.4                N/A       N/A        Y       2786

Task Status of Volume gfs-tst
------------------------------------------------------------------------------
There are no active volume tasks

16. 'peer status' command showing node-3 disconnected

root at gfstst-node2:~$ sudo gluster pool list
UUID                                    Hostname        State
d6bf51a7-c296-492f-8dac-e81efa9dd22d    IP.3        Disconnected
c1cbb58e-3ceb-4637-9ba3-3d28ef20b143    IP.4        Connected
0083ec0c-40bf-472a-a128-458924e56c96    localhost       Connected

root at gfstst-node2:~$ sudo gluster peer status
Number of Peers: 2

Hostname: IP.3
Uuid: d6bf51a7-c296-492f-8dac-e81efa9dd22d
State: Peer in Cluster (Disconnected)

Hostname: IP.4
Uuid: c1cbb58e-3ceb-4637-9ba3-3d28ef20b143
State: Peer in Cluster (Connected)

regards
Amudhan

On Thu, Jan 31, 2019 at 8:54 AM Atin Mukherjee <amukherj at redhat.com> wrote:

> I'm not very sure how did you end up into a state where in one of the node
> lost information of one peer from the cluster. I suspect doing a replace
> node operation you somehow landed into this situation by an incorrect step.
> Until and unless you could elaborate more on what all steps you have
> performed in the cluster, it'd be difficult to figure out the exact cause.
>
> On Wed, Jan 30, 2019 at 7:25 PM Amudhan P <amudhan83 at gmail.com> wrote:
>
>> Hi Atin,
>>
>> yes, it worked out thank you.
>>
>> what would be the cause of this issue?
>>
>>
>>
>> On Fri, Jan 25, 2019 at 1:56 PM Atin Mukherjee <amukherj at redhat.com>
>> wrote:
>>
>>> Amudhan,
>>>
>>> So here's the issue:
>>>
>>> In node3, 'cat /var/lib/glusterd/peers/* ' doesn't show up node2's
>>> details and that's why glusterd wasn't able to resolve the brick(s) hosted
>>> on node2.
>>>
>>> Can you please pick up 0083ec0c-40bf-472a-a128-458924e56c96 file from
>>> /var/lib/glusterd/peers/ from node 4 and place it in the same location in
>>> node 3 and then restart glusterd service on node 3?
>>>
>>>
>>> On Thu, Jan 24, 2019 at 11:57 AM Amudhan P <amudhan83 at gmail.com> wrote:
>>>
>>>> Atin,
>>>>
>>>> Sorry, i missed to send entire `glusterd` folder.  Now attached zip
>>>> contains `glusterd` folder from all nodes.
>>>>
>>>> the problem node is node3 IP 10.1.2.3, `glusterd` log file is inside
>>>> node3 folder.
>>>>
>>>> regards
>>>> Amudhan
>>>>
>>>> On Wed, Jan 23, 2019 at 11:02 PM Atin Mukherjee <amukherj at redhat.com>
>>>> wrote:
>>>>
>>>>> Amudhan,
>>>>>
>>>>> I see that you have provided the content of the configuration of the
>>>>> volume gfs-tst where the request was to share the dump of
>>>>> /var/lib/glusterd/* . I can not debug this further until you share the
>>>>> correct dump.
>>>>>
>>>>> On Thu, Jan 17, 2019 at 3:43 PM Atin Mukherjee <amukherj at redhat.com>
>>>>> wrote:
>>>>>
>>>>>> Can you please run 'glusterd -LDEBUG' and share back the
>>>>>> glusterd.log? Instead of doing too many back and forth I suggest you to
>>>>>> share the content of /var/lib/glusterd from all the nodes. Also do mention
>>>>>> which particular node the glusterd service is unable to come up.
>>>>>>
>>>>>> On Thu, Jan 17, 2019 at 11:34 AM Amudhan P <amudhan83 at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I have created the folder in the path as said but still, service
>>>>>>> failed to start below is the error msg in glusterd.log
>>>>>>>
>>>>>>> [2019-01-16 14:50:14.555742] I [MSGID: 100030]
>>>>>>> [glusterfsd.c:2741:main] 0-/usr/local/sbin/glusterd: Started running
>>>>>>> /usr/local/sbin/glusterd version 4.1.6 (args: /usr/local/sbin/glusterd -p
>>>>>>> /var/run/glusterd.pid)
>>>>>>> [2019-01-16 14:50:14.559835] I [MSGID: 106478]
>>>>>>> [glusterd.c:1423:init] 0-management: Maximum allowed open file descriptors
>>>>>>> set to 65536
>>>>>>> [2019-01-16 14:50:14.559894] I [MSGID: 106479]
>>>>>>> [glusterd.c:1481:init] 0-management: Using /var/lib/glusterd as working
>>>>>>> directory
>>>>>>> [2019-01-16 14:50:14.559912] I [MSGID: 106479]
>>>>>>> [glusterd.c:1486:init] 0-management: Using /var/run/gluster as pid file
>>>>>>> working directory
>>>>>>> [2019-01-16 14:50:14.563834] W [MSGID: 103071]
>>>>>>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event
>>>>>>> channel creation failed [No such device]
>>>>>>> [2019-01-16 14:50:14.563867] W [MSGID: 103055] [rdma.c:4938:init]
>>>>>>> 0-rdma.management: Failed to initialize IB Device
>>>>>>> [2019-01-16 14:50:14.563882] W
>>>>>>> [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: 'rdma'
>>>>>>> initialization failed
>>>>>>> [2019-01-16 14:50:14.563957] W
>>>>>>> [rpcsvc.c:1781:rpcsvc_create_listener] 0-rpc-service: cannot create
>>>>>>> listener, initing the transport failed
>>>>>>> [2019-01-16 14:50:14.563974] E [MSGID: 106244]
>>>>>>> [glusterd.c:1764:init] 0-management: creation of 1 listeners failed,
>>>>>>> continuing with succeeded transport
>>>>>>> [2019-01-16 14:50:15.565868] I [MSGID: 106513]
>>>>>>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved
>>>>>>> op-version: 40100
>>>>>>> [2019-01-16 14:50:15.642532] I [MSGID: 106544]
>>>>>>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID:
>>>>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d
>>>>>>> [2019-01-16 14:50:15.675333] I [MSGID: 106498]
>>>>>>> [glusterd-handler.c:3614:glusterd_friend_add_from_peerinfo] 0-management:
>>>>>>> connect returned 0
>>>>>>> [2019-01-16 14:50:15.675421] W [MSGID: 106061]
>>>>>>> [glusterd-handler.c:3408:glusterd_transport_inet_options_build] 0-glusterd:
>>>>>>> Failed to get tcp-user-timeout
>>>>>>> [2019-01-16 14:50:15.675451] I
>>>>>>> [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-management: setting
>>>>>>> frame-timeout to 600
>>>>>>> *[2019-01-16 14:50:15.676912] E [MSGID: 106187]
>>>>>>> [glusterd-store.c:4662:glusterd_resolve_all_bricks] 0-glusterd: resolve
>>>>>>> brick failed in restore*
>>>>>>> *[2019-01-16 14:50:15.676956] E [MSGID: 101019]
>>>>>>> [xlator.c:720:xlator_init] 0-management: Initialization of volume
>>>>>>> 'management' failed, review your volfile again*
>>>>>>> [2019-01-16 14:50:15.676973] E [MSGID: 101066]
>>>>>>> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator
>>>>>>> failed
>>>>>>> [2019-01-16 14:50:15.676986] E [MSGID: 101176]
>>>>>>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed
>>>>>>> [2019-01-16 14:50:15.677479] W [glusterfsd.c:1514:cleanup_and_exit]
>>>>>>> (-->/usr/local/sbin/glusterd(glusterfs_volumes_init+0xc2) [0x409f52]
>>>>>>> -->/usr/local/sbin/glusterd(glusterfs_process_volfp+0x151) [0x409e41]
>>>>>>> -->/usr/local/sbin/glusterd(cleanup_and_exit+0x5f) [0x40942f] ) 0-:
>>>>>>> received signum (-1), shutting down
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jan 17, 2019 at 8:06 AM Atin Mukherjee <amukherj at redhat.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> If gluster volume info/status shows the brick to be
>>>>>>>> /media/disk4/brick4 then you'd need to mount the same path and hence you'd
>>>>>>>> need to create the brick4 directory explicitly. I fail to understand the
>>>>>>>> rationale how only /media/disk4 can be used as the mount path for the
>>>>>>>> brick.
>>>>>>>>
>>>>>>>> On Wed, Jan 16, 2019 at 5:24 PM Amudhan P <amudhan83 at gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Yes, I did mount bricks but the folder 'brick4' was still not
>>>>>>>>> created inside the brick.
>>>>>>>>> Do I need to create this folder because when I run replace-brick
>>>>>>>>> it will create folder inside the brick. I have seen this behavior before
>>>>>>>>> when running replace-brick or heal begins.
>>>>>>>>>
>>>>>>>>> On Wed, Jan 16, 2019 at 5:05 PM Atin Mukherjee <
>>>>>>>>> amukherj at redhat.com> wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Jan 16, 2019 at 5:02 PM Amudhan P <amudhan83 at gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Atin,
>>>>>>>>>>> I have copied the content of 'gfs-tst' from vol folder in
>>>>>>>>>>> another node. when starting service again fails with error msg in
>>>>>>>>>>> glusterd.log file.
>>>>>>>>>>>
>>>>>>>>>>> [2019-01-15 20:16:59.513023] I [MSGID: 100030]
>>>>>>>>>>> [glusterfsd.c:2741:main] 0-/usr/local/sbin/glusterd: Started running
>>>>>>>>>>> /usr/local/sbin/glusterd version 4.1.6 (args: /usr/local/sbin/glusterd -p
>>>>>>>>>>> /var/run/glusterd.pid)
>>>>>>>>>>> [2019-01-15 20:16:59.517164] I [MSGID: 106478]
>>>>>>>>>>> [glusterd.c:1423:init] 0-management: Maximum allowed open file descriptors
>>>>>>>>>>> set to 65536
>>>>>>>>>>> [2019-01-15 20:16:59.517264] I [MSGID: 106479]
>>>>>>>>>>> [glusterd.c:1481:init] 0-management: Using /var/lib/glusterd as working
>>>>>>>>>>> directory
>>>>>>>>>>> [2019-01-15 20:16:59.517283] I [MSGID: 106479]
>>>>>>>>>>> [glusterd.c:1486:init] 0-management: Using /var/run/gluster as pid file
>>>>>>>>>>> working directory
>>>>>>>>>>> [2019-01-15 20:16:59.521508] W [MSGID: 103071]
>>>>>>>>>>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event
>>>>>>>>>>> channel creation failed [No such device]
>>>>>>>>>>> [2019-01-15 20:16:59.521544] W [MSGID: 103055]
>>>>>>>>>>> [rdma.c:4938:init] 0-rdma.management: Failed to initialize IB Device
>>>>>>>>>>> [2019-01-15 20:16:59.521562] W
>>>>>>>>>>> [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: 'rdma'
>>>>>>>>>>> initialization failed
>>>>>>>>>>> [2019-01-15 20:16:59.521629] W
>>>>>>>>>>> [rpcsvc.c:1781:rpcsvc_create_listener] 0-rpc-service: cannot create
>>>>>>>>>>> listener, initing the transport failed
>>>>>>>>>>> [2019-01-15 20:16:59.521648] E [MSGID: 106244]
>>>>>>>>>>> [glusterd.c:1764:init] 0-management: creation of 1 listeners failed,
>>>>>>>>>>> continuing with succeeded transport
>>>>>>>>>>> [2019-01-15 20:17:00.529390] I [MSGID: 106513]
>>>>>>>>>>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved
>>>>>>>>>>> op-version: 40100
>>>>>>>>>>> [2019-01-15 20:17:00.608354] I [MSGID: 106544]
>>>>>>>>>>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID:
>>>>>>>>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d
>>>>>>>>>>> [2019-01-15 20:17:00.650911] W [MSGID: 106425]
>>>>>>>>>>> [glusterd-store.c:2643:glusterd_store_retrieve_bricks] 0-management: failed
>>>>>>>>>>> to get statfs() call on brick /media/disk4/brick4 [No such file or
>>>>>>>>>>> directory]
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> This means that underlying brick /media/disk4/brick4 doesn't
>>>>>>>>>> exist. You already mentioned that you had replaced the faulty disk, but
>>>>>>>>>> have you not mounted it yet?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> [2019-01-15 20:17:00.691240] I [MSGID: 106498]
>>>>>>>>>>> [glusterd-handler.c:3614:glusterd_friend_add_from_peerinfo] 0-management:
>>>>>>>>>>> connect returned 0
>>>>>>>>>>> [2019-01-15 20:17:00.691307] W [MSGID: 106061]
>>>>>>>>>>> [glusterd-handler.c:3408:glusterd_transport_inet_options_build] 0-glusterd:
>>>>>>>>>>> Failed to get tcp-user-timeout
>>>>>>>>>>> [2019-01-15 20:17:00.691331] I
>>>>>>>>>>> [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-management: setting
>>>>>>>>>>> frame-timeout to 600
>>>>>>>>>>> [2019-01-15 20:17:00.692547] E [MSGID: 106187]
>>>>>>>>>>> [glusterd-store.c:4662:glusterd_resolve_all_bricks] 0-glusterd: resolve
>>>>>>>>>>> brick failed in restore
>>>>>>>>>>> [2019-01-15 20:17:00.692582] E [MSGID: 101019]
>>>>>>>>>>> [xlator.c:720:xlator_init] 0-management: Initialization of volume
>>>>>>>>>>> 'management' failed, review your volfile again
>>>>>>>>>>> [2019-01-15 20:17:00.692597] E [MSGID: 101066]
>>>>>>>>>>> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator
>>>>>>>>>>> failed
>>>>>>>>>>> [2019-01-15 20:17:00.692607] E [MSGID: 101176]
>>>>>>>>>>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed
>>>>>>>>>>> [2019-01-15 20:17:00.693004] W
>>>>>>>>>>> [glusterfsd.c:1514:cleanup_and_exit]
>>>>>>>>>>> (-->/usr/local/sbin/glusterd(glusterfs_volumes_init+0xc2) [0x409f52]
>>>>>>>>>>> -->/usr/local/sbin/glusterd(glusterfs_process_volfp+0x151) [0x409e41]
>>>>>>>>>>> -->/usr/local/sbin/glusterd(cleanup_and_exit+0x5f) [0x40942f] ) 0-:
>>>>>>>>>>> received signum (-1), shutting down
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Jan 16, 2019 at 4:34 PM Atin Mukherjee <
>>>>>>>>>>> amukherj at redhat.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> This is a case of partial write of a transaction and as the
>>>>>>>>>>>> host ran out of space for the root partition where all the glusterd related
>>>>>>>>>>>> configurations are persisted, the transaction couldn't be written and hence
>>>>>>>>>>>> the new (replaced) brick's information wasn't persisted in the
>>>>>>>>>>>> configuration. The workaround for this is to copy the content of
>>>>>>>>>>>> /var/lib/glusterd/vols/gfs-tst/ from one of the nodes in the trusted
>>>>>>>>>>>> storage pool to the node where glusterd service fails to come up and post
>>>>>>>>>>>> that restarting the glusterd service should be able to make peer status
>>>>>>>>>>>> reporting all nodes healthy and connected.
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Jan 16, 2019 at 3:49 PM Amudhan P <amudhan83 at gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> In short, when I started glusterd service I am getting
>>>>>>>>>>>>> following error msg in the glusterd.log file in one server.
>>>>>>>>>>>>> what needs to be done?
>>>>>>>>>>>>>
>>>>>>>>>>>>> error logged in glusterd.log
>>>>>>>>>>>>>
>>>>>>>>>>>>> [2019-01-15 17:50:13.956053] I [MSGID: 100030]
>>>>>>>>>>>>> [glusterfsd.c:2741:main] 0-/usr/local/sbin/glusterd: Started running
>>>>>>>>>>>>> /usr/local/sbin/glusterd version 4.1.6 (args: /usr/local/sbin/glusterd -p
>>>>>>>>>>>>> /var/run/glusterd.pid)
>>>>>>>>>>>>> [2019-01-15 17:50:13.960131] I [MSGID: 106478]
>>>>>>>>>>>>> [glusterd.c:1423:init] 0-management: Maximum allowed open file descriptors
>>>>>>>>>>>>> set to 65536
>>>>>>>>>>>>> [2019-01-15 17:50:13.960193] I [MSGID: 106479]
>>>>>>>>>>>>> [glusterd.c:1481:init] 0-management: Using /var/lib/glusterd as working
>>>>>>>>>>>>> directory
>>>>>>>>>>>>> [2019-01-15 17:50:13.960212] I [MSGID: 106479]
>>>>>>>>>>>>> [glusterd.c:1486:init] 0-management: Using /var/run/gluster as pid file
>>>>>>>>>>>>> working directory
>>>>>>>>>>>>> [2019-01-15 17:50:13.964437] W [MSGID: 103071]
>>>>>>>>>>>>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event
>>>>>>>>>>>>> channel creation failed [No such device]
>>>>>>>>>>>>> [2019-01-15 17:50:13.964474] W [MSGID: 103055]
>>>>>>>>>>>>> [rdma.c:4938:init] 0-rdma.management: Failed to initialize IB Device
>>>>>>>>>>>>> [2019-01-15 17:50:13.964491] W
>>>>>>>>>>>>> [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: 'rdma'
>>>>>>>>>>>>> initialization failed
>>>>>>>>>>>>> [2019-01-15 17:50:13.964560] W
>>>>>>>>>>>>> [rpcsvc.c:1781:rpcsvc_create_listener] 0-rpc-service: cannot create
>>>>>>>>>>>>> listener, initing the transport failed
>>>>>>>>>>>>> [2019-01-15 17:50:13.964579] E [MSGID: 106244]
>>>>>>>>>>>>> [glusterd.c:1764:init] 0-management: creation of 1 listeners failed,
>>>>>>>>>>>>> continuing with succeeded transport
>>>>>>>>>>>>> [2019-01-15 17:50:14.967681] I [MSGID: 106513]
>>>>>>>>>>>>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved
>>>>>>>>>>>>> op-version: 40100
>>>>>>>>>>>>> [2019-01-15 17:50:14.973931] I [MSGID: 106544]
>>>>>>>>>>>>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID:
>>>>>>>>>>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d
>>>>>>>>>>>>> [2019-01-15 17:50:15.046620] E [MSGID: 101032]
>>>>>>>>>>>>> [store.c:441:gf_store_handle_retrieve] 0-: Path corresponding to
>>>>>>>>>>>>> /var/lib/glusterd/vols/gfs-tst/bricks/IP.3:-media-disk3-brick3. [No such
>>>>>>>>>>>>> file or directory]
>>>>>>>>>>>>> [2019-01-15 17:50:15.046685] E [MSGID: 106201]
>>>>>>>>>>>>> [glusterd-store.c:3384:glusterd_store_retrieve_volumes] 0-management:
>>>>>>>>>>>>> Unable to restore volume: gfs-tst
>>>>>>>>>>>>> [2019-01-15 17:50:15.046718] E [MSGID: 101019]
>>>>>>>>>>>>> [xlator.c:720:xlator_init] 0-management: Initialization of volume
>>>>>>>>>>>>> 'management' failed, review your volfile again
>>>>>>>>>>>>> [2019-01-15 17:50:15.046732] E [MSGID: 101066]
>>>>>>>>>>>>> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator
>>>>>>>>>>>>> failed
>>>>>>>>>>>>> [2019-01-15 17:50:15.046741] E [MSGID: 101176]
>>>>>>>>>>>>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed
>>>>>>>>>>>>> [2019-01-15 17:50:15.047171] W
>>>>>>>>>>>>> [glusterfsd.c:1514:cleanup_and_exit]
>>>>>>>>>>>>> (-->/usr/local/sbin/glusterd(glusterfs_volumes
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> In long, I am trying to simulate a situation. where volume
>>>>>>>>>>>>> stoped abnormally and
>>>>>>>>>>>>> entire cluster restarted with some missing disks.
>>>>>>>>>>>>>
>>>>>>>>>>>>> My test cluster is set up with 3 nodes and each has four
>>>>>>>>>>>>> disks, I have setup a volume with disperse 4+2.
>>>>>>>>>>>>> In Node-3 2 disks have failed, to replace I have shutdown all
>>>>>>>>>>>>> system
>>>>>>>>>>>>>
>>>>>>>>>>>>> below are the steps done.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1. umount from client machine
>>>>>>>>>>>>> 2. shutdown all system by running `shutdown -h now` command (
>>>>>>>>>>>>> without stopping volume and stop service)
>>>>>>>>>>>>> 3. replace faulty disk in Node-3
>>>>>>>>>>>>> 4. powered ON all system
>>>>>>>>>>>>> 5. format replaced drives, and mount all drives
>>>>>>>>>>>>> 6. start glusterd service in all node (success)
>>>>>>>>>>>>> 7. Now running `voulume status` command from node-3
>>>>>>>>>>>>> output : [2019-01-15 16:52:17.718422]  : v status : FAILED :
>>>>>>>>>>>>> Staging failed on 0083ec0c-40bf-472a-a128-458924e56c96. Please check log
>>>>>>>>>>>>> file for details.
>>>>>>>>>>>>> 8. running `voulume start gfs-tst` command from node-3
>>>>>>>>>>>>> output : [2019-01-15 16:53:19.410252]  : v start gfs-tst :
>>>>>>>>>>>>> FAILED : Volume gfs-tst already started
>>>>>>>>>>>>>
>>>>>>>>>>>>> 9. running `gluster v status` in other node. showing all brick
>>>>>>>>>>>>> available but 'self-heal daemon' not running
>>>>>>>>>>>>> @gfstst-node2:~$ sudo gluster v status
>>>>>>>>>>>>> Status of volume: gfs-tst
>>>>>>>>>>>>> Gluster process                             TCP Port  RDMA
>>>>>>>>>>>>> Port  Online  Pid
>>>>>>>>>>>>>
>>>>>>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>>>>>>> Brick IP.2:/media/disk1/brick1          49152     0
>>>>>>>>>>>>> Y       1517
>>>>>>>>>>>>> Brick IP.4:/media/disk1/brick1          49152     0
>>>>>>>>>>>>> Y       1668
>>>>>>>>>>>>> Brick IP.2:/media/disk2/brick2          49153     0
>>>>>>>>>>>>> Y       1522
>>>>>>>>>>>>> Brick IP.4:/media/disk2/brick2          49153     0
>>>>>>>>>>>>> Y       1678
>>>>>>>>>>>>> Brick IP.2:/media/disk3/brick3          49154     0
>>>>>>>>>>>>> Y       1527
>>>>>>>>>>>>> Brick IP.4:/media/disk3/brick3          49154     0
>>>>>>>>>>>>> Y       1677
>>>>>>>>>>>>> Brick IP.2:/media/disk4/brick4          49155     0
>>>>>>>>>>>>> Y       1541
>>>>>>>>>>>>> Brick IP.4:/media/disk4/brick4          49155     0
>>>>>>>>>>>>> Y       1683
>>>>>>>>>>>>> Self-heal Daemon on localhost               N/A       N/A
>>>>>>>>>>>>>   Y       2662
>>>>>>>>>>>>> Self-heal Daemon on IP.4                N/A       N/A
>>>>>>>>>>>>> Y       2786
>>>>>>>>>>>>>
>>>>>>>>>>>>> 10. in the above output 'volume already started'. so, running
>>>>>>>>>>>>> `reset-brick` command
>>>>>>>>>>>>>    v reset-brick gfs-tst IP.3:/media/disk3/brick3
>>>>>>>>>>>>> IP.3:/media/disk3/brick3 commit force
>>>>>>>>>>>>>
>>>>>>>>>>>>> output : [2019-01-15 16:57:37.916942]  : v reset-brick gfs-tst
>>>>>>>>>>>>> IP.3:/media/disk3/brick3 IP.3:/media/disk3/brick3 commit force : FAILED :
>>>>>>>>>>>>> /media/disk3/brick3 is already part of a volume
>>>>>>>>>>>>>
>>>>>>>>>>>>> 11. reset-brick command was not working, so, tried stopping
>>>>>>>>>>>>> volume and start with force command
>>>>>>>>>>>>> output : [2019-01-15 17:01:04.570794]  : v start gfs-tst force
>>>>>>>>>>>>> : FAILED : Pre-validation failed on localhost. Please check log file for
>>>>>>>>>>>>> details
>>>>>>>>>>>>>
>>>>>>>>>>>>> 12. now stopped service in all node and tried starting again.
>>>>>>>>>>>>> except node-3 other nodes service started successfully without any issues.
>>>>>>>>>>>>>
>>>>>>>>>>>>> in node-3 receiving following message.
>>>>>>>>>>>>>
>>>>>>>>>>>>> sudo service glusterd start
>>>>>>>>>>>>> * Starting glusterd service glusterd
>>>>>>>>>>>>>
>>>>>>>>>>>>>                       [fail]
>>>>>>>>>>>>> /usr/local/sbin/glusterd: option requires an argument -- 'f'
>>>>>>>>>>>>> Try `glusterd --help' or `glusterd --usage' for more
>>>>>>>>>>>>> information.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 13. checking glusterd log file found that OS drive was running
>>>>>>>>>>>>> out of space
>>>>>>>>>>>>> output : [2019-01-15 16:51:37.210792] W [MSGID: 101012]
>>>>>>>>>>>>> [store.c:372:gf_store_save_value] 0-management: fflush failed. [No space
>>>>>>>>>>>>> left on device]
>>>>>>>>>>>>> [2019-01-15 16:51:37.210874] E [MSGID: 106190]
>>>>>>>>>>>>> [glusterd-store.c:1058:glusterd_volume_exclude_options_write] 0-management:
>>>>>>>>>>>>> Unable to write volume values for gfs-tst
>>>>>>>>>>>>>
>>>>>>>>>>>>> 14. cleared some space in OS drive but still, service is not
>>>>>>>>>>>>> running. below is the error logged in glusterd.log
>>>>>>>>>>>>>
>>>>>>>>>>>>> [2019-01-15 17:50:13.956053] I [MSGID: 100030]
>>>>>>>>>>>>> [glusterfsd.c:2741:main] 0-/usr/local/sbin/glusterd: Started running
>>>>>>>>>>>>> /usr/local/sbin/glusterd version 4.1.6 (args: /usr/local/sbin/glusterd -p
>>>>>>>>>>>>> /var/run/glusterd.pid)
>>>>>>>>>>>>> [2019-01-15 17:50:13.960131] I [MSGID: 106478]
>>>>>>>>>>>>> [glusterd.c:1423:init] 0-management: Maximum allowed open file descriptors
>>>>>>>>>>>>> set to 65536
>>>>>>>>>>>>> [2019-01-15 17:50:13.960193] I [MSGID: 106479]
>>>>>>>>>>>>> [glusterd.c:1481:init] 0-management: Using /var/lib/glusterd as working
>>>>>>>>>>>>> directory
>>>>>>>>>>>>> [2019-01-15 17:50:13.960212] I [MSGID: 106479]
>>>>>>>>>>>>> [glusterd.c:1486:init] 0-management: Using /var/run/gluster as pid file
>>>>>>>>>>>>> working directory
>>>>>>>>>>>>> [2019-01-15 17:50:13.964437] W [MSGID: 103071]
>>>>>>>>>>>>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event
>>>>>>>>>>>>> channel creation failed [No such device]
>>>>>>>>>>>>> [2019-01-15 17:50:13.964474] W [MSGID: 103055]
>>>>>>>>>>>>> [rdma.c:4938:init] 0-rdma.management: Failed to initialize IB Device
>>>>>>>>>>>>> [2019-01-15 17:50:13.964491] W
>>>>>>>>>>>>> [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: 'rdma'
>>>>>>>>>>>>> initialization failed
>>>>>>>>>>>>> [2019-01-15 17:50:13.964560] W
>>>>>>>>>>>>> [rpcsvc.c:1781:rpcsvc_create_listener] 0-rpc-service: cannot create
>>>>>>>>>>>>> listener, initing the transport failed
>>>>>>>>>>>>> [2019-01-15 17:50:13.964579] E [MSGID: 106244]
>>>>>>>>>>>>> [glusterd.c:1764:init] 0-management: creation of 1 listeners failed,
>>>>>>>>>>>>> continuing with succeeded transport
>>>>>>>>>>>>> [2019-01-15 17:50:14.967681] I [MSGID: 106513]
>>>>>>>>>>>>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved
>>>>>>>>>>>>> op-version: 40100
>>>>>>>>>>>>> [2019-01-15 17:50:14.973931] I [MSGID: 106544]
>>>>>>>>>>>>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID:
>>>>>>>>>>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d
>>>>>>>>>>>>> [2019-01-15 17:50:15.046620] E [MSGID: 101032]
>>>>>>>>>>>>> [store.c:441:gf_store_handle_retrieve] 0-: Path corresponding to
>>>>>>>>>>>>> /var/lib/glusterd/vols/gfs-tst/bricks/IP.3:-media-disk3-brick3. [No such
>>>>>>>>>>>>> file or directory]
>>>>>>>>>>>>> [2019-01-15 17:50:15.046685] E [MSGID: 106201]
>>>>>>>>>>>>> [glusterd-store.c:3384:glusterd_store_retrieve_volumes] 0-management:
>>>>>>>>>>>>> Unable to restore volume: gfs-tst
>>>>>>>>>>>>> [2019-01-15 17:50:15.046718] E [MSGID: 101019]
>>>>>>>>>>>>> [xlator.c:720:xlator_init] 0-management: Initialization of volume
>>>>>>>>>>>>> 'management' failed, review your volfile again
>>>>>>>>>>>>> [2019-01-15 17:50:15.046732] E [MSGID: 101066]
>>>>>>>>>>>>> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator
>>>>>>>>>>>>> failed
>>>>>>>>>>>>> [2019-01-15 17:50:15.046741] E [MSGID: 101176]
>>>>>>>>>>>>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed
>>>>>>>>>>>>> [2019-01-15 17:50:15.047171] W
>>>>>>>>>>>>> [glusterfsd.c:1514:cleanup_and_exit]
>>>>>>>>>>>>> (-->/usr/local/sbin/glusterd(glusterfs_volumes_init+0xc2) [0x409f52]
>>>>>>>>>>>>> -->/usr/local/sbin/glusterd(glusterfs_process_volfp+0x151) [0x409e41]
>>>>>>>>>>>>> -->/usr/local/sbin/glusterd(cleanup_and_exit+0x5f) [0x40942f] ) 0-:
>>>>>>>>>>>>> received signum (-1), shutting down
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> 15. In other node running `volume status' still shows bricks
>>>>>>>>>>>>> node3 is live
>>>>>>>>>>>>>      but 'peer status' showing node-3 disconnected
>>>>>>>>>>>>>
>>>>>>>>>>>>> @gfstst-node2:~$ sudo gluster v status
>>>>>>>>>>>>> Status of volume: gfs-tst
>>>>>>>>>>>>> Gluster process                             TCP Port  RDMA
>>>>>>>>>>>>> Port  Online  Pid
>>>>>>>>>>>>>
>>>>>>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>>>>>>> Brick IP.2:/media/disk1/brick1          49152     0
>>>>>>>>>>>>> Y       1517
>>>>>>>>>>>>> Brick IP.4:/media/disk1/brick1          49152     0
>>>>>>>>>>>>> Y       1668
>>>>>>>>>>>>> Brick IP.2:/media/disk2/brick2          49153     0
>>>>>>>>>>>>> Y       1522
>>>>>>>>>>>>> Brick IP.4:/media/disk2/brick2          49153     0
>>>>>>>>>>>>> Y       1678
>>>>>>>>>>>>> Brick IP.2:/media/disk3/brick3          49154     0
>>>>>>>>>>>>> Y       1527
>>>>>>>>>>>>> Brick IP.4:/media/disk3/brick3          49154     0
>>>>>>>>>>>>> Y       1677
>>>>>>>>>>>>> Brick IP.2:/media/disk4/brick4          49155     0
>>>>>>>>>>>>> Y       1541
>>>>>>>>>>>>> Brick IP.4:/media/disk4/brick4          49155     0
>>>>>>>>>>>>> Y       1683
>>>>>>>>>>>>> Self-heal Daemon on localhost           N/A       N/A
>>>>>>>>>>>>> Y       2662
>>>>>>>>>>>>> Self-heal Daemon on IP.4                N/A       N/A
>>>>>>>>>>>>> Y       2786
>>>>>>>>>>>>>
>>>>>>>>>>>>> Task Status of Volume gfs-tst
>>>>>>>>>>>>>
>>>>>>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>>>>>>> There are no active volume tasks
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> root at gfstst-node2:~$ sudo gluster pool list
>>>>>>>>>>>>> UUID                                    Hostname        State
>>>>>>>>>>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d    IP.3
>>>>>>>>>>>>> Disconnected
>>>>>>>>>>>>> c1cbb58e-3ceb-4637-9ba3-3d28ef20b143    IP.4        Connected
>>>>>>>>>>>>> 0083ec0c-40bf-472a-a128-458924e56c96    localhost
>>>>>>>>>>>>>  Connected
>>>>>>>>>>>>>
>>>>>>>>>>>>> root at gfstst-node2:~$ sudo gluster peer status
>>>>>>>>>>>>> Number of Peers: 2
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hostname: IP.3
>>>>>>>>>>>>> Uuid: d6bf51a7-c296-492f-8dac-e81efa9dd22d
>>>>>>>>>>>>> State: Peer in Cluster (Disconnected)
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hostname: IP.4
>>>>>>>>>>>>> Uuid: c1cbb58e-3ceb-4637-9ba3-3d28ef20b143
>>>>>>>>>>>>> State: Peer in Cluster (Connected)
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> regards
>>>>>>>>>>>>> Amudhan
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> Gluster-users mailing list
>>>>>>>>>>>>> Gluster-users at gluster.org
>>>>>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>>>>>>>>>>>
>>>>>>>>>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20190131/9046ac5f/attachment.html>