<div dir="ltr"><div dir="ltr"><div>Hi,</div><div><br></div><div>In short, when I started glusterd service I am getting following error msg in the glusterd.log file in one server.</div><div>what needs to be done?</div><div><br></div><div>error logged in glusterd.log</div><div><br></div><div><span style="white-space:pre">        </span>[2019-01-15 17:50:13.956053] I [MSGID: 100030] [glusterfsd.c:2741:main] 0-/usr/local/sbin/glusterd: Started running /usr/local/sbin/glusterd version 4.1.6 (args: /usr/local/sbin/glusterd -p /var/run/glusterd.pid)</div><div><span style="white-space:pre">        </span>[2019-01-15 17:50:13.960131] I [MSGID: 106478] [glusterd.c:1423:init] 0-management: Maximum allowed open file descriptors set to 65536</div><div><span style="white-space:pre">        </span>[2019-01-15 17:50:13.960193] I [MSGID: 106479] [glusterd.c:1481:init] 0-management: Using /var/lib/glusterd as working directory</div><div><span style="white-space:pre">        </span>[2019-01-15 17:50:13.960212] I [MSGID: 106479] [glusterd.c:1486:init] 0-management: Using /var/run/gluster as pid file working directory</div><div><span style="white-space:pre">        </span>[2019-01-15 17:50:13.964437] W [MSGID: 103071] [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event channel creation failed [No such device]</div><div><span style="white-space:pre">        </span>[2019-01-15 17:50:13.964474] W [MSGID: 103055] [rdma.c:4938:init] 0-rdma.management: Failed to initialize IB Device</div><div><span style="white-space:pre">        </span>[2019-01-15 17:50:13.964491] W [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: &#39;rdma&#39; initialization failed</div><div><span style="white-space:pre">        </span>[2019-01-15 17:50:13.964560] W [rpcsvc.c:1781:rpcsvc_create_listener] 0-rpc-service: cannot create listener, initing the transport failed</div><div><span style="white-space:pre">        </span>[2019-01-15 17:50:13.964579] E [MSGID: 106244] [glusterd.c:1764:init] 0-management: creation of 1 listeners failed, continuing with succeeded transport</div><div><span style="white-space:pre">        </span>[2019-01-15 17:50:14.967681] I [MSGID: 106513] [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved op-version: 40100</div><div><span style="white-space:pre">        </span>[2019-01-15 17:50:14.973931] I [MSGID: 106544] [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: d6bf51a7-c296-492f-8dac-e81efa9dd22d</div><div><span style="white-space:pre">        </span>[2019-01-15 17:50:15.046620] E [MSGID: 101032] [store.c:441:gf_store_handle_retrieve] 0-: Path corresponding to /var/lib/glusterd/vols/gfs-tst/bricks/IP.3:-media-disk3-brick3. [No such file or directory]</div><div><span style="white-space:pre">        </span>[2019-01-15 17:50:15.046685] E [MSGID: 106201] [glusterd-store.c:3384:glusterd_store_retrieve_volumes] 0-management: Unable to restore volume: gfs-tst</div><div><span style="white-space:pre">        </span>[2019-01-15 17:50:15.046718] E [MSGID: 101019] [xlator.c:720:xlator_init] 0-management: Initialization of volume &#39;management&#39; failed, review your volfile again</div><div><span style="white-space:pre">        </span>[2019-01-15 17:50:15.046732] E [MSGID: 101066] [graph.c:367:glusterfs_graph_init] 0-management: initializing translator failed</div><div><span style="white-space:pre">        </span>[2019-01-15 17:50:15.046741] E [MSGID: 101176] [graph.c:738:glusterfs_graph_activate] 0-graph: init failed</div><div><span style="white-space:pre">        </span>[2019-01-15 17:50:15.047171] W [glusterfsd.c:1514:cleanup_and_exit] (--&gt;/usr/local/sbin/glusterd(glusterfs_volumes</div><div><br></div><div><br></div><div><br></div><div>In long, I am trying to simulate a situation. where volume stoped abnormally and </div><div>entire cluster restarted with some missing disks.</div><div><br></div><div>My test cluster is set up with 3 nodes and each has four disks, I have setup a volume with disperse 4+2. </div><div>In Node-3 2 disks have failed, to replace I have shutdown all system</div><div><br></div><div>below are the steps done.</div><div><br></div><div>1. umount from client machine</div><div>2. shutdown all system by running `shutdown -h now` command ( without stopping volume and stop service)</div><div>3. replace faulty disk in Node-3</div><div>4. powered ON all system</div><div>5. format replaced drives, and mount all drives</div><div>6. start glusterd service in all node (success)</div><div>7. Now running `voulume status` command from node-3</div><div><span style="white-space:pre">        </span>output : [2019-01-15 16:52:17.718422]  : v status : FAILED : Staging failed on 0083ec0c-40bf-472a-a128-458924e56c96. Please check log file for details.</div><div>8. running `voulume start gfs-tst` command from node-3</div><div><span style="white-space:pre">        </span>output : [2019-01-15 16:53:19.410252]  : v start gfs-tst : FAILED : Volume gfs-tst already started</div><div><br></div><div>9. running `gluster v status` in other node. showing all brick available but &#39;self-heal daemon&#39; not running</div><div><span style="white-space:pre">        </span>@gfstst-node2:~$ sudo gluster v status</div><div><span style="white-space:pre">        </span>Status of volume: gfs-tst</div><div><span style="white-space:pre">        </span>Gluster process                             TCP Port  RDMA Port  Online  Pid</div><div><span style="white-space:pre">        </span>------------------------------------------------------------------------------</div><div><span style="white-space:pre">        </span>Brick IP.2:/media/disk1/brick1          49152     0          Y       1517</div><div><span style="white-space:pre">        </span>Brick IP.4:/media/disk1/brick1          49152     0          Y       1668</div><div><span style="white-space:pre">        </span>Brick IP.2:/media/disk2/brick2          49153     0          Y       1522</div><div><span style="white-space:pre">        </span>Brick IP.4:/media/disk2/brick2          49153     0          Y       1678</div><div><span style="white-space:pre">        </span>Brick IP.2:/media/disk3/brick3          49154     0          Y       1527</div><div><span style="white-space:pre">        </span>Brick IP.4:/media/disk3/brick3          49154     0          Y       1677</div><div><span style="white-space:pre">        </span>Brick IP.2:/media/disk4/brick4          49155     0          Y       1541</div><div><span style="white-space:pre">        </span>Brick IP.4:/media/disk4/brick4          49155     0          Y       1683</div><div><span style="white-space:pre">        </span>Self-heal Daemon on localhost               N/A       N/A        Y       2662</div><div><span style="white-space:pre">        </span>Self-heal Daemon on IP.4                N/A       N/A        Y       2786</div><div><br></div><div>10. in the above output &#39;volume already started&#39;. so, running `reset-brick` command</div><div>   v reset-brick gfs-tst IP.3:/media/disk3/brick3 IP.3:/media/disk3/brick3 commit force</div><div><br></div><div><span style="white-space:pre">        </span>output : [2019-01-15 16:57:37.916942]  : v reset-brick gfs-tst IP.3:/media/disk3/brick3 IP.3:/media/disk3/brick3 commit force : FAILED : /media/disk3/brick3 is already part of a volume </div><div><br></div><div>11. reset-brick command was not working, so, tried stopping volume and start with force command </div><div><span style="white-space:pre">        </span>output : [2019-01-15 17:01:04.570794]  : v start gfs-tst force : FAILED : Pre-validation failed on localhost. Please check log file for details</div><div><br></div><div>12. now stopped service in all node and tried starting again. except node-3 other nodes service started successfully without any issues.</div><div><br></div><div><span style="white-space:pre">        </span>in node-3 receiving following message.</div><div><br></div><div><span style="white-space:pre">        </span>sudo service glusterd start</div><div><span style="white-space:pre">        </span> * Starting glusterd service glusterd                                                                                                                            [fail]</div><div><span style="white-space:pre">        </span>/usr/local/sbin/glusterd: option requires an argument -- &#39;f&#39;</div><div><span style="white-space:pre">        </span>Try `glusterd --help&#39; or `glusterd --usage&#39; for more information.</div><div><br></div><div>13. checking glusterd log file found that OS drive was running out of space</div><div><span style="white-space:pre">        </span>output : [2019-01-15 16:51:37.210792] W [MSGID: 101012] [store.c:372:gf_store_save_value] 0-management: fflush failed. [No space left on device]</div><div><span style="white-space:pre">                </span> [2019-01-15 16:51:37.210874] E [MSGID: 106190] [glusterd-store.c:1058:glusterd_volume_exclude_options_write] 0-management: Unable to write volume values for gfs-tst</div><div><br></div><div>14. cleared some space in OS drive but still, service is not running. below is the error logged in glusterd.log</div><div><br></div><div><span style="white-space:pre">        </span>[2019-01-15 17:50:13.956053] I [MSGID: 100030] [glusterfsd.c:2741:main] 0-/usr/local/sbin/glusterd: Started running /usr/local/sbin/glusterd version 4.1.6 (args: /usr/local/sbin/glusterd -p /var/run/glusterd.pid)</div><div><span style="white-space:pre">        </span>[2019-01-15 17:50:13.960131] I [MSGID: 106478] [glusterd.c:1423:init] 0-management: Maximum allowed open file descriptors set to 65536</div><div><span style="white-space:pre">        </span>[2019-01-15 17:50:13.960193] I [MSGID: 106479] [glusterd.c:1481:init] 0-management: Using /var/lib/glusterd as working directory</div><div><span style="white-space:pre">        </span>[2019-01-15 17:50:13.960212] I [MSGID: 106479] [glusterd.c:1486:init] 0-management: Using /var/run/gluster as pid file working directory</div><div><span style="white-space:pre">        </span>[2019-01-15 17:50:13.964437] W [MSGID: 103071] [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event channel creation failed [No such device]</div><div><span style="white-space:pre">        </span>[2019-01-15 17:50:13.964474] W [MSGID: 103055] [rdma.c:4938:init] 0-rdma.management: Failed to initialize IB Device</div><div><span style="white-space:pre">        </span>[2019-01-15 17:50:13.964491] W [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: &#39;rdma&#39; initialization failed</div><div><span style="white-space:pre">        </span>[2019-01-15 17:50:13.964560] W [rpcsvc.c:1781:rpcsvc_create_listener] 0-rpc-service: cannot create listener, initing the transport failed</div><div><span style="white-space:pre">        </span>[2019-01-15 17:50:13.964579] E [MSGID: 106244] [glusterd.c:1764:init] 0-management: creation of 1 listeners failed, continuing with succeeded transport</div><div><span style="white-space:pre">        </span>[2019-01-15 17:50:14.967681] I [MSGID: 106513] [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved op-version: 40100</div><div><span style="white-space:pre">        </span>[2019-01-15 17:50:14.973931] I [MSGID: 106544] [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: d6bf51a7-c296-492f-8dac-e81efa9dd22d</div><div><span style="white-space:pre">        </span>[2019-01-15 17:50:15.046620] E [MSGID: 101032] [store.c:441:gf_store_handle_retrieve] 0-: Path corresponding to /var/lib/glusterd/vols/gfs-tst/bricks/IP.3:-media-disk3-brick3. [No such file or directory]</div><div><span style="white-space:pre">        </span>[2019-01-15 17:50:15.046685] E [MSGID: 106201] [glusterd-store.c:3384:glusterd_store_retrieve_volumes] 0-management: Unable to restore volume: gfs-tst</div><div><span style="white-space:pre">        </span>[2019-01-15 17:50:15.046718] E [MSGID: 101019] [xlator.c:720:xlator_init] 0-management: Initialization of volume &#39;management&#39; failed, review your volfile again</div><div><span style="white-space:pre">        </span>[2019-01-15 17:50:15.046732] E [MSGID: 101066] [graph.c:367:glusterfs_graph_init] 0-management: initializing translator failed</div><div><span style="white-space:pre">        </span>[2019-01-15 17:50:15.046741] E [MSGID: 101176] [graph.c:738:glusterfs_graph_activate] 0-graph: init failed</div><div><span style="white-space:pre">        </span>[2019-01-15 17:50:15.047171] W [glusterfsd.c:1514:cleanup_and_exit] (--&gt;/usr/local/sbin/glusterd(glusterfs_volumes_init+0xc2) [0x409f52] --&gt;/usr/local/sbin/glusterd(glusterfs_process_volfp+0x151) [0x409e41] --&gt;/usr/local/sbin/glusterd(cleanup_and_exit+0x5f) [0x40942f] ) 0-: received signum (-1), shutting down</div><div><br></div><div><br></div><div>15. In other node running `volume status&#39; still shows bricks node3 is live </div><div>     but &#39;peer status&#39; showing node-3 disconnected</div><div><br></div><div>@gfstst-node2:~$ sudo gluster v status</div><div>Status of volume: gfs-tst</div><div>Gluster process                             TCP Port  RDMA Port  Online  Pid</div><div>------------------------------------------------------------------------------</div><div>Brick IP.2:/media/disk1/brick1          49152     0          Y       1517</div><div>Brick IP.4:/media/disk1/brick1          49152     0          Y       1668</div><div>Brick IP.2:/media/disk2/brick2          49153     0          Y       1522</div><div>Brick IP.4:/media/disk2/brick2          49153     0          Y       1678</div><div>Brick IP.2:/media/disk3/brick3          49154     0          Y       1527</div><div>Brick IP.4:/media/disk3/brick3          49154     0          Y       1677</div><div>Brick IP.2:/media/disk4/brick4          49155     0          Y       1541</div><div>Brick IP.4:/media/disk4/brick4          49155     0          Y       1683</div><div>Self-heal Daemon on localhost           N/A       N/A        Y       2662</div><div>Self-heal Daemon on IP.4                N/A       N/A        Y       2786</div><div><br></div><div>Task Status of Volume gfs-tst</div><div>------------------------------------------------------------------------------</div><div>There are no active volume tasks</div><div><br></div><div><br></div><div>root@gfstst-node2:~$ sudo gluster pool list</div><div>UUID                                    Hostname        State</div><div>d6bf51a7-c296-492f-8dac-e81efa9dd22d    IP.3        Disconnected</div><div>c1cbb58e-3ceb-4637-9ba3-3d28ef20b143    IP.4        Connected</div><div>0083ec0c-40bf-472a-a128-458924e56c96    localhost       Connected</div><div><br></div><div>root@gfstst-node2:~$ sudo gluster peer status</div><div>Number of Peers: 2</div><div><br></div><div>Hostname: IP.3</div><div>Uuid: d6bf51a7-c296-492f-8dac-e81efa9dd22d</div><div>State: Peer in Cluster (Disconnected)</div><div><br></div><div>Hostname: IP.4</div><div>Uuid: c1cbb58e-3ceb-4637-9ba3-3d28ef20b143</div><div>State: Peer in Cluster (Connected)</div><div><br></div><div><br></div><div>regards</div><div>Amudhan</div></div></div>