<div dir="ltr"><div>Filed a bug report. I was not able to reproduce the issue on x86 hardware.<br></div><div dir="ltr"><br></div><div dir="ltr"><a href="https://bugzilla.redhat.com/show_bug.cgi?id=1811373" target="_blank">https://bugzilla.redhat.com/show_bug.cgi?id=1811373</a></div><div><br></div><div><br></div><div><br></div><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Mar 2, 2020 at 1:58 AM Strahil Nikolov <<a href="mailto:hunter86_bg@yahoo.com" target="_blank">hunter86_bg@yahoo.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On March 2, 2020 3:29:06 AM GMT+02:00, Fox <<a href="mailto:foxxz.net@gmail.com" target="_blank">foxxz.net@gmail.com</a>> wrote:<br>
>The brick is mounted. However glusterfsd crashes shortly after startup.<br>
>This happens on any host that needs to heal a dispersed volume.<br>
><br>
>I spent today doing a clean rebuild of the cluster. Clean install of<br>
>ubuntu<br>
>18 and gluster 7.2. Create a dispersed volume. Reboot one of the<br>
>cluster<br>
>members while the volume is up and online. When that cluster member<br>
>comes<br>
>back it can not heal.<br>
><br>
>I was able to replicate this behavior with raspberry pis running<br>
>raspbian<br>
>and gluster 5 so it looks like its not limited to the specific hardware<br>
>or<br>
>version of gluster I'm using but perhaps the ARM architecture as a<br>
>whole.<br>
><br>
>Thank you for your help. Aside from not using dispersed volumes I don't<br>
>think there is much more I can do. Submit a bug report I guess :)<br>
><br>
><br>
><br>
><br>
><br>
>On Sun, Mar 1, 2020 at 12:02 PM Strahil Nikolov <<a href="mailto:hunter86_bg@yahoo.com" target="_blank">hunter86_bg@yahoo.com</a>><br>
>wrote:<br>
><br>
>> On March 1, 2020 6:22:59 PM GMT+02:00, Fox <<a href="mailto:foxxz.net@gmail.com" target="_blank">foxxz.net@gmail.com</a>><br>
>wrote:<br>
>> >Yes the brick was up and running. And I can see files on the brick<br>
>> >created<br>
>> >by connected clients up until the node was rebooted.<br>
>> ><br>
>> >This is what the volume status looks like after gluster12 was<br>
>rebooted.<br>
>> >Prior to reboot it showed as online and was otherwise operational.<br>
>> ><br>
>> >root@gluster01:~# gluster volume status<br>
>> >Status of volume: disp1<br>
>> >Gluster process TCP Port RDMA Port <br>
>Online<br>
>> > Pid<br>
>><br>
>><br>
>>------------------------------------------------------------------------------<br>
>> >Brick gluster01:/exports/sda/brick1/disp1 49152 0 Y<br>
>> >3931<br>
>> >Brick gluster02:/exports/sda/brick1/disp1 49152 0 Y<br>
>> >2755<br>
>> >Brick gluster03:/exports/sda/brick1/disp1 49152 0 Y<br>
>> >2787<br>
>> >Brick gluster04:/exports/sda/brick1/disp1 49152 0 Y<br>
>> >2780<br>
>> >Brick gluster05:/exports/sda/brick1/disp1 49152 0 Y<br>
>> >2764<br>
>> >Brick gluster06:/exports/sda/brick1/disp1 49152 0 Y<br>
>> >2760<br>
>> >Brick gluster07:/exports/sda/brick1/disp1 49152 0 Y<br>
>> >2740<br>
>> >Brick gluster08:/exports/sda/brick1/disp1 49152 0 Y<br>
>> >2729<br>
>> >Brick gluster09:/exports/sda/brick1/disp1 49152 0 Y<br>
>> >2772<br>
>> >Brick gluster10:/exports/sda/brick1/disp1 49152 0 Y<br>
>> >2791<br>
>> >Brick gluster11:/exports/sda/brick1/disp1 49152 0 Y<br>
>> >2026<br>
>> >Brick gluster12:/exports/sda/brick1/disp1 N/A N/A N<br>
>> >N/A<br>
>> >Self-heal Daemon on localhost N/A N/A Y<br>
>> >3952<br>
>> >Self-heal Daemon on gluster03 N/A N/A Y<br>
>> >2808<br>
>> >Self-heal Daemon on gluster02 N/A N/A Y<br>
>> >2776<br>
>> >Self-heal Daemon on gluster06 N/A N/A Y<br>
>> >2781<br>
>> >Self-heal Daemon on gluster07 N/A N/A Y<br>
>> >2761<br>
>> >Self-heal Daemon on gluster05 N/A N/A Y<br>
>> >2785<br>
>> >Self-heal Daemon on gluster08 N/A N/A Y<br>
>> >2750<br>
>> >Self-heal Daemon on gluster04 N/A N/A Y<br>
>> >2801<br>
>> >Self-heal Daemon on gluster09 N/A N/A Y<br>
>> >2793<br>
>> >Self-heal Daemon on gluster11 N/A N/A Y<br>
>> >2047<br>
>> >Self-heal Daemon on gluster10 N/A N/A Y<br>
>> >2812<br>
>> >Self-heal Daemon on gluster12 N/A N/A Y<br>
>> >542<br>
>> ><br>
>> >Task Status of Volume disp1<br>
>><br>
>><br>
>>------------------------------------------------------------------------------<br>
>> >There are no active volume tasks<br>
>> ><br>
>> >On Sun, Mar 1, 2020 at 2:01 AM Strahil Nikolov<br>
><<a href="mailto:hunter86_bg@yahoo.com" target="_blank">hunter86_bg@yahoo.com</a>><br>
>> >wrote:<br>
>> ><br>
>> >> On March 1, 2020 6:08:31 AM GMT+02:00, Fox <<a href="mailto:foxxz.net@gmail.com" target="_blank">foxxz.net@gmail.com</a>><br>
>> >wrote:<br>
>> >> >I am using a dozen odriod HC2 ARM systems each with a single<br>
>> >HD/brick.<br>
>> >> >Running ubuntu 18 and glusterfs 7.2 installed from the gluster<br>
>PPA.<br>
>> >> ><br>
>> >> >I can create a dispersed volume and use it. If one of the cluster<br>
>> >> >members<br>
>> >> >duck out, say gluster12 reboots, when it comes back online it<br>
>shows<br>
>> >> >connected in the peer list but using<br>
>> >> >gluster volume heal <volname> info summary<br>
>> >> ><br>
>> >> >It shows up as<br>
>> >> >Brick gluster12:/exports/sda/brick1/disp1<br>
>> >> >Status: Transport endpoint is not connected<br>
>> >> >Total Number of entries: -<br>
>> >> >Number of entries in heal pending: -<br>
>> >> >Number of entries in split-brain: -<br>
>> >> >Number of entries possibly healing: -<br>
>> >> ><br>
>> >> >Trying to force a full heal doesn't fix it. The cluster member<br>
>> >> >otherwise<br>
>> >> >works and heals for other non-disperse volumes even while showing<br>
>up<br>
>> >as<br>
>> >> >disconnected for the dispersed volume.<br>
>> >> ><br>
>> >> >I have attached a terminal log of the volume creation and<br>
>diagnostic<br>
>> >> >output. Could this be an ARM specific problem?<br>
>> >> ><br>
>> >> >I tested a similar setup on x86 virtual machines. They were able<br>
>to<br>
>> >> >heal a<br>
>> >> >dispersed volume no problem. One thing I see in the ARM logs I<br>
>don't<br>
>> >> >see in<br>
>> >> >the x86 logs is lots of this..<br>
>> >> >[2020-03-01 03:54:45.856769] W [MSGID: 122035]<br>
>> >> >[ec-common.c:668:ec_child_select] 0-disp1-disperse-0: Executing<br>
>> >> >operation<br>
>> >> >with some subvolumes unavailable. (800). FOP : 'LOOKUP' failed on<br>
>> >> >'(null)'<br>
>> >> >with gfid 0d3c4cf3-e09c-4b9a-87d3-cdfc4f49b692<br>
>> >> >[2020-03-01 03:54:45.910203] W [MSGID: 122035]<br>
>> >> >[ec-common.c:668:ec_child_select] 0-disp1-disperse-0: Executing<br>
>> >> >operation<br>
>> >> >with some subvolumes unavailable. (800). FOP : 'LOOKUP' failed on<br>
>> >> >'(null)'<br>
>> >> >with gfid 0d806805-81e4-47ee-a331-1808b34949bf<br>
>> >> >[2020-03-01 03:54:45.932734] I<br>
>[rpc-clnt.c:1963:rpc_clnt_reconfig]<br>
>> >> >0-disp1-client-11: changing port to 49152 (from 0)<br>
>> >> >[2020-03-01 03:54:45.956803] W [MSGID: 122035]<br>
>> >> >[ec-common.c:668:ec_child_select] 0-disp1-disperse-0: Executing<br>
>> >> >operation<br>
>> >> >with some subvolumes unavailable. (800). FOP : 'LOOKUP' failed on<br>
>> >> >'(null)'<br>
>> >> >with gfid d5768bad-7409-40f4-af98-4aef391d7ae4<br>
>> >> >[2020-03-01 03:54:46.000102] W [MSGID: 122035]<br>
>> >> >[ec-common.c:668:ec_child_select] 0-disp1-disperse-0: Executing<br>
>> >> >operation<br>
>> >> >with some subvolumes unavailable. (800). FOP : 'LOOKUP' failed on<br>
>> >> >'(null)'<br>
>> >> >with gfid 216f5583-e1b4-49cf-bef9-8cd34617beaf<br>
>> >> >[2020-03-01 03:54:46.044184] W [MSGID: 122035]<br>
>> >> >[ec-common.c:668:ec_child_select] 0-disp1-disperse-0: Executing<br>
>> >> >operation<br>
>> >> >with some subvolumes unavailable. (800). FOP : 'LOOKUP' failed on<br>
>> >> >'(null)'<br>
>> >> >with gfid 1b610b49-2d69-4ee6-a440-5d3edd6693d1<br>
>> >><br>
>> >> Hi,<br>
>> >><br>
>> >> Are you sure that the gluster bricks on this node is up and<br>
>running ?<br>
>> >> What is the output of 'gluster volume status' on this system ?<br>
>> >><br>
>> >> Best Regards,<br>
>> >> Strahil Nikolov<br>
>> >><br>
>><br>
>> This seems like the brick is down.<br>
>> Check with 'ps aux | grep glusterfsd | grep disp1' on the 'gluster12'<br>
>.<br>
>> Most probably it is down and you need to verify the brick is<br>
>properly<br>
>> mounted.<br>
>><br>
>> Best Regards,<br>
>> Strahil Nikolov<br>
>><br>
<br>
Hi Fox,<br>
<br>
<br>
Submit a bug and provide a link in the mailing list (add the gluster-devel in CC once you register for that).<br>
Most probably it's a small thing that can be easily fixed.<br>
<br>
Have you tried to:<br>
gluster volume start <VOLNAME> force<br>
<br>
Best Regards,<br>
Strahil Nikolov<br>
</blockquote></div></div>