<div dir="auto">Hi,<div dir="auto"><br></div><div dir="auto">Have you checked for any file system errors on the brick mount point?</div><div dir="auto"><br></div><div dir="auto">I once was facing weird io errors and xfs_repair fixed the issue. </div><div dir="auto"><br></div><div dir="auto">What about the heal? Does it report any pending heals? </div></div><div class="gmail_extra"><br><div class="gmail_quote">On Feb 15, 2018 14:20, &quot;Dave Sherohman&quot; &lt;<a href="mailto:dave@sherohman.org">dave@sherohman.org</a>&gt; wrote:<br type="attribution"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Well, it looks like I&#39;ve stumped the list, so I did a bit of additional<br>

digging myself:<br>

<br>

azathoth replicates with yog-sothoth, so I compared their brick<br>

directories.  `ls -R /var/local/brick0/data | md5sum` gives the same<br>

result on both servers, so the filenames are identical in both bricks.<br>

However, `du -s /var/local/brick0/data` shows that azathoth has about 3G<br>

more data (445G vs 442G) than yog.<br>

<br>

This seems consistent with my assumption that the problem is on<br>

yog-sothoth (everything is fine with only azathoth; there are problems<br>

with only yog-sothoth) and I am reminded that a few weeks ago,<br>

yog-sothoth was offline for 4-5 days, although it should have been<br>

brought back up-to-date once it came back online.<br>

<br>

So, assuming that the issue is stale/missing data on yog-sothoth, is<br>

there a way to force gluster to do a full refresh of the data from<br>

azathoth&#39;s brick to yog-sothoth&#39;s brick?  I would have expected running<br>

heal and/or rebalance to do that sort of thing, but I&#39;ve run them both<br>

(with and without fix-layout on the rebalance) and the problem persists.<br>

<br>

If there isn&#39;t a way to force a refresh, how risky would it be to kill<br>

gluster on yog-sothoth, wipe everything from /var/local/brick0, and then<br>

re-add it to the cluster as if I were replacing a physically failed<br>

disk?  Seems like that should work in principle, but it feels dangerous<br>

to wipe the partition and rebuild, regardless.<br>

<br>

On Tue, Feb 13, 2018 at 07:33:44AM -0600, Dave Sherohman wrote:<br>

&gt; I&#39;m using gluster for a virt-store with 3x2 distributed/replicated<br>

&gt; servers for 16 qemu/kvm/libvirt virtual machines using image files<br>

&gt; stored in gluster and accessed via libgfapi.  Eight of these disk images<br>

&gt; are standalone, while the other eight are qcow2 images which all share a<br>

&gt; single backing file.<br>

&gt;<br>

&gt; For the most part, this is all working very well.  However, one of the<br>

&gt; gluster servers (azathoth) causes three of the standalone VMs and all 8<br>

&gt; of the shared-backing-image VMs to fail if it goes down.  Any of the<br>

&gt; other gluster servers can go down with no problems; only azathoth causes<br>

&gt; issues.<br>

&gt;<br>

&gt; In addition, the kvm hosts have the gluster volume fuse mounted and one<br>

&gt; of them (out of five) detects an error on the gluster volume and puts<br>

&gt; the fuse mount into read-only mode if azathoth goes down.  libgfapi<br>

&gt; connections to the VM images continue to work normally from this host<br>

&gt; despite this and the other four kvm hosts are unaffected.<br>

&gt;<br>

&gt; It initially seemed relevant that I have the libgfapi URIs specified as<br>

&gt; gluster://azathoth/..., but I&#39;ve tried changing them to make the initial<br>

&gt; connection via other gluster hosts and it had no effect on the problem.<br>

&gt; Losing azathoth still took them out.<br>

&gt;<br>

&gt; In addition to changing the mount URI, I&#39;ve also manually run a heal and<br>

&gt; rebalance on the volume, enabled the bitrot daemons (then turned them<br>

&gt; back off a week later, since they reported no activity in that time),<br>

&gt; and copied one of the standalone images to a new file in case it was a<br>

&gt; problem with the file itself.  As far as I can tell, none of these<br>

&gt; attempts changed anything.<br>

&gt;<br>

&gt; So I&#39;m at a loss.  Is this a known type of problem?  If so, how do I fix<br>

&gt; it?  If not, what&#39;s the next step to troubleshoot it?<br>

&gt;<br>

&gt;<br>

&gt; # gluster --version<br>

&gt; glusterfs 3.8.8 built on Jan 11 2017 14:07:11<br>

&gt; Repository revision: git://<a href="http://git.gluster.com/glusterfs.git" rel="noreferrer" target="_blank">git.gluster.com/<wbr>glusterfs.git</a><br>

&gt;<br>

&gt; # gluster volume status<br>

&gt; Status of volume: palantir<br>

&gt; Gluster process                             TCP Port  RDMA Port  Online<br>

&gt; Pid<br>

&gt; ------------------------------<wbr>------------------------------<wbr>------------------<br>

&gt; Brick saruman:/var/local/brick0/data        49154     0          Y<br>

&gt; 10690<br>

&gt; Brick gandalf:/var/local/brick0/data        49155     0          Y<br>

&gt; 18732<br>

&gt; Brick azathoth:/var/local/brick0/<wbr>data       49155     0          Y<br>

&gt; 9507<br>

&gt; Brick yog-sothoth:/var/local/brick0/<wbr>data    49153     0          Y<br>

&gt; 39559<br>

&gt; Brick cthulhu:/var/local/brick0/data        49152     0          Y<br>

&gt; 2682<br>

&gt; Brick mordiggian:/var/local/brick0/<wbr>data     49152     0          Y<br>

&gt; 39479<br>

&gt; Self-heal Daemon on localhost               N/A       N/A        Y<br>

&gt; 9614<br>

&gt; Self-heal Daemon on <a href="http://saruman.lub.lu.se" rel="noreferrer" target="_blank">saruman.lub.lu.se</a>       N/A       N/A        Y<br>

&gt; 15016<br>

&gt; Self-heal Daemon on <a href="http://cthulhu.lub.lu.se" rel="noreferrer" target="_blank">cthulhu.lub.lu.se</a>       N/A       N/A        Y<br>

&gt; 9756<br>

&gt; Self-heal Daemon on <a href="http://gandalf.lub.lu.se" rel="noreferrer" target="_blank">gandalf.lub.lu.se</a>       N/A       N/A        Y<br>

&gt; 5962<br>

&gt; Self-heal Daemon on <a href="http://mordiggian.lub.lu.se" rel="noreferrer" target="_blank">mordiggian.lub.lu.se</a>    N/A       N/A        Y<br>

&gt; 8295<br>

&gt; Self-heal Daemon on <a href="http://yog-sothoth.lub.lu.se" rel="noreferrer" target="_blank">yog-sothoth.lub.lu.se</a>   N/A       N/A        Y<br>

&gt; 7588<br>

&gt;<br>

&gt; Task Status of Volume palantir<br>

&gt; ------------------------------<wbr>------------------------------<wbr>------------------<br>

&gt; Task                 : Rebalance<br>

&gt; ID                   : c38e11fe-fe1b-464d-b9f5-<wbr>1398441cc229<br>

&gt; Status               : completed<br>

&gt;<br>

&gt;<br>

&gt; --<br>

&gt; Dave Sherohman<br>

&gt; ______________________________<wbr>_________________<br>

&gt; Gluster-users mailing list<br>

&gt; <a href="mailto:Gluster-users@gluster.org">Gluster-users@gluster.org</a><br>

&gt; <a href="http://lists.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer" target="_blank">http://lists.gluster.org/<wbr>mailman/listinfo/gluster-users</a><br>

<br>

<br>

--<br>

Dave Sherohman<br>

______________________________<wbr>_________________<br>

Gluster-users mailing list<br>

<a href="mailto:Gluster-users@gluster.org">Gluster-users@gluster.org</a><br>

<a href="http://lists.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer" target="_blank">http://lists.gluster.org/<wbr>mailman/listinfo/gluster-users</a><br>

</blockquote></div></div>