<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <div class="moz-cite-prefix">Am 12.08.2022 um 17:12 schrieb Ilias

      Chasapakis forumZFD:<br>

    </div>

    <blockquote type="cite"

      cite="mid:41bf4390-aa77-ee49-d1b6-61944abd84ea@forumZFD.de">

      <meta http-equiv="content-type" content="text/html; charset=UTF-8">

      <p>Dear fellow gluster users,</p>

      <p>we are facing a problem with our replica 3 setup. Glusterfs

        version is 9.2.<br>

      </p>

      <p>We have a problem with a directory that is in split-brain and

        we cannot manage to heal with:</p>

      <p> </p>

      <blockquote type="cite">

        <p>gluster volume heal gfsVol split-brain latest-mtime /folder</p>

      </blockquote>

      <p>The command throws the following error: "failed:Transport

        endpoint is not connected." <br>

      </p>

      <p>So the split brain directory entry remains and and so the whole

        healing process is not completing and other entries get stuck.<br>

      </p>

      <p>I saw there is a python script available <a target="_blank"

          class="c-link moz-txt-link-freetext"

          data-stringify-link="https://github.com/joejulian/glusterfs-splitbrain"

          data-sk="tooltip_parent"

          href="https://github.com/joejulian/glusterfs-splitbrain"

          rel="noopener noreferrer" tabindex="-1"

          data-remove-tab-index="true" moz-do-not-send="true">https://github.com/joejulian/glusterfs-splitbrain</a>

        Would that be a good solution to try? To be honest we are a bit

        concerned with deleting the gfid and the files from the brick

        manually as it seems it can create inconsistencies and break

        things... I can of course give you more information about our

        setup and situation, but if you already have some tip, that

        would be fantastic.</p>

    </blockquote>

    <p>You could at least verify what's going on: Go to your brick roots

      and list /folder from each. You have 3n bricks with n replica

      sets. Find the replica set where you can spot a difference. It's

      most likely a file or directory that's missing or different. If

      it's a file, do a ls -ain on the file on each brick in the replica

      set. It'll report an inode number. Do a find .glusterfs -inum from

      the brick root. You'll likely see that you have different

      gfid-files.</p>

    <p>To fix the problem, you have to help gluster along by cleaning up

      the mess. This is completely "do it at your own risk, it worked

      for me, ymmv": cp (not mv!) a copy of the file you want to keep.

      On each brick in the replica-set, delete the gfid-file and the

      datafile. Try a heal on the volume and verify that you can access

      the path in question using the glusterfs mount. Copy back your

      salvaged file using the glusterfs mount.</p>

    <p>We had this happening quite often on a heavily loaded glusterfs

      shared filesystem that held a mail-spool. There would be parallel

      accesses trying to mv files and sometimes we'd end up with

      mismatched data on the bricks of the replica set. I've reported

      this on github, but apparently it wasn't seen as a serious

      problem. We've moved on to ceph FS now. That sure has bugs, too,

      but hopefully not as aggravating.<br>

    </p>

    <p>

    </p>

    <pre class="moz-signature" cols="72">MfG,

i.A. Thomas Bätzler

-- 

BRINGE Informationstechnik GmbH

Zur Seeplatte 12

D-76228 Karlsruhe

Germany

Fon: +49 721 94246-0

Fon: +49 171 5438457

Fax: +49 721 94246-66

Web: <a class="moz-txt-link-freetext" href="http://www.bringe.de/">http://www.bringe.de/</a>

Geschäftsführer: Dipl.-Ing. (FH) Martin Bringe

Ust.Id: DE812936645, HRB 108943 Mannheim</pre>

  </body>

</html>