[Gluster-users] Added bricks with wrong name and now need to remove them without destroying volume.

Wed Feb 27 17:37:12 UTC 2019

Yes, I broke it. Now I need help fixing it.

I have an existing Gluster Volume, spread over 16 bricks and 4 servers;
1.5P space with 49% currently used .  Added an additional 4 bricks and
server as we expect large influx of data in the next 4 to 6 months.  The
system had been established by my predecessor, who is no longer here.

First solo addition of bricks to gluster.

Everything went smoothly until “gluster volume add-brick Volume
newserver:/bricks/dataX/vol.name"

                (I don’t have the exact response as I worked on this for
almost 5 hours last night) Unable to add-brick as “it is already mounted”
or something to that affect.

                Double checked my instructions, the name of the bricks.
Everything seemed correct.  Tried to add again adding “force.”  Again,
“unable to add-brick”

                Because of the keyword (in my mind) “mounted” in the error,
I checked /etc/fstab, where the name of the mount point is simply
/bricks/dataX.

This convention was the same across all servers, so I thought I had
discovered an error in my notes and changed the name to
newserver:/bricks/dataX.

Still had to use force, but the bricks were added.

Restarted the gluster volume vol.name. No errors.

Rebooted; but /vol.name did not mount on reboot as the /etc/fstab
instructs. So I attempted to mount manually and discovered a had a big mess
on my hands.

                                “Transport endpoint not connected” in
addition to other messages.

                Discovered an issue between certificates and the
auth.ssl-allow list because of the hostname of new server.  I made
correction and /vol.name mounted.

                However, df -h indicated the 4 new bricks were not being
seen as 400T were missing from what should have been available.

Thankfully, I could add something to vol.name on one machine and see it on
another machine and I wrongly assumed the volume was operational, even if
the new bricks were not recognized.

So I tried to correct the main issue by,

                gluster volume remove vol.name newserver/bricks/dataX/

                received prompt, data will be migrated before brick is
removed continue (or something to that) and I started the process, think
this won’t take long because there is no data.

                After 10 minutes and no apparent progress on the process, I
did panic, thinking worse case scenario – it is writing zeros over my data.

                Executed the stop command and there was still no progress,
and I assume it was due to no data on the brick to be remove causing the
program to hang.

                Found the process ID and killed it.

This morning, while all clients and servers can access /vol.name; not all
of the data is present.  I can find it under cluster, but users cannot
reach it.  I am, again, assume it is because of the 4 bricks that have been
added, but aren't really a part of the volume because of their incorrect
name.

So – how do I proceed from here.

1. Remove the 4 empty bricks from the volume without damaging data.

2. Correctly clear any metadata about these 4 bricks ONLY so they may be
added correctly.

If this doesn't restore the volume to full functionality, I'll write
another post if I cannot find answer in the notes or on line.

Tami--
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20190227/f2b11dd3/attachment.html>