[Gluster-users] Gluster extra large file on brick

Tue Jul 6 16:28:43 UTC 2021

Hi gluster users,

I'm having an issue that I'm hoping to get some help with on a
dispersed volume (EC: 2x(4+2)) that's causing me some headaches. This is
on a cluster running Gluster 6.9 on CentOS 7.

At some point in the last week, writes to one of my bricks have started
failing due to an "No Space Left on Device" error:

[2021-07-06 16:08:57.261307] E [MSGID: 115067] [server-rpc-fops_v2.c:1373:server4_writev_cbk] 0-gluster-01-server: 1853436561: WRITEV -2 (f2d6f2f8-4fd7-4692-bd60-23124897be54), client: CTX_ID:648a7383-46c8-4ed7-a921-acafc90bec1a-GRAPH_ID:4-PID:19471-HOST:rhevh08.mgmt.triumf.ca-PC_NAME:gluster-01-client-5-RECON_NO:-5, error-xlator: gluster-01-posix [No space left on device]

The disk is quite full (listed as 100% on the server), but does have
some writable room left:

/dev/mapper/vg--brick1-brick1                                                            11T   11T   97G 100% /data/glusterfs/gluster-01/brick1

however, I'm not sure if the amount of disk space used on the physical
drive is the true cause of the "No Space Left on Device" errors anyway.
I can still manually write to this brick outside of Gluster, so it seems
like the operating system isn't preventing the writes from happening.

During my investigation, I noticed that one .glusterfs paths on the problem
server is using up much more space than it is on the other servers. I can't
quite figure out why that might be, or how that happened. I'm wondering
if there's any advice on what the cause might've been.

I had done some package updates on this server with the issue and not on the
other servers. This included the kernel version, but didn't include the Gluster
packages. So possibly this, or the reboot to load the new kernel may
have caused a problem. I have scripts on my gluster machines to nicely kill
all of the brick processes before rebooting, so I'm not leaning towards
an abrupt shutdown being the cause, but it's a possibility.

I'm also looking for advice on how to safely remove the problem file and
rebuild it from the other Gluster peers. I've seen some documentation on
this, but I'm a little nervous about corrupting the volume if I
misunderstand the process. I'm not free to take the volume or cluster down and
do maintenance at this point, but that might be something I'll have to consider
if it's my only option.

For reference, here's the comparison of the same path that seems to be
taking up extra space on one of the hosts:

1: 26G     /data/gluster-01/brick1/vol/.glusterfs/99/56
2: 26G     /data/gluster-01/brick1/vol/.glusterfs/99/56
3: 26G     /data/gluster-01/brick1/vol/.glusterfs/99/56
4: 26G     /data/gluster-01/brick1/vol/.glusterfs/99/56
5: 26G     /data/gluster-01/brick1/vol/.glusterfs/99/56
6: 3.0T    /data/gluster-01/brick1/vol/.glusterfs/99/56

Any and all advice is appreciated.

Thanks!
--

Daniel Thomson
DevOps Engineer
t +1 604 222 7428
dthomson at triumf.ca
TRIUMF Canada's particle accelerator centre
www.triumf.ca @TRIUMFLab
4004 Wesbrook Mall
Vancouver BC V6T 2A3 Canada
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20210706/e58aac73/attachment.sig>