[Gluster-users] problem with missing bricks

Sat Dec 31 16:50:27 UTC 2011

Gluster-user folks,

I'm trying to use gluster in a way that may be a considered an unusual use
case for gluster.  Feel free to let me know if you think what I'm doing
is dumb.  It just feels very comfortable doing this with gluster.
I have been using gluster in other, more orthodox configurations, for
several years.

I have a single system with 45 inexpensive sata drives - it's a self-built
backblaze similar to that documented at this url but with some upgrades
and substitutions:

   http://blog.backblaze.com/2009/09/01/petabytes-on-a-budget-how-to-build-cheap-cloud-storage/

We use this system for disk-to-disk backups only, no primary storage,
nothing mission critical.

For the past two years I have been using this system with linux software
raid, with the drives organized as multiple raid 5/6/10 sets of 5 drives
per set.  This has worked ok but I have suffered enough multiple
simultaneous drive failures to prompt me to explore alternatives to raid.
Yes, I know, that's what I get for using cheap sata drives.

What I'm experimenting with now is creating gluster distributed-replicated
volumes on some of these drives, and maybe all in the future if this works
reasonably well.

At this point I am using 10 of the drives configured as shown here:

   Volume Name: volume1
   Type: Distributed-Replicate
   Status: Started
   Number of Bricks: 5 x 2 = 10
   Transport-type: tcp
   Bricks:
   Brick1: host:/gluster/brick01
   Brick2: host:/gluster/brick06
   Brick3: host:/gluster/brick02
   Brick4: host:/gluster/brick07
   Brick5: host:/gluster/brick03
   Brick6: host:/gluster/brick08
   Brick7: host:/gluster/brick04
   Brick8: host:/gluster/brick09
   Brick9: host:/gluster/brick05
   Brick10: host:/gluster/brick10
   Options Reconfigured:
   auth.allow: 127.0.0.1,10.10.10.10

For the most part this is working fine so far.  The problem I have run
into several times now is that when a drive fails and the system is
rebooted, the volume comes up without that brick.  Gluster happily writes
to the missing brick's mount point, thereby eventually filling up the root
filesystem.  Once the root filesystem is full and processes writing to
gluster space are hung, I can never recover from this state without
rebooting.

Is there any way to avoid this problem of gluster writing to a brick
path that isn't really populated by the intended brick filesystem?
Does gluster not create any sort of signature or meta-data that
indicates whether or not a path is really a gluster brick?
How do others deal with missing bricks?

I realize that ultimately I should get the bricks replaced as soon as
possible but there may be times when I want to continue running for some
time with a "degraded" volume if you will.

Any and all ideas, suggestions, comments, criticisms are welcome.

Cheers and Happy New Year,
Todd