[Gluster-users] Problems after upgrade/volume expansion

Mon Feb 3 21:35:13 UTC 2014

Hello,
   I'm experiencing some major problems with my GlusterFS filesystem 
after an upgrade/expansion, and I'm hoping I can get pointed in the 
right direction for troubleshooting it.

I had a 5 server, 5 brick distributed volume on 3.3.1.  I brought the 
volume offline, stopped glusterd and glusterfsd on all servers, then 
upgraded to 3.4.2 and brought glusterd and glusterfsd back online.  So 
far so good.

Once the volume was back online and healthy, I added a new server to the 
trusted storage pool and added two bricks attached to that server to the 
pool.  Everything looked fine so far, gluster volume status showed all 
six servers and seven bricks as online.

The problem came next when I tried to rebalance.  I ran "gluster volume 
rebalance <volname> start force", then once it returned ran "status" and 
saw that the rebalance failed on all but one node, which showed in 
progress.  The node that it was running successfully on was a 
pre-existing server, not the new server/brick(s).  The other five 
servers report "1 subvolume(s) are down. Skipping fix layout."  Somebody 
in the IRC channel suggested this means that one of my bricks are down, 
but "gluster volume <volname> status" reports all servers and bricks as 
being online.   Full pastebin of the rebalance log (essentially the same 
on all five failing servers) here: http://fpaste.org/74082/14615971/

Currently, I have both missing files and files that report "Transport 
endopint not connected" when they are accessed.  It seems to really be 
related to the rebalance failures, and the layout seems incorrect as 
well.  Really hoping somebody can point me in the right direction of 
where to look next.  Thanks in advance for any help.

-Branden