[Gluster-users] Failed rebalance resulting in major problems

Shawn Heisey gluster at elyograg.org
Mon Nov 11 20:34:44 UTC 2013


On 11/11/2013 12:33 PM, Jeff Darcy wrote:
> There's nothing about a split-network configuration like yours that
> would cause something like this *by itself*, but anything that creates
> greater complexity also creates new possibilities for something to go
> wrong.  Just to be safe, if I were you, I'd double- and triple-check the
> DNS and /etc/hosts configurations on all machines to make sure some tiny
> error didn't creep in.  If your bricks are at the same paths on each
> machine, it would be possible for a machine to think it's connecting to
> one brick and actually end up connecting to another.  I haven't even
> been able to think through all of the ramifications, but just thinking
> about how that might affect rebalance makes me a bit queasy.

As far as I can tell, the /etc/hosts files and DNS are configured 
correctly.  All four of the hosts with bricks have identical /etc/hosts 
files.  I was very careful to double-check everything I'm including 
below before beginning anything, and I also checked it just now.

127.0.0.1   localhost localhost.localdomain localhost4 
localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 
localhost6.localdomain6
10.108.0.21     slc01dfs001a-pub.REDACTED.com slc01dfs001a-pub
10.108.0.22     slc01dfs001b-pub.REDACTED.com slc01dfs001b-pub
10.108.0.23     slc01dfs002a-pub.REDACTED.com slc01dfs002a-pub
10.108.0.24     slc01dfs002b-pub.REDACTED.com slc01dfs002b-pub
10.116.0.21     slc01dfs001a.REDACTED.com slc01dfs001a
10.116.0.22     slc01dfs001b.REDACTED.com slc01dfs001b
10.116.0.23     slc01dfs002a.REDACTED.com slc01dfs002a
10.116.0.24     slc01dfs002b.REDACTED.com slc01dfs002b

The addresses that are on the -pub entries here are what is in DNS for 
the names without -pub, and the -pub names themselves do not exist in DNS.

Here is what 'gluster volume status mdfs' says:

Status of volume: mdfs
Gluster process Port    Online  Pid
------------------------------------------------------------------------------
Brick slc01dfs001a:/bricks/d00v00/mdfs 24025   Y       7739
Brick slc01dfs001b:/bricks/d00v00/mdfs 24025   Y       2547
Brick slc01dfs001a:/bricks/d00v01/mdfs 24026   Y       7744
Brick slc01dfs001b:/bricks/d00v01/mdfs 24026   Y       2552
Brick slc01dfs001a:/bricks/d00v02/mdfs 24027   Y       7750
Brick slc01dfs001b:/bricks/d00v02/mdfs 24027   Y       2558
Brick slc01dfs001a:/bricks/d00v03/mdfs 24028   Y       7756
Brick slc01dfs001b:/bricks/d00v03/mdfs 24028   Y       2564
Brick slc01dfs001a:/bricks/d01v00/mdfs 24029   Y       7762
Brick slc01dfs001b:/bricks/d01v00/mdfs 24029   Y       2570
Brick slc01dfs001a:/bricks/d01v01/mdfs 24030   Y       7768
Brick slc01dfs001b:/bricks/d01v01/mdfs 24030   Y       2576
Brick slc01dfs001a:/bricks/d01v02/mdfs 24031   Y       7774
Brick slc01dfs001b:/bricks/d01v02/mdfs 24031   Y       2582
Brick slc01dfs001a:/bricks/d01v03/mdfs 24032   Y       7780
Brick slc01dfs001b:/bricks/d01v03/mdfs 24032   Y       2588
Brick slc01dfs002a:/bricks/d00v00/mdfs 24017   Y       23691
Brick slc01dfs002b:/bricks/d00v00/mdfs 24017   Y       23802
Brick slc01dfs002a:/bricks/d00v01/mdfs 24018   Y       23696
Brick slc01dfs002b:/bricks/d00v01/mdfs 24018   Y       23807
Brick slc01dfs002a:/bricks/d00v02/mdfs 24019   Y       23702
Brick slc01dfs002b:/bricks/d00v02/mdfs 24019   Y       23813
Brick slc01dfs002a:/bricks/d00v03/mdfs 24020   Y       23708
Brick slc01dfs002b:/bricks/d00v03/mdfs 24020   Y       23819
Brick slc01dfs002a:/bricks/d01v00/mdfs 24021   Y       23714
Brick slc01dfs002b:/bricks/d01v00/mdfs 24021   Y       23825
Brick slc01dfs002a:/bricks/d01v01/mdfs 24022   Y       23720
Brick slc01dfs002b:/bricks/d01v01/mdfs 24022   Y       23831
Brick slc01dfs002a:/bricks/d01v02/mdfs 24023   Y       23726
Brick slc01dfs002b:/bricks/d01v02/mdfs 24023   Y       23837
Brick slc01dfs002a:/bricks/d01v03/mdfs 24024   Y       23732
Brick slc01dfs002b:/bricks/d01v03/mdfs 24024   Y       23843
NFS Server on localhost 38467   Y       21318
Self-heal Daemon on localhost N/A     Y       21324
NFS Server on slc01nas2 38467   Y       49120
Self-heal Daemon on slc01nas2 N/A     Y       49126
NFS Server on slc01nas1 38467   Y       12335
Self-heal Daemon on slc01nas1 N/A     Y       12341
NFS Server on slc01dfs001b 38467   Y       5390
Self-heal Daemon on slc01dfs001b N/A     Y       5396
NFS Server on slc01dfs002a 38467   Y       23740
Self-heal Daemon on slc01dfs002a N/A     Y       23746
NFS Server on slc01dfs002b 38467   Y       23850
Self-heal Daemon on slc01dfs002b N/A     Y       23856

The two hosts without bricks (slc01nas1 and slc01nas2) have only 
localhost entries in /etc/hosts.

Here's gluster peer status from slc01dfs001a.  From the other five 
hosts, it looks similar and they all say Connected.

Number of Peers: 5

Hostname: slc01nas2
Uuid: 4bb5b123-7420-4b6c-a542-3b15fc2104f8
State: Peer in Cluster (Connected)

Hostname: slc01nas1
Uuid: 1d087f2c-08b0-4de3-a547-c9e8f1255049
State: Peer in Cluster (Connected)

Hostname: slc01dfs001b
Uuid: 766a490a-132f-4baa-bf4c-193f49af3274
State: Peer in Cluster (Connected)

Hostname: slc01dfs002a
Uuid: 18a6936c-a721-49e2-82aa-fbe525986e25
State: Peer in Cluster (Connected)

Hostname: slc01dfs002b
Uuid: 5fd3e39d-dbb4-4f24-a3f7-3e0629839b2b
State: Peer in Cluster (Connected)

The iptables firewall and selinux are disabled on every host.

Thanks,
Shawn



More information about the Gluster-users mailing list