[Gluster-users] Failed rebalance resulting in major problems
Shawn Heisey
gluster at elyograg.org
Mon Nov 11 20:34:44 UTC 2013
On 11/11/2013 12:33 PM, Jeff Darcy wrote:
> There's nothing about a split-network configuration like yours that
> would cause something like this *by itself*, but anything that creates
> greater complexity also creates new possibilities for something to go
> wrong. Just to be safe, if I were you, I'd double- and triple-check the
> DNS and /etc/hosts configurations on all machines to make sure some tiny
> error didn't creep in. If your bricks are at the same paths on each
> machine, it would be possible for a machine to think it's connecting to
> one brick and actually end up connecting to another. I haven't even
> been able to think through all of the ramifications, but just thinking
> about how that might affect rebalance makes me a bit queasy.
As far as I can tell, the /etc/hosts files and DNS are configured
correctly. All four of the hosts with bricks have identical /etc/hosts
files. I was very careful to double-check everything I'm including
below before beginning anything, and I also checked it just now.
127.0.0.1 localhost localhost.localdomain localhost4
localhost4.localdomain4
::1 localhost localhost.localdomain localhost6
localhost6.localdomain6
10.108.0.21 slc01dfs001a-pub.REDACTED.com slc01dfs001a-pub
10.108.0.22 slc01dfs001b-pub.REDACTED.com slc01dfs001b-pub
10.108.0.23 slc01dfs002a-pub.REDACTED.com slc01dfs002a-pub
10.108.0.24 slc01dfs002b-pub.REDACTED.com slc01dfs002b-pub
10.116.0.21 slc01dfs001a.REDACTED.com slc01dfs001a
10.116.0.22 slc01dfs001b.REDACTED.com slc01dfs001b
10.116.0.23 slc01dfs002a.REDACTED.com slc01dfs002a
10.116.0.24 slc01dfs002b.REDACTED.com slc01dfs002b
The addresses that are on the -pub entries here are what is in DNS for
the names without -pub, and the -pub names themselves do not exist in DNS.
Here is what 'gluster volume status mdfs' says:
Status of volume: mdfs
Gluster process Port Online Pid
------------------------------------------------------------------------------
Brick slc01dfs001a:/bricks/d00v00/mdfs 24025 Y 7739
Brick slc01dfs001b:/bricks/d00v00/mdfs 24025 Y 2547
Brick slc01dfs001a:/bricks/d00v01/mdfs 24026 Y 7744
Brick slc01dfs001b:/bricks/d00v01/mdfs 24026 Y 2552
Brick slc01dfs001a:/bricks/d00v02/mdfs 24027 Y 7750
Brick slc01dfs001b:/bricks/d00v02/mdfs 24027 Y 2558
Brick slc01dfs001a:/bricks/d00v03/mdfs 24028 Y 7756
Brick slc01dfs001b:/bricks/d00v03/mdfs 24028 Y 2564
Brick slc01dfs001a:/bricks/d01v00/mdfs 24029 Y 7762
Brick slc01dfs001b:/bricks/d01v00/mdfs 24029 Y 2570
Brick slc01dfs001a:/bricks/d01v01/mdfs 24030 Y 7768
Brick slc01dfs001b:/bricks/d01v01/mdfs 24030 Y 2576
Brick slc01dfs001a:/bricks/d01v02/mdfs 24031 Y 7774
Brick slc01dfs001b:/bricks/d01v02/mdfs 24031 Y 2582
Brick slc01dfs001a:/bricks/d01v03/mdfs 24032 Y 7780
Brick slc01dfs001b:/bricks/d01v03/mdfs 24032 Y 2588
Brick slc01dfs002a:/bricks/d00v00/mdfs 24017 Y 23691
Brick slc01dfs002b:/bricks/d00v00/mdfs 24017 Y 23802
Brick slc01dfs002a:/bricks/d00v01/mdfs 24018 Y 23696
Brick slc01dfs002b:/bricks/d00v01/mdfs 24018 Y 23807
Brick slc01dfs002a:/bricks/d00v02/mdfs 24019 Y 23702
Brick slc01dfs002b:/bricks/d00v02/mdfs 24019 Y 23813
Brick slc01dfs002a:/bricks/d00v03/mdfs 24020 Y 23708
Brick slc01dfs002b:/bricks/d00v03/mdfs 24020 Y 23819
Brick slc01dfs002a:/bricks/d01v00/mdfs 24021 Y 23714
Brick slc01dfs002b:/bricks/d01v00/mdfs 24021 Y 23825
Brick slc01dfs002a:/bricks/d01v01/mdfs 24022 Y 23720
Brick slc01dfs002b:/bricks/d01v01/mdfs 24022 Y 23831
Brick slc01dfs002a:/bricks/d01v02/mdfs 24023 Y 23726
Brick slc01dfs002b:/bricks/d01v02/mdfs 24023 Y 23837
Brick slc01dfs002a:/bricks/d01v03/mdfs 24024 Y 23732
Brick slc01dfs002b:/bricks/d01v03/mdfs 24024 Y 23843
NFS Server on localhost 38467 Y 21318
Self-heal Daemon on localhost N/A Y 21324
NFS Server on slc01nas2 38467 Y 49120
Self-heal Daemon on slc01nas2 N/A Y 49126
NFS Server on slc01nas1 38467 Y 12335
Self-heal Daemon on slc01nas1 N/A Y 12341
NFS Server on slc01dfs001b 38467 Y 5390
Self-heal Daemon on slc01dfs001b N/A Y 5396
NFS Server on slc01dfs002a 38467 Y 23740
Self-heal Daemon on slc01dfs002a N/A Y 23746
NFS Server on slc01dfs002b 38467 Y 23850
Self-heal Daemon on slc01dfs002b N/A Y 23856
The two hosts without bricks (slc01nas1 and slc01nas2) have only
localhost entries in /etc/hosts.
Here's gluster peer status from slc01dfs001a. From the other five
hosts, it looks similar and they all say Connected.
Number of Peers: 5
Hostname: slc01nas2
Uuid: 4bb5b123-7420-4b6c-a542-3b15fc2104f8
State: Peer in Cluster (Connected)
Hostname: slc01nas1
Uuid: 1d087f2c-08b0-4de3-a547-c9e8f1255049
State: Peer in Cluster (Connected)
Hostname: slc01dfs001b
Uuid: 766a490a-132f-4baa-bf4c-193f49af3274
State: Peer in Cluster (Connected)
Hostname: slc01dfs002a
Uuid: 18a6936c-a721-49e2-82aa-fbe525986e25
State: Peer in Cluster (Connected)
Hostname: slc01dfs002b
Uuid: 5fd3e39d-dbb4-4f24-a3f7-3e0629839b2b
State: Peer in Cluster (Connected)
The iptables firewall and selinux are disabled on every host.
Thanks,
Shawn
More information about the Gluster-users
mailing list