[Gluster-devel] [Gluster-users] When will 3.6 be considered stable? (was: Replace brick 3.4.2 with 3.6.2?)
David Robinson
david.robinson at corvidtec.com
Wed May 6 04:29:12 UTC 2015
Sorry for the delay... Long day of flights... OK. Here goes my attempt
to explain what was happening:
First, my setup. I am using a replica-2 setup with four nodes. These
are:
gfsib01a
gfsib01b
gfsib02a
gfsib02b, where the 1a/1b and 2a/2b are replica pairs.
I am using a number of segregated networks.
gfsib01a 10.200.70.1
gfsib01b 10.200.71.1
gfsib02a 10.200.70.2
gfsib02b 10.200.71.2
where 10.200.x.x is my infiniband network. Gluster is also connect to
my super-computer nodes on a 10.214.x.x network thought the gigabit
interface.
Our DNS resolves gfsib01a to the 10.200.x.x network. When our initial
system was setup and we were accessing gluster on a non-infiniband
network space (i.e. on a machine with no infiniband card, and therefore,
no access to the 10.200 network), we adjusted the DNS entries by placing
the following in the /etc/hosts file on the machine:
/etc/hosts [only done on machines without access to 10.200 IB network]:
gfsib01a 10.214.70.1
gfsib01b 10.214.71.1
gfsib02a 10.214.70.2
gfsib02b 10.214.71.2
This setup was recommended by the Redhat guys who came out to demo
gluster for us a year or two ago. This is how we were instructed to
setup multiple network access with gluster. Basically, it tricked the
traffic to resolve gfsib01a.corvidtec.com to something that could be
seen on a given node that didn't have access to the 10.200 network.
10.200 traffic would be routed through ib0 on nodes where there was an
IB card.
10.214 traffic would be routed through eth0 on nodes where there was no
IB card, and hence, no access to the 10.200 network.
This worked for us until we upgraded to 3.6.3. At that point, we ran
into issues where some of the nodes would mount /homegfs and some would
fail with timeout issues. For those that did actually mount (430 of the
nodes out of 1500 completed the mount, the rest timed out), /homegfs was
accessible. However, when I tried to switch to a user whose home
directory was on /homegfs, it would sit there for roughly 20-30 seconds
before completing. Something in the ssh connection was taking a very
long time. Once you were connected, it behaved normally and operated
fine without any performance issues.
Now begins my best guess as to what happened with my fully admitted
novice level understanding of how this works. Let the speculation
begin... It looks like something changed in 3.6.3 with the name
resolution/IP handling. My best guess is that FUSE needs to "see" all
of the nodes to be able to write to them. When I mounted gfsib01a
effectively using "10.214.70.1:/homegfs /homegfs", it found gfsib01a
without any issues. However, it looks like 3.6.3 now returns the
10.200.x.x address space back to the FUSE mount for the other nodes in
the volume (gfsib01b, gfsib02a, gfsib02b). At which point, the route
doesn't work as the node doesn't have access to the 10.200 network
space. I fixed this by adding a route to the nodes so that 10.200
traffic goes out the 10.214 ethernet port, and removing the DNS
adjustments in /etc/hosts.
Again, I am guessing here, but do you know if the name resolution that
is passed back changed in 3.6.3. Did it send back the machine name
(gfsib01a, gfsib01b, gfsib02a, gfsib02b) prior to 3.6.3 and now it sends
back IP addresses? Or, something along these lines?
Once I added the routes and eliminated the "spoofing" in the /etc/hosts
file, everything worked fine.
On a more positive note, it does seem to be behaving well. The previous
heal-fails have been cleaned up and it no longer continually shows
failed heals. The only thing I have noticed is that I am getting a lot
of these in the logs:
[2015-05-06 04:25:15.293175] D [registry.c:408:cli_cmd_register] 0-cli:
Returning 0
[2015-05-06 04:25:15.293184] D [registry.c:408:cli_cmd_register] 0-cli:
Returning 0
[2015-05-06 04:25:15.293192] D [registry.c:408:cli_cmd_register] 0-cli:
Returning 0
[2015-05-06 04:25:15.293200] D [registry.c:408:cli_cmd_register] 0-cli:
Returning 0
[2015-05-06 04:25:15.375447] D
[cli-cmd-volume.c:1825:cli_check_gsync_present] 0-cli: Returning 0
[2015-05-06 04:25:15.375511] D [registry.c:408:cli_cmd_register] 0-cli:
Returning 0
[2015-05-06 04:25:15.375522] D [registry.c:408:cli_cmd_register] 0-cli:
Returning 0
[2015-05-06 04:25:15.375538] D [registry.c:408:cli_cmd_register] 0-cli:
Returning 0
[2015-05-06 04:25:15.375552] D [registry.c:408:cli_cmd_register] 0-cli:
Returning 0
[2015-05-06 04:25:15.375562] D [registry.c:408:cli_cmd_register] 0-cli:
Returning 0
[2015-05-06 04:25:15.375572] D [registry.c:408:cli_cmd_register] 0-cli:
Returning 0
[2015-05-06 04:25:15.375581] D [registry.c:408:cli_cmd_register] 0-cli:
Returning 0
[2015-05-06 04:25:15.375588] D [registry.c:408:cli_cmd_register] 0-cli:
Returning 0
[2015-05-06 04:25:15.375597] D [registry.c:408:cli_cmd_register] 0-cli:
Returning 0
[2015-05-06 04:25:15.375604] D [registry.c:408:cli_cmd_register] 0-cli:
Returning 0
[2015-05-06 04:25:15.375614] D [registry.c:408:cli_cmd_register] 0-cli:
Returning 0
[2015-05-06 04:25:15.375622] D [registry.c:408:cli_cmd_register] 0-cli:
Returning 0
[2015-05-06 04:25:15.375629] D [registry.c:408:cli_cmd_register] 0-cli:
Returning 0
[2015-05-06 04:25:15.375635] D [registry.c:408:cli_cmd_register] 0-cli:
Returning 0
[2015-05-06 04:25:15.375642] D [registry.c:408:cli_cmd_register] 0-cli:
Returning 0
[2015-05-06 04:25:15.375652] D [registry.c:408:cli_cmd_register] 0-cli:
Returning 0
[2015-05-06 04:25:15.375659] D [registry.c:408:cli_cmd_register] 0-cli:
Returning 0
[2015-05-06 04:25:15.375667] D [registry.c:408:cli_cmd_register] 0-cli:
Returning 0
[2015-05-06 04:25:15.375674] D [registry.c:408:cli_cmd_register] 0-cli:
Returning 0
[2015-05-06 04:25:15.375681] D [registry.c:408:cli_cmd_register] 0-cli:
Returning 0
[2015-05-06 04:25:15.375688] D [registry.c:408:cli_cmd_register] 0-cli:
Returning 0
[2015-05-06 04:25:15.375695] D [registry.c:408:cli_cmd_register] 0-cli:
Returning 0
[2015-05-06 04:25:15.375702] D [registry.c:408:cli_cmd_register] 0-cli:
Returning 0
[2015-05-06 04:25:15.375708] D [registry.c:408:cli_cmd_register] 0-cli:
Returning 0
[2015-05-06 04:25:15.375716] D [registry.c:408:cli_cmd_register] 0-cli:
Returning 0
[2015-05-06 04:25:15.375723] D [registry.c:408:cli_cmd_register] 0-cli:
Returning 0
[2015-05-06 04:25:15.375730] D [registry.c:408:cli_cmd_register] 0-cli:
Returning 0
[2015-05-06 04:25:15.375737] D [registry.c:408:cli_cmd_register] 0-cli:
Returning 0
[2015-05-06 04:25:15.375743] D [registry.c:408:cli_cmd_register] 0-cli:
Returning 0
[2015-05-06 04:25:15.375751] D [registry.c:408:cli_cmd_register] 0-cli:
Returning 0
[2015-05-06 04:25:15.375760] D [registry.c:408:cli_cmd_register] 0-cli:
Returning 0
[2015-05-06 04:25:15.375767] D [registry.c:408:cli_cmd_register] 0-cli:
Returning 0
[2015-05-06 04:25:15.375775] D [registry.c:408:cli_cmd_register] 0-cli:
Returning 0
[2015-05-06 04:25:15.375782] D [registry.c:408:cli_cmd_register] 0-cli:
Returning 0
[2015-05-06 04:25:15.375790] D [registry.c:408:cli_cmd_register] 0-cli:
Returning 0
[2015-05-06 04:25:15.375803] D [registry.c:408:cli_cmd_register] 0-cli:
Returning 0
[2015-05-06 04:25:15.375811] D [registry.c:408:cli_cmd_register] 0-cli:
Returning 0
[2015-05-06 04:25:15.375819] D [registry.c:408:cli_cmd_register] 0-cli:
Returning 0
[2015-05-06 04:25:15.375826] D [registry.c:408:cli_cmd_register] 0-cli:
Returning 0
[2015-05-06 04:25:15.375879] T [cli.c:264:cli_rpc_notify] 0-glusterfs:
got RPC_CLNT_CONNECT
[2015-05-06 04:25:15.375896] T
[cli-quotad-client.c:94:cli_quotad_notify] 0-glusterfs: got
RPC_CLNT_CONNECT
[2015-05-06 04:25:15.375911] I [socket.c:2353:socket_event_handler]
0-transport: disconnecting now
[2015-05-06 04:25:15.375938] T
[cli-quotad-client.c:100:cli_quotad_notify] 0-glusterfs: got
RPC_CLNT_DISCONNECT
[2015-05-06 04:25:15.376003] T [rpc-clnt.c:1381:rpc_clnt_record]
0-glusterfs: Auth Info: pid: 0, uid: 0, gid: 0, owner:
[2015-05-06 04:25:15.376036] T
[rpc-clnt.c:1238:rpc_clnt_record_build_header] 0-rpc-clnt: Request
fraglen 152, payload: 88, rpc hdr: 64
[2015-05-06 04:25:15.376252] T [socket.c:2872:socket_connect] (-->
/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x30fb620550] (-->
/usr/lib64/glusterfs/3.6.3/rpc-transport/socket.so(+0x72d3)[0x7f95db4232d3]
(--> /usr/lib64/libgfrpc.so.0(rpc_clnt_submit+0x468)[0x30fbe0efe8] (-->
gluster(cli_submit_request+0xdb)[0x40a8fb] (-->
gluster(cli_cmd_submit+0x8e)[0x40b6fe] ))))) 0-glusterfs: connect ()
called on transport already connected
[2015-05-06 04:25:15.376275] T [rpc-clnt.c:1573:rpc_clnt_submit]
0-rpc-clnt: submitted request (XID: 0x1 Program: Gluster CLI, ProgVers:
2, Proc: 27) to rpc-transport (glusterfs)
[2015-05-06 04:25:15.376297] D [rpc-clnt-ping.c:231:rpc_clnt_start_ping]
0-glusterfs: ping timeout is 0, returning
[2015-05-06 04:25:15.381486] T [rpc-clnt.c:660:rpc_clnt_reply_init]
0-glusterfs: received rpc message (RPC XID: 0x1 Program: Gluster CLI,
ProgVers: 2, Proc: 27) from rpc-transport (glusterfs)
[2015-05-06 04:25:15.381524] D [cli-rpc-ops.c:6649:gf_cli_status_cbk]
0-cli: Received response to status cmd
[2015-05-06 04:25:15.381712] D [cli-cmd.c:384:cli_cmd_submit] 0-cli:
Returning 0
[2015-05-06 04:25:15.381731] D [cli-rpc-ops.c:6912:gf_cli_status_volume]
0-cli: Returning: 0
[2015-05-06 04:25:15.381739] D
[cli-cmd-volume.c:1930:cli_cmd_volume_status_cbk] 0-cli: frame->local is
not NULL (0x7f95cc0009c0)
[2015-05-06 04:25:15.381764] I [input.c:36:cli_batch] 0-: Exiting with:
0
David
------ Original Message ------
From: "Justin Clift" <justin at gluster.org>
To: "Kingsley Tart - Barritel" <kingsley.tart at barritel.com>
Cc: "David F. Robinson" <david.robinson at corvidtec.com>
Sent: 5/5/2015 10:11:50 AM
Subject: Re: [Gluster-users] When will 3.6 be considered stable? (was:
Replace brick 3.4.2 with 3.6.2?)
>On 5 May 2015, at 14:39, Kingsley Tart - Barritel
><kingsley.tart at barritel.com> wrote:
>> On Thu, 2015-02-26 at 21:24 +0000, Justin Clift wrote:
>>>> When will 3.6 be considered stable? I'm waiting to deploy a cluster
>>>>into
>>>> a production environment. I've built a 3.6.2 cluster and have put a
>>>>live
>>>> copy of the data onto it, which took rsync a solid 2 weeks to do. I
>>>> don't really want to go through that again if I can help it.
>>>
>>> We thought it was - including getting tested by a some places fairly
>>> intensively before - until bugs started showing up when people
>>>deployed
>>> it to production.
>>>
>>> We're actively working on a 3.6.3 release, fixing the reported bugs,
>>> and should have a beta out in the near-ish future. (3.6.3beta1 came
>>> out on 11th Feb, we're still working on a few more patches)
>>
>> Hi Justin,
>>
>> apologies for emailing directly, but it seems better than emailing
>>the
>> whole list on this particular issue, especially as we've already
>>spoken
>> about it.
>>
>> I've seen that 3.6.3 has been out a little while now, but I've been
>> holding off in case any other issues came to light. How is it all
>>going
>> with 3.6.3? Ideally I'd like to upgrade our 3.6.2 setup if 3.6.3 is
>> considered good, but I don't know whether it's still early days.
>
>3.6.3 *should* be better than 3.6.2. David Robinson (CC'd) mentioned
>he
>saw a changE in client connectivity behaviour though, which is worth
>knowing about before hand. I don't know the details, though David
>mentioned
>he'll send info about it through to the mailing list.
>
>I'd wait until then, read though that when it arrives, and then proceed
>with
>planning and upgrading things then.
>
>Hope that helps. :)
>
>+ Justin
>
>
>> --
>> Cheers,
>> Kingsley.
>>
>
>--
>GlusterFS - http://www.gluster.org
>
>An open source, distributed file system scaling to several
>petabytes, and handling thousands of clients.
>
>My personal twitter: twitter.com/realjustinclift
>
More information about the Gluster-devel
mailing list