[Gluster-devel] [Gluster-users] When will 3.6 be considered stable? (was: Replace brick 3.4.2 with 3.6.2?)

Wed May 6 04:29:12 UTC 2015

Sorry for the delay... Long day of flights... OK.  Here goes my attempt 
to explain what was happening:

First, my setup.  I am using a replica-2 setup with four nodes.  These 
are:

gfsib01a
gfsib01b
gfsib02a
gfsib02b, where the 1a/1b and 2a/2b are replica pairs.

I am using a number of segregated networks.

gfsib01a 10.200.70.1
gfsib01b 10.200.71.1
gfsib02a 10.200.70.2
gfsib02b 10.200.71.2

where 10.200.x.x is my infiniband network.  Gluster is also connect to 
my super-computer nodes on a 10.214.x.x network thought the gigabit 
interface.

Our DNS resolves gfsib01a to the 10.200.x.x network.  When our initial 
system was setup and we were accessing gluster on a non-infiniband 
network space (i.e. on a machine with no infiniband card, and therefore, 
no access to the 10.200 network), we adjusted the DNS entries by placing 
the following in the /etc/hosts file on the machine:

/etc/hosts [only done on machines without access to 10.200 IB network]:
gfsib01a 10.214.70.1
gfsib01b 10.214.71.1
gfsib02a 10.214.70.2
gfsib02b 10.214.71.2

This setup was recommended by the Redhat guys who came out to demo 
gluster for us a year or two ago.  This is how we were instructed to 
setup multiple network access with gluster.  Basically, it tricked the 
traffic to resolve gfsib01a.corvidtec.com to something that could be 
seen on a given node that didn't have access to the 10.200 network.

10.200 traffic would be routed through ib0 on nodes where there was an 
IB card.
10.214 traffic would be routed through eth0 on nodes where there was no 
IB card, and hence, no access to the 10.200 network.

This worked for us until we upgraded to 3.6.3.  At that point, we ran 
into issues where some of the nodes would mount /homegfs and some would 
fail with timeout issues.  For those that did actually mount (430 of the 
nodes out of 1500 completed the mount, the rest timed out), /homegfs was 
accessible.  However, when I tried to switch to a user whose home 
directory was on /homegfs, it would sit there for roughly  20-30 seconds 
before completing.  Something in the ssh connection was taking a very 
long time.  Once you were connected, it behaved normally and operated 
fine without any performance issues.

Now begins my best guess as to what happened with my fully admitted 
novice level understanding of how this works.  Let the speculation 
begin... It looks like something changed in 3.6.3 with the name 
resolution/IP handling.  My best guess is that FUSE needs to "see" all 
of the nodes to be able to write to them.  When I mounted gfsib01a 
effectively using "10.214.70.1:/homegfs /homegfs", it found gfsib01a 
without any issues.  However, it looks like 3.6.3 now returns the 
10.200.x.x address space back to the FUSE mount for the other nodes in 
the volume (gfsib01b, gfsib02a, gfsib02b).  At which point, the route 
doesn't work as the node doesn't have access to the 10.200 network 
space.  I fixed this by adding a route to the nodes so that 10.200 
traffic goes out the 10.214 ethernet port, and removing the DNS 
adjustments in /etc/hosts.

Again, I am guessing here, but do you know if the name resolution that 
is passed back changed in 3.6.3.  Did it send back the machine name 
(gfsib01a, gfsib01b, gfsib02a, gfsib02b) prior to 3.6.3 and now it sends 
back IP addresses?  Or, something along these lines?

Once I added the routes and eliminated the "spoofing" in the /etc/hosts 
file, everything worked fine.

On a more positive note, it does seem to be behaving well.  The previous 
heal-fails have been cleaned up and it no longer continually shows 
failed heals.  The only thing I have noticed is that I am getting a lot 
of these in the logs:

[2015-05-06 04:25:15.293175] D [registry.c:408:cli_cmd_register] 0-cli: 
Returning 0
[2015-05-06 04:25:15.293184] D [registry.c:408:cli_cmd_register] 0-cli: 
Returning 0
[2015-05-06 04:25:15.293192] D [registry.c:408:cli_cmd_register] 0-cli: 
Returning 0
[2015-05-06 04:25:15.293200] D [registry.c:408:cli_cmd_register] 0-cli: 
Returning 0
[2015-05-06 04:25:15.375447] D 
[cli-cmd-volume.c:1825:cli_check_gsync_present] 0-cli: Returning 0
[2015-05-06 04:25:15.375511] D [registry.c:408:cli_cmd_register] 0-cli: 
Returning 0
[2015-05-06 04:25:15.375522] D [registry.c:408:cli_cmd_register] 0-cli: 
Returning 0
[2015-05-06 04:25:15.375538] D [registry.c:408:cli_cmd_register] 0-cli: 
Returning 0
[2015-05-06 04:25:15.375552] D [registry.c:408:cli_cmd_register] 0-cli: 
Returning 0
[2015-05-06 04:25:15.375562] D [registry.c:408:cli_cmd_register] 0-cli: 
Returning 0
[2015-05-06 04:25:15.375572] D [registry.c:408:cli_cmd_register] 0-cli: 
Returning 0
[2015-05-06 04:25:15.375581] D [registry.c:408:cli_cmd_register] 0-cli: 
Returning 0
[2015-05-06 04:25:15.375588] D [registry.c:408:cli_cmd_register] 0-cli: 
Returning 0
[2015-05-06 04:25:15.375597] D [registry.c:408:cli_cmd_register] 0-cli: 
Returning 0
[2015-05-06 04:25:15.375604] D [registry.c:408:cli_cmd_register] 0-cli: 
Returning 0
[2015-05-06 04:25:15.375614] D [registry.c:408:cli_cmd_register] 0-cli: 
Returning 0
[2015-05-06 04:25:15.375622] D [registry.c:408:cli_cmd_register] 0-cli: 
Returning 0
[2015-05-06 04:25:15.375629] D [registry.c:408:cli_cmd_register] 0-cli: 
Returning 0
[2015-05-06 04:25:15.375635] D [registry.c:408:cli_cmd_register] 0-cli: 
Returning 0
[2015-05-06 04:25:15.375642] D [registry.c:408:cli_cmd_register] 0-cli: 
Returning 0
[2015-05-06 04:25:15.375652] D [registry.c:408:cli_cmd_register] 0-cli: 
Returning 0
[2015-05-06 04:25:15.375659] D [registry.c:408:cli_cmd_register] 0-cli: 
Returning 0
[2015-05-06 04:25:15.375667] D [registry.c:408:cli_cmd_register] 0-cli: 
Returning 0
[2015-05-06 04:25:15.375674] D [registry.c:408:cli_cmd_register] 0-cli: 
Returning 0
[2015-05-06 04:25:15.375681] D [registry.c:408:cli_cmd_register] 0-cli: 
Returning 0
[2015-05-06 04:25:15.375688] D [registry.c:408:cli_cmd_register] 0-cli: 
Returning 0
[2015-05-06 04:25:15.375695] D [registry.c:408:cli_cmd_register] 0-cli: 
Returning 0
[2015-05-06 04:25:15.375702] D [registry.c:408:cli_cmd_register] 0-cli: 
Returning 0
[2015-05-06 04:25:15.375708] D [registry.c:408:cli_cmd_register] 0-cli: 
Returning 0
[2015-05-06 04:25:15.375716] D [registry.c:408:cli_cmd_register] 0-cli: 
Returning 0
[2015-05-06 04:25:15.375723] D [registry.c:408:cli_cmd_register] 0-cli: 
Returning 0
[2015-05-06 04:25:15.375730] D [registry.c:408:cli_cmd_register] 0-cli: 
Returning 0
[2015-05-06 04:25:15.375737] D [registry.c:408:cli_cmd_register] 0-cli: 
Returning 0
[2015-05-06 04:25:15.375743] D [registry.c:408:cli_cmd_register] 0-cli: 
Returning 0
[2015-05-06 04:25:15.375751] D [registry.c:408:cli_cmd_register] 0-cli: 
Returning 0
[2015-05-06 04:25:15.375760] D [registry.c:408:cli_cmd_register] 0-cli: 
Returning 0
[2015-05-06 04:25:15.375767] D [registry.c:408:cli_cmd_register] 0-cli: 
Returning 0
[2015-05-06 04:25:15.375775] D [registry.c:408:cli_cmd_register] 0-cli: 
Returning 0
[2015-05-06 04:25:15.375782] D [registry.c:408:cli_cmd_register] 0-cli: 
Returning 0
[2015-05-06 04:25:15.375790] D [registry.c:408:cli_cmd_register] 0-cli: 
Returning 0
[2015-05-06 04:25:15.375803] D [registry.c:408:cli_cmd_register] 0-cli: 
Returning 0
[2015-05-06 04:25:15.375811] D [registry.c:408:cli_cmd_register] 0-cli: 
Returning 0
[2015-05-06 04:25:15.375819] D [registry.c:408:cli_cmd_register] 0-cli: 
Returning 0
[2015-05-06 04:25:15.375826] D [registry.c:408:cli_cmd_register] 0-cli: 
Returning 0
[2015-05-06 04:25:15.375879] T [cli.c:264:cli_rpc_notify] 0-glusterfs: 
got RPC_CLNT_CONNECT
[2015-05-06 04:25:15.375896] T 
[cli-quotad-client.c:94:cli_quotad_notify] 0-glusterfs: got 
RPC_CLNT_CONNECT
[2015-05-06 04:25:15.375911] I [socket.c:2353:socket_event_handler] 
0-transport: disconnecting now
[2015-05-06 04:25:15.375938] T 
[cli-quotad-client.c:100:cli_quotad_notify] 0-glusterfs: got 
RPC_CLNT_DISCONNECT
[2015-05-06 04:25:15.376003] T [rpc-clnt.c:1381:rpc_clnt_record] 
0-glusterfs: Auth Info: pid: 0, uid: 0, gid: 0, owner:
[2015-05-06 04:25:15.376036] T 
[rpc-clnt.c:1238:rpc_clnt_record_build_header] 0-rpc-clnt: Request 
fraglen 152, payload: 88, rpc hdr: 64
[2015-05-06 04:25:15.376252] T [socket.c:2872:socket_connect] (--> 
/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x30fb620550] (--> 
/usr/lib64/glusterfs/3.6.3/rpc-transport/socket.so(+0x72d3)[0x7f95db4232d3] 
(--> /usr/lib64/libgfrpc.so.0(rpc_clnt_submit+0x468)[0x30fbe0efe8] (--> 
gluster(cli_submit_request+0xdb)[0x40a8fb] (--> 
gluster(cli_cmd_submit+0x8e)[0x40b6fe] ))))) 0-glusterfs: connect () 
called on transport already connected
[2015-05-06 04:25:15.376275] T [rpc-clnt.c:1573:rpc_clnt_submit] 
0-rpc-clnt: submitted request (XID: 0x1 Program: Gluster CLI, ProgVers: 
2, Proc: 27) to rpc-transport (glusterfs)
[2015-05-06 04:25:15.376297] D [rpc-clnt-ping.c:231:rpc_clnt_start_ping] 
0-glusterfs: ping timeout is 0, returning
[2015-05-06 04:25:15.381486] T [rpc-clnt.c:660:rpc_clnt_reply_init] 
0-glusterfs: received rpc message (RPC XID: 0x1 Program: Gluster CLI, 
ProgVers: 2, Proc: 27) from rpc-transport (glusterfs)
[2015-05-06 04:25:15.381524] D [cli-rpc-ops.c:6649:gf_cli_status_cbk] 
0-cli: Received response to status cmd
[2015-05-06 04:25:15.381712] D [cli-cmd.c:384:cli_cmd_submit] 0-cli: 
Returning 0
[2015-05-06 04:25:15.381731] D [cli-rpc-ops.c:6912:gf_cli_status_volume] 
0-cli: Returning: 0
[2015-05-06 04:25:15.381739] D 
[cli-cmd-volume.c:1930:cli_cmd_volume_status_cbk] 0-cli: frame->local is 
not NULL (0x7f95cc0009c0)
[2015-05-06 04:25:15.381764] I [input.c:36:cli_batch] 0-: Exiting with: 
0

David

------ Original Message ------
From: "Justin Clift" <justin at gluster.org>
To: "Kingsley Tart - Barritel" <kingsley.tart at barritel.com>
Cc: "David F. Robinson" <david.robinson at corvidtec.com>
Sent: 5/5/2015 10:11:50 AM
Subject: Re: [Gluster-users] When will 3.6 be considered stable? (was: 
Replace brick 3.4.2 with 3.6.2?)

>On 5 May 2015, at 14:39, Kingsley Tart - Barritel 
><kingsley.tart at barritel.com> wrote:
>>  On Thu, 2015-02-26 at 21:24 +0000, Justin Clift wrote:
>>>>  When will 3.6 be considered stable? I'm waiting to deploy a cluster 
>>>>into
>>>>  a production environment. I've built a 3.6.2 cluster and have put a 
>>>>live
>>>>  copy of the data onto it, which took rsync a solid 2 weeks to do. I
>>>>  don't really want to go through that again if I can help it.
>>>
>>>  We thought it was - including getting tested by a some places fairly
>>>  intensively before - until bugs started showing up when people 
>>>deployed
>>>  it to production.
>>>
>>>  We're actively working on a 3.6.3 release, fixing the reported bugs,
>>>  and should have a beta out in the near-ish future.  (3.6.3beta1 came
>>>  out on 11th Feb, we're still working on a few more patches)
>>
>>  Hi Justin,
>>
>>  apologies for emailing directly, but it seems better than emailing 
>>the
>>  whole list on this particular issue, especially as we've already 
>>spoken
>>  about it.
>>
>>  I've seen that 3.6.3 has been out a little while now, but I've been
>>  holding off in case any other issues came to light. How is it all 
>>going
>>  with 3.6.3? Ideally I'd like to upgrade our 3.6.2 setup if 3.6.3 is
>>  considered good, but I don't know whether it's still early days.
>
>3.6.3 *should* be better than 3.6.2.  David Robinson (CC'd) mentioned 
>he
>saw a changE in client connectivity behaviour though, which is worth
>knowing about before hand.  I don't know the details, though David 
>mentioned
>he'll send info about it through to the mailing list.
>
>I'd wait until then, read though that when it arrives, and then proceed 
>with
>planning and upgrading things then.
>
>Hope that helps. :)
>
>+ Justin
>
>
>>  --
>>  Cheers,
>>  Kingsley.
>>
>
>--
>GlusterFS - http://www.gluster.org
>
>An open source, distributed file system scaling to several
>petabytes, and handling thousands of clients.
>
>My personal twitter: twitter.com/realjustinclift
>