<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Mar 2, 2018 at 11:01 AM, Ravishankar N <span dir="ltr"><<a href="mailto:ravishankar@redhat.com" target="_blank">ravishankar@redhat.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class=""><br>
On 03/02/2018 10:11 AM, Ravishankar N wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
+ Anoop.<br>
<br>
It looks like clients on the old (3.12) nodes are not able to talk to the upgraded (4.0) node. I see messages like these on the old clients:<br>
<br>
[2018-03-02 03:49:13.483458] W [MSGID: 114007] [client-handshake.c:1197:clien<wbr>t_setvolume_cbk] 0-testvol-client-2: failed to find key 'clnt-lk-version' in the options<br>
<br>
</blockquote></span>
I see this in a 2x1 plain distribute also. I see ENOTCONN for the upgraded brick on the old client:<br>
<br>
[2018-03-02 04:58:54.559446] E [MSGID: 114058] [client-handshake.c:1571:clien<wbr>t_query_portmap_cbk] 0-testvol-client-1: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running.<br>
[2018-03-02 04:58:54.559618] I [MSGID: 114018] [client.c:2285:client_rpc_noti<wbr>fy] 0-testvol-client-1: disconnected from testvol-client-1. Client process will keep trying to connect to glusterd until brick's port is available<br>
[2018-03-02 04:58:56.973199] I [rpc-clnt.c:1994:rpc_clnt_reco<wbr>nfig] 0-testvol-client-1: changing port to 49152 (from 0)<br>
[2018-03-02 04:58:56.975844] I [MSGID: 114057] [client-handshake.c:1484:selec<wbr>t_server_supported_programs] 0-testvol-client-1: Using Program GlusterFS 3.3, Num (1298437), Version (330)<br>
[2018-03-02 04:58:56.978114] W [MSGID: 114007] [client-handshake.c:1197:clien<wbr>t_setvolume_cbk] 0-testvol-client-1: failed to find key 'clnt-lk-version' in the options<br>
[2018-03-02 04:58:46.618036] E [MSGID: 114031] [client-rpc-fops.c:2768:client<wbr>3_3_opendir_cbk] 0-testvol-client-1: remote operation failed. Path: / (00000000-0000-0000-0000-00000<wbr>0000001) [Transport endpoint is not connected]<br>
The message "W [MSGID: 114031] [client-rpc-fops.c:2577:client<wbr>3_3_readdirp_cbk] 0-testvol-client-1: remote operation failed [Transport endpoint is not connected]" repeated 3 times between [2018-03-02 04:58:46.609529] and [2018-03-02 04:58:46.618683]<br>
<br>
Also, mkdir fails on the old mount with EIO, though physically succeeding on both bricks. Can the rpc folks offer a helping hand?<br></blockquote><div><br></div><div> <div><div>Sometimes glusterfs returns wrong ia_type (IA_IFIFO to be
precise) in response of mkdir. This is the reason for failure. Note that
mkdir response from glusterfs says its successful, but with a wrong
iatt. That's the reason why we see directories created on bricks.<br></div><div><br></div><div>On debugging further, in dht_selfheal_dir_xattr_cbk, which gets executed as part of mkdir in dht,</div><div><br></div><div>(gdb) <br>677 ret = dict_get_bin (xdata, DHT_IATT_IN_XDATA_KEY, (void **) &stbuf);<br>(gdb) <br>692 LOCK (&frame->lock);<br>(gdb) <br>694 dht_iatt_merge (this, &local->stbuf, stbuf, subvol);<br>(gdb) p stbuf<br>$16 = (struct iatt *) 0x7f84e405aaf0<br>(gdb) p *stbuf<br>$17 = {ia_ino = 6143, ia_gfid = "\222\064\301\225~6v\242\021\<wbr>b\000\000\000\000\000",
ia_dev = 0, ia_type = IA_IFIFO, ia_prot = {suid = 0 '\000', sgid = 0
'\000', sticky = 0 '\000', owner = {read = 0 '\000', <br> write = 0
'\000', exec = 0 '\000'}, group = {read = 0 '\000', write = 0 '\000',
exec = 0 '\000'}, other = {read = 0 '\000', write = 0 '\000', exec = 0
'\000'}}, ia_nlink = 2, ia_uid = 0, ia_gid = 0, <br> ia_rdev = 0,
ia_size = 1520570685, ia_blksize = 1520570529, ia_blocks = 1520570714,
ia_atime = 0, ia_atime_nsec = 0, ia_mtime = 172390349, ia_mtime_nsec =
475585538, ia_ctime = 626110118, ia_ctime_nsec = 0}<br>(gdb) p local->stbuf<br>$18 = {ia_ino = 11706604198702429330, ia_gfid = "e\223\246pH\005F\226\242v6~\<wbr>225\301\064\222", ia_dev = 2065, ia_type = IA_IFDIR, ia_prot = {suid = 0 '\000', sgid = 0 '\000', sticky = 0 '\000', owner = {<br>
read = 1 '\001', write = 1 '\001', exec = 1 '\001'}, group = {read = 1
'\001', write = 0 '\000', exec = 1 '\001'}, other = {read = 1 '\001',
write = 0 '\000', exec = 1 '\001'}}, ia_nlink = 2, ia_uid = 0, <br>
ia_gid = 0, ia_rdev = 0, ia_size = 4096, ia_blksize = 4096, ia_blocks =
8, ia_atime = 1520570529, ia_atime_nsec = 475585538, ia_mtime =
1520570529, ia_mtime_nsec = 475585538, ia_ctime = 1520570529, <br> ia_ctime_nsec = 475585538}<br>(gdb) n<br>696 UNLOCK (&frame->lock);<br>(gdb) p local->stbuf<br>$19 = {ia_ino = 6143, ia_gfid = "\222\064\301\225~6v\242\021\<wbr>b\000\000\000\000\000",
ia_dev = 0, ia_type = IA_IFIFO, ia_prot = {suid = 0 '\000', sgid = 0
'\000', sticky = 0 '\000', owner = {read = 0 '\000',<br> write = 0
'\000', exec = 0 '\000'}, group = {read = 0 '\000', write = 0 '\000',
exec = 0 '\000'}, other = {read = 0 '\000', write = 0 '\000', exec = 0
'\000'}}, ia_nlink = 2, ia_uid = 0, ia_gid = 0,<br> ia_rdev = 0,
ia_size = 1520574781, ia_blksize = 1520570529, ia_blocks = 1520570722,
ia_atime = 1520570529, ia_atime_nsec = 475585538, ia_mtime = 1520570529,
ia_mtime_nsec = 475585538, ia_ctime = 1520570529,<br> ia_ctime_nsec = 475585538}<br></div><div><br></div><div>So, we got correct iatt during mkdir, but wrong one while trying to set the layout on directory.</div><div><br></div>Debugging further,<br><br></div><br><div><div>(gdb) p *stbuf<br>$26 = {ia_ino = 6143, ia_gfid = "L\rk\212\367\275\"\256\021\b\<wbr>000\000\000\000\000",
ia_dev = 0, ia_type = IA_IFIFO, ia_prot = {suid = 0 '\000', sgid = 0
'\000', sticky = 0 '\000', owner = {read = 0 '\000', <br> write = 0
'\000', exec = 0 '\000'}, group = {read = 0 '\000', write = 0 '\000',
exec = 0 '\000'}, other = {read = 0 '\000', write = 0 '\000', exec = 0
'\000'}}, ia_nlink = 2, ia_uid = 0, ia_gid = 0, <br> ia_rdev = 0,
ia_size = 1520571192, ia_blksize = 1520571192, ia_blocks = 1520571192,
ia_atime = 0, ia_atime_nsec = 0, ia_mtime = 87784021, ia_mtime_nsec =
87784021, ia_ctime = 92784143, ia_ctime_nsec = 0}<br>(gdb) up<br>#1
0x00007f84eae8ead1 in client3_3_setxattr_cbk (req=0x7f84e0008130,
iov=0x7f84e0008170, count=1, myframe=0x7f84e0008d80) at
client-rpc-fops.c:1013<br>1013 CLIENT_STACK_UNWIND (setxattr, frame, rsp.op_ret, op_errno, xdata);<br>(gdb) p this->name<br>$27 = 0x7f84e4009190 "testvol-client-1"<br></div><div><br></div><div>Breakpoint
12, dht_selfheal_dir_xattr_cbk (frame=0x7f84dc006a00,
cookie=0x7f84e4007c50, this=0x7f84e400ce80, op_ret=0, op_errno=0,
xdata=0x7f84e00017a0) at dht-selfheal.c:685<br>685 for (i = 0; i < layout->cnt; i++) {<br>(gdb) p *stbuf<br>$28 = {ia_ino = 12547800382684466508, ia_gfid = "\020{mk\200\067Kq\256\"\275\<wbr>367\212k\rL", ia_dev = 2065, ia_type = IA_IFDIR, ia_prot = {suid = 0 '\000', sgid = 0 '\000', sticky = 0 '\000', owner = {<br>
read = 1 '\001', write = 1 '\001', exec = 1 '\001'}, group = {read = 1
'\001', write = 0 '\000', exec = 1 '\001'}, other = {read = 1 '\001',
write = 0 '\000', exec = 1 '\001'}}, ia_nlink = 2, ia_uid = 0, <br>
ia_gid = 0, ia_rdev = 0, ia_size = 6, ia_blksize = 4096, ia_blocks = 0,
ia_atime = 1520571192, ia_atime_nsec = 90026323, ia_mtime = 1520571192,
ia_mtime_nsec = 90026323, ia_ctime = 1520571192, <br> ia_ctime_nsec = 94026420}<br>(gdb) up<br>#1
0x00007f84eae8ead1 in client3_3_setxattr_cbk (req=0x7f84e000a5f0,
iov=0x7f84e000a630, count=1, myframe=0x7f84e000aa00) at
client-rpc-fops.c:1013<br>1013 CLIENT_STACK_UNWIND (setxattr, frame, rsp.op_ret, op_errno, xdata);<br>(gdb) p this->name<br>$29 = 0x7f84e4008810 "testvol-client-0"<br></div><div><br></div><div>As
can be seen above, its always new brick (testvol-client-1) that returns
wrong iatt with ia_type IA_FIFO. old client (testvol-client-0) returns
correct iatt.</div><div><br></div><div>We need to debug further on what
in client-1 (which is running 4.0) resulted in wrong iatt. Note that the
iatt is got from dictionary. So, dictionary changes in 4.0 is one
suspect.</div><div><br></div><div>Thanks to Ravi for providing a live setup, which made my life easy :).<div class="gmail-yj6qo gmail-ajU"><div id="gmail-:15r" class="gmail-ajR" tabindex="0"><img class="gmail-ajT" src="https://ssl.gstatic.com/ui/v1/icons/mail/images/cleardot.gif"></div></div></div></div></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
-Ravi<div class="HOEnZb"><div class="h5"><br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Is there something more to be done on BZ 1544366?<br>
<br>
-Ravi<br>
On 03/02/2018 08:44 AM, Ravishankar N wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
On 03/02/2018 07:26 AM, Shyam Ranganathan wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Hi Pranith/Ravi,<br>
<br>
So, to keep a long story short, post upgrading 1 node in a 3 node 3.13<br>
cluster, self-heal is not able to catch the heal backlog and this is a<br>
very simple synthetic test anyway, but the end result is that upgrade<br>
testing is failing.<br>
</blockquote>
<br>
Let me try this now and get back. I had done some thing similar when testing the FIPS patch and the rolling upgrade had worked.<br>
Thanks,<br>
Ravi<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
Here are the details,<br>
<br>
- Using<br>
<a href="https://hackmd.io/GYIwTADCDsDMCGBaArAUxAY0QFhBAbIgJwCMySIwJmAJvGMBvNEA#" rel="noreferrer" target="_blank">https://hackmd.io/GYIwTADCDsDM<wbr>CGBaArAUxAY0QFhBAbIgJwCMySIwJm<wbr>AJvGMBvNEA#</a><br>
I setup 3 server containers to install 3.13 first as follows (within the<br>
containers)<br>
<br>
(inside the 3 server containers)<br>
yum -y update; yum -y install centos-release-gluster313; yum install<br>
glusterfs-server; glusterd<br>
<br>
(inside centos-glfs-server1)<br>
gluster peer probe centos-glfs-server2<br>
gluster peer probe centos-glfs-server3<br>
gluster peer status<br>
gluster v create patchy replica 3 centos-glfs-server1:/d/brick1<br>
centos-glfs-server2:/d/brick2 centos-glfs-server3:/d/brick3<br>
centos-glfs-server1:/d/brick4 centos-glfs-server2:/d/brick5<br>
centos-glfs-server3:/d/brick6 force<br>
gluster v start patchy<br>
gluster v status<br>
<br>
Create a client container as per the document above, and mount the above<br>
volume and create 1 file, 1 directory and a file within that directory.<br>
<br>
Now we start the upgrade process (as laid out for 3.13 here<br>
<a href="http://docs.gluster.org/en/latest/Upgrade-Guide/upgrade_to_3.13/" rel="noreferrer" target="_blank">http://docs.gluster.org/en/lat<wbr>est/Upgrade-Guide/upgrade_to_<wbr>3.13/</a> ):<br>
- killall glusterfs glusterfsd glusterd<br>
- yum install<br>
<a href="http://cbs.centos.org/kojifiles/work/tasks/1548/311548/centos-release-gluster40-0.9-1.el7.centos.x86_64.rpm" rel="noreferrer" target="_blank">http://cbs.centos.org/kojifile<wbr>s/work/tasks/1548/311548/<wbr>centos-release-gluster40-0.9-<wbr>1.el7.centos.x86_64.rpm</a> <br>
- yum upgrade --enablerepo=centos-gluster40-<wbr>test glusterfs-server<br>
<br>
< Go back to the client and edit the contents of one of the files and<br>
change the permissions of a directory, so that there are things to heal<br>
when we bring up the newly upgraded server><br>
<br>
- gluster --version<br>
- glusterd<br>
- gluster v status<br>
- gluster v heal patchy<br>
<br>
The above starts failing as follows,<br>
[root@centos-glfs-server1 /]# gluster v heal patchy<br>
Launching heal operation to perform index self heal on volume patchy has<br>
been unsuccessful:<br>
Commit failed on centos-glfs-server2.glfstest20<wbr>. Please check log file<br>
for details.<br>
Commit failed on centos-glfs-server3. Please check log file for details.<br>
<br>
From here, if further files or directories are created from the client,<br>
they just get added to the heal backlog, and heal does not catchup.<br>
<br>
As is obvious, I cannot proceed, as the upgrade procedure is broken. The<br>
issue itself may not be selfheal deamon, but something around<br>
connections, but as the process fails here, looking to you guys to<br>
unblock this as soon as possible, as we are already running a day's slip<br>
in the release.<br>
<br>
Thanks,<br>
Shyam<br>
</blockquote>
<br>
</blockquote>
<br>
</blockquote>
<br>
______________________________<wbr>_________________<br>
maintainers mailing list<br>
<a href="mailto:maintainers@gluster.org" target="_blank">maintainers@gluster.org</a><br>
<a href="http://lists.gluster.org/mailman/listinfo/maintainers" rel="noreferrer" target="_blank">http://lists.gluster.org/mailm<wbr>an/listinfo/maintainers</a><br>
</div></div></blockquote></div><br></div></div>