[Gluster-users] RE : Frequent connect and disconnect messages flooded in logs

Wed Nov 30 17:46:51 UTC 2016

Hi,

as 6 servers with 12 bricks produce a lot of log files, I have uploaded 
the last 200 lines of a client on one server here:
http://paste.ubuntu.com/23558816/

When greping the C(ritical) messages, there is for example this one:
[2016-11-30 12:01:06.813333] C 
[rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-gv0-client-3: server 
X.X.X.107:49154 has not responded in the last 42 seconds, disconnecting.

For client-3, which is giant4:/gluster/sdc/gv0, I have uploaded the log 
for this brick here:
http://paste.ubuntu.com/23558818/

It's hard to tell how to reproduce this issue other than "put load on 
the servers/clients".
There are GPGPU compute jobs running on the nodes, but those only 
consume 4 of the 6 CPU cores.
The servers are not overloaded. All of them have 16 GB RAM and most of 
it is empty.
The load on the disks is also very small (<10% according to iostat)

What other information/logs can I provide?

Thanks,
Micha

Am 30.11.2016 um 06:57 schrieb Mohammed Rafi K C:
>
> Hi Micha,
>
> I have changed the thread and subject so that your original thread 
> remain same for your query. Let's try to fix the problem what you 
> observed with 3.8.4, So I have started a new thread to discuss the 
> frequent disconnect problem.
>
> *If any one else has experienced the same problem, please respond to 
> the mail.*
>
> It would be very helpful if you could give us some more logs from 
> clients and bricks.  Also any reproducible steps will surely help to 
> chase the problem further.
>
> Regards
>
> Rafi KC
>
> On 11/30/2016 04:44 AM, Micha Ober wrote:
>> I had opened another thread on this mailing list (Subject: "After 
>> upgrade from 3.4.2 to 3.8.5 - High CPU usage resulting in disconnects 
>> and split-brain").
>>
>> The title may be a bit misleading now, as I am no longer observing 
>> high CPU usage after upgrading to 3.8.6, but the disconnects are 
>> still happening and the number of files in split-brain is growing.
>>
>> Setup: 6 compute nodes, each serving as a glusterfs server and 
>> client, Ubuntu 14.04, two bricks per node, distribute-replicate
>>
>> I have two gluster volumes set up (one for scratch data, one for the 
>> slurm scheduler). Only the scratch data volume shows critical errors 
>> "[...] has not responded in the last 42 seconds, disconnecting.". So 
>> I can rule out network problems, the gigabit link between the nodes 
>> is not saturated at all. The disks are almost idle (<10%).
>>
>> I have glusterfs 3.4.2 on Ubuntu 12.04 on a another compute cluster, 
>> running fine since it was deployed.
>> I had glusterfs 3.4.2 on Ubuntu 14.04 on this cluster, running fine 
>> for almost a year.
>>
>> After upgrading to 3.8.5, the problems (as described) started. I 
>> would like to use some of the new features of the newer versions 
>> (like bitrot), but the users can't run their compute jobs right now 
>> because the result files are garbled.
>>
>> There also seems to be a bug report with a smiliar problem: (but no 
>> progress)
>> https://bugzilla.redhat.com/show_bug.cgi?id=1370683
>>
>> For me, ALL servers are affected (not isolated to one or two servers)
>>
>> I also see messages like "INFO: task gpu_graphene_bv:4476 blocked for 
>> more than 120 seconds." in the syslog.
>>
>> For completeness (gv0 is the scratch volume, gv2 the slurm volume):
>>
>> [root at giant2: ~]# gluster v info
>>
>> Volume Name: gv0
>> Type: Distributed-Replicate
>> Volume ID: 993ec7c9-e4bc-44d0-b7c4-2d977e622e86
>> Status: Started
>> Snapshot Count: 0
>> Number of Bricks: 6 x 2 = 12
>> Transport-type: tcp
>> Bricks:
>> Brick1: giant1:/gluster/sdc/gv0
>> Brick2: giant2:/gluster/sdc/gv0
>> Brick3: giant3:/gluster/sdc/gv0
>> Brick4: giant4:/gluster/sdc/gv0
>> Brick5: giant5:/gluster/sdc/gv0
>> Brick6: giant6:/gluster/sdc/gv0
>> Brick7: giant1:/gluster/sdd/gv0
>> Brick8: giant2:/gluster/sdd/gv0
>> Brick9: giant3:/gluster/sdd/gv0
>> Brick10: giant4:/gluster/sdd/gv0
>> Brick11: giant5:/gluster/sdd/gv0
>> Brick12: giant6:/gluster/sdd/gv0
>> Options Reconfigured:
>> auth.allow: X.X.X.*,127.0.0.1
>> nfs.disable: on
>>
>> Volume Name: gv2
>> Type: Replicate
>> Volume ID: 30c78928-5f2c-4671-becc-8deaee1a7a8d
>> Status: Started
>> Snapshot Count: 0
>> Number of Bricks: 1 x 2 = 2
>> Transport-type: tcp
>> Bricks:
>> Brick1: giant1:/gluster/sdd/gv2
>> Brick2: giant2:/gluster/sdd/gv2
>> Options Reconfigured:
>> auth.allow: X.X.X.*,127.0.0.1
>> cluster.granular-entry-heal: on
>> cluster.locking-scheme: granular
>> nfs.disable: on
>>
>>
>> 2016-11-30 0:10 GMT+01:00 Micha Ober <micha2k at gmail.com 
>> <mailto:micha2k at gmail.com>>:
>>
>>     There also seems to be a bug report with a smiliar problem: (but
>>     no progress)
>>     https://bugzilla.redhat.com/show_bug.cgi?id=1370683
>>     <https://bugzilla.redhat.com/show_bug.cgi?id=1370683>
>>
>>     For me, ALL servers are affected (not isolated to one or two servers)
>>
>>     I also see messages like "INFO: task gpu_graphene_bv:4476 blocked
>>     for more than 120 seconds." in the syslog.
>>
>>     For completeness (gv0 is the scratch volume, gv2 the slurm volume):
>>
>>     [root at giant2: ~]# gluster v info
>>
>>     Volume Name: gv0
>>     Type: Distributed-Replicate
>>     Volume ID: 993ec7c9-e4bc-44d0-b7c4-2d977e622e86
>>     Status: Started
>>     Snapshot Count: 0
>>     Number of Bricks: 6 x 2 = 12
>>     Transport-type: tcp
>>     Bricks:
>>     Brick1: giant1:/gluster/sdc/gv0
>>     Brick2: giant2:/gluster/sdc/gv0
>>     Brick3: giant3:/gluster/sdc/gv0
>>     Brick4: giant4:/gluster/sdc/gv0
>>     Brick5: giant5:/gluster/sdc/gv0
>>     Brick6: giant6:/gluster/sdc/gv0
>>     Brick7: giant1:/gluster/sdd/gv0
>>     Brick8: giant2:/gluster/sdd/gv0
>>     Brick9: giant3:/gluster/sdd/gv0
>>     Brick10: giant4:/gluster/sdd/gv0
>>     Brick11: giant5:/gluster/sdd/gv0
>>     Brick12: giant6:/gluster/sdd/gv0
>>     Options Reconfigured:
>>     auth.allow: X.X.X.*,127.0.0.1
>>     nfs.disable: on
>>
>>     Volume Name: gv2
>>     Type: Replicate
>>     Volume ID: 30c78928-5f2c-4671-becc-8deaee1a7a8d
>>     Status: Started
>>     Snapshot Count: 0
>>     Number of Bricks: 1 x 2 = 2
>>     Transport-type: tcp
>>     Bricks:
>>     Brick1: giant1:/gluster/sdd/gv2
>>     Brick2: giant2:/gluster/sdd/gv2
>>     Options Reconfigured:
>>     auth.allow: X.X.X.*,127.0.0.1
>>     cluster.granular-entry-heal: on
>>     cluster.locking-scheme: granular
>>     nfs.disable: on
>>
>>
>>     2016-11-29 19:21 GMT+01:00 Micha Ober <micha2k at gmail.com>:
>>
>>         I had opened another thread on this mailing list (Subject:
>>         "After upgrade from 3.4.2 to 3.8.5 - High CPU usage resulting
>>         in disconnects and split-brain").
>>
>>         The title may be a bit misleading now, as I am no longer
>>         observing high CPU usage after upgrading to 3.8.6, but the
>>         disconnects are still happening and the number of files in
>>         split-brain is growing.
>>
>>         Setup: 6 compute nodes, each serving as a glusterfs server
>>         and client, Ubuntu 14.04, two bricks per node,
>>         distribute-replicate
>>
>>         I have two gluster volumes set up (one for scratch data, one
>>         for the slurm scheduler). Only the scratch data volume shows
>>         critical errors "[...] has not responded in the last 42
>>         seconds, disconnecting.". So I can rule out network problems,
>>         the gigabit link between the nodes is not saturated at all.
>>         The disks are almost idle (<10%).
>>
>>         I have glusterfs 3.4.2 on Ubuntu 12.04 on a another compute
>>         cluster, running fine since it was deployed.
>>         I had glusterfs 3.4.2 on Ubuntu 14.04 on this cluster,
>>         running fine for almost a year.
>>
>>         After upgrading to 3.8.5, the problems (as described)
>>         started. I would like to use some of the new features of the
>>         newer versions (like bitrot), but the users can't run their
>>         compute jobs right now because the result files are garbled.
>>
>>         2016-11-29 18:53 GMT+01:00 Atin Mukherjee <amukherj at redhat.com>:
>>
>>             Would you be able to share what is not working for you in
>>             3.8.x (mention the exact version). 3.4 is quite old and
>>             falling back to an unsupported version doesn't look a
>>             feasible option.
>>
>>             On Tue, 29 Nov 2016 at 17:01, Micha Ober
>>             <micha2k at gmail.com> wrote:
>>
>>                 Hi,
>>
>>                 I was using gluster 3.4 and upgraded to 3.8, but that
>>                 version showed to be unusable for me. I now need to
>>                 downgrade.
>>
>>                 I'm running Ubuntu 14.04. As upgrades of the op
>>                 version are irreversible, I guess I have to delete
>>                 all gluster volumes and re-create them with the
>>                 downgraded version.
>>
>>                 0. Backup data
>>                 1. Unmount all gluster volumes
>>                 2. apt-get purge glusterfs-server glusterfs-client
>>                 3. Remove PPA for 3.8
>>                 4. Add PPA for older version
>>                 5. apt-get install glusterfs-server glusterfs-client
>>                 6. Create volumes
>>
>>                 Is "purge" enough to delete all configuration files
>>                 of the currently installed version or do I need to
>>                  manually clear some residues before installing an
>>                 older version?
>>
>>                 Thanks.
>>                 _______________________________________________
>>                 Gluster-users mailing list
>>                 Gluster-users at gluster.org
>>                 <mailto:Gluster-users at gluster.org>
>>                 http://www.gluster.org/mailman/listinfo/gluster-users
>>                 <http://www.gluster.org/mailman/listinfo/gluster-users>
>>
>>             -- 
>>             - Atin (atinm)
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-users
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20161130/33aa6448/attachment.html>