[Gluster-users] RE : Frequent connect and disconnect messages flooded in logs
Micha Ober
micha2k at gmail.com
Wed Nov 30 17:46:51 UTC 2016
Hi,
as 6 servers with 12 bricks produce a lot of log files, I have uploaded
the last 200 lines of a client on one server here:
http://paste.ubuntu.com/23558816/
When greping the C(ritical) messages, there is for example this one:
[2016-11-30 12:01:06.813333] C
[rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-gv0-client-3: server
X.X.X.107:49154 has not responded in the last 42 seconds, disconnecting.
For client-3, which is giant4:/gluster/sdc/gv0, I have uploaded the log
for this brick here:
http://paste.ubuntu.com/23558818/
It's hard to tell how to reproduce this issue other than "put load on
the servers/clients".
There are GPGPU compute jobs running on the nodes, but those only
consume 4 of the 6 CPU cores.
The servers are not overloaded. All of them have 16 GB RAM and most of
it is empty.
The load on the disks is also very small (<10% according to iostat)
What other information/logs can I provide?
Thanks,
Micha
Am 30.11.2016 um 06:57 schrieb Mohammed Rafi K C:
>
> Hi Micha,
>
> I have changed the thread and subject so that your original thread
> remain same for your query. Let's try to fix the problem what you
> observed with 3.8.4, So I have started a new thread to discuss the
> frequent disconnect problem.
>
> *If any one else has experienced the same problem, please respond to
> the mail.*
>
> It would be very helpful if you could give us some more logs from
> clients and bricks. Also any reproducible steps will surely help to
> chase the problem further.
>
> Regards
>
> Rafi KC
>
> On 11/30/2016 04:44 AM, Micha Ober wrote:
>> I had opened another thread on this mailing list (Subject: "After
>> upgrade from 3.4.2 to 3.8.5 - High CPU usage resulting in disconnects
>> and split-brain").
>>
>> The title may be a bit misleading now, as I am no longer observing
>> high CPU usage after upgrading to 3.8.6, but the disconnects are
>> still happening and the number of files in split-brain is growing.
>>
>> Setup: 6 compute nodes, each serving as a glusterfs server and
>> client, Ubuntu 14.04, two bricks per node, distribute-replicate
>>
>> I have two gluster volumes set up (one for scratch data, one for the
>> slurm scheduler). Only the scratch data volume shows critical errors
>> "[...] has not responded in the last 42 seconds, disconnecting.". So
>> I can rule out network problems, the gigabit link between the nodes
>> is not saturated at all. The disks are almost idle (<10%).
>>
>> I have glusterfs 3.4.2 on Ubuntu 12.04 on a another compute cluster,
>> running fine since it was deployed.
>> I had glusterfs 3.4.2 on Ubuntu 14.04 on this cluster, running fine
>> for almost a year.
>>
>> After upgrading to 3.8.5, the problems (as described) started. I
>> would like to use some of the new features of the newer versions
>> (like bitrot), but the users can't run their compute jobs right now
>> because the result files are garbled.
>>
>> There also seems to be a bug report with a smiliar problem: (but no
>> progress)
>> https://bugzilla.redhat.com/show_bug.cgi?id=1370683
>>
>> For me, ALL servers are affected (not isolated to one or two servers)
>>
>> I also see messages like "INFO: task gpu_graphene_bv:4476 blocked for
>> more than 120 seconds." in the syslog.
>>
>> For completeness (gv0 is the scratch volume, gv2 the slurm volume):
>>
>> [root at giant2: ~]# gluster v info
>>
>> Volume Name: gv0
>> Type: Distributed-Replicate
>> Volume ID: 993ec7c9-e4bc-44d0-b7c4-2d977e622e86
>> Status: Started
>> Snapshot Count: 0
>> Number of Bricks: 6 x 2 = 12
>> Transport-type: tcp
>> Bricks:
>> Brick1: giant1:/gluster/sdc/gv0
>> Brick2: giant2:/gluster/sdc/gv0
>> Brick3: giant3:/gluster/sdc/gv0
>> Brick4: giant4:/gluster/sdc/gv0
>> Brick5: giant5:/gluster/sdc/gv0
>> Brick6: giant6:/gluster/sdc/gv0
>> Brick7: giant1:/gluster/sdd/gv0
>> Brick8: giant2:/gluster/sdd/gv0
>> Brick9: giant3:/gluster/sdd/gv0
>> Brick10: giant4:/gluster/sdd/gv0
>> Brick11: giant5:/gluster/sdd/gv0
>> Brick12: giant6:/gluster/sdd/gv0
>> Options Reconfigured:
>> auth.allow: X.X.X.*,127.0.0.1
>> nfs.disable: on
>>
>> Volume Name: gv2
>> Type: Replicate
>> Volume ID: 30c78928-5f2c-4671-becc-8deaee1a7a8d
>> Status: Started
>> Snapshot Count: 0
>> Number of Bricks: 1 x 2 = 2
>> Transport-type: tcp
>> Bricks:
>> Brick1: giant1:/gluster/sdd/gv2
>> Brick2: giant2:/gluster/sdd/gv2
>> Options Reconfigured:
>> auth.allow: X.X.X.*,127.0.0.1
>> cluster.granular-entry-heal: on
>> cluster.locking-scheme: granular
>> nfs.disable: on
>>
>>
>> 2016-11-30 0:10 GMT+01:00 Micha Ober <micha2k at gmail.com
>> <mailto:micha2k at gmail.com>>:
>>
>> There also seems to be a bug report with a smiliar problem: (but
>> no progress)
>> https://bugzilla.redhat.com/show_bug.cgi?id=1370683
>> <https://bugzilla.redhat.com/show_bug.cgi?id=1370683>
>>
>> For me, ALL servers are affected (not isolated to one or two servers)
>>
>> I also see messages like "INFO: task gpu_graphene_bv:4476 blocked
>> for more than 120 seconds." in the syslog.
>>
>> For completeness (gv0 is the scratch volume, gv2 the slurm volume):
>>
>> [root at giant2: ~]# gluster v info
>>
>> Volume Name: gv0
>> Type: Distributed-Replicate
>> Volume ID: 993ec7c9-e4bc-44d0-b7c4-2d977e622e86
>> Status: Started
>> Snapshot Count: 0
>> Number of Bricks: 6 x 2 = 12
>> Transport-type: tcp
>> Bricks:
>> Brick1: giant1:/gluster/sdc/gv0
>> Brick2: giant2:/gluster/sdc/gv0
>> Brick3: giant3:/gluster/sdc/gv0
>> Brick4: giant4:/gluster/sdc/gv0
>> Brick5: giant5:/gluster/sdc/gv0
>> Brick6: giant6:/gluster/sdc/gv0
>> Brick7: giant1:/gluster/sdd/gv0
>> Brick8: giant2:/gluster/sdd/gv0
>> Brick9: giant3:/gluster/sdd/gv0
>> Brick10: giant4:/gluster/sdd/gv0
>> Brick11: giant5:/gluster/sdd/gv0
>> Brick12: giant6:/gluster/sdd/gv0
>> Options Reconfigured:
>> auth.allow: X.X.X.*,127.0.0.1
>> nfs.disable: on
>>
>> Volume Name: gv2
>> Type: Replicate
>> Volume ID: 30c78928-5f2c-4671-becc-8deaee1a7a8d
>> Status: Started
>> Snapshot Count: 0
>> Number of Bricks: 1 x 2 = 2
>> Transport-type: tcp
>> Bricks:
>> Brick1: giant1:/gluster/sdd/gv2
>> Brick2: giant2:/gluster/sdd/gv2
>> Options Reconfigured:
>> auth.allow: X.X.X.*,127.0.0.1
>> cluster.granular-entry-heal: on
>> cluster.locking-scheme: granular
>> nfs.disable: on
>>
>>
>> 2016-11-29 19:21 GMT+01:00 Micha Ober <micha2k at gmail.com>:
>>
>> I had opened another thread on this mailing list (Subject:
>> "After upgrade from 3.4.2 to 3.8.5 - High CPU usage resulting
>> in disconnects and split-brain").
>>
>> The title may be a bit misleading now, as I am no longer
>> observing high CPU usage after upgrading to 3.8.6, but the
>> disconnects are still happening and the number of files in
>> split-brain is growing.
>>
>> Setup: 6 compute nodes, each serving as a glusterfs server
>> and client, Ubuntu 14.04, two bricks per node,
>> distribute-replicate
>>
>> I have two gluster volumes set up (one for scratch data, one
>> for the slurm scheduler). Only the scratch data volume shows
>> critical errors "[...] has not responded in the last 42
>> seconds, disconnecting.". So I can rule out network problems,
>> the gigabit link between the nodes is not saturated at all.
>> The disks are almost idle (<10%).
>>
>> I have glusterfs 3.4.2 on Ubuntu 12.04 on a another compute
>> cluster, running fine since it was deployed.
>> I had glusterfs 3.4.2 on Ubuntu 14.04 on this cluster,
>> running fine for almost a year.
>>
>> After upgrading to 3.8.5, the problems (as described)
>> started. I would like to use some of the new features of the
>> newer versions (like bitrot), but the users can't run their
>> compute jobs right now because the result files are garbled.
>>
>> 2016-11-29 18:53 GMT+01:00 Atin Mukherjee <amukherj at redhat.com>:
>>
>> Would you be able to share what is not working for you in
>> 3.8.x (mention the exact version). 3.4 is quite old and
>> falling back to an unsupported version doesn't look a
>> feasible option.
>>
>> On Tue, 29 Nov 2016 at 17:01, Micha Ober
>> <micha2k at gmail.com> wrote:
>>
>> Hi,
>>
>> I was using gluster 3.4 and upgraded to 3.8, but that
>> version showed to be unusable for me. I now need to
>> downgrade.
>>
>> I'm running Ubuntu 14.04. As upgrades of the op
>> version are irreversible, I guess I have to delete
>> all gluster volumes and re-create them with the
>> downgraded version.
>>
>> 0. Backup data
>> 1. Unmount all gluster volumes
>> 2. apt-get purge glusterfs-server glusterfs-client
>> 3. Remove PPA for 3.8
>> 4. Add PPA for older version
>> 5. apt-get install glusterfs-server glusterfs-client
>> 6. Create volumes
>>
>> Is "purge" enough to delete all configuration files
>> of the currently installed version or do I need to
>> manually clear some residues before installing an
>> older version?
>>
>> Thanks.
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> <mailto:Gluster-users at gluster.org>
>> http://www.gluster.org/mailman/listinfo/gluster-users
>> <http://www.gluster.org/mailman/listinfo/gluster-users>
>>
>> --
>> - Atin (atinm)
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20161130/33aa6448/attachment.html>
More information about the Gluster-users
mailing list