[Gluster-users] RE : Frequent connect and disconnect messages flooded in logs
Micha Ober
micha2k at gmail.com
Fri Dec 2 19:26:37 UTC 2016
** Update: ** I have downgraded from 3.8.6 to 3.7.17 now, but the
problem still exists.
Client log: http://paste.ubuntu.com/23569065/
Brick log: http://paste.ubuntu.com/23569067/
Please note that each server has two bricks.
Whereas, according to the logs, one brick loses the connection to all
other hosts:
[2016-12-02 18:38:53.703301] W [socket.c:596:__socket_rwv] 0-tcp.gv0-server: writev on X.X.X.219:49121 failed (Broken pipe)
[2016-12-02 18:38:53.703381] W [socket.c:596:__socket_rwv] 0-tcp.gv0-server: writev on X.X.X.62:49118 failed (Broken pipe)
[2016-12-02 18:38:53.703380] W [socket.c:596:__socket_rwv] 0-tcp.gv0-server: writev on X.X.X.107:49121 failed (Broken pipe)
[2016-12-02 18:38:53.703424] W [socket.c:596:__socket_rwv] 0-tcp.gv0-server: writev on X.X.X.206:49120 failed (Broken pipe)
[2016-12-02 18:38:53.703359] W [socket.c:596:__socket_rwv] 0-tcp.gv0-server: writev on X.X.X.58:49121 failed (Broken pipe)
The SECOND brick on the SAME host is NOT affected, i.e. no disconnects!
As I said, the network connection is fine and the disks are idle.
The CPU always has 2 free cores.
It looks like I have to downgrade to 3.4 now in order for the disconnects to stop.
- Micha
Am 30.11.2016 um 06:57 schrieb Mohammed Rafi K C:
>
> Hi Micha,
>
> I have changed the thread and subject so that your original thread
> remain same for your query. Let's try to fix the problem what you
> observed with 3.8.4, So I have started a new thread to discuss the
> frequent disconnect problem.
>
> *If any one else has experienced the same problem, please respond to
> the mail.*
>
> It would be very helpful if you could give us some more logs from
> clients and bricks. Also any reproducible steps will surely help to
> chase the problem further.
>
> Regards
>
> Rafi KC
>
> On 11/30/2016 04:44 AM, Micha Ober wrote:
>> I had opened another thread on this mailing list (Subject: "After
>> upgrade from 3.4.2 to 3.8.5 - High CPU usage resulting in disconnects
>> and split-brain").
>>
>> The title may be a bit misleading now, as I am no longer observing
>> high CPU usage after upgrading to 3.8.6, but the disconnects are
>> still happening and the number of files in split-brain is growing.
>>
>> Setup: 6 compute nodes, each serving as a glusterfs server and
>> client, Ubuntu 14.04, two bricks per node, distribute-replicate
>>
>> I have two gluster volumes set up (one for scratch data, one for the
>> slurm scheduler). Only the scratch data volume shows critical errors
>> "[...] has not responded in the last 42 seconds, disconnecting.". So
>> I can rule out network problems, the gigabit link between the nodes
>> is not saturated at all. The disks are almost idle (<10%).
>>
>> I have glusterfs 3.4.2 on Ubuntu 12.04 on a another compute cluster,
>> running fine since it was deployed.
>> I had glusterfs 3.4.2 on Ubuntu 14.04 on this cluster, running fine
>> for almost a year.
>>
>> After upgrading to 3.8.5, the problems (as described) started. I
>> would like to use some of the new features of the newer versions
>> (like bitrot), but the users can't run their compute jobs right now
>> because the result files are garbled.
>>
>> There also seems to be a bug report with a smiliar problem: (but no
>> progress)
>> https://bugzilla.redhat.com/show_bug.cgi?id=1370683
>>
>> For me, ALL servers are affected (not isolated to one or two servers)
>>
>> I also see messages like "INFO: task gpu_graphene_bv:4476 blocked for
>> more than 120 seconds." in the syslog.
>>
>> For completeness (gv0 is the scratch volume, gv2 the slurm volume):
>>
>> [root at giant2: ~]# gluster v info
>>
>> Volume Name: gv0
>> Type: Distributed-Replicate
>> Volume ID: 993ec7c9-e4bc-44d0-b7c4-2d977e622e86
>> Status: Started
>> Snapshot Count: 0
>> Number of Bricks: 6 x 2 = 12
>> Transport-type: tcp
>> Bricks:
>> Brick1: giant1:/gluster/sdc/gv0
>> Brick2: giant2:/gluster/sdc/gv0
>> Brick3: giant3:/gluster/sdc/gv0
>> Brick4: giant4:/gluster/sdc/gv0
>> Brick5: giant5:/gluster/sdc/gv0
>> Brick6: giant6:/gluster/sdc/gv0
>> Brick7: giant1:/gluster/sdd/gv0
>> Brick8: giant2:/gluster/sdd/gv0
>> Brick9: giant3:/gluster/sdd/gv0
>> Brick10: giant4:/gluster/sdd/gv0
>> Brick11: giant5:/gluster/sdd/gv0
>> Brick12: giant6:/gluster/sdd/gv0
>> Options Reconfigured:
>> auth.allow: X.X.X.*,127.0.0.1
>> nfs.disable: on
>>
>> Volume Name: gv2
>> Type: Replicate
>> Volume ID: 30c78928-5f2c-4671-becc-8deaee1a7a8d
>> Status: Started
>> Snapshot Count: 0
>> Number of Bricks: 1 x 2 = 2
>> Transport-type: tcp
>> Bricks:
>> Brick1: giant1:/gluster/sdd/gv2
>> Brick2: giant2:/gluster/sdd/gv2
>> Options Reconfigured:
>> auth.allow: X.X.X.*,127.0.0.1
>> cluster.granular-entry-heal: on
>> cluster.locking-scheme: granular
>> nfs.disable: on
>>
>>
>> 2016-11-30 0:10 GMT+01:00 Micha Ober <micha2k at gmail.com
>> <mailto:micha2k at gmail.com>>:
>>
>> There also seems to be a bug report with a smiliar problem: (but
>> no progress)
>> https://bugzilla.redhat.com/show_bug.cgi?id=1370683
>> <https://bugzilla.redhat.com/show_bug.cgi?id=1370683>
>>
>> For me, ALL servers are affected (not isolated to one or two servers)
>>
>> I also see messages like "INFO: task gpu_graphene_bv:4476 blocked
>> for more than 120 seconds." in the syslog.
>>
>> For completeness (gv0 is the scratch volume, gv2 the slurm volume):
>>
>> [root at giant2: ~]# gluster v info
>>
>> Volume Name: gv0
>> Type: Distributed-Replicate
>> Volume ID: 993ec7c9-e4bc-44d0-b7c4-2d977e622e86
>> Status: Started
>> Snapshot Count: 0
>> Number of Bricks: 6 x 2 = 12
>> Transport-type: tcp
>> Bricks:
>> Brick1: giant1:/gluster/sdc/gv0
>> Brick2: giant2:/gluster/sdc/gv0
>> Brick3: giant3:/gluster/sdc/gv0
>> Brick4: giant4:/gluster/sdc/gv0
>> Brick5: giant5:/gluster/sdc/gv0
>> Brick6: giant6:/gluster/sdc/gv0
>> Brick7: giant1:/gluster/sdd/gv0
>> Brick8: giant2:/gluster/sdd/gv0
>> Brick9: giant3:/gluster/sdd/gv0
>> Brick10: giant4:/gluster/sdd/gv0
>> Brick11: giant5:/gluster/sdd/gv0
>> Brick12: giant6:/gluster/sdd/gv0
>> Options Reconfigured:
>> auth.allow: X.X.X.*,127.0.0.1
>> nfs.disable: on
>>
>> Volume Name: gv2
>> Type: Replicate
>> Volume ID: 30c78928-5f2c-4671-becc-8deaee1a7a8d
>> Status: Started
>> Snapshot Count: 0
>> Number of Bricks: 1 x 2 = 2
>> Transport-type: tcp
>> Bricks:
>> Brick1: giant1:/gluster/sdd/gv2
>> Brick2: giant2:/gluster/sdd/gv2
>> Options Reconfigured:
>> auth.allow: X.X.X.*,127.0.0.1
>> cluster.granular-entry-heal: on
>> cluster.locking-scheme: granular
>> nfs.disable: on
>>
>>
>> 2016-11-29 19:21 GMT+01:00 Micha Ober <micha2k at gmail.com>:
>>
>> I had opened another thread on this mailing list (Subject:
>> "After upgrade from 3.4.2 to 3.8.5 - High CPU usage resulting
>> in disconnects and split-brain").
>>
>> The title may be a bit misleading now, as I am no longer
>> observing high CPU usage after upgrading to 3.8.6, but the
>> disconnects are still happening and the number of files in
>> split-brain is growing.
>>
>> Setup: 6 compute nodes, each serving as a glusterfs server
>> and client, Ubuntu 14.04, two bricks per node,
>> distribute-replicate
>>
>> I have two gluster volumes set up (one for scratch data, one
>> for the slurm scheduler). Only the scratch data volume shows
>> critical errors "[...] has not responded in the last 42
>> seconds, disconnecting.". So I can rule out network problems,
>> the gigabit link between the nodes is not saturated at all.
>> The disks are almost idle (<10%).
>>
>> I have glusterfs 3.4.2 on Ubuntu 12.04 on a another compute
>> cluster, running fine since it was deployed.
>> I had glusterfs 3.4.2 on Ubuntu 14.04 on this cluster,
>> running fine for almost a year.
>>
>> After upgrading to 3.8.5, the problems (as described)
>> started. I would like to use some of the new features of the
>> newer versions (like bitrot), but the users can't run their
>> compute jobs right now because the result files are garbled.
>>
>> 2016-11-29 18:53 GMT+01:00 Atin Mukherjee <amukherj at redhat.com>:
>>
>> Would you be able to share what is not working for you in
>> 3.8.x (mention the exact version). 3.4 is quite old and
>> falling back to an unsupported version doesn't look a
>> feasible option.
>>
>> On Tue, 29 Nov 2016 at 17:01, Micha Ober
>> <micha2k at gmail.com> wrote:
>>
>> Hi,
>>
>> I was using gluster 3.4 and upgraded to 3.8, but that
>> version showed to be unusable for me. I now need to
>> downgrade.
>>
>> I'm running Ubuntu 14.04. As upgrades of the op
>> version are irreversible, I guess I have to delete
>> all gluster volumes and re-create them with the
>> downgraded version.
>>
>> 0. Backup data
>> 1. Unmount all gluster volumes
>> 2. apt-get purge glusterfs-server glusterfs-client
>> 3. Remove PPA for 3.8
>> 4. Add PPA for older version
>> 5. apt-get install glusterfs-server glusterfs-client
>> 6. Create volumes
>>
>> Is "purge" enough to delete all configuration files
>> of the currently installed version or do I need to
>> manually clear some residues before installing an
>> older version?
>>
>> Thanks.
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> <mailto:Gluster-users at gluster.org>
>> http://www.gluster.org/mailman/listinfo/gluster-users
>> <http://www.gluster.org/mailman/listinfo/gluster-users>
>>
>> --
>> - Atin (atinm)
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20161202/cae8db98/attachment.html>
More information about the Gluster-users
mailing list