[Bugs] [Bug 1250704] New: Random errors when reading multiple files in parallel on disperse volume

Wed Aug 5 19:16:03 UTC 2015

https://bugzilla.redhat.com/show_bug.cgi?id=1250704

            Bug ID: 1250704
           Summary: Random errors when reading multiple files in parallel
                    on disperse volume
           Product: GlusterFS
           Version: 3.7.3
         Component: disperse
          Severity: high
          Assignee: bugs at gluster.org
          Reporter: eivind at pacbell.net
                CC: bugs at gluster.org, gluster-bugs at redhat.com

Description of problem:
Reading multiple files in parallel from a disperse volume will get random read
errors (EIO).  If I create a set of files and attempt to read each of the at
the same time I get random read errors.  The error(s) will occur on different
files and in different location each time a run the test.  These errors will
happen with 4 or more parallel readers.

Tried to run test with and without direct-io.  Does not matter.
Tried to slow the rate of the reader threads.  See fewer errors, but still
fails.
Test does not fail with only 1-3 reader threads.
Test does not fail with other types of volumes (distributed-replicated).

Version-Release number of selected component (if applicable):
3.7.3

How reproducible:

Steps to Reproduce:
1.  Create a disperse volume
# gluster vol info
Volume Name: volec
Type: Disperse
Volume ID: 1a849e84-a9c2-4a08-9950-ac948f6b3d8d
Status: Started
Number of Bricks: 1 x (4 + 2) = 6
Transport-type: tcp
Bricks:
Brick1: science03:/brick_hdd_0/gv
Brick2: science04:/brick_hdd_0/gv
Brick3: science05:/brick_hdd_0/gv
Brick4: science06:/brick_hdd_0/gv
Brick5: science07:/brick_hdd_0/gv
Brick6: science09:/brick_hdd_0/gv
Options Reconfigured:
performance.readdir-ahead: on

2. Mount volume from a client node (separate than gluster server nodes).
3. Create 10 * 64M files (test.0 .. test.10)
4. Run fio with following job file:
[global]
ioengine=sync
size=64m
rw=read
bs=4k
#rate_iops=100
directory=/volec
thread
group_reporting
numjobs=10
filename_format=test.$jobnum
[test]

Actual results:
fio: io_u error on file /volec/test.1: Input/output error: read offset=2609152,
buflen=4096
fio: pid=4988, err=5/file:io_u.c:1575, func=io_u error, error=Input/output
error
fio: io_u error on file /volec/test.3: Input/output error: read offset=3215360,
buflen=4096
fio: pid=4986, err=5/file:io_u.c:1575, func=io_u error, error=Input/output
error
fio: io_u error on file /volec/test.0: Input/output error: read offset=3657728,
buflen=4096
fio: pid=4981, err=5/file:io_u.c:1575, func=io_u error, error=Input/output
error
fio: io_u error on file /volec/test.6: Input/output error: read
offset=20516864, buflen=4096
fio: pid=4983, err=5/file:io_u.c:1575, func=io_u error, error=Input/output
error

Expected results:
No errors

Additional info:
See the following messages in the client log.

/var/log/glusterfs/volec.log:
[2015-08-05 18:19:36.250244] W [MSGID: 122053]
[ec-common.c:166:ec_check_status] 0-volec-disperse-0: Operation failed on some
subvolumes (up=3F, mask=2F, remaining=1, good=2E, bad=10)
[2015-08-05 18:19:36.252131] W [MSGID: 122053]
[ec-common.c:166:ec_check_status] 0-volec-disperse-0: Operation failed on some
subvolumes (up=3F, mask=2F, remaining=2, good=2D, bad=10)
[2015-08-05 18:19:36.253987] W [MSGID: 122053]
[ec-common.c:166:ec_check_status] 0-volec-disperse-0: Operation failed on some
subvolumes (up=3F, mask=2F, remaining=4, good=2B, bad=10)
[2015-08-05 18:19:36.257672] W [MSGID: 122053]
[ec-common.c:166:ec_check_status] 0-volec-disperse-0: Operation failed on some
subvolumes (up=3F, mask=2F, remaining=8, good=27, bad=10)
[2015-08-05 18:19:36.259767] W [MSGID: 122053]
[ec-common.c:166:ec_check_status] 0-volec-disperse-0: Operation failed on some
subvolumes (up=3F, mask=2F, remaining=8, good=27, bad=10)
[2015-08-05 18:19:36.265674] W [MSGID: 122002] [ec-common.c:122:ec_heal_report]
0-volec-disperse-0: Heal failed [Transport endpoint is not connected]
The message "W [MSGID: 122002] [ec-common.c:122:ec_heal_report]
0-volec-disperse-0: Heal failed [Transport endpoint is not connected]" repeated
6 times between [2015-08-05 18:19:36.265674] and [2015-08-05 18:19:36.294575]
[2015-08-05 18:19:36.295111] W [MSGID: 122035]
[ec-common.c:462:ec_child_select] 0-volec-disperse-0: Executing operation with
some subvolumes unavailable (11)
[2015-08-05 18:19:36.295680] W [MSGID: 122053]
[ec-common.c:166:ec_check_status] 0-volec-disperse-0: Operation failed on some
subvolumes (up=3F, mask=2E, remaining=0, good=2E, bad=11)
[2015-08-05 18:19:36.301504] W [MSGID: 122002] [ec-common.c:122:ec_heal_report]
0-volec-disperse-0: Heal failed [Transport endpoint is not connected]
[2015-08-05 18:19:36.328919] W [MSGID: 122035]
[ec-common.c:462:ec_child_select] 0-volec-disperse-0: Executing operation with
some subvolumes unavailable (10)
[2015-08-05 18:19:36.330502] W [MSGID: 122053]
[ec-common.c:166:ec_check_status] 0-volec-disperse-0: Operation failed on some
subvolumes (up=3F, mask=2F, remaining=8, good=27, bad=10)
The message "W [MSGID: 122002] [ec-common.c:122:ec_heal_report]
0-volec-disperse-0: Heal failed [Transport endpoint is not connected]" repeated
13 times between [2015-08-05 18:19:36.301504] and [2015-08-05 18:19:36.362407]

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.