[Gluster-devel] constant "failed to submit message" to rpc-transport

Michael Brown michael at netdirect.ca
Thu May 9 06:24:47 UTC 2013


(there's probably going to be a TON of wrapped lines, so you can see the unwrapped version at http://paste.fedoraproject.org/11223/80643136/ ) 

OK, after taking a much more serious look at this, I've got: 
* a full packet capture of a situation that causes this problem (358MB compressed) 
* logs from gluster, truncated before starting, at TRACE level (88MB compressed) 
* a corefile from the gluster/nfs process generated near the end of the test (10GB, compressible to 133MB or so) 

My configuration: 

Volume Name: gv0 
Type: Distributed-Replicate 
Volume ID: 52b830b4-f6b0-4103-b90b-ce665dfbafec 
Status: Stopped 
Number of Bricks: 8 x 2 = 16 
Transport-type: tcp 
Bricks: 
Brick1: fearless1:/export/bricks/500117310007a6d8/glusterdata 
Brick2: fearless2:/export/bricks/500117310007a674/glusterdata 
Brick3: fearless1:/export/bricks/500117310007a714/glusterdata 
Brick4: fearless2:/export/bricks/500117310007a684/glusterdata 
Brick5: fearless1:/export/bricks/500117310007a7dc/glusterdata 
Brick6: fearless2:/export/bricks/500117310007a694/glusterdata 
Brick7: fearless1:/export/bricks/500117310007a7e4/glusterdata 
Brick8: fearless2:/export/bricks/500117310007a720/glusterdata 
Brick9: fearless1:/export/bricks/500117310007a7ec/glusterdata 
Brick10: fearless2:/export/bricks/500117310007a74c/glusterdata 
Brick11: fearless1:/export/bricks/500117310007a838/glusterdata 
Brick12: fearless2:/export/bricks/500117310007a814/glusterdata 
Brick13: fearless1:/export/bricks/500117310007a850/glusterdata 
Brick14: fearless2:/export/bricks/500117310007a84c/glusterdata 
Brick15: fearless1:/export/bricks/500117310007a858/glusterdata 
Brick16: fearless2:/export/bricks/500117310007a8f8/glusterdata 
Options Reconfigured: 
diagnostics.count-fop-hits: on 
diagnostics.latency-measurement: on 
performance.nfs.io-threads: on 
performance.client-io-threads: on 
performance.io-cache: on 
performance.write-behind-window-size: 4MB 
performance.nfs.write-behind: on 
nfs.ports-insecure: on 
diagnostics.client-log-level: INFO 

Notable "problems": 
RPC context pointer, requested by JoeJulian: 

(gdb) print *this 
$22 = {ops = 0x7f05d50d2580, listener = 0x0, private = 0x27b6d90, xl_private = 0x0, xl = 0x261ffb0, mydata = 0x27a70d0, lock = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, 
__kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}, refcount = 2, ctx = 0x25dd010, options = 0x7f05d70ba3c8, 
name = 0x26215e0 "gv0-client-9", dnscache = 0x28fda50, buf = 0x0, init = 0x7f05d4ec5650 <init>, fini = 0x7f05d4ec6780 <fini>, reconfigure = 0x7f05d4ec56d0 <reconfigure>, 
notify = 0x7f05d83f2bd0 <rpc_clnt_notify>, notify_data = 0x0, peerinfo = {sockaddr = {ss_family = 2, __ss_align = 0, __ss_padding = '\000' <repeats 111 times>}, sockaddr_len = 16, 
identifier = "192.168.11.2:24037", '\000' <repeats 89 times>}, myinfo = {sockaddr = {ss_family = 2, __ss_align = 0, __ss_padding = '\000' <repeats 111 times>}, sockaddr_len = 16, 
identifier = "192.168.11.1:933", '\000' <repeats 91 times>}, total_bytes_read = 134672272, total_bytes_write = 3678711, list = {next = 0x0, prev = 0x0}, bind_insecure = 0} 

40 million oustanding frames at this point: 
(gdb) print ((call_pool_t)this->ctx->pool) 
$23 = {{all_frames = {next = 0x25f96c0, prev = 0x25f9620}, all_stacks = {next_call = 0x25f96c0, prev_call = 0x25f9620}}, cnt = 39705952, lock = 0, frame_mem_pool = 0x0, stack_mem_pool = 0x0} 

gluster/nfs gets into a state (very quickly) where it's spending all of it's time spinning somewhere: 

PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command 
26240 root 20 0 4287M 2062M 2736 R 1687 1.6 21:53.95 /usr/local/glusterfs/sbin/glusterfs -s localhost --volfile-id gluster/nfs 
26656 root 20 0 4287M 2062M 2736 R 99.0 1.6 1:06.76 /usr/local/glusterfs/sbin/glusterfs -s localhost --volfile-id gluster/nfs 
26649 root 20 0 4287M 2062M 2736 R 99.0 1.6 1:07.32 /usr/local/glusterfs/sbin/glusterfs -s localhost --volfile-id gluster/nfs 
26684 root 20 0 4287M 2062M 2736 R 99.0 1.6 1:08.14 /usr/local/glusterfs/sbin/glusterfs -s localhost --volfile-id gluster/nfs 
26682 root 20 0 4287M 2062M 2736 R 99.0 1.6 1:10.35 /usr/local/glusterfs/sbin/glusterfs -s localhost --volfile-id gluster/nfs 
26678 root 20 0 4287M 2062M 2736 R 99.0 1.6 1:07.98 /usr/local/glusterfs/sbin/glusterfs -s localhost --volfile-id gluster/nfs 
26675 root 20 0 4287M 2062M 2736 R 99.0 1.6 1:07.45 /usr/local/glusterfs/sbin/glusterfs -s localhost --volfile-id gluster/nfs 
26679 root 20 0 4287M 2062M 2736 R 99.0 1.6 1:10.32 /usr/local/glusterfs/sbin/glusterfs -s localhost --volfile-id gluster/nfs 
26676 root 20 0 4287M 2062M 2736 R 99.0 1.6 1:10.89 /usr/local/glusterfs/sbin/glusterfs -s localhost --volfile-id gluster/nfs 
26681 root 20 0 4287M 2062M 2736 R 99.0 1.6 1:06.87 /usr/local/glusterfs/sbin/glusterfs -s localhost --volfile-id gluster/nfs 
26677 root 20 0 4287M 2062M 2736 R 99.0 1.6 1:06.87 /usr/local/glusterfs/sbin/glusterfs -s localhost --volfile-id gluster/nfs 
26672 root 20 0 4287M 2062M 2736 R 99.0 1.6 1:11.07 /usr/local/glusterfs/sbin/glusterfs -s localhost --volfile-id gluster/nfs 
26674 root 20 0 4287M 2062M 2736 R 99.0 1.6 1:07.54 /usr/local/glusterfs/sbin/glusterfs -s localhost --volfile-id gluster/nfs 
26680 root 20 0 4287M 2062M 2736 R 99.0 1.6 1:10.12 /usr/local/glusterfs/sbin/glusterfs -s localhost --volfile-id gluster/nfs 
26280 root 20 0 4287M 2062M 2736 R 99.0 1.6 1:10.37 /usr/local/glusterfs/sbin/glusterfs -s localhost --volfile-id gluster/nfs 
26683 root 20 0 4287M 2062M 2736 R 99.0 1.6 1:08.90 /usr/local/glusterfs/sbin/glusterfs -s localhost --volfile-id gluster/nfs 
26652 root 20 0 4287M 2062M 2736 R 98.0 1.6 1:09.92 /usr/local/glusterfs/sbin/glusterfs -s localhost --volfile-id gluster/nfs 
6580 michael 20 0 111M 3112 1232 R 3.0 0.0 6:56.91 htop 
1611 root 20 0 0 0 0 S 0.0 0.0 0:25.56 fio-wq/0 
1647 root 20 0 0 0 0 S 0.0 0.0 0:21.33 fioa-scan 
26228 root 20 0 1386M 57112 2172 S 0.0 0.0 0:09.38 /usr/local/glusterfs/sbin/glusterfsd -s localhost --volfile-id gv0.fearles 
26267 root 20 0 1386M 57112 2172 S 0.0 0.0 0:01.46 /usr/local/glusterfs/sbin/glusterfsd -s localhost --volfile-id gv0.fearles 
26646 root 20 0 1386M 57112 2172 S 0.0 0.0 0:00.59 /usr/local/glusterfs/sbin/glusterfsd -s localhost --volfile-id gv0.fearles 
26671 root 20 0 1386M 57112 2172 S 0.0 0.0 0:00.57 /usr/local/glusterfs/sbin/glusterfsd -s localhost --volfile-id gv0.fearles 
26685 root 20 0 1386M 57112 2172 S 0.0 0.0 0:00.46 /usr/local/glusterfs/sbin/glusterfsd -s localhost --volfile-id gv0.fearles 
26686 root 20 0 1386M 57112 2172 S 0.0 0.0 0:00.25 /usr/local/glusterfs/sbin/glusterfsd -s localhost --volfile-id gv0.fearles 

yet the disks are idle: 

avg-cpu: %user %nice %system %iowait %steal %idle 
18.41 0.00 0.26 0.00 0.00 81.33 

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util 
sdf 0.00 0.00 0.00 0.67 0.00 6.17 18.50 0.00 2.00 2.00 0.13 
sdh 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 
sda 0.00 0.00 0.00 0.33 0.00 1.33 8.00 0.00 0.00 0.00 0.00 
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 
sdd 0.00 0.00 0.00 0.67 0.00 6.17 18.50 0.00 2.00 2.00 0.13 
sdi 0.00 0.00 0.00 0.33 0.00 1.33 8.00 0.00 0.00 0.00 0.00 
dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 
fioa 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 
dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 

and volume profile says that operations aren't being responded to: 

NFS Server : localhost 
---------------------- 
Cumulative Stats: 
Block Size: 2b+ 8b+ 16b+ 
No. of Reads: 0 0 0 
No. of Writes: 2 6 18 

Block Size: 32b+ 64b+ 128b+ 
No. of Reads: 0 0 0 
No. of Writes: 85 30 19 

Block Size: 256b+ 512b+ 1024b+ 
No. of Reads: 0 23 18 
No. of Writes: 81 28 54 

Block Size: 2048b+ 4096b+ 8192b+ 
No. of Reads: 6 28 385676 
No. of Writes: 21 16 7 

Block Size: 16384b+ 32768b+ 65536b+ 
No. of Reads: 289 135 31300 
No. of Writes: 235 0 0 

%-latency Avg-latency Min-Latency Max-Latency No. of calls Fop 
--------- ----------- ----------- ----------- ------------ ---- 
0.00 0.00 us 0.00 us 0.00 us 1 FORGET 
0.00 0.00 us 0.00 us 0.00 us 81 RELEASE 
0.00 671.00 us 671.00 us 671.00 us 1 UNLINK 
0.00 151.88 us 139.00 us 166.00 us 8 OPEN 
0.00 1622.00 us 1622.00 us 1622.00 us 1 RENAME 
0.00 411.88 us 345.00 us 502.00 us 8 FSTAT 
0.00 1515.50 us 549.00 us 3835.00 us 8 READDIRP 
0.00 488.23 us 155.00 us 4467.00 us 30 LK 
0.00 832.00 us 211.00 us 3206.00 us 119 SETATTR 
0.00 167.32 us 115.00 us 5326.00 us 677 STATFS 
0.00 11469.73 us 464.00 us 672289.00 us 74 CREATE 
0.00 3335.09 us 60.00 us 3806186.00 us 1200 ACCESS 
0.00 130144.47 us 235.00 us 26674232.00 us 256 LOOKUP 
0.00 223278.53 us 5.00 us 5554879.00 us 602 WRITE 
0.00 582000.45 us 51.00 us 75297085.00 us 278 FLUSH 
0.01 129560.57 us 65.00 us 70316547.00 us 3446 STAT 
99.98 8454338.20 us 99.00 us 115469515.00 us 417475 READ 

Duration: 6021 seconds 
Data Read: 5221816212 bytes 
Data Written: 4168883 bytes 

glusterfs really is taking a long time to respond: 
rpc.xid == 0x8879128a 

[2013-05-08 21:16:00.564817] T [rpcsvc.c:382:rpcsvc_request_create] 0-rpc-service: received rpc-message (XID: 0x8879128a, Ver: 2, Program: 100003, ProgVers: 3, Proc: 6) from rpc-transport (socket.nfs-server) 
[2013-05-08 21:16:00.564860] D [nfs3-helpers.c:1757:nfs3_log_rw_call] 0-nfs-nfsv3: XID: 8879128a, READ: args: FH: exportid 52b830b4-f6b0-4103-b90b-ce665dfbafec, gfid 77c67b0a-57a1-4b56-a6dd-15899529facd, offset: 7456178176, count: 8192 
[2013-05-08 21:16:03.161708] T [rpcsvc.c:382:rpcsvc_request_create] 0-rpc-service: received rpc-message (XID: 0x8879128a, Ver: 2, Program: 100003, ProgVers: 3, Proc: 6) from rpc-transport (socket.nfs-server) 
[2013-05-08 21:16:03.161784] D [nfs3-helpers.c:1757:nfs3_log_rw_call] 0-nfs-nfsv3: XID: 8879128a, READ: args: FH: exportid 52b830b4-f6b0-4103-b90b-ce665dfbafec, gfid 77c67b0a-57a1-4b56-a6dd-15899529facd, offset: 7456178176, count: 8192 
[2013-05-08 21:16:13.940617] T [rpcsvc.c:382:rpcsvc_request_create] 0-rpc-service: received rpc-message (XID: 0x8879128a, Ver: 2, Program: 100003, ProgVers: 3, Proc: 6) from rpc-transport (socket.nfs-server) 
[2013-05-08 21:16:13.940662] D [nfs3-helpers.c:1757:nfs3_log_rw_call] 0-nfs-nfsv3: XID: 8879128a, READ: args: FH: exportid 52b830b4-f6b0-4103-b90b-ce665dfbafec, gfid 77c67b0a-57a1-4b56-a6dd-15899529facd, offset: 7456178176, count: 8192 
[2013-05-08 21:16:41.028161] T [rpcsvc.c:382:rpcsvc_request_create] 0-rpc-service: received rpc-message (XID: 0x8879128a, Ver: 2, Program: 100003, ProgVers: 3, Proc: 6) from rpc-transport (socket.nfs-server) 
[2013-05-08 21:16:41.028205] D [nfs3-helpers.c:1757:nfs3_log_rw_call] 0-nfs-nfsv3: XID: 8879128a, READ: args: FH: exportid 52b830b4-f6b0-4103-b90b-ce665dfbafec, gfid 77c67b0a-57a1-4b56-a6dd-15899529facd, offset: 7456178176, count: 8192 
[2013-05-08 21:17:03.172976] D [nfs3-helpers.c:3421:nfs3_log_read_res] 0-nfs-nfsv3: XID: 8879128a, READ: NFS: 0(Call completed successfully.), POSIX: 0(Success), count: 8192, is_eof: 0, vector: count: 1, len: 8192 

[2013-05-08 21:17:03.173042] T [rpcsvc.c:1087:rpcsvc_submit_generic] 0-rpc-service: submitted reply for rpc-message (XID: 0x8879128a, Program: NFS3, ProgVers: 3, Proc: 6) to rpc-transport (socket.nfs-server) 
[2013-05-08 21:17:40.996580] D [nfs3-helpers.c:3421:nfs3_log_read_res] 0-nfs-nfsv3: XID: 8879128a, READ: NFS: 0(Call completed successfully.), POSIX: 0(Success), count: 8192, is_eof: 0, vector: count: 1, len: 8192 
[2013-05-08 21:17:40.996628] T [rpcsvc.c:1087:rpcsvc_submit_generic] 0-rpc-service: submitted reply for rpc-message (XID: 0x8879128a, Program: NFS3, ProgVers: 3, Proc: 6) to rpc-transport (socket.nfs-server) 
[2013-05-08 21:17:44.372915] T [rpcsvc.c:382:rpcsvc_request_create] 0-rpc-service: received rpc-message (XID: 0x8879128a, Ver: 2, Program: 100003, ProgVers: 3, Proc: 6) from rpc-transport (socket.nfs-server) 
[2013-05-08 21:17:44.372956] D [nfs3-helpers.c:1757:nfs3_log_rw_call] 0-nfs-nfsv3: XID: 8879128a, READ: args: FH: exportid 52b830b4-f6b0-4103-b90b-ce665dfbafec, gfid 77c67b0a-57a1-4b56-a6dd-15899529facd, offset: 7456178176, count: 8192 
[2013-05-08 21:18:12.439583] T [rpcsvc.c:382:rpcsvc_request_create] 0-rpc-service: received rpc-message (XID: 0x8879128a, Ver: 2, Program: 100003, ProgVers: 3, Proc: 6) from rpc-transport (socket.nfs-server) 
[2013-05-08 21:18:12.439628] D [nfs3-helpers.c:1757:nfs3_log_rw_call] 0-nfs-nfsv3: XID: 8879128a, READ: args: FH: exportid 52b830b4-f6b0-4103-b90b-ce665dfbafec, gfid 77c67b0a-57a1-4b56-a6dd-15899529facd, offset: 7456178176, count: 8192 
[2013-05-08 21:19:29.949901] D [nfs3-helpers.c:3421:nfs3_log_read_res] 0-nfs-nfsv3: XID: 8879128a, READ: NFS: 0(Call completed successfully.), POSIX: 0(Success), count: 8192, is_eof: 0, vector: count: 1, len: 8192 
[2013-05-08 21:19:29.949953] T [rpcsvc.c:1087:rpcsvc_submit_generic] 0-rpc-service: submitted reply for rpc-message (XID: 0x8879128a, Program: NFS3, ProgVers: 3, Proc: 6) to rpc-transport (socket.nfs-server) 
[2013-05-08 21:23:02.353017] D [nfs3-helpers.c:3421:nfs3_log_read_res] 0-nfs-nfsv3: XID: 8879128a, READ: NFS: 0(Call completed successfully.), POSIX: 0(Success), count: 8192, is_eof: 0, vector: count: 1, len: 8192 
[2013-05-08 21:23:02.353066] T [rpcsvc.c:1087:rpcsvc_submit_generic] 0-rpc-service: submitted reply for rpc-message (XID: 0x8879128a, Program: NFS3, ProgVers: 3, Proc: 6) to rpc-transport (socket.nfs-server) 

-- 
Michael Brown, Systems Consultant 
Net Direct Inc. 
☎: +1 519 883 1172 x5106 

----- Original Message -----

> After a bit of load, I constantly find my gluster server getting into
> a state where it seems to be unable to reply to NFS RPCs:

> [2013-05-05 01:31:16.421507] E [rpcsvc.c:1080:rpcsvc_submit_generic]
> 0-rpc-service: failed to submit message (XID: 0x3705786983x,
> Program: NFS3, ProgVers: 3, Proc: 6) to rpc-transport
> (socket.nfs-server)
> [2013-05-05 01:31:16.421528] E
> [nfs3.c:627:nfs3svc_submit_vector_reply] 0-nfs-nfsv3: Reply
> submission failed http://pastie.org/7803022

> Any idea what to do about it?

> The NFS daemon also gets rather large - I suspect it's storing up
> data for all these RPCs:
> PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command
> 30674 root 20 0 37.6G 37.3G 2288 R 99.0 29.6 5:29.88
> /usr/local/glusterfs/sbin/glusterfs

> I'm running 3.3.1 with a few patches:
> https://github.com/Supermathie/glusterfs/tree/release-3.3-oracle

> Workload is usually Oracle DNFS.

> M.
> --
> Michael Brown               | `One of the main causes of the fall of
> Systems Consultant          | the Roman Empire was that, lacking
> zero,
> Net Direct Inc.             | they had no way to indicate successful
> ☎: +1 519 883 1172 x5106    | termination of their C programs.' -
> Firth
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20130509/a73bae22/attachment-0001.html>


More information about the Gluster-devel mailing list