[Gluster-users] Cannot write more than 512 bytes to gluster vol

Jamie Lawrence jlawrence at squaretrade.com
Thu Mar 7 19:32:14 UTC 2019


I just stood up a new cluster running 4.1.7, my first experience with version 4. It is a simple replica 3 volume:

gluster v create la1_db_1 replica 3 \
	gluster-10g-1:/gluster-bricks/la1_db_1/la1_db_1 \
	gluster-10g-2:/gluster-bricks/la1_db_1/la1_db_1 \
	gluster-10g-3:/gluster-bricks/la1_db_1/la1_db_1

gluster v set la1_db_1 storage.owner-uid 130
gluster v set la1_db_1 storage.owner-gid 130
gluster v set la1_db_1 server.allow-insecure on
gluster v set la1_db_1 auth.allow [various IPs]

After mounting on a client, everything appears fine until you try to use it.

dd if=/dev/zero of=/path/on/client/foo

will write 512 bytes and then hang until timeout, at which point it declares "Transport not connected".

Notably, if I mount the volume on one of the gluster machines over the same interface, it behave like it should. That led me to investigate packet filtering, which is configured correctly, and in any case, after flushing all rules on all involved machines, the issue persists.

cli.log contains a lot of:

[2019-03-06 17:00:02.893553] I [cli.c:773:main] 0-cli: Started running /sbin/gluster with version 4.1.7
[2019-03-06 17:00:02.897199] I [cli-cmd-volume.c:2375:cli_check_gsync_present] 0-: geo-replication not installed
[2019-03-06 17:00:02.897545] I [MSGID: 101190] [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2019-03-06 17:00:02.897617] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now
[2019-03-06 17:00:02.897678] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0
[2019-03-06 17:00:02.898244] I [input.c:31:cli_batch] 0-: Exiting with: 0
[2019-03-06 17:00:02.922637] I [cli.c:773:main] 0-cli: Started running /sbin/gluster with version 4.1.7
[2019-03-06 17:00:02.926599] I [cli-cmd-volume.c:2375:cli_check_gsync_present] 0-: geo-replication not installed
[2019-03-06 17:00:02.926906] I [MSGID: 101190] [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2019-03-06 17:00:02.926956] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now
[2019-03-06 17:00:02.927113] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-glusterfs: error returned while attempting to connect to host:(null), port:0
[2019-03-06 17:00:02.927573] I [input.c:31:cli_batch] 0-: Exiting with: 0

The client log is more interesting, I just don't know what to make of it:

[2019-03-07 19:18:36.674687] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-la1_db_1-client-0: error returned while attempting to connect to host:(null), port:0
[2019-03-07 19:18:36.674726] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-la1_db_1-client-1: error returned while attempting to connect to host:(null), port:0
[2019-03-07 19:18:36.674752] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-la1_db_1-client-2: error returned while attempting to connect to host:(null), port:0
[2019-03-07 19:18:36.674806] I [rpc-clnt.c:2105:rpc_clnt_reconfig] 0-la1_db_1-client-0: changing port to 49152 (from 0)
[2019-03-07 19:18:36.674815] I [rpc-clnt.c:2105:rpc_clnt_reconfig] 0-la1_db_1-client-1: changing port to 49152 (from 0)
[2019-03-07 19:18:36.674927] I [rpc-clnt.c:2105:rpc_clnt_reconfig] 0-la1_db_1-client-2: changing port to 49152 (from 0)
[2019-03-07 19:18:36.675012] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-la1_db_1-client-1: error returned while attempting to connect to host:(null), port:0
[2019-03-07 19:18:36.675054] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-la1_db_1-client-0: error returned while attempting to connect to host:(null), port:0
[2019-03-07 19:18:36.675155] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-la1_db_1-client-2: error returned while attempting to connect to host:(null), port:0
[2019-03-07 19:18:36.675203] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-la1_db_1-client-1: error returned while attempting to connect to host:(null), port:0
[2019-03-07 19:18:36.675243] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-la1_db_1-client-0: error returned while attempting to connect to host:(null), port:0
[2019-03-07 19:18:36.675306] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-la1_db_1-client-2: error returned while attempting to connect to host:(null), port:0
[2019-03-07 19:18:36.675563] I [MSGID: 114046] [client-handshake.c:1095:client_setvolume_cbk] 0-la1_db_1-client-1: Connected to la1_db_1-client-1, attached to remote volume '/gluster-bricks/la1_db_1/la1_db_1'.
[2019-03-07 19:18:36.675573] I [MSGID: 108005] [afr-common.c:5336:__afr_handle_child_up_event] 0-la1_db_1-replicate-0: Subvolume 'la1_db_1-client-1' came back up; going online.
[2019-03-07 19:18:36.675722] I [MSGID: 114046] [client-handshake.c:1095:client_setvolume_cbk] 0-la1_db_1-client-0: Connected to la1_db_1-client-0, attached to remote volume '/gluster-bricks/la1_db_1/la1_db_1'.
[2019-03-07 19:18:36.675728] I [MSGID: 114046] [client-handshake.c:1095:client_setvolume_cbk] 0-la1_db_1-client-2: Connected to la1_db_1-client-2, attached to remote volume '/gluster-bricks/la1_db_1/la1_db_1'.
[2019-03-07 19:18:36.675743] I [MSGID: 108002] [afr-common.c:5611:afr_notify] 0-la1_db_1-replicate-0: Client-quorum is met
[2019-03-07 19:18:36.676564] I [fuse-bridge.c:4294:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel 7.26
[2019-03-07 19:18:36.676578] I [fuse-bridge.c:4927:fuse_graph_sync] 0-fuse: switched to graph 0
[2019-03-07 19:18:36.677623] I [MSGID: 109005] [dht-selfheal.c:2342:dht_selfheal_directory] 0-la1_db_1-dht: Directory selfheal failed: Unable to form layout for directory /
[2019-03-07 19:20:01.674361] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-la1_db_1-client-0: server 172.16.0.171:49152 has not responded in the last 42 seconds, disconnecting.
[2019-03-07 19:20:01.674440] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-la1_db_1-client-1: server 172.16.0.172:49152 has not responded in the last 42 seconds, disconnecting.
[2019-03-07 19:20:01.674458] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-la1_db_1-client-2: server 172.16.0.174:49152 has not responded in the last 42 seconds, disconnecting.
[2019-03-07 19:20:01.674455] I [MSGID: 114018] [client.c:2254:client_rpc_notify] 0-la1_db_1-client-0: disconnected from la1_db_1-client-0. Client process will keep trying to connect to glusterd until brick's port is available
[2019-03-07 19:20:01.674530] I [MSGID: 114018] [client.c:2254:client_rpc_notify] 0-la1_db_1-client-1: disconnected from la1_db_1-client-1. Client process will keep trying to connect to glusterd until brick's port is available
[2019-03-07 19:20:01.674557] W [MSGID: 108001] [afr-common.c:5618:afr_notify] 0-la1_db_1-replicate-0: Client-quorum is not met
[2019-03-07 19:20:01.674776] E [rpc-clnt.c:348:saved_frames_unwind] (--> /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fcfbed74ddb] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xc021)[0x7fcfbeb44021] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xc14e)[0x7fcfbeb4414e] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x8e)[0x7fcfbeb456be] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xe268)[0x7fcfbeb46268] ))))) 0-la1_db_1-client-0: forced unwinding frame type(GlusterFS 4.x v1) op(LOOKUP(27)) called at 2019-03-07 19:18:47.415290 (xid=0x11)
[2019-03-07 19:20:01.674798] W [MSGID: 114031] [client-rpc-fops_v2.c:2540:client4_0_lookup_cbk] 0-la1_db_1-client-0: remote operation failed. Path: /foo (ef9860a9-a366-49a4-9305-08cdb164395b) [Transport endpoint is not connected]
[2019-03-07 19:20:01.674808] E [rpc-clnt.c:348:saved_frames_unwind] (--> /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fcfbed74ddb] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xc021)[0x7fcfbeb44021] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xc14e)[0x7fcfbeb4414e] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x8e)[0x7fcfbeb456be] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xe268)[0x7fcfbeb46268] ))))) 0-la1_db_1-client-1: forced unwinding frame type(GlusterFS 4.x v1) op(LOOKUP(27)) called at 2019-03-07 19:18:47.415303 (xid=0x15)
[2019-03-07 19:20:01.674830] W [MSGID: 114031] [client-rpc-fops_v2.c:2540:client4_0_lookup_cbk] 0-la1_db_1-client-1: remote operation failed. Path: /foo (ef9860a9-a366-49a4-9305-08cdb164395b) [Transport endpoint is not connected]
[2019-03-07 19:20:01.674888] E [rpc-clnt.c:348:saved_frames_unwind] (--> /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fcfbed74ddb] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xc021)[0x7fcfbeb44021] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xc14e)[0x7fcfbeb4414e] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x8e)[0x7fcfbeb456be] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xe268)[0x7fcfbeb46268] ))))) 0-la1_db_1-client-0: forced unwinding frame type(GF-DUMP) op(NULL(2)) called at 2019-03-07 19:19:19.670189 (xid=0x12)
[2019-03-07 19:20:01.674902] W [rpc-clnt-ping.c:222:rpc_clnt_ping_cbk] 0-la1_db_1-client-0: socket disconnected
[2019-03-07 19:20:01.674916] E [rpc-clnt.c:348:saved_frames_unwind] (--> /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fcfbed74ddb] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xc021)[0x7fcfbeb44021] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xc14e)[0x7fcfbeb4414e] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x8e)[0x7fcfbeb456be] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xe268)[0x7fcfbeb46268] ))))) 0-la1_db_1-client-1: forced unwinding frame type(GF-DUMP) op(NULL(2)) called at 2019-03-07 19:19:19.670216 (xid=0x16)
[2019-03-07 19:20:01.674932] W [rpc-clnt-ping.c:222:rpc_clnt_ping_cbk] 0-la1_db_1-client-1: socket disconnected
[2019-03-07 19:20:01.674935] I [MSGID: 114018] [client.c:2254:client_rpc_notify] 0-la1_db_1-client-2: disconnected from la1_db_1-client-2. Client process will keep trying to connect to glusterd until brick's port is available
[2019-03-07 19:20:01.674947] E [MSGID: 108006] [afr-common.c:5413:__afr_handle_child_down_event] 0-la1_db_1-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up.
[2019-03-07 19:20:01.675040] E [rpc-clnt.c:348:saved_frames_unwind] (--> /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fcfbed74ddb] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xc021)[0x7fcfbeb44021] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xc14e)[0x7fcfbeb4414e] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x8e)[0x7fcfbeb456be] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xe268)[0x7fcfbeb46268] ))))) 0-la1_db_1-client-2: forced unwinding frame type(GlusterFS 4.x v1) op(LOOKUP(27)) called at 2019-03-07 19:18:47.415315 (xid=0x11)
[2019-03-07 19:20:01.675055] W [MSGID: 114031] [client-rpc-fops_v2.c:2540:client4_0_lookup_cbk] 0-la1_db_1-client-2: remote operation failed. Path: /foo (ef9860a9-a366-49a4-9305-08cdb164395b) [Transport endpoint is not connected]
[2019-03-07 19:20:01.675077] E [MSGID: 101046] [dht-common.c:1905:dht_revalidate_cbk] 0-la1_db_1-dht: dict is null
[2019-03-07 19:20:01.675097] W [fuse-resolve.c:61:fuse_resolve_entry_cbk] 0-fuse: 00000000-0000-0000-0000-000000000001/foo: failed to resolve (Transport endpoint is not connected)
[2019-03-07 19:20:01.675164] I [MSGID: 108006] [afr-common.c:5677:afr_local_init] 0-la1_db_1-replicate-0: no subvolumes up
[2019-03-07 19:20:01.675176] I [MSGID: 108006] [afr-common.c:5677:afr_local_init] 0-la1_db_1-replicate-0: no subvolumes up
[2019-03-07 19:20:01.675179] E [MSGID: 101046] [dht-common.c:1502:dht_lookup_dir_cbk] 0-la1_db_1-dht: dict is null
[2019-03-07 19:20:01.675194] W [fuse-bridge.c:565:fuse_entry_cbk] 0-glusterfs-fuse: 18: LOOKUP() /foo => -1 (Transport endpoint is not connected)
[2019-03-07 19:20:01.675280] I [MSGID: 108006] [afr-common.c:5677:afr_local_init] 0-la1_db_1-replicate-0: no subvolumes up
[2019-03-07 19:20:01.675282] E [rpc-clnt.c:348:saved_frames_unwind] (--> /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fcfbed74ddb] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xc021)[0x7fcfbeb44021] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xc14e)[0x7fcfbeb4414e] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x8e)[0x7fcfbeb456be] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xe268)[0x7fcfbeb46268] ))))) 0-la1_db_1-client-2: forced unwinding frame type(GF-DUMP) op(NULL(2)) called at 2019-03-07 19:19:19.670241 (xid=0x12)
[2019-03-07 19:20:01.675308] W [rpc-clnt-ping.c:222:rpc_clnt_ping_cbk] 0-la1_db_1-client-2: socket disconnected
[2019-03-07 19:20:01.675310] I [MSGID: 108006] [afr-common.c:5677:afr_local_init] 0-la1_db_1-replicate-0: no subvolumes up
[2019-03-07 19:20:01.675314] E [MSGID: 101046] [dht-common.c:1502:dht_lookup_dir_cbk] 0-la1_db_1-dht: dict is null
[2019-03-07 19:20:01.675332] W [fuse-resolve.c:61:fuse_resolve_entry_cbk] 0-fuse: 00000000-0000-0000-0000-000000000001/foo: failed to resolve (Transport endpoint is not connected)
[2019-03-07 19:20:01.675379] I [MSGID: 108006] [afr-common.c:5677:afr_local_init] 0-la1_db_1-replicate-0: no subvolumes up
[2019-03-07 19:20:01.675390] I [MSGID: 108006] [afr-common.c:5677:afr_local_init] 0-la1_db_1-replicate-0: no subvolumes up
[2019-03-07 19:20:01.675393] E [MSGID: 101046] [dht-common.c:1502:dht_lookup_dir_cbk] 0-la1_db_1-dht: dict is null
[2019-03-07 19:20:01.675405] W [fuse-bridge.c:565:fuse_entry_cbk] 0-glusterfs-fuse: 20: LOOKUP() /foo => -1 (Transport endpoint is not connected)
[2019-03-07 19:20:01.675447] I [MSGID: 108006] [afr-common.c:5677:afr_local_init] 0-la1_db_1-replicate-0: no subvolumes up
[2019-03-07 19:20:01.675470] I [MSGID: 108006] [afr-common.c:5677:afr_local_init] 0-la1_db_1-replicate-0: no subvolumes up
[2019-03-07 19:20:01.675472] E [MSGID: 101046] [dht-common.c:1905:dht_revalidate_cbk] 0-la1_db_1-dht: dict is null
[2019-03-07 19:20:01.675485] W [fuse-bridge.c:899:fuse_attr_cbk] 0-glusterfs-fuse: 21: LOOKUP() / => -1 (Transport endpoint is not connected)
[2019-03-07 19:20:11.675946] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-la1_db_1-client-0: error returned while attempting to connect to host:(null), port:0
[2019-03-07 19:20:11.675989] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-la1_db_1-client-1: error returned while attempting to connect to host:(null), port:0
[2019-03-07 19:20:11.676040] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-la1_db_1-client-2: error returned while attempting to connect to host:(null), port:0
[2019-03-07 19:20:11.676101] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-la1_db_1-client-0: error returned while attempting to connect to host:(null), port:0
[2019-03-07 19:20:11.676138] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-la1_db_1-client-1: error returned while attempting to connect to host:(null), port:0
[2019-03-07 19:20:11.676181] I [rpc-clnt.c:2105:rpc_clnt_reconfig] 0-la1_db_1-client-0: changing port to 49152 (from 0)
[2019-03-07 19:20:11.676261] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-la1_db_1-client-2: error returned while attempting to connect to host:(null), port:0
[2019-03-07 19:20:11.676271] I [rpc-clnt.c:2105:rpc_clnt_reconfig] 0-la1_db_1-client-1: changing port to 49152 (from 0)
[2019-03-07 19:20:11.676293] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-la1_db_1-client-0: error returned while attempting to connect to host:(null), port:0
[2019-03-07 19:20:11.676358] I [rpc-clnt.c:2105:rpc_clnt_reconfig] 0-la1_db_1-client-2: changing port to 49152 (from 0)
[2019-03-07 19:20:11.676455] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-la1_db_1-client-0: error returned while attempting to connect to host:(null), port:0
[2019-03-07 19:20:11.676477] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-la1_db_1-client-1: error returned while attempting to connect to host:(null), port:0
[2019-03-07 19:20:11.676571] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-la1_db_1-client-2: error returned while attempting to connect to host:(null), port:0
[2019-03-07 19:20:11.676684] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-la1_db_1-client-1: error returned while attempting to connect to host:(null), port:0
[2019-03-07 19:20:11.676753] W [rpc-clnt.c:1753:rpc_clnt_submit] 0-la1_db_1-client-2: error returned while attempting to connect to host:(null), port:0


This is a 10G network that is otherwise performing flawlessly, and I'm seeing no errors on the switches; as mentioned, I've tried this both in our normal configuration and with iptables flushed. Not entirely sure what to try next; any suggestions?

Thanks,

-j


More information about the Gluster-users mailing list