[Gluster-users] Issue in RDMA transport
beat at 0x1b.ch
Mon Feb 21 06:51:23 UTC 2011
I found some memory corruption in the RDMA transport layer.
Setup is CentOS 5.5, Mellanox OFED 1.5.2 / OpenFabrics OFED 1.5.2,
ConnectX-2 cards, GlusterFS 3.1.2 / Git Master Branch.
Application is ANSYS CFX wit transient cases, running with strange
corecounts like 6 or 12.
Symptoms are failure during the write out of the case. Errors are
recorded in the brick's and client's logs:
[2011-02-04 15:41:19.688110] W [fuse-bridge.c:1761:fuse_writev_cbk]
glusterfs-fuse: 29810266: WRITE => -1 (Bad address)
[2011-02-04 15:41:19.687733] E [posix.c:2504:posix_writev] home-posix:
write failed: offset 538534184, Bad address
I was able to reproduce the error using a single brick and a single
client. Running server and client on the same system didn't pop up the
error, the data must pass a wire to trigger the bug. Switching to TCP
over IPoIB was a successful workaround.
It looks like a pointer in the iovec structure used by the writev is
screwed up during the transport over RDMA. I can imagine that the
debugging would be rather hard, hopefully you'll be able to find the
root cause. Feel free to ask for additional logs or traces, I'll try to
\|/ Beat Rubischon <beat at 0x1b.ch>
( 0-0 ) http://www.0x1b.ch/~beat/
Meine Erlebnisse, Gedanken und Traeume: http://www.0x1b.ch/blog/
More information about the Gluster-users