[Gluster-users] Gluster Startup Issue

Danny Lee dannyl at vt.edu
Mon Jun 27 04:05:47 UTC 2016


Actually, never mind.  I had another background process running that was
checking the state of the gluster cluster.  I turned that off and when I
ran "gluster volume heal appian full" on server-ip-1, only the
server-ip-1's tailed logs showed anything.  The other two server's logs
didn't output anything.

On Sun, Jun 26, 2016 at 2:02 PM, Danny Lee <dannyl at vt.edu> wrote:

> Originally, I ran "sudo gluster volume heal appian full" on server-ip-1
> and then tailed the logs for all of the servers.  The only thing that
> showed up was the logs for server-ip-1, so I thought it wasn't even
> connecting to the other boxes.  But after about 15 seconds, logs showed up
> in server-ip-2 and server-ip-3.  Thanks for pointing that out, Joe.
>
> Here are the tailed logs.  I also noticed that there were some locking
> errors that popped up once in a while in
> the etc-glusterfs-glusterd.vol.log.  I have also added these logs below.
>
> ==== server-ip-1 ====
>
> ==> etc-glusterfs-glusterd.vol.log <==
> [2016-06-26 16:48:31.405513] I [MSGID: 106533]
> [glusterd-volume-ops.c:857:__glusterd_handle_cli_heal_volume] 0-management:
> Received heal vol req for volume volname
> [2016-06-26 16:48:31.409903] E [MSGID: 106153]
> [glusterd-syncop.c:113:gd_collate_errors] 0-glusterd: Commit failed on
> server-ip-2. Please check log file for details.
>
> ==> cli.log <==
> [2016-06-26 16:48:51.208828] I [cli.c:721:main] 0-cli: Started running
> gluster with version 3.7.11
> [2016-06-26 16:48:51.213391] I
> [cli-cmd-volume.c:1795:cli_check_gsync_present] 0-: geo-replication not
> installed
> [2016-06-26 16:48:51.213674] I [MSGID: 101190]
> [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread
> with index 1
> [2016-06-26 16:48:51.213733] I [socket.c:2356:socket_event_handler]
> 0-transport: disconnecting now
> [2016-06-26 16:48:51.219674] I [cli-rpc-ops.c:8417:gf_cli_heal_volume_cbk]
> 0-cli: Received resp to heal volume
> [2016-06-26 16:48:51.219768] I [input.c:36:cli_batch] 0-: Exiting with: -1
>
> ==> cmd_history.log <==
> [2016-06-26 16:48:51.219596]  : volume heal volname full : FAILED : Commit
> failed on server-ip-2. Please check log file for details.
>
> ==> etc-glusterfs-glusterd.vol.log <==
> [2016-06-26 16:48:51.214185] I [MSGID: 106533]
> [glusterd-volume-ops.c:857:__glusterd_handle_cli_heal_volume] 0-management:
> Received heal vol req for volume volname
> [2016-06-26 16:48:51.219031] E [MSGID: 106153]
> [glusterd-syncop.c:113:gd_collate_errors] 0-glusterd: Commit failed on
> server-ip-2. Please check log file for details.
>
>
> ==== server-ip-2 ====
>
> ==> etc-glusterfs-glusterd.vol.log <==
> [2016-06-26 16:48:30.087365] I [MSGID: 106533]
> [glusterd-volume-ops.c:857:__glusterd_handle_cli_heal_volume] 0-management:
> Received heal vol req for volume volname
> [2016-06-26 16:48:30.092829] E [MSGID: 106153]
> [glusterd-syncop.c:113:gd_collate_errors] 0-glusterd: Commit failed on
> server-ip-2. Please check log file for details.
>
> ==> cli.log <==
> [2016-06-26 16:49:30.099446] I [cli.c:721:main] 0-cli: Started running
> gluster with version 3.7.11
> [2016-06-26 16:49:30.104599] I
> [cli-cmd-volume.c:1795:cli_check_gsync_present] 0-: geo-replication not
> installed
> [2016-06-26 16:49:30.104853] I [MSGID: 101190]
> [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread
> with index 1
> [2016-06-26 16:49:30.104896] I [socket.c:2356:socket_event_handler]
> 0-transport: disconnecting now
> [2016-06-26 16:49:30.177924] I [input.c:36:cli_batch] 0-: Exiting with: 0
>
>
> ==== server-ip-3 ====
>
> ==> cli.log <==
> [2016-06-26 16:48:49.177859] I [cli.c:721:main] 0-cli: Started running
> gluster with version 3.7.11
> [2016-06-26 16:48:49.182887] I
> [cli-cmd-volume.c:1795:cli_check_gsync_present] 0-: geo-replication not
> installed
> [2016-06-26 16:48:49.183146] I [MSGID: 101190]
> [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread
> with index 1
> [2016-06-26 16:48:49.183188] I [socket.c:2356:socket_event_handler]
> 0-transport: disconnecting now
> [2016-06-26 16:48:49.189005] I [cli-rpc-ops.c:8417:gf_cli_heal_volume_cbk]
> 0-cli: Received resp to heal volume
> [2016-06-26 16:48:49.189058] I [input.c:36:cli_batch] 0-: Exiting with: -1
>
>
> ==== All servers ====
>
> ==> glfsheal-volname.log <==
> [2016-06-26 16:51:47.493809] I [MSGID: 104045] [glfs-master.c:95:notify]
> 0-gfapi: New graph 6766732d-332d-6272-6963-6b2d656d7074 (0) coming up
> [2016-06-26 16:51:47.493841] I [MSGID: 114020] [client.c:2106:notify]
> 0-volname-client-0: parent translators are ready, attempting connect on
> transport
> [2016-06-26 16:51:47.496465] I [MSGID: 114020] [client.c:2106:notify]
> 0-volname-client-1: parent translators are ready, attempting connect on
> transport
> [2016-06-26 16:51:47.496729] I [rpc-clnt.c:1868:rpc_clnt_reconfig]
> 0-volname-client-0: changing port to 49152 (from 0)
> [2016-06-26 16:51:47.498945] I [MSGID: 114020] [client.c:2106:notify]
> 0-volname-client-2: parent translators are ready, attempting connect on
> transport
> [2016-06-26 16:51:47.501600] I [MSGID: 114057]
> [client-handshake.c:1437:select_server_supported_programs]
> 0-volname-client-0: Using Program GlusterFS 3.3, Num (1298437), Version
> (330)
> [2016-06-26 16:51:47.502008] I [MSGID: 114046]
> [client-handshake.c:1213:client_setvolume_cbk] 0-volname-client-0:
> Connected to volname-client-0, attached to remote volume
> '/usr/local/volname/local-data/mirrored-data'.
> [2016-06-26 16:51:47.502031] I [MSGID: 114047]
> [client-handshake.c:1224:client_setvolume_cbk] 0-volname-client-0: Server
> and Client lk-version numbers are not same, reopening the fds
> [2016-06-26 16:51:47.502101] I [MSGID: 108005]
> [afr-common.c:4007:afr_notify] 0-volname-replicate-0: Subvolume
> 'volname-client-0' came back up; going online.
> [2016-06-26 16:51:47.502146] I [rpc-clnt.c:1868:rpc_clnt_reconfig]
> 0-volname-client-1: changing port to 49152 (from 0)
> [2016-06-26 16:51:47.502201] I [MSGID: 114035]
> [client-handshake.c:193:client_set_lk_version_cbk] 0-volname-client-0:
> Server lk version = 1
> [2016-06-26 16:51:47.504858] I [rpc-clnt.c:1868:rpc_clnt_reconfig]
> 0-volname-client-2: changing port to 49152 (from 0)
> [2016-06-26 16:51:47.507128] I [MSGID: 114057]
> [client-handshake.c:1437:select_server_supported_programs]
> 0-volname-client-1: Using Program GlusterFS 3.3, Num (1298437), Version
> (330)
> [2016-06-26 16:51:47.507674] I [MSGID: 114046]
> [client-handshake.c:1213:client_setvolume_cbk] 0-volname-client-1:
> Connected to volname-client-1, attached to remote volume
> '/usr/local/volname/local-data/mirrored-data'.
> [2016-06-26 16:51:47.507692] I [MSGID: 114047]
> [client-handshake.c:1224:client_setvolume_cbk] 0-volname-client-1: Server
> and Client lk-version numbers are not same, reopening the fds
> [2016-06-26 16:51:47.507994] I [MSGID: 114035]
> [client-handshake.c:193:client_set_lk_version_cbk] 0-volname-client-1:
> Server lk version = 1
> [2016-06-26 16:51:47.509373] I [MSGID: 114057]
> [client-handshake.c:1437:select_server_supported_programs]
> 0-volname-client-2: Using Program GlusterFS 3.3, Num (1298437), Version
> (330)
> [2016-06-26 16:51:47.509815] I [MSGID: 114046]
> [client-handshake.c:1213:client_setvolume_cbk] 0-volname-client-2:
> Connected to volname-client-2, attached to remote volume
> '/usr/local/volname/local-data/mirrored-data'.
> [2016-06-26 16:51:47.509825] I [MSGID: 114047]
> [client-handshake.c:1224:client_setvolume_cbk] 0-volname-client-2: Server
> and Client lk-version numbers are not same, reopening the fds
> [2016-06-26 16:51:47.526799] I [MSGID: 114035]
> [client-handshake.c:193:client_set_lk_version_cbk] 0-volname-client-2:
> Server lk version = 1
> [2016-06-26 16:51:47.528396] I [MSGID: 104041]
> [glfs-resolve.c:869:__glfs_active_subvol] 0-volname: switched to graph
> 6766732d-332d-6272-6963-6b2d656d7074 (0)
>
>
>
> ==== Possibly Related? ====
> ==== server-ip-1 ====
>
> ==> Some logs of note in etc-glusterfs-glusterd.vol.log <==
> [2016-06-26 11:35:13.583338] W
> [glusterd-locks.c:577:glusterd_mgmt_v3_lock]
> (-->/usr/lib64/glusterfs/3.7.11/xlator/mgmt/glusterd.so(glusterd_op_begin_synctask+0x30)
> [0x7fad0f286520]
> -->/usr/lib64/glusterfs/3.7.11/xlator/mgmt/glusterd.so(gd_sync_task_begin+0x980)
> [0x7fad0f286450]
> -->/usr/lib64/glusterfs/3.7.11/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_lock+0x5f7)
> [0x7fad0f28ac47] ) 0-management: Lock for volname held by
> e24bbfa8-5e0f-4ff2-846d-743bf47d2022
> [2016-06-26 11:35:13.583355] E [MSGID: 106119]
> [glusterd-syncop.c:1830:gd_sync_task_begin] 0-management: Unable to acquire
> lock for volname
> [2016-06-26 11:35:13.601006] I [MSGID: 106494]
> [glusterd-handler.c:3070:__glusterd_handle_cli_profile_volume]
> 0-management: Received volume profile req for volume volname
> [2016-06-26 11:35:13.868117] W
> [glusterd-locks.c:577:glusterd_mgmt_v3_lock]
> (-->/usr/lib64/glusterfs/3.7.11/xlator/mgmt/glusterd.so(glusterd_op_sm+0x29f)
> [0x7fad0f1fb86f]
> -->/usr/lib64/glusterfs/3.7.11/xlator/mgmt/glusterd.so(+0x5c675)
> [0x7fad0f1ed675]
> -->/usr/lib64/glusterfs/3.7.11/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_lock+0x5f7)
> [0x7fad0f28ac47] ) 0-management: Lock for volname held by
> 9c486029-ec26-4222-b302-b609216fd68f
> [2016-06-26 11:35:13.868135] E [MSGID: 106119]
> [glusterd-op-sm.c:3711:glusterd_op_ac_lock] 0-management: Unable to acquire
> lock for volname
> [2016-06-26 11:35:13.868175] E [MSGID: 106376]
> [glusterd-op-sm.c:7591:glusterd_op_sm] 0-management: handler returned: -1
> [2016-06-26 11:35:13.869150] E [MSGID: 106116]
> [glusterd-mgmt.c:135:gd_mgmt_v3_collate_errors] 0-management: Locking
> failed on server-ip-2. Please check log file for details.
> [2016-06-26 11:35:13.869515] W
> [glusterd-locks.c:692:glusterd_mgmt_v3_unlock]
> (-->/usr/lib64/glusterfs/3.7.11/xlator/mgmt/glusterd.so(glusterd_op_sm+0x29f)
> [0x7fad0f1fb86f]
> -->/usr/lib64/glusterfs/3.7.11/xlator/mgmt/glusterd.so(+0x5c362)
> [0x7fad0f1ed362]
> -->/usr/lib64/glusterfs/3.7.11/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_unlock+0x37a)
> [0x7fad0f28afea] ) 0-management: Lock owner mismatch. Lock for vol volname
> held by 9c486029-ec26-4222-b302-b609216fd68f
> [2016-06-26 11:35:13.869533] E [MSGID: 106118]
> [glusterd-op-sm.c:3774:glusterd_op_ac_unlock] 0-management: Unable to
> release lock for volname
> [2016-06-26 11:35:13.869573] E [MSGID: 106376]
> [glusterd-op-sm.c:7591:glusterd_op_sm] 0-management: handler returned: 1
> [2016-06-26 11:35:13.869590] E [MSGID: 106151]
> [glusterd-syncop.c:1868:gd_sync_task_begin] 0-management: Locking Peers
> Failed.
> [2016-06-26 11:35:13.872140] E [MSGID: 106116]
> [glusterd-mgmt.c:135:gd_mgmt_v3_collate_errors] 0-management: Unlocking
> failed on server-ip-2. Please check log file for details.
> [2016-06-26 11:35:13.872187] E [MSGID: 106152]
> [glusterd-syncop.c:1569:gd_unlock_op_phase] 0-management: Failed to unlock
> on some peer(s)
> [2016-06-26 11:35:13.959124] W
> [glusterd-locks.c:577:glusterd_mgmt_v3_lock]
> (-->/usr/lib64/glusterfs/3.7.11/xlator/mgmt/glusterd.so(glusterd_op_begin_synctask+0x30)
> [0x7fad0f286520]
> -->/usr/lib64/glusterfs/3.7.11/xlator/mgmt/glusterd.so(gd_sync_task_begin+0x980)
> [0x7fad0f286450]
> -->/usr/lib64/glusterfs/3.7.11/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_lock+0x5f7)
> [0x7fad0f28ac47] ) 0-management: Lock for volname held by
> e24bbfa8-5e0f-4ff2-846d-743bf47d2022
> The message "I [MSGID: 106494]
> [glusterd-handler.c:3070:__glusterd_handle_cli_profile_volume]
> 0-management: Received volume profile req for volume volname" repeated 7
> times between [2016-06-26 11:35:13.601006] and [2016-06-26 11:35:14.822349]
> [2016-06-26 11:35:40.905084] I [MSGID: 106533]
> [glusterd-volume-ops.c:857:__glusterd_handle_cli_heal_volume] 0-management:
> Received heal vol req for volume volname
> [2016-06-26 11:35:40.909790] E [MSGID: 106153]
> [glusterd-syncop.c:113:gd_collate_errors] 0-glusterd: Commit failed on
> server-ip-2. Please check log file for details.
> The message "I [MSGID: 106488]
> [glusterd-handler.c:1533:__glusterd_handle_cli_get_volume] 0-glusterd:
> Received get vol req" repeated 2 times between [2016-06-26 11:35:13.538007]
> and [2016-06-26 11:35:13.735654]
> [2016-06-26 11:35:13.959143] E [MSGID: 106119]
> [glusterd-syncop.c:1830:gd_sync_task_begin] 0-management: Unable to acquire
> lock for volname
> The message "I [MSGID: 106499]
> [glusterd-handler.c:4331:__glusterd_handle_status_volume] 0-management:
> Received status volume req for volume volname" repeated 8 times between
> [2016-06-26 11:35:13.583212] and [2016-06-26 11:35:14.358853]
>
> On Sat, Jun 25, 2016 at 11:17 AM, Joe Julian <joe at julianfamily.org> wrote:
>
>> Notice it actually tells you to look in the logs on server-ip-2 but you
>> did not include any logs from that server.
>>
>> On June 21, 2016 10:22:14 AM PDT, Danny Lee <dannyl at vt.edu> wrote:
>>
>>> Hello,
>>>
>>> We are currently figuring out how to add GlusterFS to our system to make
>>> our systems highly available using scripts.  We are using Gluster 3.7.11.
>>>
>>> Problem:
>>> Trying to migrate to GlusterFS from a non-clustered system to a 3-node
>>> glusterfs replicated cluster using scripts.  Tried various things to make
>>> this work, but it sometimes causes us to be in an indesirable state where
>>> if you call "gluster volume heal <volname> full", we would get the error
>>> message, "Launching heal operation to perform full self heal on volume
>>> <volname> has been unsuccessful on bricks that are down. Please check if
>>> all brick processes are running."  All the brick processes are running
>>> based on running the command, "gluster volume status volname"
>>>
>>> Things we have tried:
>>> Order of preference
>>> 1. Create Volume with 3 Filesystems with the same data
>>> 2. Create Volume with 2 Empty filesysytems and one with the data
>>> 3. Create Volume with only one filesystem with data and then using
>>> "add-brick" command to add the other two empty filesystems
>>> 4. Create Volume with one empty filesystem, mounting it, and then
>>> copying the data over to that one.  And then finally, using "add-brick"
>>> command to add the other two empty filesystems
>>> 5. Create Volume with 3 empty filesystems, mounting it, and then copying
>>> the data over
>>>
>>> Other things to note:
>>>   A few minutes after the volume is created and started successfully,
>>> our application server starts up against it, so reads and writes may happen
>>> pretty quickly after the volume has started.  But there is only about 50MB
>>> of data.
>>>
>>> Steps to reproduce (all in a script):
>>> # This is run by the primary node with the IP Adress, <server-ip-1>,
>>> that has data
>>> systemctl restart glusterd
>>> gluster peer probe <server-ip-2>
>>> gluster peer probe <server-ip-3>
>>> Wait for "gluster peer status" to all be in "Peer in Cluster" state
>>> gluster volume create <volname> replica 3 transport tcp ${BRICKS[0]}
>>> ${BRICKS[1]} ${BRICKS[2]} force
>>> gluster volume set <volname> nfs.disable true
>>> gluster volume start <volname>
>>> mkdir -p $MOUNT_POINT
>>> mount -t glusterfs <server-ip-1>:/volname $MOUNT_POINT
>>> find $MOUNT_POINT | xargs stat
>>>
>>> Note that, when we added sleeps around the gluster commands, there was a
>>> higher probability of success, but not 100%.
>>>
>>> # Once volume is started, all the the clients/servers will mount the
>>> gluster filesystem by polling "mountpoint -q $MOUNT_POINT":
>>> mkdir -p $MOUNT_POINT
>>> mount -t glusterfs <server-ip-1>:/volname $MOUNT_POINT
>>>
>>> Logs:
>>> *etc-glusterfs-glusterd.vol.log* in *server-ip-1*
>>>
>>> [2016-06-21 14:10:38.285234] I [MSGID: 106533]
>>> [glusterd-volume-ops.c:857:__glusterd_handle_cli_heal_volume] 0-management:
>>> Received heal vol req for volume volname
>>> [2016-06-21 14:10:38.296801] E [MSGID: 106153]
>>> [glusterd-syncop.c:113:gd_collate_errors] 0-glusterd: Commit failed on
>>> <server-ip-2>. Please check log file for details.
>>>
>>>
>>> *usr-local-volname-data-mirrored-data.log* in *server-ip-1*
>>>
>>> [2016-06-21 14:14:39.233366] E [MSGID: 114058]
>>> [client-handshake.c:1524:client_query_portmap_cbk] 0-volname-client-0:
>>> failed to get the port number for remote subvolume. Please run 'gluster
>>> volume status' on server to see if brick process is running.
>>> *I think this is caused by the self heal daemon*
>>>
>>> *cmd_history.log* in *server-ip-1*
>>>
>>> [2016-06-21 14:10:38.298800]  : volume heal volname full : FAILED :
>>> Commit failed on <server-ip-2>. Please check log file for details.
>>>
>>> ------------------------------
>>>
>>> Gluster-users mailing list
>>> Gluster-users at gluster.org
>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>>
>>>
>> --
>> Sent from my Android device with K-9 Mail. Please excuse my brevity.
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160627/0cffa28c/attachment.html>


More information about the Gluster-users mailing list