[Gluster-users] Gluster Startup Issue
Danny Lee
dannyl at vt.edu
Sun Jun 26 18:02:27 UTC 2016
Originally, I ran "sudo gluster volume heal appian full" on server-ip-1 and
then tailed the logs for all of the servers. The only thing that showed up
was the logs for server-ip-1, so I thought it wasn't even connecting to the
other boxes. But after about 15 seconds, logs showed up in server-ip-2 and
server-ip-3. Thanks for pointing that out, Joe.
Here are the tailed logs. I also noticed that there were some locking
errors that popped up once in a while in
the etc-glusterfs-glusterd.vol.log. I have also added these logs below.
==== server-ip-1 ====
==> etc-glusterfs-glusterd.vol.log <==
[2016-06-26 16:48:31.405513] I [MSGID: 106533]
[glusterd-volume-ops.c:857:__glusterd_handle_cli_heal_volume] 0-management:
Received heal vol req for volume volname
[2016-06-26 16:48:31.409903] E [MSGID: 106153]
[glusterd-syncop.c:113:gd_collate_errors] 0-glusterd: Commit failed on
server-ip-2. Please check log file for details.
==> cli.log <==
[2016-06-26 16:48:51.208828] I [cli.c:721:main] 0-cli: Started running
gluster with version 3.7.11
[2016-06-26 16:48:51.213391] I
[cli-cmd-volume.c:1795:cli_check_gsync_present] 0-: geo-replication not
installed
[2016-06-26 16:48:51.213674] I [MSGID: 101190]
[event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread
with index 1
[2016-06-26 16:48:51.213733] I [socket.c:2356:socket_event_handler]
0-transport: disconnecting now
[2016-06-26 16:48:51.219674] I [cli-rpc-ops.c:8417:gf_cli_heal_volume_cbk]
0-cli: Received resp to heal volume
[2016-06-26 16:48:51.219768] I [input.c:36:cli_batch] 0-: Exiting with: -1
==> cmd_history.log <==
[2016-06-26 16:48:51.219596] : volume heal volname full : FAILED : Commit
failed on server-ip-2. Please check log file for details.
==> etc-glusterfs-glusterd.vol.log <==
[2016-06-26 16:48:51.214185] I [MSGID: 106533]
[glusterd-volume-ops.c:857:__glusterd_handle_cli_heal_volume] 0-management:
Received heal vol req for volume volname
[2016-06-26 16:48:51.219031] E [MSGID: 106153]
[glusterd-syncop.c:113:gd_collate_errors] 0-glusterd: Commit failed on
server-ip-2. Please check log file for details.
==== server-ip-2 ====
==> etc-glusterfs-glusterd.vol.log <==
[2016-06-26 16:48:30.087365] I [MSGID: 106533]
[glusterd-volume-ops.c:857:__glusterd_handle_cli_heal_volume] 0-management:
Received heal vol req for volume volname
[2016-06-26 16:48:30.092829] E [MSGID: 106153]
[glusterd-syncop.c:113:gd_collate_errors] 0-glusterd: Commit failed on
server-ip-2. Please check log file for details.
==> cli.log <==
[2016-06-26 16:49:30.099446] I [cli.c:721:main] 0-cli: Started running
gluster with version 3.7.11
[2016-06-26 16:49:30.104599] I
[cli-cmd-volume.c:1795:cli_check_gsync_present] 0-: geo-replication not
installed
[2016-06-26 16:49:30.104853] I [MSGID: 101190]
[event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread
with index 1
[2016-06-26 16:49:30.104896] I [socket.c:2356:socket_event_handler]
0-transport: disconnecting now
[2016-06-26 16:49:30.177924] I [input.c:36:cli_batch] 0-: Exiting with: 0
==== server-ip-3 ====
==> cli.log <==
[2016-06-26 16:48:49.177859] I [cli.c:721:main] 0-cli: Started running
gluster with version 3.7.11
[2016-06-26 16:48:49.182887] I
[cli-cmd-volume.c:1795:cli_check_gsync_present] 0-: geo-replication not
installed
[2016-06-26 16:48:49.183146] I [MSGID: 101190]
[event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread
with index 1
[2016-06-26 16:48:49.183188] I [socket.c:2356:socket_event_handler]
0-transport: disconnecting now
[2016-06-26 16:48:49.189005] I [cli-rpc-ops.c:8417:gf_cli_heal_volume_cbk]
0-cli: Received resp to heal volume
[2016-06-26 16:48:49.189058] I [input.c:36:cli_batch] 0-: Exiting with: -1
==== All servers ====
==> glfsheal-volname.log <==
[2016-06-26 16:51:47.493809] I [MSGID: 104045] [glfs-master.c:95:notify]
0-gfapi: New graph 6766732d-332d-6272-6963-6b2d656d7074 (0) coming up
[2016-06-26 16:51:47.493841] I [MSGID: 114020] [client.c:2106:notify]
0-volname-client-0: parent translators are ready, attempting connect on
transport
[2016-06-26 16:51:47.496465] I [MSGID: 114020] [client.c:2106:notify]
0-volname-client-1: parent translators are ready, attempting connect on
transport
[2016-06-26 16:51:47.496729] I [rpc-clnt.c:1868:rpc_clnt_reconfig]
0-volname-client-0: changing port to 49152 (from 0)
[2016-06-26 16:51:47.498945] I [MSGID: 114020] [client.c:2106:notify]
0-volname-client-2: parent translators are ready, attempting connect on
transport
[2016-06-26 16:51:47.501600] I [MSGID: 114057]
[client-handshake.c:1437:select_server_supported_programs]
0-volname-client-0: Using Program GlusterFS 3.3, Num (1298437), Version
(330)
[2016-06-26 16:51:47.502008] I [MSGID: 114046]
[client-handshake.c:1213:client_setvolume_cbk] 0-volname-client-0:
Connected to volname-client-0, attached to remote volume
'/usr/local/volname/local-data/mirrored-data'.
[2016-06-26 16:51:47.502031] I [MSGID: 114047]
[client-handshake.c:1224:client_setvolume_cbk] 0-volname-client-0: Server
and Client lk-version numbers are not same, reopening the fds
[2016-06-26 16:51:47.502101] I [MSGID: 108005]
[afr-common.c:4007:afr_notify] 0-volname-replicate-0: Subvolume
'volname-client-0' came back up; going online.
[2016-06-26 16:51:47.502146] I [rpc-clnt.c:1868:rpc_clnt_reconfig]
0-volname-client-1: changing port to 49152 (from 0)
[2016-06-26 16:51:47.502201] I [MSGID: 114035]
[client-handshake.c:193:client_set_lk_version_cbk] 0-volname-client-0:
Server lk version = 1
[2016-06-26 16:51:47.504858] I [rpc-clnt.c:1868:rpc_clnt_reconfig]
0-volname-client-2: changing port to 49152 (from 0)
[2016-06-26 16:51:47.507128] I [MSGID: 114057]
[client-handshake.c:1437:select_server_supported_programs]
0-volname-client-1: Using Program GlusterFS 3.3, Num (1298437), Version
(330)
[2016-06-26 16:51:47.507674] I [MSGID: 114046]
[client-handshake.c:1213:client_setvolume_cbk] 0-volname-client-1:
Connected to volname-client-1, attached to remote volume
'/usr/local/volname/local-data/mirrored-data'.
[2016-06-26 16:51:47.507692] I [MSGID: 114047]
[client-handshake.c:1224:client_setvolume_cbk] 0-volname-client-1: Server
and Client lk-version numbers are not same, reopening the fds
[2016-06-26 16:51:47.507994] I [MSGID: 114035]
[client-handshake.c:193:client_set_lk_version_cbk] 0-volname-client-1:
Server lk version = 1
[2016-06-26 16:51:47.509373] I [MSGID: 114057]
[client-handshake.c:1437:select_server_supported_programs]
0-volname-client-2: Using Program GlusterFS 3.3, Num (1298437), Version
(330)
[2016-06-26 16:51:47.509815] I [MSGID: 114046]
[client-handshake.c:1213:client_setvolume_cbk] 0-volname-client-2:
Connected to volname-client-2, attached to remote volume
'/usr/local/volname/local-data/mirrored-data'.
[2016-06-26 16:51:47.509825] I [MSGID: 114047]
[client-handshake.c:1224:client_setvolume_cbk] 0-volname-client-2: Server
and Client lk-version numbers are not same, reopening the fds
[2016-06-26 16:51:47.526799] I [MSGID: 114035]
[client-handshake.c:193:client_set_lk_version_cbk] 0-volname-client-2:
Server lk version = 1
[2016-06-26 16:51:47.528396] I [MSGID: 104041]
[glfs-resolve.c:869:__glfs_active_subvol] 0-volname: switched to graph
6766732d-332d-6272-6963-6b2d656d7074 (0)
==== Possibly Related? ====
==== server-ip-1 ====
==> Some logs of note in etc-glusterfs-glusterd.vol.log <==
[2016-06-26 11:35:13.583338] W [glusterd-locks.c:577:glusterd_mgmt_v3_lock]
(-->/usr/lib64/glusterfs/3.7.11/xlator/mgmt/glusterd.so(glusterd_op_begin_synctask+0x30)
[0x7fad0f286520]
-->/usr/lib64/glusterfs/3.7.11/xlator/mgmt/glusterd.so(gd_sync_task_begin+0x980)
[0x7fad0f286450]
-->/usr/lib64/glusterfs/3.7.11/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_lock+0x5f7)
[0x7fad0f28ac47] ) 0-management: Lock for volname held by
e24bbfa8-5e0f-4ff2-846d-743bf47d2022
[2016-06-26 11:35:13.583355] E [MSGID: 106119]
[glusterd-syncop.c:1830:gd_sync_task_begin] 0-management: Unable to acquire
lock for volname
[2016-06-26 11:35:13.601006] I [MSGID: 106494]
[glusterd-handler.c:3070:__glusterd_handle_cli_profile_volume]
0-management: Received volume profile req for volume volname
[2016-06-26 11:35:13.868117] W [glusterd-locks.c:577:glusterd_mgmt_v3_lock]
(-->/usr/lib64/glusterfs/3.7.11/xlator/mgmt/glusterd.so(glusterd_op_sm+0x29f)
[0x7fad0f1fb86f]
-->/usr/lib64/glusterfs/3.7.11/xlator/mgmt/glusterd.so(+0x5c675)
[0x7fad0f1ed675]
-->/usr/lib64/glusterfs/3.7.11/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_lock+0x5f7)
[0x7fad0f28ac47] ) 0-management: Lock for volname held by
9c486029-ec26-4222-b302-b609216fd68f
[2016-06-26 11:35:13.868135] E [MSGID: 106119]
[glusterd-op-sm.c:3711:glusterd_op_ac_lock] 0-management: Unable to acquire
lock for volname
[2016-06-26 11:35:13.868175] E [MSGID: 106376]
[glusterd-op-sm.c:7591:glusterd_op_sm] 0-management: handler returned: -1
[2016-06-26 11:35:13.869150] E [MSGID: 106116]
[glusterd-mgmt.c:135:gd_mgmt_v3_collate_errors] 0-management: Locking
failed on server-ip-2. Please check log file for details.
[2016-06-26 11:35:13.869515] W
[glusterd-locks.c:692:glusterd_mgmt_v3_unlock]
(-->/usr/lib64/glusterfs/3.7.11/xlator/mgmt/glusterd.so(glusterd_op_sm+0x29f)
[0x7fad0f1fb86f]
-->/usr/lib64/glusterfs/3.7.11/xlator/mgmt/glusterd.so(+0x5c362)
[0x7fad0f1ed362]
-->/usr/lib64/glusterfs/3.7.11/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_unlock+0x37a)
[0x7fad0f28afea] ) 0-management: Lock owner mismatch. Lock for vol volname
held by 9c486029-ec26-4222-b302-b609216fd68f
[2016-06-26 11:35:13.869533] E [MSGID: 106118]
[glusterd-op-sm.c:3774:glusterd_op_ac_unlock] 0-management: Unable to
release lock for volname
[2016-06-26 11:35:13.869573] E [MSGID: 106376]
[glusterd-op-sm.c:7591:glusterd_op_sm] 0-management: handler returned: 1
[2016-06-26 11:35:13.869590] E [MSGID: 106151]
[glusterd-syncop.c:1868:gd_sync_task_begin] 0-management: Locking Peers
Failed.
[2016-06-26 11:35:13.872140] E [MSGID: 106116]
[glusterd-mgmt.c:135:gd_mgmt_v3_collate_errors] 0-management: Unlocking
failed on server-ip-2. Please check log file for details.
[2016-06-26 11:35:13.872187] E [MSGID: 106152]
[glusterd-syncop.c:1569:gd_unlock_op_phase] 0-management: Failed to unlock
on some peer(s)
[2016-06-26 11:35:13.959124] W [glusterd-locks.c:577:glusterd_mgmt_v3_lock]
(-->/usr/lib64/glusterfs/3.7.11/xlator/mgmt/glusterd.so(glusterd_op_begin_synctask+0x30)
[0x7fad0f286520]
-->/usr/lib64/glusterfs/3.7.11/xlator/mgmt/glusterd.so(gd_sync_task_begin+0x980)
[0x7fad0f286450]
-->/usr/lib64/glusterfs/3.7.11/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_lock+0x5f7)
[0x7fad0f28ac47] ) 0-management: Lock for volname held by
e24bbfa8-5e0f-4ff2-846d-743bf47d2022
The message "I [MSGID: 106494]
[glusterd-handler.c:3070:__glusterd_handle_cli_profile_volume]
0-management: Received volume profile req for volume volname" repeated 7
times between [2016-06-26 11:35:13.601006] and [2016-06-26 11:35:14.822349]
[2016-06-26 11:35:40.905084] I [MSGID: 106533]
[glusterd-volume-ops.c:857:__glusterd_handle_cli_heal_volume] 0-management:
Received heal vol req for volume volname
[2016-06-26 11:35:40.909790] E [MSGID: 106153]
[glusterd-syncop.c:113:gd_collate_errors] 0-glusterd: Commit failed on
server-ip-2. Please check log file for details.
The message "I [MSGID: 106488]
[glusterd-handler.c:1533:__glusterd_handle_cli_get_volume] 0-glusterd:
Received get vol req" repeated 2 times between [2016-06-26 11:35:13.538007]
and [2016-06-26 11:35:13.735654]
[2016-06-26 11:35:13.959143] E [MSGID: 106119]
[glusterd-syncop.c:1830:gd_sync_task_begin] 0-management: Unable to acquire
lock for volname
The message "I [MSGID: 106499]
[glusterd-handler.c:4331:__glusterd_handle_status_volume] 0-management:
Received status volume req for volume volname" repeated 8 times between
[2016-06-26 11:35:13.583212] and [2016-06-26 11:35:14.358853]
On Sat, Jun 25, 2016 at 11:17 AM, Joe Julian <joe at julianfamily.org> wrote:
> Notice it actually tells you to look in the logs on server-ip-2 but you
> did not include any logs from that server.
>
> On June 21, 2016 10:22:14 AM PDT, Danny Lee <dannyl at vt.edu> wrote:
>
>> Hello,
>>
>> We are currently figuring out how to add GlusterFS to our system to make
>> our systems highly available using scripts. We are using Gluster 3.7.11.
>>
>> Problem:
>> Trying to migrate to GlusterFS from a non-clustered system to a 3-node
>> glusterfs replicated cluster using scripts. Tried various things to make
>> this work, but it sometimes causes us to be in an indesirable state where
>> if you call "gluster volume heal <volname> full", we would get the error
>> message, "Launching heal operation to perform full self heal on volume
>> <volname> has been unsuccessful on bricks that are down. Please check if
>> all brick processes are running." All the brick processes are running
>> based on running the command, "gluster volume status volname"
>>
>> Things we have tried:
>> Order of preference
>> 1. Create Volume with 3 Filesystems with the same data
>> 2. Create Volume with 2 Empty filesysytems and one with the data
>> 3. Create Volume with only one filesystem with data and then using
>> "add-brick" command to add the other two empty filesystems
>> 4. Create Volume with one empty filesystem, mounting it, and then copying
>> the data over to that one. And then finally, using "add-brick" command to
>> add the other two empty filesystems
>> 5. Create Volume with 3 empty filesystems, mounting it, and then copying
>> the data over
>>
>> Other things to note:
>> A few minutes after the volume is created and started successfully, our
>> application server starts up against it, so reads and writes may happen
>> pretty quickly after the volume has started. But there is only about 50MB
>> of data.
>>
>> Steps to reproduce (all in a script):
>> # This is run by the primary node with the IP Adress, <server-ip-1>, that
>> has data
>> systemctl restart glusterd
>> gluster peer probe <server-ip-2>
>> gluster peer probe <server-ip-3>
>> Wait for "gluster peer status" to all be in "Peer in Cluster" state
>> gluster volume create <volname> replica 3 transport tcp ${BRICKS[0]}
>> ${BRICKS[1]} ${BRICKS[2]} force
>> gluster volume set <volname> nfs.disable true
>> gluster volume start <volname>
>> mkdir -p $MOUNT_POINT
>> mount -t glusterfs <server-ip-1>:/volname $MOUNT_POINT
>> find $MOUNT_POINT | xargs stat
>>
>> Note that, when we added sleeps around the gluster commands, there was a
>> higher probability of success, but not 100%.
>>
>> # Once volume is started, all the the clients/servers will mount the
>> gluster filesystem by polling "mountpoint -q $MOUNT_POINT":
>> mkdir -p $MOUNT_POINT
>> mount -t glusterfs <server-ip-1>:/volname $MOUNT_POINT
>>
>> Logs:
>> *etc-glusterfs-glusterd.vol.log* in *server-ip-1*
>>
>> [2016-06-21 14:10:38.285234] I [MSGID: 106533]
>> [glusterd-volume-ops.c:857:__glusterd_handle_cli_heal_volume] 0-management:
>> Received heal vol req for volume volname
>> [2016-06-21 14:10:38.296801] E [MSGID: 106153]
>> [glusterd-syncop.c:113:gd_collate_errors] 0-glusterd: Commit failed on
>> <server-ip-2>. Please check log file for details.
>>
>>
>> *usr-local-volname-data-mirrored-data.log* in *server-ip-1*
>>
>> [2016-06-21 14:14:39.233366] E [MSGID: 114058]
>> [client-handshake.c:1524:client_query_portmap_cbk] 0-volname-client-0:
>> failed to get the port number for remote subvolume. Please run 'gluster
>> volume status' on server to see if brick process is running.
>> *I think this is caused by the self heal daemon*
>>
>> *cmd_history.log* in *server-ip-1*
>>
>> [2016-06-21 14:10:38.298800] : volume heal volname full : FAILED :
>> Commit failed on <server-ip-2>. Please check log file for details.
>>
>> ------------------------------
>>
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-users
>>
>>
> --
> Sent from my Android device with K-9 Mail. Please excuse my brevity.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160626/c802d385/attachment.html>
More information about the Gluster-users
mailing list