<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
@Jim Kusznir<br>
<br>
For the heal issue, can you provide the getfattr output of one of
the 8 files in question from all 3 bricks?<br>
Example: `getfattr -d -m . -e hex
/gluster/brick3/data-hdd/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/48d7ecb8-7ac5-4725-bca5-b3519681cf2f/0d6080b0-7018-4fa3-bb82-1dd9ef07d9b9`<br>
Also provide the stat output of the same file from all 3 bricks.<br>
<br>
Thanks,<br>
Ravi<br>
<p><br>
</p>
<br>
<div class="moz-cite-prefix">On 05/30/2018 09:47 AM, Krutika
Dhananjay wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CAPhYV8M6AGAMZCmSPi9m2ktffha3UeP0dpnuN1mBUsUjzNNj-w@mail.gmail.com">
<div dir="ltr">
<div>Adding Ravi to look into the heal issue.</div>
<div><br>
</div>
<div>As for the fsync hang and subsequent IO errors, it seems a
lot like <a
href="https://bugzilla.redhat.com/show_bug.cgi?id=1497156"
moz-do-not-send="true">https://bugzilla.redhat.com/show_bug.cgi?id=1497156</a>
and Paolo Bonzini from qemu had pointed out that this would be
fixed by the following commit:</div>
<div><br>
</div>
<div>
<pre class="gmail-bz_comment_text gmail-bz_wrap_comment_text" id="gmail-comment_text_20"> commit e72c9a2a67a6400c8ef3d01d4c461dbbbfa0e1f0
Author: Paolo Bonzini <<a href="mailto:pbonzini@redhat.com" moz-do-not-send="true">pbonzini@redhat.com</a>>
Date: Wed Jun 21 16:35:46 2017 +0200
scsi: virtio_scsi: let host do exception handling
virtio_scsi tries to do exception handling after the default 30 seconds
timeout expires. However, it's better to let the host control the
timeout, otherwise with a heavy I/O load it is likely that an abort will
also timeout. This leads to fatal errors like filesystems going
offline.
Disable the 'sd' timeout and allow the host to do exception handling,
following the precedent of the storvsc driver.
Hannes has a proposal to introduce timeouts in virtio, but this provides
an immediate solution for stable kernels too.
[mkp: fixed typo]
Reported-by: Douglas Miller <<a href="mailto:dougmill@linux.vnet.ibm.com" moz-do-not-send="true">dougmill@linux.vnet.ibm.com</a>>
Cc: "James E.J. Bottomley" <<a href="mailto:jejb@linux.vnet.ibm.com" moz-do-not-send="true">jejb@linux.vnet.ibm.com</a>>
Cc: "Martin K. Petersen" <<a href="mailto:martin.petersen@oracle.com" moz-do-not-send="true">martin.petersen@oracle.com</a>>
Cc: Hannes Reinecke <<a href="mailto:hare@suse.de" moz-do-not-send="true">hare@suse.de</a>>
Cc: <a href="mailto:linux-scsi@vger.kernel.org" moz-do-not-send="true">linux-scsi@vger.kernel.org</a>
Cc: <a href="mailto:stable@vger.kernel.org" moz-do-not-send="true">stable@vger.kernel.org</a>
Signed-off-by: Paolo Bonzini <<a href="mailto:pbonzini@redhat.com" moz-do-not-send="true">pbonzini@redhat.com</a>>
Signed-off-by: Martin K. Petersen <<a href="mailto:martin.petersen@oracle.com" moz-do-not-send="true">martin.petersen@oracle.com</a>></pre>
</div>
<div><br>
</div>
<div>Adding Paolo/Kevin to comment.</div>
<div><br>
</div>
<div>As for the poor gluster performance, could you disable
cluster.eager-lock and see if that makes any difference:</div>
<div><br>
</div>
<div># gluster volume set <VOL> cluster.eager-lock off</div>
<div><br>
</div>
<div>Do also capture the volume profile again if you still see
performance issues after disabling eager-lock.<br>
</div>
<div><br>
</div>
<div>-Krutika</div>
<div><br>
</div>
</div>
<div class="gmail_extra"><br>
<div class="gmail_quote">On Wed, May 30, 2018 at 6:55 AM, Jim
Kusznir <span dir="ltr"><<a
href="mailto:jim@palousetech.com" target="_blank"
moz-do-not-send="true">jim@palousetech.com</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="ltr">I also finally found the following in my
system log on one server:
<div><br>
</div>
<div>
<div>[10679.524491] INFO: task glusterclogro:14933
blocked for more than 120 seconds.</div>
<div>[10679.525826] "echo 0 >
/proc/sys/kernel/hung_task_<wbr>timeout_secs" disables
this message.</div>
<div>[10679.527144] glusterclogro D ffff97209832bf40
0 14933 1 0x00000080</div>
<div>[10679.527150] Call Trace:</div>
<div>[10679.527161] [<ffffffffb9913f79>]
schedule+0x29/0x70</div>
<div>[10679.527218] [<ffffffffc060e388>]
_xfs_log_force_lsn+0x2e8/0x340 [xfs]</div>
<div>[10679.527225] [<ffffffffb92cf1b0>] ?
wake_up_state+0x20/0x20</div>
<div>[10679.527254] [<ffffffffc05eeb97>]
xfs_file_fsync+0x107/0x1e0 [xfs]</div>
<div>[10679.527260] [<ffffffffb944f0e7>]
do_fsync+0x67/0xb0</div>
<div>[10679.527268] [<ffffffffb992076f>] ?
system_call_after_swapgs+0xbc/<wbr>0x160</div>
<div>[10679.527271] [<ffffffffb944f3d0>]
SyS_fsync+0x10/0x20</div>
<div>[10679.527275] [<ffffffffb992082f>]
system_call_fastpath+0x1c/0x21</div>
<div>[10679.527279] [<ffffffffb992077b>] ?
system_call_after_swapgs+0xc8/<wbr>0x160</div>
<div>[10679.527283] INFO: task glusterposixfsy:14941
blocked for more than 120 seconds.</div>
<div>[10679.528608] "echo 0 >
/proc/sys/kernel/hung_task_<wbr>timeout_secs" disables
this message.</div>
<div>[10679.529956] glusterposixfsy D ffff972495f84f10
0 14941 1 0x00000080</div>
<div>[10679.529961] Call Trace:</div>
<div>[10679.529966] [<ffffffffb9913f79>]
schedule+0x29/0x70</div>
<div>[10679.530003] [<ffffffffc060e388>]
_xfs_log_force_lsn+0x2e8/0x340 [xfs]</div>
<div>[10679.530008] [<ffffffffb92cf1b0>] ?
wake_up_state+0x20/0x20</div>
<div>[10679.530038] [<ffffffffc05eeb97>]
xfs_file_fsync+0x107/0x1e0 [xfs]</div>
<div>[10679.530042] [<ffffffffb944f0e7>]
do_fsync+0x67/0xb0</div>
<div>[10679.530046] [<ffffffffb992076f>] ?
system_call_after_swapgs+0xbc/<wbr>0x160</div>
<div>[10679.530050] [<ffffffffb944f3f3>]
SyS_fdatasync+0x13/0x20</div>
<div>[10679.530054] [<ffffffffb992082f>]
system_call_fastpath+0x1c/0x21</div>
<div>[10679.530058] [<ffffffffb992077b>] ?
system_call_after_swapgs+0xc8/<wbr>0x160</div>
<div>[10679.530062] INFO: task glusteriotwr13:15486
blocked for more than 120 seconds.</div>
<div>[10679.531805] "echo 0 >
/proc/sys/kernel/hung_task_<wbr>timeout_secs" disables
this message.</div>
<div>[10679.533732] glusteriotwr13 D ffff9720a83f0000
0 15486 1 0x00000080</div>
<div>[10679.533738] Call Trace:</div>
<div>[10679.533747] [<ffffffffb9913f79>]
schedule+0x29/0x70</div>
<div>[10679.533799] [<ffffffffc060e388>]
_xfs_log_force_lsn+0x2e8/0x340 [xfs]</div>
<div>[10679.533806] [<ffffffffb92cf1b0>] ?
wake_up_state+0x20/0x20</div>
<div>[10679.533846] [<ffffffffc05eeb97>]
xfs_file_fsync+0x107/0x1e0 [xfs]</div>
<div>[10679.533852] [<ffffffffb944f0e7>]
do_fsync+0x67/0xb0</div>
<div>[10679.533858] [<ffffffffb992076f>] ?
system_call_after_swapgs+0xbc/<wbr>0x160</div>
<div>[10679.533863] [<ffffffffb944f3f3>]
SyS_fdatasync+0x13/0x20</div>
<div>[10679.533868] [<ffffffffb992082f>]
system_call_fastpath+0x1c/0x21</div>
<div>[10679.533873] [<ffffffffb992077b>] ?
system_call_after_swapgs+0xc8/<wbr>0x160</div>
<div>[10919.512757] INFO: task glusterclogro:14933
blocked for more than 120 seconds.</div>
<div>[10919.514714] "echo 0 >
/proc/sys/kernel/hung_task_<wbr>timeout_secs" disables
this message.</div>
<div>[10919.516663] glusterclogro D ffff97209832bf40
0 14933 1 0x00000080</div>
<div>[10919.516677] Call Trace:</div>
<div>[10919.516690] [<ffffffffb9913f79>]
schedule+0x29/0x70</div>
<div>[10919.516696] [<ffffffffb99118e9>]
schedule_timeout+0x239/0x2c0</div>
<div>[10919.516703] [<ffffffffb951cc04>] ?
blk_finish_plug+0x14/0x40</div>
<div>[10919.516768] [<ffffffffc05e9224>] ?
_xfs_buf_ioapply+0x334/0x460 [xfs]</div>
<div>[10919.516774] [<ffffffffb991432d>]
wait_for_completion+0xfd/0x140</div>
<div>[10919.516782] [<ffffffffb92cf1b0>] ?
wake_up_state+0x20/0x20</div>
<div>[10919.516821] [<ffffffffc05eb0a3>] ?
_xfs_buf_read+0x23/0x40 [xfs]</div>
<div>[10919.516859] [<ffffffffc05eafa9>]
xfs_buf_submit_wait+0xf9/0x1d0 [xfs]</div>
<div>[10919.516902] [<ffffffffc061b279>] ?
xfs_trans_read_buf_map+0x199/<wbr>0x400 [xfs]</div>
<div>[10919.516940] [<ffffffffc05eb0a3>]
_xfs_buf_read+0x23/0x40 [xfs]</div>
<div>[10919.516977] [<ffffffffc05eb1b9>]
xfs_buf_read_map+0xf9/0x160 [xfs]</div>
<div>[10919.517022] [<ffffffffc061b279>]
xfs_trans_read_buf_map+0x199/<wbr>0x400 [xfs]</div>
<div>[10919.517057] [<ffffffffc05c8d04>]
xfs_da_read_buf+0xd4/0x100 [xfs]</div>
<div>[10919.517091] [<ffffffffc05c8d53>]
xfs_da3_node_read+0x23/0xd0 [xfs]</div>
<div>[10919.517126] [<ffffffffc05c9fee>]
xfs_da3_node_lookup_int+0x6e/<wbr>0x2f0 [xfs]</div>
<div>[10919.517160] [<ffffffffc05d5a1d>]
xfs_dir2_node_lookup+0x4d/<wbr>0x170 [xfs]</div>
<div>[10919.517194] [<ffffffffc05ccf5d>]
xfs_dir_lookup+0x1bd/0x1e0 [xfs]</div>
<div>[10919.517233] [<ffffffffc05fd8d9>]
xfs_lookup+0x69/0x140 [xfs]</div>
<div>[10919.517271] [<ffffffffc05fa018>]
xfs_vn_lookup+0x78/0xc0 [xfs]</div>
<div>[10919.517278] [<ffffffffb9425cf3>]
lookup_real+0x23/0x60</div>
<div>[10919.517283] [<ffffffffb9426702>]
__lookup_hash+0x42/0x60</div>
<div>[10919.517288] [<ffffffffb942d519>]
SYSC_renameat2+0x3a9/0x5a0</div>
<div>[10919.517296] [<ffffffffb94d3753>] ?
selinux_file_free_security+<wbr>0x23/0x30</div>
<div>[10919.517304] [<ffffffffb992077b>] ?
system_call_after_swapgs+0xc8/<wbr>0x160</div>
<div>[10919.517309] [<ffffffffb992076f>] ?
system_call_after_swapgs+0xbc/<wbr>0x160</div>
<div>[10919.517313] [<ffffffffb992077b>] ?
system_call_after_swapgs+0xc8/<wbr>0x160</div>
<div>[10919.517318] [<ffffffffb992076f>] ?
system_call_after_swapgs+0xbc/<wbr>0x160</div>
<div>[10919.517323] [<ffffffffb942e58e>]
SyS_renameat2+0xe/0x10</div>
<div>[10919.517328] [<ffffffffb942e5ce>]
SyS_rename+0x1e/0x20</div>
<div>[10919.517333] [<ffffffffb992082f>]
system_call_fastpath+0x1c/0x21</div>
<div>[10919.517339] [<ffffffffb992077b>] ?
system_call_after_swapgs+0xc8/<wbr>0x160</div>
<div>[11159.496095] INFO: task glusteriotwr9:15482
blocked for more than 120 seconds.</div>
<div>[11159.497546] "echo 0 >
/proc/sys/kernel/hung_task_<wbr>timeout_secs" disables
this message.</div>
<div>[11159.498978] glusteriotwr9 D ffff971fa0fa1fa0
0 15482 1 0x00000080</div>
<div>[11159.498984] Call Trace:</div>
<div>[11159.498995] [<ffffffffb9911f00>] ?
bit_wait+0x50/0x50</div>
<div>[11159.498999] [<ffffffffb9913f79>]
schedule+0x29/0x70</div>
<div>[11159.499003] [<ffffffffb99118e9>]
schedule_timeout+0x239/0x2c0</div>
<div>[11159.499056] [<ffffffffc05dd9b7>] ?
xfs_iext_bno_to_ext+0xa7/0x1a0 [xfs]</div>
<div>[11159.499082] [<ffffffffc05dd43e>] ?
xfs_iext_bno_to_irec+0x8e/0xd0 [xfs]</div>
<div>[11159.499090] [<ffffffffb92f7a12>] ?
ktime_get_ts64+0x52/0xf0</div>
<div>[11159.499093] [<ffffffffb9911f00>] ?
bit_wait+0x50/0x50</div>
<div>[11159.499097] [<ffffffffb991348d>]
io_schedule_timeout+0xad/0x130</div>
<div>[11159.499101] [<ffffffffb9913528>]
io_schedule+0x18/0x20</div>
<div>[11159.499104] [<ffffffffb9911f11>]
bit_wait_io+0x11/0x50</div>
<div>[11159.499107] [<ffffffffb9911ac1>]
__wait_on_bit_lock+0x61/0xc0</div>
<div>[11159.499113] [<ffffffffb9393634>]
__lock_page+0x74/0x90</div>
<div>[11159.499118] [<ffffffffb92bc210>] ?
wake_bit_function+0x40/0x40</div>
<div>[11159.499121] [<ffffffffb9394154>]
__find_lock_page+0x54/0x70</div>
<div>[11159.499125] [<ffffffffb9394e85>]
grab_cache_page_write_begin+<wbr>0x55/0xc0</div>
<div>[11159.499130] [<ffffffffb9484b76>]
iomap_write_begin+0x66/0x100</div>
<div>[11159.499135] [<ffffffffb9484edf>]
iomap_write_actor+0xcf/0x1d0</div>
<div>[11159.499140] [<ffffffffb9484e10>] ?
iomap_write_end+0x80/0x80</div>
<div>[11159.499144] [<ffffffffb94854e7>]
iomap_apply+0xb7/0x150</div>
<div>[11159.499149] [<ffffffffb9485621>]
iomap_file_buffered_write+<wbr>0xa1/0xe0</div>
<div>[11159.499153] [<ffffffffb9484e10>] ?
iomap_write_end+0x80/0x80</div>
<div>[11159.499182] [<ffffffffc05f025d>]
xfs_file_buffered_aio_write+<wbr>0x12d/0x2c0 [xfs]</div>
<div>[11159.499213] [<ffffffffc05f057d>]
xfs_file_aio_write+0x18d/0x1b0 [xfs]</div>
<div>[11159.499217] [<ffffffffb941a533>]
do_sync_write+0x93/0xe0</div>
<div>[11159.499222] [<ffffffffb941b010>]
vfs_write+0xc0/0x1f0</div>
<div>[11159.499225] [<ffffffffb941c002>]
SyS_pwrite64+0x92/0xc0</div>
<div>[11159.499230] [<ffffffffb992076f>] ?
system_call_after_swapgs+0xbc/<wbr>0x160</div>
<div>[11159.499234] [<ffffffffb992082f>]
system_call_fastpath+0x1c/0x21</div>
<div>[11159.499238] [<ffffffffb992077b>] ?
system_call_after_swapgs+0xc8/<wbr>0x160</div>
<div>[11279.488720] INFO: task xfsaild/dm-10:1134
blocked for more than 120 seconds.</div>
<div>[11279.490197] "echo 0 >
/proc/sys/kernel/hung_task_<wbr>timeout_secs" disables
this message.</div>
<div>[11279.491665] xfsaild/dm-10 D ffff9720a8660fd0
0 1134 2 0x00000000</div>
<div>[11279.491671] Call Trace:</div>
<div>[11279.491682] [<ffffffffb92a3a2e>] ?
try_to_del_timer_sync+0x5e/<wbr>0x90</div>
<div>[11279.491688] [<ffffffffb9913f79>]
schedule+0x29/0x70</div>
<div>[11279.491744] [<ffffffffc060de36>]
_xfs_log_force+0x1c6/0x2c0 [xfs]</div>
<div>[11279.491750] [<ffffffffb92cf1b0>] ?
wake_up_state+0x20/0x20</div>
<div>[11279.491783] [<ffffffffc0619fec>] ?
xfsaild+0x16c/0x6f0 [xfs]</div>
<div>[11279.491817] [<ffffffffc060df5c>]
xfs_log_force+0x2c/0x70 [xfs]</div>
<div>[11279.491849] [<ffffffffc0619e80>] ?
xfs_trans_ail_cursor_first+<wbr>0x90/0x90 [xfs]</div>
<div>[11279.491880] [<ffffffffc0619fec>]
xfsaild+0x16c/0x6f0 [xfs]</div>
<div>[11279.491913] [<ffffffffc0619e80>] ?
xfs_trans_ail_cursor_first+<wbr>0x90/0x90 [xfs]</div>
<div>[11279.491919] [<ffffffffb92bb161>]
kthread+0xd1/0xe0</div>
<div>[11279.491926] [<ffffffffb92bb090>] ?
insert_kthread_work+0x40/0x40</div>
<div>[11279.491932] [<ffffffffb9920677>]
ret_from_fork_nospec_begin+<wbr>0x21/0x21</div>
<div>[11279.491936] [<ffffffffb92bb090>] ?
insert_kthread_work+0x40/0x40</div>
<div>[11279.491976] INFO: task glusterclogfsyn:14934
blocked for more than 120 seconds.</div>
<div>[11279.493466] "echo 0 >
/proc/sys/kernel/hung_task_<wbr>timeout_secs" disables
this message.</div>
<div>[11279.494952] glusterclogfsyn D ffff97209832af70
0 14934 1 0x00000080</div>
<div>[11279.494957] Call Trace:</div>
<div>[11279.494979] [<ffffffffc0309839>] ?
__split_and_process_bio+0x2e9/<wbr>0x520 [dm_mod]</div>
<div>[11279.494983] [<ffffffffb9913f79>]
schedule+0x29/0x70</div>
<div>[11279.494987] [<ffffffffb99118e9>]
schedule_timeout+0x239/0x2c0</div>
<div>[11279.494997] [<ffffffffc0309d98>] ?
dm_make_request+0x128/0x1a0 [dm_mod]</div>
<div>[11279.495001] [<ffffffffb991348d>]
io_schedule_timeout+0xad/0x130</div>
<div>[11279.495005] [<ffffffffb99145ad>]
wait_for_completion_io+0xfd/<wbr>0x140</div>
<div>[11279.495010] [<ffffffffb92cf1b0>] ?
wake_up_state+0x20/0x20</div>
<div>[11279.495016] [<ffffffffb951e574>]
blkdev_issue_flush+0xb4/0x110</div>
<div>[11279.495049] [<ffffffffc06064b9>]
xfs_blkdev_issue_flush+0x19/<wbr>0x20 [xfs]</div>
<div>[11279.495079] [<ffffffffc05eec40>]
xfs_file_fsync+0x1b0/0x1e0 [xfs]</div>
<div>[11279.495086] [<ffffffffb944f0e7>]
do_fsync+0x67/0xb0</div>
<div>[11279.495090] [<ffffffffb992076f>] ?
system_call_after_swapgs+0xbc/<wbr>0x160</div>
<div>[11279.495094] [<ffffffffb944f3d0>]
SyS_fsync+0x10/0x20</div>
<div>[11279.495098] [<ffffffffb992082f>]
system_call_fastpath+0x1c/0x21</div>
<div>[11279.495102] [<ffffffffb992077b>] ?
system_call_after_swapgs+0xc8/<wbr>0x160</div>
<div>[11279.495105] INFO: task glusterposixfsy:14941
blocked for more than 120 seconds.</div>
<div>[11279.496606] "echo 0 >
/proc/sys/kernel/hung_task_<wbr>timeout_secs" disables
this message.</div>
<div>[11279.498114] glusterposixfsy D ffff972495f84f10
0 14941 1 0x00000080</div>
<div>[11279.498118] Call Trace:</div>
<div>[11279.498134] [<ffffffffc0309839>] ?
__split_and_process_bio+0x2e9/<wbr>0x520 [dm_mod]</div>
<div>[11279.498138] [<ffffffffb9913f79>]
schedule+0x29/0x70</div>
<div>[11279.498142] [<ffffffffb99118e9>]
schedule_timeout+0x239/0x2c0</div>
<div>[11279.498152] [<ffffffffc0309d98>] ?
dm_make_request+0x128/0x1a0 [dm_mod]</div>
<div>[11279.498156] [<ffffffffb991348d>]
io_schedule_timeout+0xad/0x130</div>
<div>[11279.498160] [<ffffffffb99145ad>]
wait_for_completion_io+0xfd/<wbr>0x140</div>
<div>[11279.498165] [<ffffffffb92cf1b0>] ?
wake_up_state+0x20/0x20</div>
<div>[11279.498169] [<ffffffffb951e574>]
blkdev_issue_flush+0xb4/0x110</div>
<div>[11279.498202] [<ffffffffc06064b9>]
xfs_blkdev_issue_flush+0x19/<wbr>0x20 [xfs]</div>
<div>[11279.498231] [<ffffffffc05eec40>]
xfs_file_fsync+0x1b0/0x1e0 [xfs]</div>
<div>[11279.498238] [<ffffffffb944f0e7>]
do_fsync+0x67/0xb0</div>
<div>[11279.498242] [<ffffffffb992076f>] ?
system_call_after_swapgs+0xbc/<wbr>0x160</div>
<div>[11279.498246] [<ffffffffb944f3f3>]
SyS_fdatasync+0x13/0x20</div>
<div>[11279.498250] [<ffffffffb992082f>]
system_call_fastpath+0x1c/0x21</div>
<div>[11279.498254] [<ffffffffb992077b>] ?
system_call_after_swapgs+0xc8/<wbr>0x160</div>
<div>[11279.498257] INFO: task glusteriotwr1:14950
blocked for more than 120 seconds.</div>
<div>[11279.499789] "echo 0 >
/proc/sys/kernel/hung_task_<wbr>timeout_secs" disables
this message.</div>
<div>[11279.501343] glusteriotwr1 D ffff97208b6daf70
0 14950 1 0x00000080</div>
<div>[11279.501348] Call Trace:</div>
<div>[11279.501353] [<ffffffffb9913f79>]
schedule+0x29/0x70</div>
<div>[11279.501390] [<ffffffffc060e388>]
_xfs_log_force_lsn+0x2e8/0x340 [xfs]</div>
<div>[11279.501396] [<ffffffffb92cf1b0>] ?
wake_up_state+0x20/0x20</div>
<div>[11279.501428] [<ffffffffc05eeb97>]
xfs_file_fsync+0x107/0x1e0 [xfs]</div>
<div>[11279.501432] [<ffffffffb944ef3f>]
generic_write_sync+0x4f/0x70</div>
<div>[11279.501461] [<ffffffffc05f0545>]
xfs_file_aio_write+0x155/0x1b0 [xfs]</div>
<div>[11279.501466] [<ffffffffb941a533>]
do_sync_write+0x93/0xe0</div>
<div>[11279.501471] [<ffffffffb941b010>]
vfs_write+0xc0/0x1f0</div>
<div>[11279.501475] [<ffffffffb941c002>]
SyS_pwrite64+0x92/0xc0</div>
<div>[11279.501479] [<ffffffffb992076f>] ?
system_call_after_swapgs+0xbc/<wbr>0x160</div>
<div>[11279.501483] [<ffffffffb992082f>]
system_call_fastpath+0x1c/0x21</div>
<div>[11279.501489] [<ffffffffb992077b>] ?
system_call_after_swapgs+0xc8/<wbr>0x160</div>
<div>[11279.501493] INFO: task glusteriotwr4:14953
blocked for more than 120 seconds.</div>
<div>[11279.503047] "echo 0 >
/proc/sys/kernel/hung_task_<wbr>timeout_secs" disables
this message.</div>
<div>[11279.504630] glusteriotwr4 D ffff972499f2bf40
0 14953 1 0x00000080</div>
<div>[11279.504635] Call Trace:</div>
<div>[11279.504640] [<ffffffffb9913f79>]
schedule+0x29/0x70</div>
<div>[11279.504676] [<ffffffffc060e388>]
_xfs_log_force_lsn+0x2e8/0x340 [xfs]</div>
<div>[11279.504681] [<ffffffffb92cf1b0>] ?
wake_up_state+0x20/0x20</div>
<div>[11279.504710] [<ffffffffc05eeb97>]
xfs_file_fsync+0x107/0x1e0 [xfs]</div>
<div>[11279.504714] [<ffffffffb944f0e7>]
do_fsync+0x67/0xb0</div>
<div>[11279.504718] [<ffffffffb992076f>] ?
system_call_after_swapgs+0xbc/<wbr>0x160</div>
<div>[11279.504722] [<ffffffffb944f3d0>]
SyS_fsync+0x10/0x20</div>
<div>[11279.504725] [<ffffffffb992082f>]
system_call_fastpath+0x1c/0x21</div>
<div>[11279.504730] [<ffffffffb992077b>] ?
system_call_after_swapgs+0xc8/<wbr>0x160</div>
<div>[12127.466494] perf: interrupt took too long (8263
> 8150), lowering kernel.perf_event_max_sample_<wbr>rate
to 24000</div>
</div>
<div><br>
</div>
<div>--------------------</div>
<div>I think this is the cause of the massive ovirt
performance issues irrespective of gluster volume. At
the time this happened, I was also ssh'ed into the host,
and was doing some rpm querry commands. I had just run
rpm -qa |grep glusterfs (to verify what version was
actually installed), and that command took almost 2
minutes to return! Normally it takes less than 2
seconds. That is all pure local SSD IO, too....</div>
<div><br>
</div>
<div>I'm no expert, but its my understanding that anytime
a software causes these kinds of issues, its a serious
bug in the software, even if its mis-handled
exceptions. Is this correct?</div>
<span class="HOEnZb"><font color="#888888">
<div><br>
</div>
<div>--Jim</div>
</font></span></div>
<div class="HOEnZb">
<div class="h5">
<div class="gmail_extra"><br>
<div class="gmail_quote">On Tue, May 29, 2018 at 3:01
PM, Jim Kusznir <span dir="ltr"><<a
href="mailto:jim@palousetech.com"
target="_blank" moz-do-not-send="true">jim@palousetech.com</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="ltr">I think this is the profile
information for one of the volumes that lives on
the SSDs and is fully operational with no
down/problem disks:
<div><br>
</div>
<div>
<div>[root@ovirt2 yum.repos.d]# gluster volume
profile data info</div>
<div>Brick: ovirt2.nwfiber.com:/gluster/br<wbr>ick2/data</div>
<div>------------------------------<wbr>----------------</div>
<div>Cumulative Stats:</div>
<div> Block Size: 256b+
512b+ 1024b+ </div>
<div> No. of Reads: 983
2696 1059 </div>
<div>No. of Writes: 0
1113 302 </div>
<div> </div>
<div> Block Size: 2048b+
4096b+ 8192b+ </div>
<div> No. of Reads: 852
88608 53526 </div>
<div>No. of Writes: 522
812340 76257 </div>
<div> </div>
<div> Block Size: 16384b+
32768b+ 65536b+ </div>
<div> No. of Reads: 54351
241901 15024 </div>
<div>No. of Writes: 21636
8656 8976 </div>
<div> </div>
<div> Block Size: 131072b+ </div>
<div> No. of Reads: 524156 </div>
<div>No. of Writes: 296071 </div>
<div> %-latency Avg-latency Min-Latency
Max-Latency No. of calls Fop</div>
<div> --------- ----------- -----------
----------- ------------ ----</div>
<div> 0.00 0.00 us 0.00 us
0.00 us 4189 RELEASE</div>
<div> 0.00 0.00 us 0.00 us
0.00 us 1257 RELEASEDIR</div>
<div> 0.00 46.19 us 12.00 us
187.00 us 69 FLUSH</div>
<div> 0.00 147.00 us 78.00 us
367.00 us 86 REMOVEXATTR</div>
<div> 0.00 223.46 us 24.00 us
1166.00 us 149 READDIR</div>
<div> 0.00 565.34 us 76.00 us
3639.00 us 88 FTRUNCATE</div>
<div> 0.00 263.28 us 20.00 us
28385.00 us 228 LK</div>
<div> 0.00 98.84 us 2.00 us
880.00 us 1198 OPENDIR</div>
<div> 0.00 91.59 us 26.00 us
10371.00 us 3853 STATFS</div>
<div> 0.00 494.14 us 17.00 us
193439.00 us 1171 GETXATTR</div>
<div> 0.00 299.42 us 35.00 us
9799.00 us 2044 READDIRP</div>
<div> 0.00 1965.31 us 110.00 us
382258.00 us 321 XATTROP</div>
<div> 0.01 113.40 us 24.00 us
61061.00 us 8134 STAT</div>
<div> 0.01 755.38 us 57.00 us
607603.00 us 3196 DISCARD</div>
<div> 0.05 2690.09 us 58.00 us
2704761.00 us 3206 OPEN</div>
<div> 0.10 119978.25 us 97.00 us
9406684.00 us 154 SETATTR</div>
<div> 0.18 101.73 us 28.00 us
700477.00 us 313379 FSTAT</div>
<div> 0.23 1059.84 us 25.00 us
2716124.00 us 38255 LOOKUP</div>
<div> 0.47 1024.11 us 54.00 us
6197164.00 us 81455 FXATTROP</div>
<div> 1.72 2984.00 us 15.00 us
37098954.00 us 103020 FINODELK</div>
<div> 5.92 44315.32 us 51.00 us
24731536.00 us 23957 FSYNC</div>
<div> 13.27 2399.78 us 25.00 us
22089540.00 us 991005 READ</div>
<div> 37.00 5980.43 us 52.00 us
22099889.00 us 1108976 WRITE</div>
<div> 41.04 5452.75 us 13.00 us
22102452.00 us 1349053 INODELK</div>
<div> </div>
<div> Duration: 10026 seconds</div>
<div> Data Read: 80046027759 bytes</div>
<div>Data Written: 44496632320 bytes</div>
<div> </div>
<div>Interval 1 Stats:</div>
<div> Block Size: 256b+
512b+ 1024b+ </div>
<div> No. of Reads: 983
2696 1059 </div>
<div>No. of Writes: 0
838 185 </div>
<div> </div>
<div> Block Size: 2048b+
4096b+ 8192b+ </div>
<div> No. of Reads: 852
85856 51575 </div>
<div>No. of Writes: 382
705802 57812 </div>
<div> </div>
<div> Block Size: 16384b+
32768b+ 65536b+ </div>
<div> No. of Reads: 52673
232093 14984 </div>
<div>No. of Writes: 13499
4908 4242 </div>
<div> </div>
<div> Block Size: 131072b+ </div>
<div> No. of Reads: 460040 </div>
<div>No. of Writes: 6411 </div>
<div> %-latency Avg-latency Min-Latency
Max-Latency No. of calls Fop</div>
<div> --------- ----------- -----------
----------- ------------ ----</div>
<div> 0.00 0.00 us 0.00 us
0.00 us 2093 RELEASE</div>
<div> 0.00 0.00 us 0.00 us
0.00 us 1093 RELEASEDIR</div>
<div> 0.00 53.38 us 26.00 us
111.00 us 16 FLUSH</div>
<div> 0.00 145.14 us 78.00 us
367.00 us 71 REMOVEXATTR</div>
<div> 0.00 190.96 us 114.00 us
298.00 us 71 SETATTR</div>
<div> 0.00 213.38 us 24.00 us
1145.00 us 90 READDIR</div>
<div> 0.00 263.28 us 20.00 us
28385.00 us 228 LK</div>
<div> 0.00 101.76 us 2.00 us
880.00 us 1093 OPENDIR</div>
<div> 0.01 93.60 us 27.00 us
10371.00 us 3090 STATFS</div>
<div> 0.02 537.47 us 17.00 us
193439.00 us 1038 GETXATTR</div>
<div> 0.03 297.44 us 35.00 us
9799.00 us 1990 READDIRP</div>
<div> 0.03 2357.28 us 110.00 us
382258.00 us 253 XATTROP</div>
<div> 0.04 385.93 us 58.00 us
47593.00 us 2091 OPEN</div>
<div> 0.04 114.86 us 24.00 us
61061.00 us 7715 STAT</div>
<div> 0.06 444.59 us 57.00 us
333240.00 us 3053 DISCARD</div>
<div> 0.42 316.24 us 25.00 us
290728.00 us 29823 LOOKUP</div>
<div> 0.73 257.92 us 54.00 us
344812.00 us 63296 FXATTROP</div>
<div> 1.37 98.30 us 28.00 us
67621.00 us 313172 FSTAT</div>
<div> 1.58 2124.69 us 51.00 us
849200.00 us 16717 FSYNC</div>
<div> 5.73 162.46 us 52.00 us
748492.00 us 794079 WRITE</div>
<div> 7.19 2065.17 us 16.00 us
37098954.00 us 78381 FINODELK</div>
<div> 36.44 886.32 us 25.00 us
2216436.00 us 925421 READ</div>
<div> 46.30 1178.04 us 13.00 us
1700704.00 us 884635 INODELK</div>
<div> </div>
<div> Duration: 7485 seconds</div>
<div> Data Read: 71250527215 bytes</div>
<div>Data Written: 5119903744 bytes</div>
<div> </div>
<div>Brick: ovirt3.nwfiber.com:/gluster/br<wbr>ick2/data</div>
<div>------------------------------<wbr>----------------</div>
<div>Cumulative Stats:</div>
<div> Block Size: 1b+ </div>
<div> No. of Reads: 0 </div>
<div>No. of Writes: 3264419 </div>
<div> %-latency Avg-latency Min-Latency
Max-Latency No. of calls Fop</div>
<div> --------- ----------- -----------
----------- ------------ ----</div>
<div> 0.00 0.00 us 0.00 us
0.00 us 90 FORGET</div>
<div> 0.00 0.00 us 0.00 us
0.00 us 9462 RELEASE</div>
<div> 0.00 0.00 us 0.00 us
0.00 us 4254 RELEASEDIR</div>
<div> 0.00 50.52 us 13.00 us
190.00 us 71 FLUSH</div>
<div> 0.00 186.97 us 87.00 us
713.00 us 86 REMOVEXATTR</div>
<div> 0.00 79.32 us 33.00 us
189.00 us 228 LK</div>
<div> 0.00 220.98 us 129.00 us
513.00 us 86 SETATTR</div>
<div> 0.01 259.30 us 26.00 us
2632.00 us 137 READDIR</div>
<div> 0.02 322.76 us 145.00 us
2125.00 us 321 XATTROP</div>
<div> 0.03 109.55 us 2.00 us
1258.00 us 1193 OPENDIR</div>
<div> 0.05 70.21 us 21.00 us
431.00 us 3196 DISCARD</div>
<div> 0.05 169.26 us 21.00 us
2315.00 us 1545 GETXATTR</div>
<div> 0.12 176.85 us 63.00 us
2844.00 us 3206 OPEN</div>
<div> 0.61 303.49 us 90.00 us
3085.00 us 9633 FSTAT</div>
<div> 2.44 305.66 us 28.00 us
3716.00 us 38230 LOOKUP</div>
<div> 4.52 266.22 us 55.00 us
53424.00 us 81455 FXATTROP</div>
<div> 6.96 1397.99 us 51.00 us
64822.00 us 23889 FSYNC</div>
<div> 16.48 84.74 us 25.00 us
6917.00 us 932592 WRITE</div>
<div> 30.16 106.90 us 13.00 us
3920189.00 us 1353046 INODELK</div>
<div> 38.55 1794.52 us 14.00 us
16210553.00 us 103039 FINODELK</div>
<div> </div>
<div> Duration: 66562 seconds</div>
<div> Data Read: 0 bytes</div>
<div>Data Written: 3264419 bytes</div>
<div> </div>
<div>Interval 1 Stats:</div>
<div> Block Size: 1b+ </div>
<div> No. of Reads: 0 </div>
<div>No. of Writes: 794080 </div>
<div> %-latency Avg-latency Min-Latency
Max-Latency No. of calls Fop</div>
<div> --------- ----------- -----------
----------- ------------ ----</div>
<div> 0.00 0.00 us 0.00 us
0.00 us 2093 RELEASE</div>
<div> 0.00 0.00 us 0.00 us
0.00 us 1093 RELEASEDIR</div>
<div> 0.00 70.31 us 26.00 us
125.00 us 16 FLUSH</div>
<div> 0.00 193.10 us 103.00 us
713.00 us 71 REMOVEXATTR</div>
<div> 0.01 227.32 us 133.00 us
513.00 us 71 SETATTR</div>
<div> 0.01 79.32 us 33.00 us
189.00 us 228 LK</div>
<div> 0.01 259.83 us 35.00 us
1138.00 us 89 READDIR</div>
<div> 0.03 318.26 us 145.00 us
2047.00 us 253 XATTROP</div>
<div> 0.04 112.67 us 3.00 us
1258.00 us 1093 OPENDIR</div>
<div> 0.06 167.98 us 23.00 us
1951.00 us 1014 GETXATTR</div>
<div> 0.08 70.97 us 22.00 us
431.00 us 3053 DISCARD</div>
<div> 0.13 183.78 us 66.00 us
2844.00 us 2091 OPEN</div>
<div> 1.01 303.82 us 90.00 us
3085.00 us 9610 FSTAT</div>
<div> 3.27 316.59 us 30.00 us
3716.00 us 29820 LOOKUP</div>
<div> 5.83 265.79 us 59.00 us
53424.00 us 63296 FXATTROP</div>
<div> 7.95 1373.89 us 51.00 us
64822.00 us 16717 FSYNC</div>
<div> 23.17 851.99 us 14.00 us
16210553.00 us 78555 FINODELK</div>
<div> 24.04 87.44 us 27.00 us
6917.00 us 794081 WRITE</div>
<div> 34.36 111.91 us 14.00 us
984871.00 us 886790 INODELK</div>
<div> </div>
<div> Duration: 7485 seconds</div>
<div> Data Read: 0 bytes</div>
<div>Data Written: 794080 bytes</div>
</div>
<div><br>
</div>
<div><br>
</div>
<div>-----------------------</div>
<div>Here is the data from the volume that is
backed by the SHDDs and has one failed disk:</div>
<div>
<div>[root@ovirt2 yum.repos.d]# gluster volume
profile data-hdd info</div>
<div>Brick: 172.172.1.12:/gluster/brick3/d<wbr>ata-hdd</div>
<div>------------------------------<wbr>--------------</div>
<div>Cumulative Stats:</div>
<div> Block Size: 256b+
512b+ 1024b+ </div>
<div> No. of Reads: 1702
86 16 </div>
<div>No. of Writes: 0
767 71 </div>
<div> </div>
<div> Block Size: 2048b+
4096b+ 8192b+ </div>
<div> No. of Reads: 19
51841 2049 </div>
<div>No. of Writes: 76
60668 35727 </div>
<div> </div>
<div> Block Size: 16384b+
32768b+ 65536b+ </div>
<div> No. of Reads: 1744
639 1088 </div>
<div>No. of Writes: 8524
2410 1285 </div>
<div> </div>
<div> Block Size: 131072b+ </div>
<div> No. of Reads: 771999 </div>
<div>No. of Writes: 29584 </div>
<div> %-latency Avg-latency Min-Latency
Max-Latency No. of calls Fop</div>
<div> --------- ----------- -----------
----------- ------------ ----</div>
<div> 0.00 0.00 us 0.00 us
0.00 us 2902 RELEASE</div>
<div> 0.00 0.00 us 0.00 us
0.00 us 1517 RELEASEDIR</div>
<div> 0.00 197.00 us 197.00 us
197.00 us 1 FTRUNCATE</div>
<div> 0.00 70.24 us 16.00 us
758.00 us 51 FLUSH</div>
<div> 0.00 143.93 us 82.00 us
305.00 us 57 REMOVEXATTR</div>
<div> 0.00 178.63 us 105.00 us
712.00 us 60 SETATTR</div>
<div> 0.00 67.30 us 19.00 us
572.00 us 555 LK</div>
<div> 0.00 322.80 us 23.00 us
4673.00 us 138 READDIR</div>
<div> 0.00 336.56 us 106.00 us
11994.00 us 237 XATTROP</div>
<div> 0.00 84.70 us 28.00 us
1071.00 us 3469 STATFS</div>
<div> 0.01 387.75 us 2.00 us
146017.00 us 1467 OPENDIR</div>
<div> 0.01 148.59 us 21.00 us
64374.00 us 4454 STAT</div>
<div> 0.02 783.02 us 16.00 us
93502.00 us 1902 GETXATTR</div>
<div> 0.03 1516.10 us 17.00 us
210690.00 us 1364 ENTRYLK</div>
<div> 0.03 2555.47 us 300.00 us
674454.00 us 1064 READDIRP</div>
<div> 0.07 85.74 us 19.00 us
68340.00 us 62849 FSTAT</div>
<div> 0.07 1978.12 us 59.00 us
202596.00 us 2729 OPEN</div>
<div> 0.22 708.57 us 15.00 us
394799.00 us 25447 LOOKUP</div>
<div> 5.94 2331.74 us 15.00 us
1099530.00 us 207534 FINODELK</div>
<div> 7.31 8311.75 us 58.00 us
1800216.00 us 71668 FXATTROP</div>
<div> 12.49 7735.19 us 51.00 us
3595513.00 us 131642 WRITE</div>
<div> 17.70 957.08 us 16.00 us
13700466.00 us 1508160 INODELK</div>
<div> 24.55 2546.43 us 26.00 us
5077347.00 us 786060 READ</div>
<div> 31.56 49699.15 us 47.00 us
3746331.00 us 51777 FSYNC</div>
<div> </div>
<div> Duration: 10101 seconds</div>
<div> Data Read: 101562897361 bytes</div>
<div>Data Written: 4834450432 bytes</div>
<div> </div>
<div>Interval 0 Stats:</div>
<div> Block Size: 256b+
512b+ 1024b+ </div>
<div> No. of Reads: 1702
86 16 </div>
<div>No. of Writes: 0
767 71 </div>
<div> </div>
<div> Block Size: 2048b+
4096b+ 8192b+ </div>
<div> No. of Reads: 19
51841 2049 </div>
<div>No. of Writes: 76
60668 35727 </div>
<div> </div>
<div> Block Size: 16384b+
32768b+ 65536b+ </div>
<div> No. of Reads: 1744
639 1088 </div>
<div>No. of Writes: 8524
2410 1285 </div>
<div> </div>
<div> Block Size: 131072b+ </div>
<div> No. of Reads: 771999 </div>
<div>No. of Writes: 29584 </div>
<div> %-latency Avg-latency Min-Latency
Max-Latency No. of calls Fop</div>
<div> --------- ----------- -----------
----------- ------------ ----</div>
<div> 0.00 0.00 us 0.00 us
0.00 us 2902 RELEASE</div>
<div> 0.00 0.00 us 0.00 us
0.00 us 1517 RELEASEDIR</div>
<div> 0.00 197.00 us 197.00 us
197.00 us 1 FTRUNCATE</div>
<div> 0.00 70.24 us 16.00 us
758.00 us 51 FLUSH</div>
<div> 0.00 143.93 us 82.00 us
305.00 us 57 REMOVEXATTR</div>
<div> 0.00 178.63 us 105.00 us
712.00 us 60 SETATTR</div>
<div> 0.00 67.30 us 19.00 us
572.00 us 555 LK</div>
<div> 0.00 322.80 us 23.00 us
4673.00 us 138 READDIR</div>
<div> 0.00 336.56 us 106.00 us
11994.00 us 237 XATTROP</div>
<div> 0.00 84.70 us 28.00 us
1071.00 us 3469 STATFS</div>
<div> 0.01 387.75 us 2.00 us
146017.00 us 1467 OPENDIR</div>
<div> 0.01 148.59 us 21.00 us
64374.00 us 4454 STAT</div>
<div> 0.02 783.02 us 16.00 us
93502.00 us 1902 GETXATTR</div>
<div> 0.03 1516.10 us 17.00 us
210690.00 us 1364 ENTRYLK</div>
<div> 0.03 2555.47 us 300.00 us
674454.00 us 1064 READDIRP</div>
<div> 0.07 85.73 us 19.00 us
68340.00 us 62849 FSTAT</div>
<div> 0.07 1978.12 us 59.00 us
202596.00 us 2729 OPEN</div>
<div> 0.22 708.57 us 15.00 us
394799.00 us 25447 LOOKUP</div>
<div> 5.94 2334.57 us 15.00 us
1099530.00 us 207534 FINODELK</div>
<div> 7.31 8311.49 us 58.00 us
1800216.00 us 71668 FXATTROP</div>
<div> 12.49 7735.32 us 51.00 us
3595513.00 us 131642 WRITE</div>
<div> 17.71 957.08 us 16.00 us
13700466.00 us 1508160 INODELK</div>
<div> 24.56 2546.42 us 26.00 us
5077347.00 us 786060 READ</div>
<div> 31.54 49651.63 us 47.00 us
3746331.00 us 51777 FSYNC</div>
<div> </div>
<div> Duration: 10101 seconds</div>
<div> Data Read: 101562897361 bytes</div>
<div>Data Written: 4834450432 bytes</div>
</div>
<div><br>
</div>
</div>
<div class="m_-5992202424066002276HOEnZb">
<div class="m_-5992202424066002276h5">
<div class="gmail_extra"><br>
<div class="gmail_quote">On Tue, May 29,
2018 at 2:55 PM, Jim Kusznir <span
dir="ltr"><<a
href="mailto:jim@palousetech.com"
target="_blank" moz-do-not-send="true">jim@palousetech.com</a>></span>
wrote:<br>
<blockquote class="gmail_quote"
style="margin:0 0 0 .8ex;border-left:1px
#ccc solid;padding-left:1ex">
<div dir="ltr">Thank you for your
response.
<div><br>
</div>
<div>I have 4 gluster volumes. 3 are
replica 2 + arbitrator. replica
bricks are on ovirt1 and ovirt2,
arbitrator on ovirt3. The 4th
volume is replica 3, with a brick on
all three ovirt machines.</div>
<div><br>
</div>
<div>The first 3 volumes are on an SSD
disk; the 4th is on a Seagate SSHD
(same in all three machines). On
ovirt3, the SSHD has reported hard
IO failures, and that brick is
offline. However, the other two
replicas are fully operational
(although they still show contents
in the heal info command that won't
go away, but that may be the case
until I replace the failed disk).</div>
<div><br>
</div>
<div>What is bothering me is that ALL
4 gluster volumes are showing
horrible performance issues. At
this point, as the bad disk has been
completely offlined, I would expect
gluster to perform at normal speed,
but that is definitely not the case.</div>
<div><br>
</div>
<div>I've also noticed that the
performance hits seem to come in
waves: things seem to work
acceptably (but slow) for a while,
then suddenly, its as if all disk IO
on all volumes (including
non-gluster local OS disk volumes
for the hosts) pause for about 30
seconds, then IO resumes again.
During those times, I start getting
VM not responding and host not
responding notices as well as the
applications having major issues.</div>
<div><br>
</div>
<div>I've shut down most of my VMs and
am down to just my essential core
VMs (shedded about 75% of my VMs).
I still am experiencing the same
issues.</div>
<div><br>
</div>
<div>Am I correct in believing that
once the failed disk was brought
offline that performance should
return to normal?</div>
</div>
<div
class="m_-5992202424066002276m_1037085839393797930HOEnZb">
<div
class="m_-5992202424066002276m_1037085839393797930h5">
<div class="gmail_extra"><br>
<div class="gmail_quote">On Tue,
May 29, 2018 at 1:27 PM, Alex K
<span dir="ltr"><<a
href="mailto:rightkicktech@gmail.com"
target="_blank"
moz-do-not-send="true">rightkicktech@gmail.com</a>></span>
wrote:<br>
<blockquote class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px #ccc
solid;padding-left:1ex">
<div dir="auto">I would check
disks status and
accessibility of mount
points where your gluster
volumes reside.</div>
<br>
<div class="gmail_quote">
<div>
<div
class="m_-5992202424066002276m_1037085839393797930m_-4909453786756208844h5">
<div dir="ltr">On Tue,
May 29, 2018, 22:28
Jim Kusznir <<a
href="mailto:jim@palousetech.com"
target="_blank"
moz-do-not-send="true">jim@palousetech.com</a>>
wrote:<br>
</div>
</div>
</div>
<blockquote
class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px #ccc
solid;padding-left:1ex">
<div>
<div
class="m_-5992202424066002276m_1037085839393797930m_-4909453786756208844h5">
<div dir="ltr">On one
ovirt server, I'm
now seeing these
messages:
<div>
<div>[56474.239725]
blk_update_request: 63 callbacks suppressed</div>
<div>[56474.239732]
blk_update_request: I/O error, dev dm-2, sector 0</div>
<div>[56474.240602]
blk_update_request: I/O error, dev dm-2, sector 3905945472</div>
<div>[56474.241346]
blk_update_request: I/O error, dev dm-2, sector 3905945584</div>
<div>[56474.242236]
blk_update_request: I/O error, dev dm-2, sector 2048</div>
<div>[56474.243072]
blk_update_request: I/O error, dev dm-2, sector 3905943424</div>
<div>[56474.243997]
blk_update_request: I/O error, dev dm-2, sector 3905943536</div>
<div>[56474.247347]
blk_update_request: I/O error, dev dm-2, sector 0</div>
<div>[56474.248315]
blk_update_request: I/O error, dev dm-2, sector 3905945472</div>
<div>[56474.249231]
blk_update_request: I/O error, dev dm-2, sector 3905945584</div>
<div>[56474.250221]
blk_update_request: I/O error, dev dm-2, sector 2048</div>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
</div>
<div
class="gmail_extra"><br>
<div
class="gmail_quote">On
Tue, May 29, 2018
at 11:59 AM, Jim
Kusznir <span
dir="ltr"><<a
href="mailto:jim@palousetech.com" rel="noreferrer" target="_blank"
moz-do-not-send="true">jim@palousetech.com</a>></span>
wrote:<br>
<blockquote
class="gmail_quote"
style="margin:0
0 0
.8ex;border-left:1px
#ccc
solid;padding-left:1ex">
<div dir="ltr">I
see in
messages on
ovirt3 (my 3rd
machine, the
one upgraded
to 4.2):
<div><br>
</div>
<div>
<div>May 29
11:54:41
ovirt3
ovs-vsctl:
ovs|00001|db_ctl_base|ERR|unix<wbr>:/var/run/openvswitch/db.sock:
database
connection
failed (No
such file or
directory)</div>
<div>May 29
11:54:51
ovirt3
ovs-vsctl:
ovs|00001|db_ctl_base|ERR|unix<wbr>:/var/run/openvswitch/db.sock:
database
connection
failed (No
such file or
directory)</div>
<div>May 29
11:55:01
ovirt3
ovs-vsctl:
ovs|00001|db_ctl_base|ERR|unix<wbr>:/var/run/openvswitch/db.sock:
database
connection
failed (No
such file or
directory)</div>
</div>
<div>(appears
a lot).</div>
<div><br>
</div>
<div>I also
found on the
ssh session of
that, some
sysv warnings
about the
backing disk
for one of the
gluster
volumes
(straight
replica 3).
The glusterfs
process for
that disk on
that machine
went offline.
Its my
understanding
that it should
continue to
work with the
other two
machines while
I attempt to
replace that
disk, right?
Attempted
writes
(touching an
empty file)
can take 15
seconds,
repeating it
later will be
much faster.</div>
<div><br>
</div>
<div>Gluster
generates a
bunch of
different log
files, I don't
know what ones
you want, or
from which
machine(s).</div>
<div><br>
</div>
<div>How do I
do "volume
profiling"?</div>
<div><br>
</div>
<div>Thanks!</div>
</div>
<div
class="m_-5992202424066002276m_1037085839393797930m_-4909453786756208844m_-1594786904780884718m_492621309039667928HOEnZb">
<div
class="m_-5992202424066002276m_1037085839393797930m_-4909453786756208844m_-1594786904780884718m_492621309039667928h5">
<div
class="gmail_extra"><br>
<div
class="gmail_quote">On
Tue, May 29,
2018 at 11:53
AM, Sahina
Bose <span
dir="ltr"><<a
href="mailto:sabose@redhat.com" rel="noreferrer" target="_blank"
moz-do-not-send="true">sabose@redhat.com</a>></span>
wrote:<br>
<blockquote
class="gmail_quote"
style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="ltr">
<div>Do you
see errors
reported in
the mount logs
for the
volume? If so,
could you
attach the
logs?<br>
</div>
Any issues
with your
underlying
disks. Can you
also attach
output of
volume
profiling?<br>
</div>
<div
class="m_-5992202424066002276m_1037085839393797930m_-4909453786756208844m_-1594786904780884718m_492621309039667928m_-6088757787094439702HOEnZb">
<div
class="m_-5992202424066002276m_1037085839393797930m_-4909453786756208844m_-1594786904780884718m_492621309039667928m_-6088757787094439702h5">
<div
class="gmail_extra"><br>
<div
class="gmail_quote">On
Wed, May 30,
2018 at 12:13
AM, Jim
Kusznir <span
dir="ltr"><<a
href="mailto:jim@palousetech.com" rel="noreferrer" target="_blank"
moz-do-not-send="true">jim@palousetech.com</a>></span>
wrote:<br>
<blockquote
class="gmail_quote"
style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="ltr">Ok,
things have
gotten MUCH
worse this
morning. I'm
getting random
errors from
VMs, right
now, about a
third of my
VMs have been
paused due to
storage
issues, and
most of the
remaining VMs
are not
performing
well.
<div><br>
</div>
<div>At this
point, I am in
full EMERGENCY
mode, as my
production
services are
now impacted,
and I'm
getting calls
coming in with
problems...</div>
<div><br>
</div>
<div>I'd
greatly
appreciate
help...VMs are
running VERY
slowly (when
they run), and
they are
steadily
getting
worse. I
don't know
why. I was
seeing CPU
peaks (to
100%) on
several VMs,
in perfect
sync, for a
few minutes at
a time (while
the VM became
unresponsive
and any VMs I
was logged
into that were
linux were
giving me the
CPU stuck
messages in my
origional
post). Is all
this storage
related?</div>
<div><br>
</div>
<div>I also
have two
different
gluster
volumes for VM
storage, and
only one had
the issues,
but now VMs in
both are being
affected at
the same time
and same way.</div>
<span
class="m_-5992202424066002276m_1037085839393797930m_-4909453786756208844m_-1594786904780884718m_492621309039667928m_-6088757787094439702m_1448879657997877339HOEnZb"><font
color="#888888">
<div><br>
</div>
<div>--Jim</div>
</font></span></div>
<div
class="m_-5992202424066002276m_1037085839393797930m_-4909453786756208844m_-1594786904780884718m_492621309039667928m_-6088757787094439702m_1448879657997877339HOEnZb">
<div
class="m_-5992202424066002276m_1037085839393797930m_-4909453786756208844m_-1594786904780884718m_492621309039667928m_-6088757787094439702m_1448879657997877339h5">
<div
class="gmail_extra"><br>
<div
class="gmail_quote">On
Mon, May 28,
2018 at 10:50
PM, Sahina
Bose <span
dir="ltr"><<a
href="mailto:sabose@redhat.com" rel="noreferrer" target="_blank"
moz-do-not-send="true">sabose@redhat.com</a>></span>
wrote:<br>
<blockquote
class="gmail_quote"
style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="ltr">[Adding
gluster-users
to look at the
heal issue]<br>
</div>
<div
class="gmail_extra"><br>
<div
class="gmail_quote">
<div>
<div
class="m_-5992202424066002276m_1037085839393797930m_-4909453786756208844m_-1594786904780884718m_492621309039667928m_-6088757787094439702m_1448879657997877339m_2506865858631215125h5">On
Tue, May 29,
2018 at 9:17
AM, Jim
Kusznir <span
dir="ltr"><<a
href="mailto:jim@palousetech.com" rel="noreferrer" target="_blank"
moz-do-not-send="true">jim@palousetech.com</a>></span>
wrote:<br>
</div>
</div>
<blockquote
class="gmail_quote"
style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div>
<div
class="m_-5992202424066002276m_1037085839393797930m_-4909453786756208844m_-1594786904780884718m_492621309039667928m_-6088757787094439702m_1448879657997877339m_2506865858631215125h5">
<div dir="ltr">Hello:
<div><br>
</div>
<div>I've been
having some
cluster and
gluster
performance
issues
lately. I
also found
that my
cluster was
out of date,
and was trying
to apply
updates
(hoping to fix
some of
these), and
discovered the
ovirt 4.1
repos were
taken
completely
offline. So,
I was forced
to begin an
upgrade to
4.2.
According to
docs I
found/read, I
needed only
add the new
repo, do a yum
update,
reboot, and be
good on my
hosts (did the
yum update,
the
engine-setup
on my hosted
engine).
Things seemed
to work
relatively
well, except
for a gluster
sync issue
that showed
up.</div>
<div><br>
</div>
<div>My
cluster is a 3
node
hyperconverged
cluster. I
upgraded the
hosted engine
first, then
engine 3.
When engine 3
came back up,
for some
reason one of
my gluster
volumes would
not sync.
Here's sample
output:</div>
<div><br>
</div>
<div>
<div>[root@ovirt3
~]# gluster
volume heal
data-hdd info</div>
<div>Brick
172.172.1.11:/gluster/brick3/d<wbr>ata-hdd</div>
<div>/cc65f671-3377-494a-a7d4-1d9f7<wbr>c3ae46c/images/48d7ecb8-7ac5-4<wbr>725-bca5-b3519681cf2f/0d6080b0<wbr>-7018-4fa3-bb82-1dd9ef07d9b9 </div>
<div>/cc65f671-3377-494a-a7d4-1d9f7<wbr>c3ae46c/images/647be733-f153-4<wbr>cdc-85bd-ba72544c2631/b453a300<wbr>-0602-4be1-8310-8bd5abe00971 </div>
<div>/cc65f671-3377-494a-a7d4-1d9f7<wbr>c3ae46c/images/6da854d1-b6be-4<wbr>46b-9bf0-90a0dbbea830/3c93bd1f<wbr>-b7fa-4aa2-b445-6904e31839ba </div>
<div>/cc65f671-3377-494a-a7d4-1d9f7<wbr>c3ae46c/images/7f647567-d18c-4<wbr>4f1-a58e-9b8865833acb/f9364470<wbr>-9770-4bb1-a6b9-a54861849625 </div>
<div>/cc65f671-3377-494a-a7d4-1d9f7<wbr>c3ae46c/images/f3c8e7aa-6ef2-4<wbr>2a7-93d4-e0a4df6dd2fa/2eb0b1ad<wbr>-2606-44ef-9cd3-ae59610a504b </div>
<div>/cc65f671-3377-494a-a7d4-1d9f7<wbr>c3ae46c/images/b1ea3f62-0f05-4<wbr>ded-8c82-9c91c90e0b61/d5d6bf5a<wbr>-499f-431d-9013-5453db93ed32 </div>
<div>/cc65f671-3377-494a-a7d4-1d9f7<wbr>c3ae46c/images/8c8b5147-e9d6-4<wbr>810-b45b-185e3ed65727/16f08231<wbr>-93b0-489d-a2fd-687b6bf88eaa </div>
<div>/cc65f671-3377-494a-a7d4-1d9f7<wbr>c3ae46c/images/12924435-b9c2-4<wbr>aab-ba19-1c1bc31310ef/07b3db69<wbr>-440e-491e-854c-bbfa18a7cff2 </div>
<div>Status:
Connected</div>
<div>Number of
entries: 8</div>
<div><br>
</div>
<div>Brick
172.172.1.12:/gluster/brick3/d<wbr>ata-hdd</div>
<div>/cc65f671-3377-494a-a7d4-1d9f7<wbr>c3ae46c/images/48d7ecb8-7ac5-4<wbr>725-bca5-b3519681cf2f/0d6080b0<wbr>-7018-4fa3-bb82-1dd9ef07d9b9 </div>
<div>/cc65f671-3377-494a-a7d4-1d9f7<wbr>c3ae46c/images/647be733-f153-4<wbr>cdc-85bd-ba72544c2631/b453a300<wbr>-0602-4be1-8310-8bd5abe00971 </div>
<div>/cc65f671-3377-494a-a7d4-1d9f7<wbr>c3ae46c/images/b1ea3f62-0f05-4<wbr>ded-8c82-9c91c90e0b61/d5d6bf5a<wbr>-499f-431d-9013-5453db93ed32 </div>
<div>/cc65f671-3377-494a-a7d4-1d9f7<wbr>c3ae46c/images/6da854d1-b6be-4<wbr>46b-9bf0-90a0dbbea830/3c93bd1f<wbr>-b7fa-4aa2-b445-6904e31839ba </div>
<div>/cc65f671-3377-494a-a7d4-1d9f7<wbr>c3ae46c/images/7f647567-d18c-4<wbr>4f1-a58e-9b8865833acb/f9364470<wbr>-9770-4bb1-a6b9-a54861849625 </div>
<div>/cc65f671-3377-494a-a7d4-1d9f7<wbr>c3ae46c/images/8c8b5147-e9d6-4<wbr>810-b45b-185e3ed65727/16f08231<wbr>-93b0-489d-a2fd-687b6bf88eaa </div>
<div>/cc65f671-3377-494a-a7d4-1d9f7<wbr>c3ae46c/images/12924435-b9c2-4<wbr>aab-ba19-1c1bc31310ef/07b3db69<wbr>-440e-491e-854c-bbfa18a7cff2 </div>
<div>/cc65f671-3377-494a-a7d4-1d9f7<wbr>c3ae46c/images/f3c8e7aa-6ef2-4<wbr>2a7-93d4-e0a4df6dd2fa/2eb0b1ad<wbr>-2606-44ef-9cd3-ae59610a504b </div>
<div>Status:
Connected</div>
<div>Number of
entries: 8</div>
<div><br>
</div>
<div>Brick
172.172.1.13:/gluster/brick3/d<wbr>ata-hdd</div>
<div>/cc65f671-3377-494a-a7d4-1d9f7<wbr>c3ae46c/images/b1ea3f62-0f05-4<wbr>ded-8c82-9c91c90e0b61/d5d6bf5a<wbr>-499f-431d-9013-5453db93ed32 </div>
<div>/cc65f671-3377-494a-a7d4-1d9f7<wbr>c3ae46c/images/8c8b5147-e9d6-4<wbr>810-b45b-185e3ed65727/16f08231<wbr>-93b0-489d-a2fd-687b6bf88eaa </div>
<div>/cc65f671-3377-494a-a7d4-1d9f7<wbr>c3ae46c/images/12924435-b9c2-4<wbr>aab-ba19-1c1bc31310ef/07b3db69<wbr>-440e-491e-854c-bbfa18a7cff2 </div>
<div>/cc65f671-3377-494a-a7d4-1d9f7<wbr>c3ae46c/images/f3c8e7aa-6ef2-4<wbr>2a7-93d4-e0a4df6dd2fa/2eb0b1ad<wbr>-2606-44ef-9cd3-ae59610a504b </div>
<div>/cc65f671-3377-494a-a7d4-1d9f7<wbr>c3ae46c/images/647be733-f153-4<wbr>cdc-85bd-ba72544c2631/b453a300<wbr>-0602-4be1-8310-8bd5abe00971 </div>
<div>/cc65f671-3377-494a-a7d4-1d9f7<wbr>c3ae46c/images/48d7ecb8-7ac5-4<wbr>725-bca5-b3519681cf2f/0d6080b0<wbr>-7018-4fa3-bb82-1dd9ef07d9b9 </div>
<div>/cc65f671-3377-494a-a7d4-1d9f7<wbr>c3ae46c/images/6da854d1-b6be-4<wbr>46b-9bf0-90a0dbbea830/3c93bd1f<wbr>-b7fa-4aa2-b445-6904e31839ba </div>
<div>/cc65f671-3377-494a-a7d4-1d9f7<wbr>c3ae46c/images/7f647567-d18c-4<wbr>4f1-a58e-9b8865833acb/f9364470<wbr>-9770-4bb1-a6b9-a54861849625 </div>
<div>Status:
Connected</div>
<div>Number of
entries: 8</div>
</div>
<div><br>
</div>
<div>---------</div>
<div>Its been
in this state
for a couple
days now, and
bandwidth
monitoring
shows no
appreciable
data moving.
I've tried
repeatedly
commanding a
full heal from
all three
clusters in
the node. Its
always the
same files
that need
healing.</div>
<div><br>
</div>
<div>When
running
gluster volume
heal data-hdd
statistics, I
see sometimes
different
information,
but always
some number of
"heal failed"
entries. It
shows 0 for
split brain.</div>
<div><br>
</div>
<div>I'm not
quite sure
what to do. I
suspect it may
be due to
nodes 1 and 2
still being on
the older
ovirt/gluster
release, but
I'm afraid to
upgrade and
reboot them
until I have a
good gluster
sync (don't
need to create
a split brain
issue). How
do I proceed
with this?</div>
<div><br>
</div>
<div>Second
issue: I've
been
experiencing
VERY POOR
performance on
most of my
VMs. To the
tune that
logging into a
windows 10 vm
via remote
desktop can
take 5
minutes,
launching
quickbooks
inside said vm
can easily
take 10
minutes. On
some linux
VMs, I get
random
messages like
this:</div>
<div>
<div>Message
from
syslogd@unifi
at May 28
20:39:23 ...</div>
<div> kernel:[6171996.308904]
NMI watchdog:
BUG: soft
lockup - CPU#0
stuck for 22s!
[mongod:14766]</div>
</div>
<div><br>
</div>
<div>(the
process and
PID are often
different)</div>
<div><br>
</div>
<div>I'm not
quite sure
what to do
about this
either. My
initial
thought was
upgrad
everything to
current and
see if its
still there,
but I cannot
move forward
with that
until my
gluster is
healed...</div>
<div><br>
</div>
<div>Thanks!</div>
<span
class="m_-5992202424066002276m_1037085839393797930m_-4909453786756208844m_-1594786904780884718m_492621309039667928m_-6088757787094439702m_1448879657997877339m_2506865858631215125m_-3484925472286407273HOEnZb"><font
color="#888888">
<div>--Jim</div>
</font></span></div>
<br>
</div>
</div>
______________________________<wbr>_________________<br>
Users mailing
list -- <a
href="mailto:users@ovirt.org"
rel="noreferrer" target="_blank" moz-do-not-send="true">users@ovirt.org</a><br>
To unsubscribe
send an email
to <a
href="mailto:users-leave@ovirt.org"
rel="noreferrer" target="_blank" moz-do-not-send="true">users-leave@ovirt.org</a><br>
Privacy
Statement: <a
href="https://www.ovirt.org/site/privacy-policy/" rel="noreferrer
noreferrer"
target="_blank"
moz-do-not-send="true">https://www.ovirt.org/site/pri<wbr>vacy-policy/</a><br>
oVirt Code of
Conduct: <a
href="https://www.ovirt.org/community/about/community-guidelines/"
rel="noreferrer noreferrer" target="_blank" moz-do-not-send="true">https://www.ovirt.org/communit<wbr>y/about/community-guidelines/</a><br>
List Archives:
<a
href="https://lists.ovirt.org/archives/list/users@ovirt.org/message/3LEV6ZQ3JV2XLAL7NYBTXOYMYUOTIRQF/"
rel="noreferrer noreferrer" target="_blank" moz-do-not-send="true">https://lists.ovirt.org/archiv<wbr>es/list/users@ovirt.org/messag<wbr>e/3LEV6ZQ3JV2XLAL7NYBTXOYMYUOT<wbr>IRQF/</a><br>
<br>
</blockquote>
</div>
<br>
</div>
</blockquote>
</div>
<br>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
______________________________<wbr>_________________<br>
Users mailing list --
<a
href="mailto:users@ovirt.org"
rel="noreferrer"
target="_blank"
moz-do-not-send="true">users@ovirt.org</a><br>
To unsubscribe send an
email to <a
href="mailto:users-leave@ovirt.org"
rel="noreferrer"
target="_blank"
moz-do-not-send="true">users-leave@ovirt.org</a><br>
Privacy Statement: <a
href="https://www.ovirt.org/site/privacy-policy/" rel="noreferrer
noreferrer"
target="_blank"
moz-do-not-send="true">https://www.ovirt.org/site/pri<wbr>vacy-policy/</a><br>
oVirt Code of Conduct:
<a
href="https://www.ovirt.org/community/about/community-guidelines/"
rel="noreferrer
noreferrer"
target="_blank"
moz-do-not-send="true">https://www.ovirt.org/communit<wbr>y/about/community-guidelines/</a><br>
</div>
</div>
List Archives: <a
href="https://lists.ovirt.org/archives/list/users@ovirt.org/message/ACO7RFSLBSRBAIONIC2HQ6Z24ZDES5MF/"
rel="noreferrer
noreferrer"
target="_blank"
moz-do-not-send="true">https://lists.ovirt.org/archiv<wbr>es/list/users@ovirt.org/messag<wbr>e/ACO7RFSLBSRBAIONIC2HQ6Z24ZDE<wbr>S5MF/</a><br>
</blockquote>
</div>
</blockquote>
</div>
<br>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</div>
</div>
<br>
______________________________<wbr>_________________<br>
Users mailing list -- <a href="mailto:users@ovirt.org"
moz-do-not-send="true">users@ovirt.org</a><br>
To unsubscribe send an email to <a
href="mailto:users-leave@ovirt.org" moz-do-not-send="true">users-leave@ovirt.org</a><br>
Privacy Statement: <a
href="https://www.ovirt.org/site/privacy-policy/"
rel="noreferrer" target="_blank" moz-do-not-send="true">https://www.ovirt.org/site/<wbr>privacy-policy/</a><br>
oVirt Code of Conduct: <a
href="https://www.ovirt.org/community/about/community-guidelines/"
rel="noreferrer" target="_blank" moz-do-not-send="true">https://www.ovirt.org/<wbr>community/about/community-<wbr>guidelines/</a><br>
List Archives: <a
href="https://lists.ovirt.org/archives/list/users@ovirt.org/message/3DEQQLJM3WHQNZJ7KEMRZVFZ52MTIL74/"
rel="noreferrer" target="_blank" moz-do-not-send="true">https://lists.ovirt.org/<wbr>archives/list/users@ovirt.org/<wbr>message/<wbr>3DEQQLJM3WHQNZJ7KEMRZVFZ52MTIL<wbr>74/</a><br>
<br>
</blockquote>
</div>
<br>
</div>
</blockquote>
<br>
</body>
</html>