[Gluster-users] Healing will not stop after replacing a brick in disperse volume

柯名峻 fmpnate at gmail.com
Fri Mar 18 09:19:04 UTC 2016

Hi gluster users,

I'm a doing write-read-verify test on 2 + 1 disperse volume.

The test keeps write-read-verify on a tested-file until it receives a stop

When test is still working, I replace a brick in the disperse volume.

The test result seems good as usual, no error after verification.

 [gluster volume heal info] command shows nothing in the new brick but the
tested file name under two unreplaced bricks.

Both of two unreplaced brick also have non-zero number on Number of
entries.(not in new brick)

I also found that the size of the testes-file in the new brick will grow-up
to the same size as in the other brick and then suddenly drop to zero.

After tracing by gdb, when the file is being healed, another heal operation
will trigger cluster_ftruncate in __ec_heal_trim_sinks, which causes the
file size drop to zero.

The getfattr -d -m. -e hex "filename" shows trusted.ec.dirty with nonezero
value in two unreplaced brick but no trusted.ec.dirty shows in new brick.

I'm confused about some questions.

1. Why another heal activity will be triggered after preview healing is
complete (ec_heal_report shows success)?

    I tried to reduce disperse.background-heals to 1
or disperse.heal-wait-qlength to 0 but no help.

2. After I stop the write-read-verify test, the file size in each bricks
became the same, no file name shows

    in the  [gluster volume heal info] command and all trusted.ec.dirty
became to 0.

    How Is the IO activity affect the result of healing?

If the healing will not end, and the other brick failed, the volume failed.

Any ideas please?

