[Gluster-Maintainers] Geo-replication infra failures

Amar Tumballi atumball at redhat.com
Thu Jan 18 06:24:02 UTC 2018


Adding maintainers ML here as the concerns raised are very very important.

On Thu, Jan 18, 2018 at 11:47 AM, Kotresh Hiremath Ravishankar <
khiremat at redhat.com> wrote:

> Hi Nigel,
>
> I debugged the issue and found the root cause. It is indeed setup issue
> where two gluster binary instances are present on this machine,
> one at /usr/local/sbin/gluster and one at /build/install/sbin/gluster.
> Geo-rep is failing with gluster version being mismatch between master
> and slave. It's finding instance "/usr/local/sbin/gluster" on master and
> finding "/build/install/sbin/gluster" on slave (when run via ssh).
>
>
Thanks for finding this, Kotresh. Very helpful.


> This is scary! In all these kind of machines, we don't even know whether
> our patches are being tested properly as it might be
> using wrong gluster binary ("/usr/local/sbin/gluster") where regression
> tests should use "/build/install/sbin/gluster"
>
> May be it is the result of developers taking machines for debugging and
> pulling there own instance of gluster source and installing
> in normally without using "/opt/qa/build.sh". We need to address these in
> some way.
>
>
The above is very concerning, and scary for sure. I guess by moving to
chunked regression where we don't get machines but only the logs, we may be
at better state as every time machine comes up, it will start from a fresh
instance.

Everyone using the regression machine till we have new setup, please make
sure you cleanup your stuff.

I have sent the geo-rep patch[1] which throws out more specific error
> messages in these scenarios.
> [1] https://review.gluster.org/19224
>
> Thanks,
> Kotresh HR
>
> On Wed, Jan 17, 2018 at 8:57 PM, Nigel Babu <nigelb at redhat.com> wrote:
>
>> I've granted you access to slave23. You should be able to SSH in as
>> jenkins at slave23.cloud.gluster.org
>>
>> On Wed, Jan 17, 2018 at 8:21 PM, Kotresh Hiremath Ravishankar <
>> khiremat at redhat.com> wrote:
>>
>>> If it's happening consistently. Give me the machine. I will root cause
>>> it.
>>>
>>> On 17 Jan 2018 7:33 pm, "Nigel Babu" <nigelb at redhat.com> wrote:
>>>
>>>> Hi Kotresh,
>>>>
>>>> I can reliably reproduce the geo-rep failure on a machine where `ssh
>>>> root at 127.0.0.1` works. What are the next steps I can take to debug
>>>> this? I can also provide you with access if you'd like.
>>>>
>>>> --
>>>> nigelb
>>>>
>>>
>>
>>
>> --
>> nigelb
>>
>
>
>
> --
> Thanks and Regards,
> Kotresh H R
>



-- 
Amar Tumballi (amarts)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/maintainers/attachments/20180118/edcad778/attachment.html>


More information about the maintainers mailing list