[Gluster-infra] [Gluster-devel] Jenkins Issues this weekend and how we're solving them

Mon Feb 19 12:37:50 UTC 2018

On Mon, Feb 19, 2018 at 5:58 PM, Nithya Balachandran
<nbalacha at redhat.com> wrote:
>
>
> On 19 February 2018 at 13:12, Atin Mukherjee <amukherj at redhat.com> wrote:
>>
>>
>>
>> On Mon, Feb 19, 2018 at 8:53 AM, Nigel Babu <nigelb at redhat.com> wrote:
>>>
>>> Hello,
>>>
>>> As you all most likely know, we store the tarball of the binaries and
>>> core if there's a core during regression. Occasionally, we've introduced a
>>> bug in Gluster and this tar can take up a lot of space. This has happened
>>> recently with brick multiplex tests. The build-install tar takes up 25G,
>>> causing the machine to run out of space and continuously fail.
>>
>>
>> AFAIK, we don't have a .t file in upstream regression suits where hundreds
>> of volumes are created. With that scale and brick multiplexing enabled, I
>> can understand the core will be quite heavy loaded and may consume up to
>> this much of crazy amount of space. FWIW, can we first try to figure out
>> which test was causing this crash and see if running a gcore after a certain
>> steps in the tests do left us with a similar size of the core file? IOW,
>> have we actually seen such huge size of core file generated earlier? If not,
>> what changed because which we've started seeing this is something to be
>> invested on.
>
>
> We also need to check if this is only the core file that is causing the
> increase in size or whether there is something else that is taking up a lot
> of space.
>>
>>
>>>
>>>
>>> I've made some changes this morning. Right after we create the tarball,
>>> we'll delete all files in /archive that are greater than 1G. Please be aware
>>> that this means all large files including the newly created tarball will be
>>> deleted. You will have to work with the traceback on the Jenkins job.
>>
>>
>> We'd really need to first investigate on the average size of the core file
>> what we can get with when a system is running with brick multiplexing and
>> ongoing I/O. With out that immediately deleting the core files > 1G will
>> cause trouble to the developers in debugging genuine crashes as traceback
>> alone may not be sufficient.
>>

I'd like to echo what Nithya writes - instead of treating this
incident as an outlier, we might want to do further analysis. If this
has happened on a production system - there would be blood.