[Gluster-infra] New spurious regression

Thu Nov 5 12:56:16 UTC 2015

Raised a bug (https://bugzilla.redhat.com/show_bug.cgi?id=1278418) for 
this, and sent a fix (http://review.gluster.org/#/c/12516/) too. It 
would be great if you could also review the patch.

Regards,
Avra

On 11/05/2015 06:01 PM, Avra Sengupta wrote:
> Hey Michael,
>
> Thanks, but I don't think that would be necessary anymore.
>
> Guys,
>
> I wrote a patch changing logs to set brick status logs to INFO 
> (http://review.gluster.org/#/c/12515/). Ironically this patch too did 
> not fail regression on first go, but did fail on the next iteration. 
> From what I see in the logs (given below). As i had suspected, the 
> brick connectivity happens a tad bit after the clone command is 
> executed. Now I don't know why this time delay happens on the 
> regression setup (that too not all the time), and never locally. I can 
> think of various reasons for the same(slower regression machines being 
> my prime suspect to begin with), but I can't say for sure. I will 
> raise a bug for this, and try and modify the testcase accordingly.
>
> Logs:
> [2015-11-05 11:25:15.103233] E [MSGID: 106122] 
> [glusterd-snapshot.c:2376:glusterd_snapshot_clone_prevalidate] 
> 0-management: Failed to pre validate
> *[2015-11-05 11:25:15.103265] E [MSGID: 106443] 
> [glusterd-snapshot.c:2398:glusterd_snapshot_clone_prevalidate] 
> 0-management: One or more bricks are not running. Please run snapshot 
> status command to see brick sta**
> **tus.*
> Please start the stopped brick and then issue snapshot clone command
> [2015-11-05 11:25:15.103280] W [MSGID: 106443] 
> [glusterd-snapshot.c:8398:glusterd_snapshot_prevalidate] 0-management: 
> Snapshot clone pre-validation failed
> [2015-11-05 11:25:15.103294] W [MSGID: 106122] 
> [glusterd-mgmt.c:166:gd_mgmt_v3_pre_validate_fn] 0-management: 
> Snapshot Prevalidate Failed
> [2015-11-05 11:25:15.103305] E [MSGID: 106122] 
> [glusterd-mgmt.c:820:glusterd_mgmt_v3_pre_validate] 0-management: Pre 
> Validation failed for operation Snapshot on local node
> [2015-11-05 11:25:15.103315] E [MSGID: 106122] 
> [glusterd-mgmt.c:2166:glusterd_mgmt_v3_initiate_snap_phases] 
> 0-management: Pre Validation Failed
> [2015-11-05 11:25:15.103332] E [MSGID: 106027] 
> [glusterd-snapshot.c:7946:glusterd_snapshot_clone_postvalidate] 
> 0-management: unable to find clone clone1 volinfo
> [2015-11-05 11:25:15.103342] W [MSGID: 106444] 
> [glusterd-snapshot.c:8837:glusterd_snapshot_postvalidate] 
> 0-management: Snapshot create post-validation failed
> [2015-11-05 11:25:15.103352] W [MSGID: 106121] 
> [glusterd-mgmt.c:323:gd_mgmt_v3_post_validate_fn] 0-management: 
> postvalidate operation failed
> [2015-11-05 11:25:15.103362] E [MSGID: 106121] 
> [glusterd-mgmt.c:1585:glusterd_mgmt_v3_post_validate] 0-management: 
> Post Validation failed for operation Snapshot on local node
> [2015-11-05 11:25:15.103372] E [MSGID: 106122] 
> [glusterd-mgmt.c:2286:glusterd_mgmt_v3_initiate_snap_phases] 
> 0-management: Post Validation Failed
> [2015-11-05 11:25:15.109994]:++++++++++ 
> G_LOG:./tests/bugs/snapshot/bug-1275616.t: TEST: 42 42 149 
> snap_info_volume CLI Snaps Available patchy ++++++++++
> [2015-11-05 11:25:15.239358]:++++++++++ 
> G_LOG:./tests/bugs/snapshot/bug-1275616.t: TEST: 43 43 150 
> snap_config_volume CLI snap-max-hard-limit patchy ++++++++++
> [2015-11-05 11:25:15.378255]:++++++++++ 
> G_LOG:./tests/bugs/snapshot/bug-1275616.t: TEST: 45 45 200 
> snap_info_volume CLI Snaps Available clone1 ++++++++++
> [2015-11-05 11:25:15.501970] E [MSGID: 106027] 
> [glusterd-snapshot.c:3574:glusterd_snapshot_get_info_by_volume] 
> 0-management: Volume (clone1) does not exist [Invalid argument]
> [2015-11-05 11:25:15.502024] E [MSGID: 106027] 
> [glusterd-snapshot.c:3766:glusterd_handle_snapshot_info] 0-management: 
> Failed to get volume info of volume clone1 [Invalid argument]
> [2015-11-05 11:25:15.502061] W [MSGID: 106063] 
> [glusterd-snapshot.c:9082:glusterd_handle_snapshot_fn] 0-management: 
> Snapshot info failed
> [2015-11-05 11:25:15.510016]:++++++++++ 
> G_LOG:./tests/bugs/snapshot/bug-1275616.t: TEST: 46 46 200 
> snap_config_volume CLI snap-max-hard-limit clone1 ++++++++++
> [2015-11-05 11:25:15.639515] E [MSGID: 106060] 
> [glusterd-snapshot.c:438:snap_max_limits_display_commit] 0-management: 
> Volume (clone1) does not exist
> [2015-11-05 11:25:15.639543] E [MSGID: 106090] 
> [glusterd-snapshot.c:1446:glusterd_handle_snapshot_config] 
> 0-management: snap-max-limit display commit failed.
> [2015-11-05 11:25:15.639558] W [MSGID: 106045] 
> [glusterd-snapshot.c:9101:glusterd_handle_snapshot_fn] 0-management: 
> snapshot config failed
> *[2015-11-05 11:25:15.684746] I 
> [glusterd-utils.c:4883:glusterd_set_brick_status] 0-glusterd: Setting 
> brick 
> slave28.cloud.gluster.org:/var/run/gluster/snaps/7db8306c170541eb98c02633407bf625/brick1 
> status to started*
>
> Regards,
> Avra
>
> On 11/05/2015 05:07 PM, Michael Scherer wrote:
>> Le jeudi 05 novembre 2015 à 15:59 +0530, Avra Sengupta a écrit :
>>> On 11/05/2015 03:57 PM, Avra Sengupta wrote:
>>>> On 11/05/2015 03:56 PM, Vijay Bellur wrote:
>>>>> On Thursday 05 November 2015 12:19 PM, Avra Sengupta wrote:
>>>>>> Hi,
>>>>>>
>>>>>> We investigated the logs in the regression failures that encountered
>>>>>> this and following are the findings:
>>>>>> 1. snapshot clone failure is indeed the reason for the failure.
>>>>>> 2. snapshot clone has failed in pre-validation with the error that the
>>>>>> brick of snap3 is not up and running.
>>>>>> 3. snap3 was created, and subsequently started (because of
>>>>>> activate-on-create being enabled), long before we tried to create a
>>>>>> clone out of it.
>>>>>> 4. The snap3's brick shows no failure logs, and thereby gives us no
>>>>>> reason to believe that it did not start properly in the course of the
>>>>>> testcase.
>>>>>> 5. Which leaves us with the assumption (it is an assumption because we
>>>>>> do not have any logs backing it) that, there was some delay in either
>>>>>> the start of the brick process for snap3, or for glusterd to register
>>>>>> that the same has started, and before either of these events could have
>>>>>> happened the clone command got executed and failed. This would make
>>>>>> it a
>>>>>> race.
>>>>>>
>>>>>> Some other things to consider about the particular testcase:
>>>>>> 1. It did pass (and still passes consistently), in our local systems
>>>>>> making it not reproducible locally.
>>>>>> 2. The patch was merged after both linux and netbsd regressions passed
>>>>>> (at one go).
>>>>>> 3. The release 3.7 backported patch for the same, has also passed both
>>>>>> the linux and netbsd regressions as of now.
>>>>>>
>>>>>> The rationale behind mentioning the above three points being, this
>>>>>> testcase has passed locally, as well as on the regression setups(not
>>>>>> just at the time of merge, but even now), which brings me back to the
>>>>>> assumption mentioned in point #5 . To get more clarity on the said
>>>>>> assumption we need access to one of the regression setups, so that we
>>>>>> can try reproducing the failure in that environment and get some proof
>>>>>> of what really is happening.
>>>>>>
>>>>>> Vijay,
>>>>>>
>>>>>> Could you please provide us with a jenkins linux slave to perform the
>>>>>> above mentioned validity
>>>>>>
>>>>> Please send out a request on gluster-infra if not done so and Michael
>>>>> Scherer should be able to help.
>>>>>
>>>>> Thanks!
>>>>> Vijay
>>>>>
>>>> + Adding gluster-infra and Michael
>>>>
>>>> Could you please provide us with a jenkins linux slave to perform the
>>>> above mentioned validity
>> So you just want 1 single centos 6 gluster slave, who need access to it,
>> and for how long ?
>>
>> Can you provides a ssh key so I can create a snapshot and give to you ?
>>
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-infra/attachments/20151105/c2c28a5c/attachment.html>