[Gluster-devel] DHT NULL pointer usage

Mon Nov 17 16:00:18 UTC 2014

On 11/17/2014 10:50 AM, Xavier Hernandez wrote:
> Hi Shyam,
>
> On 11/17/2014 03:50 PM, Shyam wrote:
>> On 11/17/2014 07:20 AM, Emmanuel Dreyfus wrote:
>>> Hello
>>>
>>> I have an almost reliable test that fails on NetBSD:
>>>
>>>> ./tests/basic/ec/quota.t                 (Wstat: 0 Tests: 22 Failed: 3)
>>>>    Failed tests:  19-21
>>>
>>> This one is a real bug: glusterfsd crashed because of a NULL pointer. I
>>> am going to submit a change to avoid touhcing postbuf is op_ret = -1 but
>>> if someone has a better idea, please let me know.
>>
>> This should be fine from a DHT perspective. The current tests check if
>> the file is under migration, using the postbuf attrs sticky and SGID
>> (for the file). If the FOP errore'd out then the code should look to
>> check if this is a dht_inode_missing error and then do a phase2 check to
>> see if the file migrated to a new target (based on a good/bad op_ret).
>>
>> Looking at the test, it seems that we hit a quota error on the write. In
>> which case following the file even if it migrated does not make sense.
>> We should have received an EDQUOT error as the op_errno (if I am not
>> mistaken) and exited this function at the check below,
>>
>> dht_writev_cbk:
>> 35:       if (op_ret == -1 && !dht_inode_missing(op_errno)) {
>> 36:                goto out;
>> 37:        }
>
> It's very strange. The failing test tries to write a file. There should
> be enough quota for the file to be written and it's not being migrated,
> but it seems that the brick is returning ENOENT (at least this is what
> ec_writev_cbk() is receiving). However all other tests writing to files
> seems to work ok, so I'm not sure what is special with this particular
> test.

Hmmm... need to check this out, will do as I get some cycles today or 
tomorrow, or, if we could understand where/why this errno is cropping 
from it would be good, as it may point to a different problem.

In this case ENOENT is very strange, as the fd is open, so how did the 
entry disappear? or, is this some form of open behind in xlators below 
DHT (wild guess)?

The handling of this error, should trigger P2 migration check in DHT, so 
the current fix @ http://review.gluster.org/#/c/9139/ is kind of invalid.

NOTE: P1/2 PHASE1 and PHASE2 related checks in DHT

I think we need this fixed in all P2 (and P1) checks in DHT to treat it 
appropriately. If no postbuf _but_ error is ENOENT or ESTALE, then we 
need to enter P2 checks in DHT. P1 depends on no error and a valid 
postbuf to detect migration. This is irrespective of the problem with 
the errno as discussed above and should be fixed as mentioned in the 
original mail.

Xavi/Manu, let me know if I/we need to handle this in DHT, or if you 
would be comfortable posting the patch.

>
>>
>> Are we not sending an EDQUOT upwards in this case?
>>
>> Validating assumptions:
>> - As this is an fd based operation, the fd should have been open, and so
>> why did the writev fail exactly? EDQUOT?
>
> The op_errno is 2 (ENOENT)
>
>> - Was this an NFS based test, which could explain the use of anon-fds
>> which would return an error if the file is no longer on the backend. Is
>> this a test run from an NFS mount point?
>
> It's a normal FUSE mount.

Thank you.

Shyam