[Bugs] [Bug 1676400] rm -rf fails with "Directory not empty"

bugzilla at redhat.com bugzilla at redhat.com
Tue Feb 12 08:08:31 UTC 2019


https://bugzilla.redhat.com/show_bug.cgi?id=1676400



--- Comment #1 from Nithya Balachandran <nbalacha at redhat.com> ---
RCA for the invisible directory left behind with concurrent rm -rf :
--------------------------------------------------------------------


dht_selfheal_dir_mkdir_lookup_cbk (...) {
...

1381         this_call_cnt = dht_frame_return (frame);                          
1382                                                                            
1383         LOCK (&frame->lock);                                               
1384         {                                                                  
1385                 if ((op_ret < 0) &&                                        
1386                     (op_errno == ENOENT || op_errno == ESTALE)) {          
1387                         local->selfheal.hole_cnt =
!local->selfheal.hole_cnt ? 1
1388                                                 : local->selfheal.hole_cnt
+ 1; 
1389                 }                                                          
1390                                                                            
1391                 if (!op_ret) {                                             
1392                         dht_iatt_merge (this, &local->stbuf, stbuf, prev); 
1393                 }                                                          
1394                 check_mds = dht_dict_get_array (xattr,
conf->mds_xattr_key,     
1395                                                 mds_xattr_val, 1, &errst); 
1396                 if (dict_get (xattr, conf->mds_xattr_key) && check_mds &&
!errst) {
1397                         dict_unref (local->xattr);                         
1398                         local->xattr = dict_ref (xattr);                   
1399                 }                                                          
1400                                                                            
1401         }                                                                  
1402         UNLOCK (&frame->lock);                                             
1403                                                                            
1404         if (is_last_call (this_call_cnt)) {                                
1405                 if (local->selfheal.hole_cnt == layout->cnt) {             
1406                         gf_msg_debug (this->name, op_errno,                
1407                                       "Lookup failed, an rmdir could have
"     
1408                                       "deleted this entry %s", loc->name); 
1409                         local->op_errno = op_errno;                        
1410                         goto err;                                          
1411                 } else {                                                   
1412                         for (i = 0; i < layout->cnt; i++) {                
1413                                 if (layout->list[i].err == ENOENT ||       
1414                                     layout->list[i].err == ESTALE ||       
1415                                     local->selfheal.force_mkdir)           
1416                                         missing_dirs++;                    
1417                         }      




There are 2 problems here:

1. The layout is not updated with the new subvol status on error. 

In this case, the initial lookup found a directory on the hashed subvol so only
2 entries in the layout indicate missing directories. However, by the time the
selfheal code is executed, the racing rmdir has deleted the directory from all
the subvols.  At this point, the directory does not exist on any subvol and
dht_selfheal_dir_mkdir_lookup_cbk gets an error from all 3 subvols, 
but this new status is not updated in the layout which still has only 2 missing
dirs marked.


2. this_call_cnt = dht_frame_return (frame); is called before processing the
frame. So with a call cnt of 3, it is possible that the second response has
reached 1404 before the third one has started processing the return values. At
this point, 

local->selfheal.hole_cnt != layout->cnt so control goes to line 1412.

At line 1412, since we are still using the old layout, only the directories on
the non-hashed subvols are considered when incrementing missing_dirs and for
the healing.


The combination of these two causes the selfheal to start healing the
directories on the non-hashed subvols. It succeeds in creating the dirs on the
non-hashed subvols. However, to set the layout, dht takes an inodelk on the
hashed subvol which fails because the directory does on exist there. We
therefore end up with directories on the non-hashed subvols with no layouts
set.

-- 
You are receiving this mail because:
You are on the CC list for the bug.


More information about the Bugs mailing list