[Gluster-users] No space left on device (when there is actually lots of free space)

Tue Apr 6 04:07:53 UTC 2010

Hi all,

We are running glusterfs 3.0.3, installed from RHEL rpm's, over 30 nodes 
(not virtual machines). Our config pairs each 2 machines under replicate 
translator as mirrors, and over that aggregates the 15 resulting mirrors 
under stripe translator. Before we were using distribute instead, but we 
had the same problem.

We are copying (using cp) a lot of files which reside under the same 
directory, and I have been monitoring the whole copy process to check 
where the failure starts.

In the middle of the copy process we get this error:

cp: cannot create regular file 
`/mnt/gluster_new/videos/1251512-3CA86758640A31E7770EBC7629AEC10F.mpg': 
No space left on device
cp: cannot create regular file 
`/mnt/gluster_new/videos/1758650-3AF69C6B7FDAC0A40D85EABA8C85490D.mswmm': No 
space left on device
cp: cannot create regular file 
`/mnt/gluster_new/videos/179183-A018B5FBE6DCCF04A3BB99C814CD9EAB.wmv': 
No space left on device
cp: cannot create regular file 
`/mnt/gluster_new/videos/2448602-568B1ACF53675DC762485F2B26539E0D.wmv': 
No space left on device
cp: cannot create regular file 
`/mnt/gluster_new/videos/626249-7B7FFFE0B9C56E9BE5733409CB73BCDF_300.jpg': 
No space left on device
cp: cannot create regular file 
`/mnt/gluster_new/videos/1962299-B7CDFF12FB1AD41DF3660BF0C7045CBC.avi': 
No space left on device

(hundreds of times)

When I look at the storage distribution, I can see this:

node 10    37G   14G   23G  38% /glusterfs_storage
node 11    37G   14G   23G  37% /glusterfs_storage
node 12    37G   14G   23G  37% /glusterfs_storage
node 13    37G   14G   23G  37% /glusterfs_storage
node 14    37G   13G   24G  36% /glusterfs_storage
node 15    37G   13G   24G  36% /glusterfs_storage
node 16    37G   13G   24G  35% /glusterfs_storage
node 17    49G   12G   36G  26% /glusterfs_storage
node 18    37G   12G   25G  33% /glusterfs_storage
node 19    37G   12G   25G  33% /glusterfs_storage
node 20    37G   14G   23G  38% /glusterfs_storage
node 21    37G   14G   23G  37% /glusterfs_storage
node 22    37G   14G   23G  37% /glusterfs_storage
node 23    37G   14G   23G  37% /glusterfs_storage
node 24    37G   13G   24G  36% /glusterfs_storage
node 25    37G   13G   24G  36% /glusterfs_storage
node 26    37G   13G   24G  35% /glusterfs_storage
node 27    49G   12G   36G  26% /glusterfs_storage
node 28    37G   12G   25G  33% /glusterfs_storage
node 29    37G   12G   25G  33% /glusterfs_storage
node 35    40G   40G     0 100% /glusterfs_storage
node 36    40G   22G   18G  56% /glusterfs_storage
node 37    40G   18G   22G  45% /glusterfs_storage
node 38    40G   16G   24G  40% /glusterfs_storage
node 39    40G   15G   25G  37% /glusterfs_storage
node 45    40G   40G     0 100% /glusterfs_storage
node 46    40G   22G   18G  56% /glusterfs_storage
node 47    40G   18G   22G  45% /glusterfs_storage
node 48    40G   16G   24G  40% /glusterfs_storage
node 49    40G   15G   25G  37% /glusterfs_storage

(node mirror pairings are 10-19 paired to 20-29, and 35-39 to 45-49)

As you can see, distribution of space over the cluster is more or less 
rational over most of the nodes, except for node pair 35/45, which run 
out of space. Thus, every time I try to copy more data onto the cluster, 
I run into the mentioned "no space left on device"

 From the mountpoint point of view, the gluster free space looks like this:

Filesystem                                       1M-blocks      Used 
Available Use% Mounted on
[...]
/etc/glusterfs/glusterfs.vol.new        586617    240197    340871  42% 
/mnt/gluster_new

So basically, I get out of space messages when there is around 340 Gb 
free on the cluster.

I tried using distribute translator instead of stripe, in fact that was 
our first setup, but we thought maybe we are starting to copy a big file 
(usually we store really big .tar.gz backups here) and it runs out of 
space in the meanwhile, so we thought about using stripe, because 
theoretically glusterfs would in that case move and copy the next block 
of the file into another node. But in both cases (distribute and stripe) 
we run into the same problems.

So I am wondering if this is a problem of a maximum number of files in a 
same directory or filesystem or what?

Any ideas on this issue?

Our config as follows:

Each node has

--------------
volume posix
   type storage/posix
   option directory /glusterfs_storage
end-volume

volume locks
   type features/posix-locks
   subvolumes posix
end-volume

volume server
   type protocol/server
   option transport-type tcp
   option auth.addr.locks.allow 10.20.0.*
   subvolumes locks
end-volume
-------------

And the mount client has:
=======

##### Old blades (37 gb each, except rsid-a-27, 49 gb)

volume rsid-a-10
   type protocol/client
   option transport-type tcp
   option remote-host 10.20.0.150
   option remote-subvolume locks
end-volume

volume rsid-a-11
   type protocol/client
   option transport-type tcp
   option remote-host 10.20.0.151
   option remote-subvolume locks
end-volume

volume rsid-a-12
   type protocol/client
   option transport-type tcp
   option remote-host 10.20.0.152
   option remote-subvolume locks
end-volume

volume rsid-a-13
   type protocol/client
   option transport-type tcp
   option remote-host 10.20.0.153
   option remote-subvolume locks
end-volume

volume rsid-a-14
   type protocol/client
   option transport-type tcp
   option remote-host 10.20.0.154
   option remote-subvolume locks
end-volume

volume rsid-a-15
   type protocol/client
   option transport-type tcp
   option remote-host 10.20.0.155
   option remote-subvolume locks
end-volume

volume rsid-a-16
   type protocol/client
   option transport-type tcp
   option remote-host 10.20.0.156
   option remote-subvolume locks
end-volume

volume rsid-a-17
   type protocol/client
   option transport-type tcp
   option remote-host 10.20.0.157
   option remote-subvolume locks
end-volume

volume rsid-a-18
   type protocol/client
   option transport-type tcp
   option remote-host 10.20.0.158
   option remote-subvolume locks
end-volume

volume rsid-a-19
   type protocol/client
   option transport-type tcp
   option remote-host 10.20.0.159
   option remote-subvolume locks
end-volume

volume rsid-a-20
   type protocol/client
   option transport-type tcp
   option remote-host 10.20.0.160
   option remote-subvolume locks
end-volume

volume rsid-a-21
   type protocol/client
   option transport-type tcp
   option remote-host 10.20.0.161
   option remote-subvolume locks
end-volume

volume rsid-a-22
   type protocol/client
   option transport-type tcp
   option remote-host 10.20.0.162
   option remote-subvolume locks
end-volume

volume rsid-a-23
   type protocol/client
   option transport-type tcp
   option remote-host 10.20.0.163
   option remote-subvolume locks
end-volume

volume rsid-a-24
   type protocol/client
   option transport-type tcp
   option remote-host 10.20.0.164
   option remote-subvolume locks
end-volume

volume rsid-a-25
   type protocol/client
   option transport-type tcp
   option remote-host 10.20.0.165
   option remote-subvolume locks
end-volume

volume rsid-a-26
   type protocol/client
   option transport-type tcp
   option remote-host 10.20.0.166
   option remote-subvolume locks
end-volume

volume rsid-a-27
   type protocol/client
   option transport-type tcp
   option remote-host 10.20.0.167
   option remote-subvolume locks
end-volume

volume rsid-a-28
   type protocol/client
   option transport-type tcp
   option remote-host 10.20.0.168
   option remote-subvolume locks
end-volume

volume rsid-a-29
   type protocol/client
   option transport-type tcp
   option remote-host 10.20.0.169
   option remote-subvolume locks
end-volume

##### New blades (40gb each)

volume rsid-a-35
   type protocol/client
   option transport-type tcp
   option remote-host 10.20.0.180
   option remote-subvolume locks
end-volume

volume rsid-a-36
   type protocol/client
   option transport-type tcp
   option remote-host 10.20.0.181
   option remote-subvolume locks
end-volume

volume rsid-a-37
   type protocol/client
   option transport-type tcp
   option remote-host 10.20.0.182
   option remote-subvolume locks
end-volume

volume rsid-a-38
   type protocol/client
   option transport-type tcp
   option remote-host 10.20.0.183
   option remote-subvolume locks
end-volume

volume rsid-a-39
   type protocol/client
   option transport-type tcp
   option remote-host 10.20.0.184
   option remote-subvolume locks
end-volume

volume rsid-a-45
   type protocol/client
   option transport-type tcp
   option remote-host 10.20.0.190
   option remote-subvolume locks
end-volume

volume rsid-a-46
   type protocol/client
   option transport-type tcp
   option remote-host 10.20.0.191
   option remote-subvolume locks
end-volume

volume rsid-a-47
   type protocol/client
   option transport-type tcp
   option remote-host 10.20.0.192
   option remote-subvolume locks
end-volume

volume rsid-a-48
   type protocol/client
   option transport-type tcp
   option remote-host 10.20.0.193
   option remote-subvolume locks
end-volume

volume rsid-a-49
   type protocol/client
   option transport-type tcp
   option remote-host 10.20.0.194
   option remote-subvolume locks
end-volume

### mirroring blade volumes, 1x <---> 2x and 3x <---> 4x

volume mirror1
   type cluster/replicate
   subvolumes rsid-a-10 rsid-a-20
end-volume

volume mirror2
   type cluster/replicate
   subvolumes rsid-a-11 rsid-a-21
end-volume

volume mirror3
   type cluster/replicate
   subvolumes rsid-a-12 rsid-a-22
end-volume

volume mirror4
   type cluster/replicate
   subvolumes rsid-a-13 rsid-a-23
end-volume

volume mirror5
   type cluster/replicate
   subvolumes rsid-a-14 rsid-a-24
end-volume

volume mirror6
   type cluster/replicate
   subvolumes rsid-a-15 rsid-a-25
end-volume

volume mirror7
   type cluster/replicate
   subvolumes rsid-a-16 rsid-a-26
end-volume

volume mirror8
   type cluster/replicate
   subvolumes rsid-a-17 rsid-a-27
end-volume

volume mirror9
   type cluster/replicate
   subvolumes rsid-a-18 rsid-a-28
end-volume

volume mirror10
   type cluster/replicate
   subvolumes rsid-a-19 rsid-a-29
end-volume

volume mirror11
   type cluster/replicate
   subvolumes rsid-a-35 rsid-a-45
end-volume

volume mirror12
   type cluster/replicate
   subvolumes rsid-a-36 rsid-a-46
end-volume

volume mirror13
   type cluster/replicate
   subvolumes rsid-a-37 rsid-a-47
end-volume

volume mirror14
   type cluster/replicate
   subvolumes rsid-a-38 rsid-a-48
end-volume

volume mirror15
   type cluster/replicate
   subvolumes rsid-a-39 rsid-a-49
end-volume

### final volume, striped mirrors: 15x2 blades. 4 mb block size to allow 
small files fit complete in a single storage node
### (currently only new blades are mirrored)

volume stripe
   type cluster/stripe
   option block-size 4MB
   subvolumes mirror1 mirror2 mirror3 mirror4 mirror5 mirror6 mirror7 
mirror8 mirror9 mirror10 mirror11 mirror12 mirror13 mirror14 mirror15
end-volume

volume writebehind
   type performance/write-behind
   option cache-size 4MB
   option disable-for-first-nbytes 128KB
   subvolumes stripe
end-volume

volume iocache
   type performance/io-cache
   subvolumes writebehind
   option cache-size 4MB
   option cache-timeout 5
end-volume