[Gluster-users] Configuring legacy Gulster NFS

Mon May 25 14:13:52 UTC 2020

На 25 май 2020 г. 5:49:00 GMT+03:00, Olivier <Olivier.Nicole at cs.ait.ac.th> написа:
>Strahil Nikolov <hunter86_bg at yahoo.com> writes:
>
>> On May 23, 2020 7:29:23 AM GMT+03:00, Olivier
><Olivier.Nicole at cs.ait.ac.th> wrote:
>>>Hi,
>>>
>>>I have been struggling with NFS Ganesha: one gluster node with
>ganesha
>>>serving only one client could not handle the load when dealing with
>>>thousand of small files. Legacy gluster NFS works flawlesly with 5 or
>6
>>>clients.
>>>
>>>But the documentation for gNFS is scarce, I could not find where to
>>>configure the various autorizations, so any pointer is greatly
>welcome.
>>>
>>>Best regards,
>>>
>>>Olivier
>>
>> Hi Oliver,
>>
>> Can you hint me why you are using gluster with a single node in the
>TSP serving only 1 client ?
>> Usually, this is not a typical gluster workload.
>
>Hi Strahil,
>
>Of course I have more than one node, other nodes are supporting the
>bricks and the data. I am using a node with no data to solve this issue
>with NFS. But in my comparison between gNFS and Ganesha, I was using
>the
>same configuration, with one node with no birck accessing the other
>nodes for the data. So the only change between what is working and what
>was not is the NFS server. Beside, I have been using NFS for over 15
>years and know that given my data and type of activity, one single NFS
>server should be able to serve 5 to 10 clients without a problem, that
>is why I suspected Ganesha from the begining.

You are not clmparing apples-to-apples. Pure  NFS  has been used in UNIXes  before  reaching modern OS-es. Linux  has long  been using Pure  NFS  and the kernel has been optimized  for that, while  Ganesha is new  tech and requires  some tuning.

You haven't mentioned  what kind of  issues  do  you see -  searching  a directory, accessing a  lot  of  files  for  read, writing a lot of small files, etc.

Usually  a  negative lookup (searching/accessing) inexisting  object (file/dir/etc) has  a  serious performance degradation.

>If I cannot configure gNFS, I think I could glusterfs_mount the volume
>and use the native NFS server of Linux, but that would add overhead and
>leave some features behind, that is why my focus is primarily on
>configuring gNFS.
>
>>
>> Also can you specify:
>> - Brick block device type and details (raid type, lvm, vdo, etc )
>
>All nodes are VMware virtual machines, the RAID being at VMware level

Yeah,  that's not very descriptive.
For  write-intensive  and small-file workload  the optimal raid mode  is  raid10  with at least  12 disks per node.
 What is the I/O scheduler,  are you using Thin LVM or thic?  How  many snapshots  you have ?
Are  you using striping  on LVM level (  if you use  local  storage then most probably no striping)?
PE  size  in VG  ?

>> - xfs_info of the brick

What kind  of  FS  are  you  using  ?  You need  to be  sure  that  inode  size  is at least  512 bytes (1024  for  swift)  in order  to be  supported.

>> - mount options  for the brick
>
>Bricks are not mounted

It  is not good  to share OS  and Gluster Bricks VMDK. You can benefit  from options like 'noatime,nodiratime,nobarrier,inode64'  .  Noatime  requires  storage  with battery-backed  write  cache.

>> - SELINUX/APPARMOR status
>> - sysctl tunables (including tuned profile)
>
>All systems are vanilla Ubuntu with no tuning.

I have  done  some tests and you can benefit from the rhgs random IO  tuned profile . The latest  source  rpm can be  found  at:
ftp://ftp.redhat.com/redhat/linux/enterprise/7Server/en/RHS/SRPMS/redhat-storage-server-3.5.0.0-1.el7rhgs.src.rpm

On top  of  that  you need  to  modify it  to disable LRO,  as  it  is  automatically  enabled  for VMXNET NICs. This  increases bandwidth but reduces  lattency which is crucial  for  looking up thousand  of files/directories.

>> - gluster volume information and status
>
>sudo gluster volume info gv0
>
>Volume Name: gv0
>Type: Distributed-Replicate
>Volume ID: cc664830-1dd0-4dd4-9f1c-493578297e79
>Status: Started
>Snapshot Count: 0
>Number of Bricks: 2 x 2 = 4
>Transport-type: tcp
>Bricks:
>Brick1: gluster3000:/gluster1/br
>Brick2: gluster5000:/gluster/br
>Brick3: gluster3000:/gluster2/br
>Brick4: gluster2000:/gluster/br
>Options Reconfigured:
>features.quota-deem-statfs: on
>features.inode-quota: on
>features.quota: on
>transport.address-family: inet
>nfs.disable: off
>features.cache-invalidation: on
>on at gluster3:~$ sudo gluster volume status gv0
>Status of volume: gv0
>Gluster process                             TCP Port  RDMA Port  Online
> Pid
>------------------------------------------------------------------------------
>Brick gluster3000:/gluster1/br              49152     0          Y     
> 1473
>Brick gluster5000:/gluster/br               49152     0          Y     
> 724
>Brick gluster3000:/gluster2/br              49153     0          Y     
> 1549
>Brick gluster2000:/gluster/br               49152     0          Y     
> 723
>Self-heal Daemon on localhost               N/A       N/A        Y     
> 1571
>NFS Server on localhost                     N/A       N/A        N     
> N/A
>Quota Daemon on localhost                   N/A       N/A        Y     
> 1560
>Self-heal Daemon on gluster2000.cs.ait.ac.t
>h                                           N/A       N/A        Y     
> 835
>NFS Server on gluster2000.cs.ait.ac.th      N/A       N/A        N     
> N/A
>Quota Daemon on gluster2000.cs.ait.ac.th    N/A       N/A        Y     
> 735
>Self-heal Daemon on gluster5000.cs.ait.ac.t
>h                                           N/A       N/A        Y     
> 829
>NFS Server on gluster5000.cs.ait.ac.th      N/A       N/A        N     
> N/A
>Quota Daemon on gluster5000.cs.ait.ac.th    N/A       N/A        Y     
> 736
>Self-heal Daemon on fbsd3500                N/A       N/A        Y     
> 2584
>NFS Server on fbsd3500                      2049      0          Y     
> 2671
>Quota Daemon on fbsd3500                    N/A       N/A        Y     
> 2571
>
>Task Status of Volume gv0
>------------------------------------------------------------------------------
>Task                 : Rebalance
>ID                   : 53e7c649-27f0-4da0-90dc-af59f937d01f
>Status               : completed

You don't have any tunings  in the volume, despite  the  predefined  ones  in  /var/lib/glusterd/groups.
Both metadata-cache and nl-cache bring some performance  when having a small-file  workload.  You  have  to try them and check the results. Use  a  real-world  workload  job for testing, as  synthetic benches do not always show the real truth.
In order to reset (revert) a setting you can use 'gluster volume  reset  gv0  <setting>'

>> - ganesha settings
>
>MDCACHE
>{
>Attr_Expiration_Time = 600;
>Entries_HWMark = 50000;
>LRU_Run_Interval = 90;
>FD_HWMark_Percent = 60;
>FD_LWMark_Percent = 20;
>FD_Limit_Percent = 90;
>}
>EXPORT
>{
>        Export_Id = 2;
>        etc.
>}
>
>> - Network settings + MTU
>
>MTU 1500 (I think it is my switch that never worked with jumbo
>frames). I have a dedicated VLAN for NFS and gluster and a VLAN for
>users connection.

Verify that there  is no fragmentation between the TSP  nodes and between the NFS (Ganesha) and the cluster:
For  example  MTU  is  1500 ,  then use  a  size  of  1500  - 28 (ICMP  +  IP headers) = 1472
ping  -M do -s  1472  -c  4 -I <interface> <other  gluster node>

Even the dumbest gigabit switches  support Jumbo frames  of  9000  (anything  above that is  supported  by expensive hardware),  so I  would  recommend  you to verify if Jumbo frames  is  possible  at least between the TSP nodes  and  maybe the NFS.

>I hope that helps.
>
>Best regards,
>
>Olivier
>
>>
>> Best Regards,
>> Strahil Nikolov
>>

As you can see you  are getting further into the deep and we haven't covered the storage stack yet, nor any Ganesha settings :) 

Good luck!

Best Regards,
Strahil  Nikolov