[Gluster-users] cascading errors (GlusterFS v3.7.x) - what's new about my issues?

Geoffrey Letessier geoffrey.letessier at cnrs.fr
Sat Aug 15 23:35:24 UTC 2015


Hi,

Since i've upgraded GlusterFS from 3.5.3 to 3.7.x, trying to solve my quota miscalculation and poor performances (as advised by the user support team), we are still out-of-production for roughly 7 weeks because of a lot of v3.7.x issues we meet:

	- T-files apparition. I notice a lot of T files (with permissions --- --- --- T) located in my brick paths. Vijay has explained me T-files appear when a re-name is performed or when an add/remove brick is performed; but the problem is, since I've completely re-created (with RAID initialization, etc.) and import my data into the new volume, i've renamed nothing and never add nor delete any brick. 
So, why these T-files are present in my new volume??? For example, for my /derreumaux_team directory,  I have 13891 real files and 704 T-files totalized in the brick paths…
How to clean it, avoiding side effets?

The first time I noticed this kind of files, it was after having set a quota under the real path size which has resulted in some quota explosions (quota daemon failure) and T-files apparitions...

	- 7006 files in split-brain status after having back transfert data (30TB, 6.2M files) from a backup server into my just created volume. Thanks to Mathieu Chateau who help me putting me on road (GFID vs real file path), this problem has been manually fixed.

	- log issue. After having created only one file (35GB), I can notice more than 186000 new lines in brick log files. I can stop them setting brick-log-level to CRITICAL but I guess this issue gravely impact the IO performances and throughput. Vijay told me having fixed this problem in the code but I apparently need to wait the new release to take advantage of… Very nice for the production!

Actually, if I dont set brick-log-level to CRITICAL, i can fill my /var partition (10GB) in less than 1 day making some tests/benchs in the volume… 

	- volume healing issue: slightly less than 14000 files was in a bad situation (# gluster volume heal <vol_home> info) and a new forced heal in my volume make no change. Thanks to Krutika and Pranith, this is problem is currently fixed.

	- du/df/stat/etc. hangs cause of RDMA protocol. This problem seems to not occur anymore since I’ve upgraded my GlusterFS v3.7.2 to v3.7.3. This was probably due to the brick crashes (after a few minutes or a few days after having [re]start the volume) with RDMA transport-type we had. I noticed it only with v3.7.2 version.

	- quota problem: after having forced (with success) the quota re-calculation (with a simple DU for each defined quotas), after a couple of days with good values, the quota daemon failed again (some quota explosions, etc.)

	- a lot of warnings in TAR operations on replicated volumes: 
tar: linux-4.1-rc6/sound/soc/codecs/wm8962.c : fichier modifié pendant sa lecture


	- low I/O performances and throughput:

		1- if I enable to quota feature, my IO throughput is divided by 2. So, for the moment, i disabled this feature… (only since I’ve upgraded GlusterFS into 3.7.x version)
		2- since I’ve upgraded GlusterFS from 3.5.3 to 3.7.3, my I/O performance and throughput is lower than before, as you can read below. (keeping in mind i’ve disable quota feature)

IO operation tests with a Linux kernel archive (80MB tar ball file, ~53000 files, 550MB uncompressed):
------------------------------------------------------------------------
|                          PRODUCTION HARDWARE                         |
------------------------------------------------------------------------
|             |  UNTAR  |   DU   |  FIND   |  GREP  |   TAR   |   RM   |
------------------------------------------------------------------------
| native FS   |    ~16s |   ~0.1s |  ~0.1s |  ~0.1s |    ~24s |    ~3s |
------------------------------------------------------------------------
|                        GlusterFS version 3.5.3	       	       |
------------------------------------------------------------------------
| distributed |  ~2m57s |   ~23s |    ~22s |   ~49s |    ~50s |   ~54s |
------------------------------------------------------------------------
| dist-repl   | ~29m56s |  ~1m5s |  ~1m04s | ~1m32s |  ~1m31s | ~2m40s |
------------------------------------------------------------------------
|                        GlusterFS version 3.7.3	       	       |
------------------------------------------------------------------------
| distributed |  ~2m49s |   ~20s |    ~29s |   ~58s |    ~60s |   ~41s |
------------------------------------------------------------------------
| dist-repl   | ~28m24s |   ~51s |    ~37s | ~1m16s |  ~1m14s | ~1m17s |
------------------------------------------------------------------------
*:
	- distributed: 4 bricks (2 bricks on 2 servers)
	- dist-repl: 4 bricks (2 bricks on 2 servers) for each replicas, 2 replicas.
	- native FS: each brick path (XFS)

And the craziest thing is  I did the same test on a crashtest storage cluster (2 old Dell servers, all brick are single 2TB hard drive 7.2k, 2 bricks for each server) and the performance exceeds the production hardware performance (4 recent servers, 2 bricks each, each brick is 24TB RAID6 with good LSI RAID controllers (1 controller for 1 brick):
------------------------------------------------------------------------
|                           CRASHTEST HARDWARE                         |
------------------------------------------------------------------------
|             |  UNTAR  |   DU   |  FIND   |   GREP |   TAR   |   RM   |
------------------------------------------------------------------------
| native FS   |    ~19s |   ~0.2s |  ~0.1s |  ~1.2s |    ~29s |    ~2s |
------------------------------------------------------------------------
------------------------------------------------------------------------
| single      |  ~3m45s |   ~43s |    ~47s |        |  ~3m10s | ~3m15s |
------------------------------------------------------------------------
| single v2*  |  ~3m24s |   ~13s |    ~33s | ~1m10s |    ~46s |   ~48s |
------------------------------------------------------------------------
| single NFS  | ~23m51s |    ~3s |     ~1s |   ~27s |    ~36s |   ~13s |
------------------------------------------------------------------------
| replicated  |  ~5m10s |   ~59s |   ~1m6s |        |  ~1m19s | ~1m49s |
------------------------------------------------------------------------
| distributed |  ~4m18s |   ~41s |    ~57s |        |  ~2m24s | ~1m38s |
------------------------------------------------------------------------
| dist-repl   |   ~7m1s |   ~19s |    ~31s | ~1m34s |  ~1m26s | ~2m11s |
------------------------------------------------------------------------
| FhGFS(dist) |  ~3m33s |   ~15s |     ~2s | ~1m31s |  ~1m31s |   ~52s |
------------------------------------------------------------------------
*: with default parameters


Concerning the throughput (for both writes and reads operations), in the production hardware, it was around 600MBs (dist-repl volume) and 1.1GBs (distributed volume) with GlusterFS version 3.5.3 with TCP network transport-type (RDMA never worked in my storage cluster before 3.7.x version of GlusterFS).
Now, it is around 500-600MBs with RDMA and 150-300MBs with TCP (for dist-repl volume), and around 600-700MBs with RDMA and 500-600 with TCP for distributed volume.

Could you help to back into production our HPC center, solving above-mentioned issues? Or do you advise me to downgrade into v3.5.3 (the more stable version I’d known since I use GlusterFS in production)? or move on ?;-)

Thanks in advance.
Geoffrey
------------------------------------------------------
Geoffrey Letessier
Responsable informatique & ingénieur système
UPR 9080 - CNRS - Laboratoire de Biochimie Théorique
Institut de Biologie Physico-Chimique
13, rue Pierre et Marie Curie - 75005 Paris
Tel: 01 58 41 50 93 - eMail: geoffrey.letessier at ibpc.fr

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20150816/8c7f7831/attachment.html>


More information about the Gluster-users mailing list