[Bugs] [Bug 1749664] New: The result (hostname) of getnameinfo for all bricks (ipv6 addresses) are the same, while they are not.

Fri Sep 6 08:02:10 UTC 2019

https://bugzilla.redhat.com/show_bug.cgi?id=1749664

            Bug ID: 1749664
           Summary: The result (hostname) of getnameinfo for all bricks
                    (ipv6 addresses)  are the same, while they are not.
           Product: GlusterFS
           Version: 7
          Hardware: All
                OS: Linux
            Status: NEW
         Component: glusterd
          Severity: urgent
          Priority: high
          Assignee: bugs at gluster.org
          Reporter: moagrawa at redhat.com
                CC: amgad.saleh at nokia.com, amukherj at redhat.com,
                    avishwan at redhat.com, bugs at gluster.org,
                    moagrawa at redhat.com, pasik at iki.fi,
                    rkothiya at redhat.com, srakonde at redhat.com
        Depends On: 1739320
            Blocks: 1747746
  Target Milestone: ---
    Classification: Community

+++ This bug was initially created as a clone of Bug #1739320 +++

Description of problem:

When creating a volume using IPv6, failed with an error that bricks are on same
hostname while they are not.

The result (hostname) of getnameinfo for all bricks (ipv6 addresses)  are the
same, while they are not. 

Version-Release number of selected component (if applicable):
6.3-1 and 6.4-1

How reproducible:

Steps to Reproduce:
1. Create a volume with replica 3 using the command:
gluster --mode=script volume create vol_b6b4f444031cb86c969f3fc744f2e999
replica 3  2001:db8:1234::10:/root/test/a 2001:db8:1234::5:/root/test/a 
2001:db8:1234::14:/root/test/a
2. Error happens that all bricks on the same hostname
3. check those addresses using nslookup which shows the opposite, those IP
belongs to different hostnames

Actual results:
===============
# gluster --mode=script volume create vol_b6b4f444031cb86c969f3fc744f2e999
replica 3  2001:db8:1234::10:/root/test/a 2001:db8:1234::5:/root/test/a 
2001:db8:1234::14:/root/test/a
volume create: vol_b6b4f444031cb86c969f3fc744f2e999: failed: Multiple bricks of
a replicate volume are present on the same server. This setup is not optimal.
Bricks should be on different nodes to have best fault tolerant configuration.
Use 'force' at the end of the command if you want to override this behavior.

# nslookup 2001:db8:1234::10
Server:         2001:db8:1234::5
Address:        2001:db8:1234::5#53

0.1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.4.3.2.1.8.b.d.0.1.0.0.2.ip6.arpa       
name = roger-1812-we-01.

# nslookup 2001:db8:1234::5
Server:         2001:db8:1234::5
Address:        2001:db8:1234::5#53

5.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.4.3.2.1.8.b.d.0.1.0.0.2.ip6.arpa       
name = roger-1903-we-01.

# nslookup 2001:db8:1234::14
Server:         2001:db8:1234::5
Address:        2001:db8:1234::5#53

4.1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.4.3.2.1.8.b.d.0.1.0.0.2.ip6.arpa       
name = roger-1812-cwes-01.

Expected results:
Volume should succeed

Additional info:

Here are  the code snippets from glusterfs 6.4, which has the problem
For some reason,  the result (hostname) of getnameinfo for all bricks (ipv6
addresses)  are the same, while actually they are not. 
================
xlators/mgmt/glusterd/src/glusterd-volume-ops.c

gf_ai_compare_t
glusterd_compare_addrinfo(struct addrinfo *first, struct addrinfo *next)
{
    int ret = -1;
    struct addrinfo *tmp1 = NULL;
    struct addrinfo *tmp2 = NULL;
    char firstip[NI_MAXHOST] = {0.};
    char nextip[NI_MAXHOST] = {
        0,
    };

    for (tmp1 = first; tmp1 != NULL; tmp1 = tmp1->ai_next) {
        ret = getnameinfo(tmp1->ai_addr, tmp1->ai_addrlen, firstip, NI_MAXHOST,
                          NULL, 0, NI_NUMERICHOST);
        if (ret)
            return GF_AI_COMPARE_ERROR;
        for (tmp2 = next; tmp2 != NULL; tmp2 = tmp2->ai_next) {
            ret = getnameinfo(tmp2->ai_addr, tmp2->ai_addrlen, nextip,
                              NI_MAXHOST, NULL, 0, NI_NUMERICHOST);
            if (ret)
                return GF_AI_COMPARE_ERROR;
            if (!strcmp(firstip, nextip)) {
                return GF_AI_COMPARE_MATCH;
            }
        }
    }
    return GF_AI_COMPARE_NO_MATCH;
}

...
            if (GF_AI_COMPARE_MATCH == ret)
                goto found_bad_brick_order;

...

found_bad_brick_order:
    gf_msg(this->name, GF_LOG_INFO, 0, GD_MSG_BAD_BRKORDER,
           "Bad brick order found");
    if (type == GF_CLUSTER_TYPE_DISPERSE) {
        snprintf(err_str, sizeof(found_string), found_string, "disperse");
    } else {
        snprintf(err_str, sizeof(found_string), found_string, "replicate");
    }
....
   const char found_string[2048] =
        "Multiple bricks of a %s "
        "volume are present on the same server. This "
        "setup is not optimal. Bricks should be on "
        "different nodes to have best fault tolerant "
        "configuration. Use 'force' at the end of the "
        "command if you want to override this "
        "behavior. ";

--- Additional comment from Amgad on 2019-08-11 04:34:04 UTC ---

Any response?

--- Additional comment from Ravishankar N on 2019-08-12 04:21:10 UTC ---

CC'ing glusterd maintainer to take a look.

--- Additional comment from Amgad on 2019-08-14 05:12:19 UTC ---

can someone provide a pointer to the "getnameinfo" source code while looking at
the issue

--- Additional comment from Atin Mukherjee on 2019-08-14 05:19:45 UTC ---

Aravinda - can you please help here?

--- Additional comment from Aravinda VK on 2019-08-14 06:15:06 UTC ---

I think it is failing while doing strcmp comparison.

```
            if (!strcmp(firstip, nextip)) {
                return GF_AI_COMPARE_MATCH;
            }
```

Wrote a small script to compare the hostnames

```
#include <stdio.h>
#include <string.h>

int main()
{
    char* first = "roger-1812-we-01";
    char* second = "roger-1903-we-01";
    char* third = "roger-1812-cwes-01";
    printf("First(%s)  vs Second(%s): %d\n", first, second, strcmp(first,
second));
    printf("First(%s)  vs Third(%s): %d\n", first, third, strcmp(first,
third));
    printf("Second(%s) vs Third(%s): %d\n", second, third, strcmp(second,
third));
}

```

And the output is

First(roger-1812-we-01)  vs Second(roger-1903-we-01): -1
First(roger-1812-we-01)  vs Third(roger-1812-cwes-01): 20
Second(roger-1903-we-01) vs Third(roger-1812-cwes-01): 1

We should change the comparison to 

```
if (strcmp(firstip, nextip) == 0) {
    return GF_AI_COMPARE_MATCH;
}
```

--- Additional comment from Aravinda VK on 2019-08-14 06:25:06 UTC ---

Ignore my previous comment. I was wrong. Thanks Amar for pointing that. `!-1`
is `0`

--- Additional comment from Amgad on 2019-08-16 14:50:41 UTC ---

I did some testing on "getnameinfo" and it works fine. When you pass the IPv6
address, it returns the right IP address.

I used the following test program:
#include <arpa/inet.h>
#include <netdb.h>
#include <stdio.h>
#include <stdlib.h>

#define SIZE 1024

int main(int argc, char *argv[])
{
    char host[SIZE];
    char service[SIZE];

    struct sockaddr_in6 sa;
    sa.sin6_family = AF_INET6;
    inet_pton(AF_INET6, argv[1], &sa.sin6_addr);

    int res = getnameinfo((struct sockaddr*)&sa, sizeof(sa), host,
sizeof(host), service, sizeof(service), 0);

    if(res)
    {
        exit(1);
    }
    else
    {
        printf("Hostname: %s\n", host);
        printf("Service: %s\n", service);
    }

    return 0;
}

So I think the problem in what is passed to:
glusterd_compare_addrinfo(struct addrinfo *first, struct addrinfo *next) --

--- Additional comment from Amgad on 2019-08-16 20:52:30 UTC ---

dig more in the glusterd-volume-ops.c file where the
"glusterd_compare_addrinfo" function is called by "glusterd_check_brick_order",
The following code which prepares what is passed to
"glusterd_compare_addrinfo", "getaddrinfo" doesn't seem to return the right
address.

    brick_list_dup = brick_list_ptr = gf_strdup(brick_list);
    /* Resolve hostnames and get addrinfo */
    while (i < brick_count) {
        ++i;
        brick = strtok_r(brick_list_dup, " \n", &tmpptr);
        brick_list_dup = tmpptr;
        if (brick == NULL)
            goto check_failed;
        brick = strtok_r(brick, ":", &tmpptr);
        if (brick == NULL)
            goto check_failed;
        ret = getaddrinfo(brick, NULL, NULL, &ai_info);
        if (ret != 0) {
            ret = 0;
            gf_msg(this->name, GF_LOG_ERROR, 0, GD_MSG_HOSTNAME_RESOLVE_FAIL,
                   "unable to resolve "
                   "host name");
            goto out;
        }
        ai_list_tmp1 = MALLOC(sizeof(addrinfo_list_t));
        if (ai_list_tmp1 == NULL) {
            ret = 0;
            gf_msg(this->name, GF_LOG_ERROR, ENOMEM, GD_MSG_NO_MEMORY,
                   "failed to allocate "
                   "memory");
            freeaddrinfo(ai_info);
            goto out;
        }
        ai_list_tmp1->info = ai_info;
        cds_list_add_tail(&ai_list_tmp1->list, &ai_list->list);
        ai_list_tmp1 = NULL;
    }

I wrote a small program to call it and it always returns --> "0.0.0.0", so
maybe that's why later the code assumes it's the same host.
It works though for IPv4. Also, have to loop thru the list to get the right
address.

I'll dig more, but I hope that gives some direction to other developers to
check

--- Additional comment from Amgad on 2019-08-28 19:55:02 UTC ---

Hi Amar /GlusterFS team

I was busy addressing other development issues - back to the this IPv6 one.
In this problem, the volume is created thru heketi and failed at the
"glusterd-volume-ops.c" file when "glusterd_compare_addrinfo" is called.

In a different test (system is configured with pure IPv6), where volumes were
generated using gluster CLI, the volumes are created at different servers, but
"glustershd" failed to come up with the following 
errors:

[2019-08-28 19:11:36.645541] I [MSGID: 100030] [glusterfsd.c:2847:main]
0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 6.5 (args:
/usr/sbin/glusterfs -s 2001:db8:1234::8 --volfile-id gluster/glustershd -p 
/var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log
-S /var/run/gluster/3a1e3977fd7318f2.socket --xlator-option
*replicate*.node-uuid=8e2b40a7-098c-4f0a-b323-2e764bd315f3 --process-name
glustershd --client-pid=-6)
[2019-08-28 19:11:36.646207] I [glusterfsd.c:2556:daemonize] 0-glusterfs: Pid
of current running process is 26375
[2019-08-28 19:11:36.655872] I [socket.c:902:__socket_server_bind]
0-socket.glusterfsd: closing (AF_UNIX) reuse check socket 9
[2019-08-28 19:11:36.656708] E [MSGID: 101075]
[common-utils.c:508:gf_resolve_ip6] 0-resolver: getaddrinfo failed (family:2)
(Address family for hostname not supported)
[2019-08-28 19:11:36.656730] E [name.c:258:af_inet_client_get_remote_sockaddr]
0-glusterfs: DNS resolution failed on host 2001:db8:1234::8
[2019-08-28 19:11:36.658459] I [MSGID: 101190]
[event-epoll.c:680:event_dispatch_epoll_worker] 0-epoll: Started thread with
index 0
[2019-08-28 19:11:36.658744] I [glusterfsd-mgmt.c:2443:mgmt_rpc_notify]
0-glusterfsd-mgmt: disconnected from remote-host: 2001:db8:1234::8
[2019-08-28 19:11:36.658766] I [glusterfsd-mgmt.c:2463:mgmt_rpc_notify]
0-glusterfsd-mgmt: Exhausted all volfile servers
[2019-08-28 19:11:36.658832] I [MSGID: 101190]
[event-epoll.c:680:event_dispatch_epoll_worker] 0-epoll: Started thread with
index 1
[2019-08-28 19:11:36.659376] W [glusterfsd.c:1570:cleanup_and_exit]
(-->/lib64/libgfrpc.so.0(+0xf1d3) [0x7f61e883a1d3]
-->/usr/sbin/glusterfs(+0x12fef) [0x5653bb1c9fef]
-->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x5653bb1c201b] ) 0-: received
signum (1), shutting down

It indcates that function "gf_resolve_ip6" in the "common-utils.c" is failing
becuase of (family:2) -- since the IP is IPv6, the family should be 10, not 2
and thus it failed as family:2 not supported.
same for "af_inet_client_get_remote_sockaddr". 

Any suggestion what could be passing the family as "2" (IPv4) rather than "10"
(IPv6)?

Regards,
Amgad

--- Additional comment from Amgad on 2019-08-28 21:46:03 UTC ---

GlusterFS team:

Can someone check urgently if "hints.ai_family" in these function calls is set
to "AF_INET6" and not "AF_UNSPEC" to force version?

Regards,
Amgad

--- Additional comment from Amgad on 2019-08-30 03:15:12 UTC ---

*** I verified that "ai_family" passed in "af_inet_client_get_remote_sockaddr"
to "gf_resolve_ip6" [rpc/rpc-transport/socket/src/name.c], is for IPv4 "2" and
not IPv6 (should be "10"):

af_inet_client_get_remote_sockaddr(rpc_transport_t *this,
                                   struct sockaddr *sockaddr,
                                   socklen_t *sockaddr_len)
.......

    /* TODO: gf_resolve is a blocking call. kick in some
       non blocking dns techniques */
    ret = gf_resolve_ip6(remote_host, remote_port, sockaddr->sa_family,
                         &this->dnscache, &addr_info);
    gf_log(this->name, GF_LOG_ERROR, "CSTO-DEBUG: Family Address is %d",
sockaddr->sa_family);              ==> my added debug msg

AND

*** in [libglusterfs/src/common-utils.c] where "gf_resolve_ip6" is defined
where the IPv6 is passed to "getaddrinfo" as host name, and it failed because
the ai_family is not right:

int32_t
gf_resolve_ip6(const char *hostname, uint16_t port, int family, void
**dnscache,
               struct addrinfo **addr_info)
{
...
        if ((ret = getaddrinfo(hostname, port_str, &hints, &cache->first)) !=
            0) {
            gf_msg("resolver", GF_LOG_ERROR, 0, LG_MSG_GETADDRINFO_FAILED,
                   "getaddrinfo failed (family:%d) (%s)", family,
                   gai_strerror(ret));

            gf_msg("resolver", GF_LOG_ERROR, 0, LG_MSG_GETADDRINFO_FAILED,     
                          ==> my added debug msg
                   "CSTO-DEBUG: getaddrinfo failed (hostname:%s) (%s)",
hostname,
                   gai_strerror(ret));

.........
/var/log/glusterfs/glustershd.log output:
.....
[2019-08-30 01:03:51.871225] E [MSGID: 101075]
[common-utils.c:512:gf_resolve_ip6] 0-resolver: CSTO-DEBUG: getaddrinfo failed
(hostname:2001:db8:1234::8) (Address family for hostname not supported)

[2019-08-30 01:03:51.871239] E [name.c:256:af_inet_client_get_remote_sockaddr]
0-glusterfs: CSTO-DEBUG: Family Address is 2 ==>
[2019-08-30 01:03:51.871249] E [name.c:260:af_inet_client_get_remote_sockaddr]
0-glusterfs: DNS resolution failed on host 2001:db8:1234::8
........

That's why failed DNS resolution and caused glustershd not to come up.

--- Additional comment from Mohit Agrawal on 2019-08-30 08:59:40 UTC ---

Hi,

To enable ipv6 for gluster processes you need to change
"transport.address-family" in /etc/glusterfs/glusterd.vol and restart glusterd

The issue has been fixed in upstream from the below patch
https://review.gluster.org/#/c/glusterfs/+/21948/

By default transport address family is inet and the value is commented in file
/etc/glusterfs/glusterd.vol.

To enable ipv6 please change the value to inet6 and uncomment the line as below 

cat /etc/glusterfs/glusterd.vol 
volume management
    type mgmt/glusterd
    option working-directory /var/lib/glusterd
    option transport-type socket
    option transport.socket.keepalive-time 10
    option transport.socket.keepalive-interval 2
    option transport.socket.read-fail-log off
    option transport.socket.listen-port 24007
    option ping-timeout 0
    option event-threads 1
#   option lock-timer 180
    option transport.address-family inet6
#   option base-port 49152
    option max-port  60999
end-volume

Thanks,
Mohit Agrawal

--- Additional comment from Amgad on 2019-08-30 13:43:24 UTC ---

Hi Mohit:

Our "/etc/glusterfs/glusterd.vol" is set with IPv6 - so this is not the case
see below:

Regards,
Amgad

volume management
    type mgmt/glusterd
    option working-directory /var/lib/glusterd
    option transport-type socket,rdma
    option transport.socket.keepalive-time 10
    option transport.socket.keepalive-interval 2
    option transport.socket.read-fail-log off
    option transport.socket.listen-port 24007
    option transport.rdma.listen-port 24008
    option transport.address-family inet6
    option transport.socket.bind-address 2001:db8:1234::8
    option transport.tcp.bind-address 2001:db8:1234::8
    option transport.rdma.bind-address 2001:db8:1234::8
    option ping-timeout 0
    option event-threads 1
    option transport.listen-backlog 1024
#   option base-port 49152
end-volume

--- Additional comment from Mohit Agrawal on 2019-08-30 14:03:31 UTC ---

Hi,

Can you please share complete dump of /var/log/gluster directory.

Thanks,
Mohit Agrawal

--- Additional comment from Amgad on 2019-08-30 20:58:32 UTC ---

I'm attaching the tar file. I just reverted the private version with my
debugging statements to 6.5-1. Keep in mind this was upgraded back and forth
several times, so the logs have different versions, but the latest is 6.5-1

--- Additional comment from Amgad on 2019-08-30 21:00:59 UTC ---

--- Additional comment from Mohit Agrawal on 2019-08-31 03:06:04 UTC ---

Hi,

 As per currently shared logs it seems now you are facing a different issue,
issue related to "DNS resolution failed" is resolved already.
 It seems earlier correct transport-type was not mentioned in volfile so brick
was not coming up(throwing an error Address family not supported) but now brick
is failing because brick is not able to connect with glusterd because glusterd
is not up.

>>>>>>>>>>>>>>>>>>>>>>
.....
.....
[2019-08-30 01:03:41.480435] W [socket.c:721:__socket_rwv] 0-glusterfs: readv
on 2001:db8:1234::8:24007 failed (No data available)
[2019-08-30 01:03:41.480554] I [glusterfsd-mgmt.c:2443:mgmt_rpc_notify]
0-glusterfsd-mgmt: disconnected from remote-host:
ceph-cs-01.storage.bcmt.cluster.local
[2019-08-30 01:03:41.480573] I [glusterfsd-mgmt.c:2463:mgmt_rpc_notify]
0-glusterfsd-mgmt: Exhausted all volfile servers
.....
....
>>>>>>>>>>>>>>>>>>>>>>>>>>

I am seeing similar messages in other brick logs file also.
glusterd is not coming up because it is throwing an error "Address is already
in use".

>>>>>>>>>>>>>>>>>

[2019-08-30 01:03:43.493787] I [socket.c:904:__socket_server_bind]
0-socket.management: closing (AF_UNIX) reuse check socket 10
[2019-08-30 01:03:43.499501] I [MSGID: 106513]
[glusterd-store.c:2394:glusterd_restore_op_version] 0-glusterd: retrieved
op-version: 60000
[2019-08-30 01:03:43.503539] I [MSGID: 106544]
[glusterd.c:152:glusterd_uuid_init] 0-management: retrieved UUID:
8e2b40a7-098c-4f0a-b323-2e764bd315f3
[2019-08-30 01:03:43.855699] I [MSGID: 106498]
[glusterd-handler.c:3687:glusterd_friend_add_from_peerinfo] 0-management:
connect returned 0
[2019-08-30 01:03:43.860181] I [MSGID: 106498]
[glusterd-handler.c:3687:glusterd_friend_add_from_peerinfo] 0-management:
connect returned 0
[2019-08-30 01:03:43.860245] W [MSGID: 106061]
[glusterd-handler.c:3490:glusterd_transport_inet_options_build] 0-glusterd:
Failed to get tcp-user-timeout
[2019-08-30 01:03:43.860284] I [rpc-clnt.c:1005:rpc_clnt_connection_init]
0-management: setting frame-timeout to 600
[2019-08-30 01:03:43.966588] E [name.c:256:af_inet_client_get_remote_sockaddr]
0-management: CSTO-DEBUG: Family Address is 10
[2019-08-30 01:03:43.967196] I [rpc-clnt.c:1005:rpc_clnt_connection_init]
0-management: setting frame-timeout to 600
[2019-08-30 01:03:43.969757] E [name.c:256:af_inet_client_get_remote_sockaddr]
0-management: CSTO-DEBUG: Family Address is 10
[2019-08-30 01:03:44.681604] E [socket.c:923:__socket_server_bind]
0-socket.management: binding to  failed: Address already in use
[2019-08-30 01:03:44.681645] E [socket.c:925:__socket_server_bind]
0-socket.management: Port is already in use
[2019-08-30 01:03:45.681776] E [socket.c:923:__socket_server_bind]
0-socket.management: binding to  failed: Address already in use
[2019-08-30 01:03:45.681883] E [socket.c:925:__socket_server_bind]
0-socket.management: Port is already in use
[2019-08-30 01:03:46.681992] E [socket.c:923:__socket_server_bind]
0-socket.management: binding to  failed: Address already in use
[2019-08-30 01:03:46.682027] E [socket.c:925:__socket_server_bind]
0-socket.management: Port is already in use
[2019-08-30 01:03:47.682249] E [socket.c:925:__socket_server_bind]
0-socket.management: Port is already in use
[2019-08-30 01:03:47.682191] E [socket.c:923:__socket_server_bind]
0-socket.management: binding to  failed: Address already in use
[2019-08-30 01:03:43.967187] W [MSGID: 106061]
[glusterd-handler.c:3490:glusterd_transport_inet_options_build] 0-glusterd:
Failed to get tcp-user-timeout
[2019-08-30 01:03:48.598585] W [glusterfsd.c:1570:cleanup_and_exit]
(-->/lib64/libpthread.so.0(+0x7dd5) [0x7f4af4b0fdd5]
-->glusterd(glusterfs_sigwaiter+0xe5) [0x5584ef3131b5]
-->glusterd(cleanup_and_exit+0x6b) [0x5584ef31301b] ) 0-: received signum (15),
shutting down
>>>>>>>>>>>>>>>>>>>>>>

We have fixed the same in release-6 recently
https://review.gluster.org/#/c/glusterfs/+/23268/

Kindly apply this patch or install the build after merged this patch.

Regards,
Mohit Agrawal

--- Additional comment from Amgad on 2019-09-01 04:09:00 UTC ---

Hi Mohit:

The patch is already applied and glusterd is running. Glustershd is the one not
running. I re-applied the latest rpm from the 6.x branch and verified the code
change - I called it 6.5-1.7.el7.x86_64 (please check starting 2019-09-1
03:44:34 the newly loaded tar file)

Here's the glusterd status:
# systemctl status glusterd
â glusterd.service - GlusterFS, a clustered file-system server
   Loaded: loaded (/etc/systemd/system/glusterd.service; disabled; vendor
preset: disabled)
   Active: active (running) since Sun 2019-09-01 03:57:43 UTC; 5min ago
  Process: 20220 ExecStart=/usr/sbin/glusterd -p /var/run/glusterd.pid
--log-level $LOG_LEVEL $GLUSTERD_OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 20221 (glusterd)
    Tasks: 113
   Memory: 140.7M (limit: 3.8G)
   CGroup: /system.slice/glusterd.service
           ââ20221 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level
ER...
           ââ29021 /usr/sbin/glusterfsd -s
ceph-cs-01.storage.bcmt.cluster.lo...
           ââ29033 /usr/sbin/glusterfsd -s
ceph-cs-01.storage.bcmt.cluster.lo...
           ââ29044 /usr/sbin/glusterfsd -s
ceph-cs-01.storage.bcmt.cluster.lo...
           ââ29055 /usr/sbin/glusterfsd -s
ceph-cs-01.storage.bcmt.cluster.lo...
           ââ29067 /usr/sbin/glusterfsd -s
ceph-cs-01.storage.bcmt.cluster.lo...

Sep 01 03:57:42 ceph-cs-01 systemd[1]: Starting GlusterFS, a clustered
file.....
Sep 01 03:57:43 ceph-cs-01 systemd[1]: Started GlusterFS, a clustered
file-...r.
Hint: Some lines were ellipsized, use -l to show in full.

# gluster volume status
Status of volume: bcmt-glusterfs
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick ceph-cs-02.storage.bcmt.cluster.local
:/data0/glusterfs                           49152     0          Y       2381 
Brick ceph-cs-03.storage.bcmt.cluster.local
:/data0/glusterfs                           49152     0          Y       3607 
Brick ceph-cs-01.storage.bcmt.cluster.local
:/data0/glusterfs                           49152     0          Y       29021
Self-heal Daemon on localhost               N/A       N/A        N       N/A  
Self-heal Daemon on ceph-cs-03.storage.bcmt
.cluster.local                              N/A       N/A        N       N/A  
Self-heal Daemon on ceph-cs-02              N/A       N/A        N       N/A  

.......

Regards,
Amgad

--- Additional comment from Amgad on 2019-09-01 04:10:42 UTC ---

--- Additional comment from Amgad on 2019-09-01 04:12:42 UTC ---

I did restart at 2019-09-01 03:57:43

--- Additional comment from Mohit Agrawal on 2019-09-01 05:49:40 UTC ---

Hi Amgad,

  Thanks for sharing the reply. I have checked the code. 
  We need to fix the issue to update transport-address family in case of client
process.

  For as of now I can provide a workaround to start shd with ipv6.
  You can copy the existing argument from the shd log file and add a string
--xlator-options 

  As per last shared logs last shd was spawned with below arguments

  /usr/sbin/glusterfs version 6.5 (args: /usr/sbin/glusterfs -s
2001:db8:1234::8 --volfile-id gluster/glustershd -p
/var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log
-S /var/run/gluster/3a1e3977fd7318f2.socket --xlator-option
*replicate*.node-uuid=8e2b40a7-098c-4f0a-b323-2e764bd315f3 --process-name
glustershd --client-pid=-6

  so start shd with enable ipv6 you can run like below after add
--xlator-option in command-line 
  arguments

  >>>>>>>>>>>>>>>

  /usr/sbin/glusterfs -s 2001:db8:1234::8 --volfile-id gluster/glustershd -p
/var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log
-S /var/run/gluster/3a1e3977fd7318f2.socket --xlator-option
*replicate*.node-uuid=8e2b40a7-098c-4f0a-b323-2e764bd315f3 --xlator-option
transport.address-family=inet6 --process-name glustershd --client-pid=-6 

  >>>>>>>>>>>>.

  I will upload a patch to resolve the same. 
  Let me know if you can test patch in your environment.

Thanks,
Mohit Agrawal

--- Additional comment from Mohit Agrawal on 2019-09-01 06:40:07 UTC ---

Upload an upstream patch to resolve the same.
https://review.gluster.org/#/c/glusterfs/+/23340/

--- Additional comment from Amgad on 2019-09-01 17:41:15 UTC ---

Thanks a lot Mohit -- I'll try that.

How about the original problem - where it thinks the bricks on same server
while it's not, any suggestion or clue?. Please refer to the first comment
reporting the issue:

# gluster --mode=script volume create vol_b6b4f444031cb86c969f3fc744f2e999
replica 3  2001:db8:1234::10:/root/test/a 2001:db8:1234::5:/root/test/a 
2001:db8:1234::14:/root/test/a
volume create: vol_b6b4f444031cb86c969f3fc744f2e999: failed: Multiple bricks of
a replicate volume are present on the same server. This setup is not optimal.
Bricks should be on different nodes to have best fault tolerant configuration.
Use 'force' at the end of the command if you want to override this behavior.
......

--- Additional comment from Mohit Agrawal on 2019-09-02 05:01:29 UTC ---

Hi Amgad,

 There is an issue in the function glusterd_check_brick_order in case of
parsing ipv6 host address. 
 As we can see below code is checking single ":" to fetch host address in case
of ipv4 it is fine but
 in case of ipv6 multiple ":" are present so it does not work.  

 >>>>>>>>>>>>>>>>
  brick = strtok_r(brick, ":", &tmpptr);
  if (brick == NULL)
       goto check_failed;
  ret = getaddrinfo(brick, NULL, NULL, &ai_info)
  if (ret != 0) {
      ret = 0;
      gf_msg(this->name, GF_LOG_ERROR, 0, GD_MSG_HOSTNAME_RESOLVE_FAIL,
              "unable to resolve " "host name");
      goto out;
  }

 >>>>>>>>>>>>>

 I will send a separate patch to resolve the same. 
 Let me know if you can test my patch in your environment.

Thanks,
Mohit Agrawal

--- Additional comment from Amgad on 2019-09-02 05:18:51 UTC ---

Thanks Mohit:

Sure, I can test the new patch in our environment. Please provide the link to
the patch.

BTW, I rebuilt the RPMs with your glustershd patch and it works fine.

Regards,
Amgad

--- Additional comment from Mohit Agrawal on 2019-09-02 05:22:39 UTC ---

Below is the upstream patch to resolve the same

https://review.gluster.org/#/c/glusterfs/+/23341/

Thanks,
Mohit Agrawal

--- Additional comment from Amgad on 2019-09-02 05:33:34 UTC ---

Thanks Mohit -  

I assume it's fine despite of the -1 for CentOS-regression and smoke, Right??

--- Additional comment from Mohit Agrawal on 2019-09-02 05:36:24 UTC ---

Yes, I think it should work fine.
I could not test it because environment is not available as of now.

Thanks,
Mohit Agrawal

--- Additional comment from Amgad on 2019-09-02 15:59:23 UTC ---

Thx Mohit:

Built it and the initial testing looks good. Will do more testing today.

Regards,
Amgad

--- Additional comment from Mohit Agrawal on 2019-09-02 16:20:38 UTC ---

Thanks Amgad for verifying the fix.

If you are satisfied with previous patch kindly give your vote.
https://review.gluster.org/#/c/glusterfs/+/23340/

Thanks,
Mohit Agrawal

--- Additional comment from Worker Ant on 2019-09-06 07:07:40 UTC ---

REVIEW: https://review.gluster.org/23372 (rpc: Update address family if it is
not provide in cmd-line arguments) posted (#1) for review on release-6 by MOHIT
AGRAWAL

--- Additional comment from Worker Ant on 2019-09-06 07:09:27 UTC ---

REVIEW: https://review.gluster.org/23373 (glusterd: IPV6 hostname address is
not parsed correctly) posted (#1) for review on release-6 by MOHIT AGRAWAL

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1739320
[Bug 1739320] The result (hostname) of getnameinfo for all bricks (ipv6
addresses)  are the same, while they are not.
https://bugzilla.redhat.com/show_bug.cgi?id=1747746
[Bug 1747746] The result (hostname) of getnameinfo for all bricks (ipv6
addresses)  are the same, while they are not.
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.