MariaDB Cluster with Galera Replication

artemidis · 01-13-2016, 05:36 AM

Dear All,
I have an issue on my 3 nodes cluster.
I have setted in configuration file /etc/my.cnf.d/server.cnf a variable max_allowed_size=200M .
Randomply this value change on one node to default value (1M) and I do not know why!
To fix the issue I set manually the global value with:

Code:

SET GLOBAL max_allowed_packet=209715200;

This is my configuration file:

Code:

[mariadb-10.0]
binlog_format=ROW
default-storage-engine=innodb
innodb_autoinc_lock_mode=2
innodb_locks_unsafe_for_binlog=1
query_cache_size=0
query_cache_type=0
bind-address=0.0.0.0

datadir=/var/lib/mysql
innodb_log_file_size=300M
innodb_file_per_table
innodb_flush_log_at_trx_commit=2
#max_allowed_packet=200M
#replicate_do_db="test1"

wsrep_provider=/usr/lib64/galera/libgalera_smm.so
wsrep_cluster_address="gcomm://10.105.228.147,10.105.228.175,10.105.228.168"
#wsrep_cluster_address="gcomm://"
wsrep_cluster_name='galera_cluster'
wsrep_node_address='10.105.228.147'
wsrep_node_name='db1'
wsrep_sst_method=rsync
wsrep_sst_auth=sst_user:pass
#
# These groups are read by MariaDB server.
# Use it for options that only the server (but not clients) should see
#
# See the examples of server my.cnf files in /usr/share/mysql/
#

# this is read by the standalone daemon and embedded servers
[server]

# this is only for the mysqld standalone daemon
[mysqld]
max_allowed_packet=200M
#
# * Galera-related settings
#
[galera]
# Mandatory settings
#wsrep_provider=
#wsrep_cluster_address=
#binlog_format=row
#default_storage_engine=InnoDB
#innodb_autoinc_lock_mode=2
#bind-address=0.0.0.0
#
# Optional setting
#wsrep_slave_threads=1
#innodb_flush_log_at_trx_commit=0

# this is only for embedded server
[embedded]
# This group is only read by MariaDB servers, not by MySQL.
# If you use the same .cnf file for MySQL and MariaDB,
# you can put MariaDB-only options here
[mariadb]

# This group is only read by MariaDB-10.0 servers.
# If you use the same .cnf file for MariaDB of different versions,
# use this group for options that older servers don't understand
[mariadb-10.0]

and here is my log file after a restart of mysql service on every node and after the occurred issue:

Code:

160111 17:59:41 [Note] /usr/sbin/mysqld: Normal shutdown

160111 17:59:41 [Note] WSREP: Stop replication
160111 17:59:41 [Note] WSREP: Closing send monitor...
160111 17:59:41 [Note] WSREP: Closed send monitor.
160111 17:59:41 [Note] WSREP: gcomm: terminating thread
160111 17:59:41 [Note] WSREP: gcomm: joining thread
160111 17:59:41 [Note] WSREP: gcomm: closing backend
160111 17:59:41 [Note] WSREP: view(view_id(NON_PRIM,2d755602,20) memb {
	95e065df,0
} joined {
} left {
} partitioned {
	2d755602,0
	ba5fe03e,0
})
160111 17:59:41 [Note] WSREP: view((empty))
160111 17:59:41 [Note] WSREP: gcomm: closed
160111 17:59:41 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
160111 17:59:41 [Note] WSREP: Flow-control interval: [16, 16]
160111 17:59:41 [Note] WSREP: Received NON-PRIMARY.
160111 17:59:41 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 7602990)
160111 17:59:41 [Note] WSREP: Received self-leave message.
160111 17:59:41 [Note] WSREP: Flow-control interval: [0, 0]
160111 17:59:41 [Note] WSREP: Received SELF-LEAVE. Closing connection.
160111 17:59:41 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 7602990)
160111 17:59:41 [Note] WSREP: RECV thread exiting 0: Success
160111 17:59:41 [Note] WSREP: New cluster view: global state: 104ce0a3-fafd-11e4-9924-ea8a73e19897:7602990, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 3
160111 17:59:41 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
160111 17:59:41 [Note] WSREP: New cluster view: global state: 104ce0a3-fafd-11e4-9924-ea8a73e19897:7602990, view# -1: non-Primary, number of nodes: 0, my index: -1, protocol version 3
160111 17:59:41 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
160111 17:59:41 [Note] WSREP: applier thread exiting (code:0)
160111 17:59:41 [Note] WSREP: recv_thread() joined.
160111 17:59:41 [Note] WSREP: Closing replication queue.
160111 17:59:41 [Note] WSREP: Closing slave action queue.
160111 17:59:43 [Note] WSREP: rollbacker thread exiting
160111 17:59:43 [Note] Event Scheduler: Purging the queue. 0 events
160111 17:59:43 [Note] WSREP: dtor state: CLOSED
160111 17:59:43 [Note] WSREP: mon: entered 1593483 oooe fraction 0 oool fraction 2.63574e-05
160111 17:59:43 [Note] WSREP: mon: entered 1593483 oooe fraction 0.00164796 oool fraction 2.94951e-05
160111 17:59:43 [Note] WSREP: mon: entered 1624277 oooe fraction 0 oool fraction 3.07829e-06
160111 17:59:43 [Note] WSREP: cert index usage at exit 0
160111 17:59:43 [Note] WSREP: cert trx map usage at exit 26
160111 17:59:43 [Note] WSREP: deps set usage at exit 0
160111 17:59:43 [Note] WSREP: avg deps dist 45.7738
160111 17:59:43 [Note] WSREP: avg cert interval 0.00801264
160111 17:59:43 [Note] WSREP: cert index size 15
160111 17:59:43 [Note] WSREP: Service thread queue flushed.
160111 17:59:43 [Note] WSREP: wsdb trx map usage 0 conn query map usage 0
160111 17:59:43 [Note] WSREP: MemPool(LocalTrxHandle): hit ratio: 0.999761, misses: 127, in use: 0, in pool: 127
160111 17:59:43 [Note] WSREP: MemPool(SlaveTrxHandle): hit ratio: 0.999779, misses: 236, in use: 0, in pool: 236
160111 17:59:43 [Note] WSREP: Shifting CLOSED -> DESTROYED (TO: 7602990)
160111 17:59:43 [Note] WSREP: Flushing memory map to disk...
160111 17:59:43 [Note] InnoDB: FTS optimize thread exiting.
160111 17:59:43 [Note] InnoDB: Starting shutdown...
160111 17:59:45 [Note] InnoDB: Shutdown completed; log sequence number 162143401020
160111 17:59:45 [Note] /usr/sbin/mysqld: Shutdown complete

160111 17:59:45 mysqld_safe mysqld from pid file /var/lib/mysql/db1-smartshop.pid ended
160111 18:02:45 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
160111 18:02:45 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.CimjBr' --pid-file='/var/lib/mysql/db1-smartshop-recover.pid'
160111 18:02:47 mysqld_safe WSREP: Recovered position 104ce0a3-fafd-11e4-9924-ea8a73e19897:7602990
160111 18:02:47 [Note] WSREP: wsrep_start_position var submitted: '104ce0a3-fafd-11e4-9924-ea8a73e19897:7602990'
160111 18:02:47 [Note] WSREP: Read nil XID from storage engines, skipping position init
160111 18:02:47 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/galera/libgalera_smm.so'
160111 18:02:47 [Note] WSREP: wsrep_load(): Galera 25.3.9(r3387) by Codership Oy <info@codership.com> loaded successfully.
160111 18:02:47 [Note] WSREP: CRC-32C: using hardware acceleration.
160111 18:02:47 [Note] WSREP: Found saved state: 104ce0a3-fafd-11e4-9924-ea8a73e19897:7602990
160111 18:02:47 [Note] WSREP: Passing config to GCS: base_host = 10.105.228.147; base_port = 4567; cert.log_conflicts = no; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.size = 128M; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce_timeout = PT3S; pc.checksum = false; pc.ignore_quorum = false; pc.ignore_sb = false; pc.npvo = false; pc.recov
160111 18:02:47 [Note] WSREP: Service thread queue flushed.
160111 18:02:47 [Note] WSREP: Assign initial position for certification: 7602990, protocol version: -1
160111 18:02:47 [Note] WSREP: wsrep_sst_grab()
160111 18:02:47 [Note] WSREP: Start replication
160111 18:02:47 [Note] WSREP: Setting initial position to 104ce0a3-fafd-11e4-9924-ea8a73e19897:7602990
160111 18:02:47 [Note] WSREP: protonet asio version 0
160111 18:02:47 [Note] WSREP: Using CRC-32C for message checksums.
160111 18:02:47 [Note] WSREP: backend: asio
160111 18:02:47 [Warning] WSREP: access file(gvwstate.dat) failed(No such file or directory)
160111 18:02:47 [Note] WSREP: restore pc from disk failed
160111 18:02:47 [Note] WSREP: GMCast version 0
160111 18:02:47 [Note] WSREP: (23db9d57, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
160111 18:02:47 [Note] WSREP: (23db9d57, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
160111 18:02:47 [Note] WSREP: EVS version 0
160111 18:02:47 [Note] WSREP: gcomm: connecting to group 'galera_cluster', peer '10.105.228.147:,10.105.228.175:,10.105.228.168:'
160111 18:02:47 [Warning] WSREP: (23db9d57, 'tcp://0.0.0.0:4567') address 'tcp://10.105.228.147:4567' points to own listening address, blacklisting
160111 18:02:47 [Note] WSREP: (23db9d57, 'tcp://0.0.0.0:4567') address 'tcp://10.105.228.147:4567' pointing to uuid 23db9d57 is blacklisted, skipping
160111 18:02:47 [Note] WSREP: (23db9d57, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: 
160111 18:02:47 [Note] WSREP: (23db9d57, 'tcp://0.0.0.0:4567') address 'tcp://10.105.228.147:4567' pointing to uuid 23db9d57 is blacklisted, skipping
160111 18:02:47 [Note] WSREP: (23db9d57, 'tcp://0.0.0.0:4567') address 'tcp://10.105.228.147:4567' pointing to uuid 23db9d57 is blacklisted, skipping
160111 18:02:47 [Note] WSREP: (23db9d57, 'tcp://0.0.0.0:4567') address 'tcp://10.105.228.147:4567' pointing to uuid 23db9d57 is blacklisted, skipping
160111 18:02:48 [Note] WSREP: declaring 30a93685 at tcp://10.105.228.168:4567 stable
160111 18:02:48 [Note] WSREP: declaring 348472a5 at tcp://10.105.228.175:4567 stable
160111 18:02:48 [Note] WSREP: Node 30a93685 state prim
160111 18:02:48 [Note] WSREP: view(view_id(PRIM,23db9d57,3) memb {
	23db9d57,0
	30a93685,0
	348472a5,0
} joined {
} left {
} partitioned {
})
160111 18:02:48 [Note] WSREP: save pc into disk
160111 18:02:48 [Note] WSREP: gcomm: connected
160111 18:02:48 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
160111 18:02:48 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
160111 18:02:48 [Note] WSREP: Opened channel 'galera_cluster'
160111 18:02:48 [Note] WSREP: Waiting for SST to complete.
160111 18:02:48 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 3
160111 18:02:48 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 2474ff90-b885-11e5-a6c2-2755ac968ec3
160111 18:02:48 [Note] WSREP: STATE EXCHANGE: sent state msg: 2474ff90-b885-11e5-a6c2-2755ac968ec3
160111 18:02:48 [Note] WSREP: STATE EXCHANGE: got state msg: 2474ff90-b885-11e5-a6c2-2755ac968ec3 from 0 (db1)
160111 18:02:48 [Note] WSREP: STATE EXCHANGE: got state msg: 2474ff90-b885-11e5-a6c2-2755ac968ec3 from 1 (db3)
160111 18:02:48 [Note] WSREP: STATE EXCHANGE: got state msg: 2474ff90-b885-11e5-a6c2-2755ac968ec3 from 2 (db2)
160111 18:02:48 [Note] WSREP: Quorum results:
	version    = 3,
	component  = PRIMARY,
	conf_id    = 2,
	members    = 3/3 (joined/total),
	act_id     = 7602990,
	last_appl. = -1,
	protocols  = 0/7/3 (gcs/repl/appl),
	group UUID = 104ce0a3-fafd-11e4-9924-ea8a73e19897
160111 18:02:48 [Note] WSREP: Flow-control interval: [28, 28]
160111 18:02:48 [Note] WSREP: Restored state OPEN -> JOINED (7602990)
160111 18:02:48 [Note] WSREP: New cluster view: global state: 104ce0a3-fafd-11e4-9924-ea8a73e19897:7602990, view# 3: Primary, number of nodes: 3, my index: 0, protocol version 3
160111 18:02:48 [Note] WSREP: SST complete, seqno: 7602990
160111 18:02:48 [Note] WSREP: Member 0.0 (db1) synced with group.
160111 18:02:48 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 7602990)
2016-01-11 18:02:48 7fbe2553f880 InnoDB: Warning: Using innodb_locks_unsafe_for_binlog is DEPRECATED. This option may be removed in future releases. Please use READ COMMITTED transaction isolation level instead, see http://dev.mysql.com/doc/refman/5.6/en/set-transaction.html.
160111 18:02:48 [Note] InnoDB: Using mutexes to ref count buffer pool pages
160111 18:02:48 [Note] InnoDB: The InnoDB memory heap is disabled
160111 18:02:48 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
160111 18:02:48 [Note] InnoDB: Memory barrier is not used
160111 18:02:48 [Note] InnoDB: Compressed tables use zlib 1.2.7
160111 18:02:48 [Note] InnoDB: Using Linux native AIO
160111 18:02:48 [Note] InnoDB: Using CPU crc32 instructions
160111 18:02:48 [Note] InnoDB: Initializing buffer pool, size = 128.0M
160111 18:02:48 [Note] InnoDB: Completed initialization of buffer pool
160111 18:02:48 [Note] InnoDB: Highest supported file format is Barracuda.
160111 18:02:48 [Note] InnoDB: 128 rollback segment(s) are active.
160111 18:02:48 [Note] InnoDB: Waiting for purge to start
160111 18:02:48 [Note] InnoDB:  Percona XtraDB (http://www.percona.com) 5.6.22-72.0 started; log sequence number 162143404318
160111 18:02:48 [Note] Plugin 'FEEDBACK' is disabled.
160111 18:02:48 [Note] Server socket created on IP: '0.0.0.0'.
160111 18:02:48 [Note] Event Scheduler: Loaded 0 events
160111 18:02:48 [Note] /usr/sbin/mysqld: ready for connections.
Version: '10.0.17-MariaDB-wsrep'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MariaDB Server, wsrep_25.10.r4144
160111 18:02:48 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
160111 18:02:48 [Note] WSREP: REPL Protocols: 7 (3, 2)
160111 18:02:48 [Note] WSREP: Service thread queue flushed.
160111 18:02:48 [Note] WSREP: Assign initial position for certification: 7602990, protocol version: 3
160111 18:02:48 [Note] WSREP: Service thread queue flushed.
160111 18:02:48 [Note] WSREP: Synchronized with group, ready for connections
160111 18:02:48 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
160111 18:02:49 [Warning] Hostname 'db1-smartshop' does not resolve to '10.105.228.147'.
160111 18:02:49 [Note] Hostname 'db1-smartshop' has the following IP addresses:
160111 18:02:49 [Note]  - 159.8.42.247
160111 18:02:49 [Warning] IP address '10.105.228.168' could not be resolved: Name or service not known
160111 18:02:49 [Warning] IP address '10.105.228.175' could not be resolved: Name or service not known
160111 18:02:51 [Note] WSREP: (23db9d57, 'tcp://0.0.0.0:4567') turning message relay requesting off
160111 23:11:40 [Note] WSREP: Created page /var/lib/mysql/gcache.page.000000 of size 134217728 bytes
160111 23:12:30 [Note] WSREP: Deleted page /var/lib/mysql/gcache.page.000000
160111 23:12:31 [Note] WSREP: Created page /var/lib/mysql/gcache.page.000001 of size 166064580 bytes
160111 23:13:09 [Note] WSREP: Deleted page /var/lib/mysql/gcache.page.000001
160112  0:28:03 [Warning] IP address '23.247.5.32' could not be resolved: Name or service not known
160112  1:08:50 [Warning] IP address '115.239.196.59' could not be resolved: Name or service not known
160112  2:03:14 [Note] WSREP: Created page /var/lib/mysql/gcache.page.000002 of size 134217728 bytes
160112  2:04:05 [Note] WSREP: Created page /var/lib/mysql/gcache.page.000003 of size 134217728 bytes
160112  2:04:26 [Note] WSREP: Deleted page /var/lib/mysql/gcache.page.000002
160112  2:04:28 [Note] WSREP: Created page /var/lib/mysql/gcache.page.000004 of size 166064580 bytes
160112  2:05:02 [Note] WSREP: Deleted page /var/lib/mysql/gcache.page.000003
160112  2:05:05 [Note] WSREP: Deleted page /var/lib/mysql/gcache.page.000004
160112  4:25:56 [Warning] IP address '222.186.190.37' could not be resolved: Name or service not known
160112  6:24:57 [Warning] IP address '118.244.158.237' could not be resolved: Name or service not known
160112  7:47:57 [Warning] IP address '58.221.55.130' could not be resolved: Name or service not known

How can i fix this issue or what could i check to understand what happens?
Thank you so much.

Antonio

dijetlo · 01-14-2016, 04:47 AM

Does this thing actually replicate?

Quote:

wsrep_cluster_address="gcomm://10.105.228.147,10.105.228.175,10.105.228.168"
#wsrep_cluster_address="gcomm://"
wsrep_cluster_name='galera_cluster'
wsrep_node_address='10.105.228.147'

because if I'm reading this right, it's not finding the other members of the cluster.

Quote:

60111 18:02:49 [Warning] Hostname 'db1-smartshop' does not resolve to '10.105.228.147'.
160111 18:02:49 [Note] Hostname 'db1-smartshop' has the following IP addresses:
160111 18:02:49 [Note] - 159.8.42.247
160111 18:02:49 [Warning] IP address '10.105.228.168' could not be resolved: Name or service not known
160111 18:02:49 [Warning] IP address '10.105.228.175' could not be resolved: Name or service not known

Which might explain why the configuration file is intermittently resetting/failing to replicate through the cluster.

artemidis · 01-14-2016, 04:51 AM

Hi dijetlo,
thanks for your reply.
Yes, they are replicating; i removed the warning fixing my dns but the problem is present again...
Any ideas?

Antonio

dijetlo · 01-14-2016, 05:00 AM

Sorry Art, I've never used the Galera product. The only question I might ask is are you using some type of configuration management software (chef/Ansible/etc)? That's normally the first place I check when I have inconsistent configurations across a class of servers.

artemidis · 01-14-2016, 05:09 AM

no, i configured everything manually and till 2 months ago everything in production environment worked good...

dijetlo · 01-14-2016, 05:35 AM

Well, if it's not a configuration management problem and it's not associated with replication issues, that only leave MySQL as a possible culprit.

I'm assuming you're running a syntax checker on your modifications to the cnf file and validating the changes are surviving mysql stop/start and an instance reboot. Assuming that's true, that suggests it's being reset inadvertently by the use of some tool, MySQL Workbench, for example. Are you the only person working on these DBs?

artemidis · 01-14-2016, 05:51 AM

Yes I'm the only one working on DBs.
I do not beleive any tool is modifing the configuration because it is a random issue. The only tool i use to connect to DBs is a mysql client, no more... Maybe the application on front end servers may cause the modification but it is very strange.
Start/stop services or instance reboot fix the issue temporarily but generally i use "set global variable max_allowed_packet=200M" command to hot fix the issue.
In the log file i cannot see anything can address me to the right solution...