galera mysql cluster 故障節點再次接入集群遇到問題該怎么辦

發布時間：2021-11-16 11:10:17 來源：億速云閱讀：525 作者：柒染欄目：MySQL數據庫

這期內容當中小編將會給大家帶來有關galera mysql cluster 故障節點再次接入集群遇到問題該怎么辦，文章內容豐富且以專業的角度為大家分析和敘述，閱讀完這篇文章希望大家可以有所收獲。

galera cluster  是mysql 的多主集群.
我們目前搭建了3個節點的測試集群.
第一輪測試的時候, 發現一個問題,節點故障了, 下線,然后重新加入集群,無法加入.

然后直接整個節點內容作為一個新節點加入, 也是失敗的. 搞了兩天, 頭大了. 失敗告終.

報錯信息如下:

170609 16:55:59 [Note] WSREP: Read nil XID from storage engines, skipping position init
170609 16:55:59 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/galera-3/libgalera_smm.so'
170609 16:55:59 [Note] WSREP: wsrep_load(): Galera 3.20(r7e383f7) by Codership Oy <info@codership.com> loaded successfully.
170609 16:55:59 [Note] WSREP: CRC-32C: using hardware acceleration.
170609 16:55:59 [Note] WSREP: Found saved state: 51391c6d-4bff-11e7-a1c3-b797743e8629:824276, safe_to_bootsrap: 0
170609 16:55:59 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = 192.168.11.152; base_port = 4567; cert.log_conflicts = no; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 300M; gcache.recover = no; gcache.size = 300M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce_timeout = PT3S; pc.checksum = false; pc
170609 16:55:59 [Note] WSREP: GCache history reset: old(51391c6d-4bff-11e7-a1c3-b797743e8629:0) -> new(51391c6d-4bff-11e7-a1c3-b797743e8629:824276)
170609 16:55:59 [Note] WSREP: Assign initial position for certification: 824276, protocol version: -1
170609 16:55:59 [Note] WSREP: wsrep_sst_grab()
170609 16:55:59 [Note] WSREP: Start replication
170609 16:55:59 [Note] WSREP: Setting initial position to 51391c6d-4bff-11e7-a1c3-b797743e8629:824276
170609 16:55:59 [Note] WSREP: protonet asio version 0
170609 16:55:59 [Note] WSREP: Using CRC-32C for message checksums.
170609 16:55:59 [Note] WSREP: backend: asio
170609 16:55:59 [Note] WSREP: gcomm thread scheduling priority set to other:0
170609 16:55:59 [Warning] WSREP: access file(/var/lib/mysql//gvwstate.dat) failed(No such file or directory)
170609 16:55:59 [Note] WSREP: restore pc from disk failed
170609 16:55:59 [Note] WSREP: GMCast version 0
170609 16:55:59 [Warning] WSREP: Failed to resolve tcp:// 192.168.11.98:4567
170609 16:55:59 [Warning] WSREP: Failed to resolve tcp:// 192.168.12.75 :4567
170609 16:55:59 [Note] WSREP: (753e6ee4, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
170609 16:55:59 [Note] WSREP: (753e6ee4, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
170609 16:55:59 [Note] WSREP: EVS version 0
170609 16:55:59 [Note] WSREP: gcomm: connecting to group 'mycluster', peer '192.168.11.152:, 192.168.11.98:, 192.168.12.75 :'
170609 16:55:59 [Note] WSREP: (753e6ee4, 'tcp://0.0.0.0:4567') connection established to 753e6ee4 tcp://192.168.11.152:4567
170609 16:55:59 [Warning] WSREP: (753e6ee4, 'tcp://0.0.0.0:4567') address 'tcp://192.168.11.152:4567' points to own listening address, blacklisting
170609 16:56:02 [Warning] WSREP: no nodes coming from prim view, prim not possible
170609 16:56:02 [Note] WSREP: view(view_id(NON_PRIM,753e6ee4,1) memb {
      753e6ee4,0
} joined {
} left {
} partitioned {
})
170609 16:56:02 [Note] WSREP: (753e6ee4, 'tcp://0.0.0.0:4567') connection to peer 753e6ee4 with addr tcp://192.168.11.152:4567 timed out, no messages seen in PT3S
170609 16:56:03 [Warning] WSREP: last inactive check more than PT1.5S ago (PT3.50193S), skipping check
170609 16:56:32 [Note] WSREP: view((empty))
170609 16:56:32 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
      at gcomm/src/pc.cpp:connect():158
170609 16:56:32 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -110 (Connection timed out)
170609 16:56:32 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1404: Failed to open channel 'mycluster' at 'gcomm://192.168.11.152, 192.168.11.98, 192.168.12.75 ? gmcast.segment=0 & evs.max_install_timeouts=1': -110 (Connection timed out)
170609 16:56:32 [ERROR] WSREP: gcs connect failed: Connection timed out
170609 16:56:32 [ERROR] WSREP: wsrep::connect(gcomm://192.168.11.152, 192.168.11.98, 192.168.12.75 ? gmcast.segment=0 & evs.max_install_timeouts=1) failed: 7
170609 16:56:32 [ERROR] Aborting

170609 16:56:32 [Note] WSREP: Service disconnected.
170609 16:56:33 [Note] WSREP: Some threads may fail to exit.
170609 16:56:33 [Note] /usr/sbin/mysqld: Shutdown complete

然后就在也加入不了集群了.

人都蒙了, 一度懷疑國內的最大的集群是怎么維護這個問題的?

刪除所有的測試vm , 從新安裝os .從新來過.

這兩天重新開始測試這個問題.

繼續重復測試這個案例.

節點刪除后, 重現了相同的問題.

幾點不管是清空所有數據,重新加入, 還是保留原數據加入集群. 都是失敗, 報錯信息跟上面是一樣的.

又無解了.

又開始郁悶了. 按理說不應該. 開始分析報錯信息. 從信息上了. 似乎總是讀了第一個節點, 也就是本身這個節點.
報錯無法連接. 然后重復7次, 然后timeout 退出.

我們集群有3個節點, 不應該啊, 第一個無法連接, 應該會roundrobin  嘗試后面的節點連接啊.
但是從日志里,沒有體現出來這個問題.

我突然開始懷疑這部門軟件代碼的設計上是不是有問題呢?

源代碼就不用看了, 我們可以修改下配嘛.

于是我修改了 wsrep_cluster_address 的配置把第一個節點的ip 的位置拿到了最后面.

然后重新啟動數據庫, 奇跡發生了.

170609 16:57:09 [Note] WSREP: Read nil XID from storage engines, skipping position init
170609 16:57:09 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/galera-3/libgalera_smm.so'
170609 16:57:09 [Note] WSREP: wsrep_load(): Galera 3.20(r7e383f7) by Codership Oy <info@codership.com> loaded successfully.
170609 16:57:09 [Note] WSREP: CRC-32C: using hardware acceleration.
170609 16:57:09 [Note] WSREP: Found saved state: 51391c6d-4bff-11e7-a1c3-b797743e8629:-1, safe_to_bootsrap: 0
170609 16:57:09 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = 192.168.11.152; base_port = 4567; cert.log_conflicts = no; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 300M; gcache.recover = no; gcache.size = 300M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce_timeout = PT3S; pc.checksum = false; pc
170609 16:57:09 [Note] WSREP: GCache history reset: old(51391c6d-4bff-11e7-a1c3-b797743e8629:0) -> new(51391c6d-4bff-11e7-a1c3-b797743e8629:824276)
170609 16:57:09 [Note] WSREP: Assign initial position for certification: 824276, protocol version: -1
170609 16:57:09 [Note] WSREP: wsrep_sst_grab()
170609 16:57:09 [Note] WSREP: Start replication
170609 16:57:09 [Note] WSREP: Setting initial position to 51391c6d-4bff-11e7-a1c3-b797743e8629:824276
170609 16:57:09 [Note] WSREP: protonet asio version 0
170609 16:57:09 [Note] WSREP: Using CRC-32C for message checksums.
170609 16:57:09 [Note] WSREP: backend: asio
170609 16:57:09 [Note] WSREP: gcomm thread scheduling priority set to other:0
170609 16:57:09 [Warning] WSREP: access file(/var/lib/mysql//gvwstate.dat) failed(No such file or directory)
170609 16:57:09 [Note] WSREP: restore pc from disk failed
170609 16:57:09 [Note] WSREP: GMCast version 0
170609 16:57:09 [Warning] WSREP: Failed to resolve tcp:// 192.168.12.75:4567
170609 16:57:09 [Note] WSREP: (9f2dfc7e, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
170609 16:57:09 [Note] WSREP: (9f2dfc7e, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
170609 16:57:09 [Note] WSREP: EVS version 0
170609 16:57:09 [Note] WSREP: gcomm: connecting to group 'mycluster', peer '192.168.11.98:, 192.168.12.75:,192.168.11.152 :'
170609 16:57:09 [Note] WSREP: (9f2dfc7e, 'tcp://0.0.0.0:4567') connection established to 9f2dfc7e tcp://192.168.11.152:4567
170609 16:57:09 [Warning] WSREP: (9f2dfc7e, 'tcp://0.0.0.0:4567') address 'tcp://192.168.11.152:4567' points to own listening address, blacklisting
170609 16:57:09 [Note] WSREP: (9f2dfc7e, 'tcp://0.0.0.0:4567') connection established to 017c00ff tcp://192.168.11.98:4567
170609 16:57:09 [Note] WSREP: (9f2dfc7e, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://192.168.12.75:4567
170609 16:57:10 [Note] WSREP: (9f2dfc7e, 'tcp://0.0.0.0:4567') connection established to 325d47d6 tcp://192.168.12.75:4567
170609 16:57:10 [Note] WSREP: declaring 017c00ff at tcp://192.168.11.98:4567 stable
170609 16:57:10 [Note] WSREP: declaring 325d47d6 at tcp://192.168.12.75:4567 stable
170609 16:57:10 [Note] WSREP: Node 017c00ff state prim
170609 16:57:10 [Note] WSREP: view(view_id(PRIM,017c00ff,13) memb {
      017c00ff,0
      325d47d6,0
      9f2dfc7e,0
} joined {
} left {
} partitioned {
})
170609 16:57:10 [Note] WSREP: save pc into disk
170609 16:57:10 [Note] WSREP: gcomm: connected
170609 16:57:10 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
170609 16:57:10 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
170609 16:57:10 [Note] WSREP: Opened channel 'mycluster'
170609 16:57:10 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 2, memb_num = 3
170609 16:57:10 [Note] WSREP: Waiting for SST to complete.
170609 16:57:10 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
170609 16:57:10 [Note] WSREP: STATE EXCHANGE: sent state msg: 9f7acc66-4cf1-11e7-878b-a2d0231f71b0
170609 16:57:10 [Note] WSREP: STATE EXCHANGE: got state msg: 9f7acc66-4cf1-11e7-878b-a2d0231f71b0 from 0 (11_98)
170609 16:57:10 [Note] WSREP: STATE EXCHANGE: got state msg: 9f7acc66-4cf1-11e7-878b-a2d0231f71b0 from 1 (12_75)
170609 16:57:10 [Note] WSREP: STATE EXCHANGE: got state msg: 9f7acc66-4cf1-11e7-878b-a2d0231f71b0 from 2 (11_152)
170609 16:57:10 [Note] WSREP: Quorum results:
      version = 4,
      component  = PRIMARY,
      conf_id = 12,
      members = 3/3 (joined/total),
      act_id    = 824276,
      last_appl. = -1,
      protocols  = 0/7/3 (gcs/repl/appl),
      group UUID = 51391c6d-4bff-11e7-a1c3-b797743e8629
170609 16:57:10 [Note] WSREP: Flow-control interval: [28, 28]
170609 16:57:10 [Note] WSREP: Restored state OPEN -> JOINED (824276)
170609 16:57:10 [Note] WSREP: New cluster view: global state: 51391c6d-4bff-11e7-a1c3-b797743e8629:824276, view# 13: Primary, number of nodes: 3, my index: 2, protocol version 3
170609 16:57:10 [Note] WSREP: SST complete, seqno: 824276
170609 16:57:10 [Note] WSREP: Member 2.0 (11_152) synced with group.
170609 16:57:10 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 824276)
170609 16:57:10 [Note] Plugin 'FEDERATED' is disabled.
170609 16:57:10 InnoDB: The InnoDB memory heap is disabled
170609 16:57:10 InnoDB: Mutexes and rw_locks use InnoDB's own implementation
170609 16:57:10 InnoDB: Compressed tables use zlib 1.2.3
170609 16:57:10 InnoDB: Using Linux native AIO
170609 16:57:10 InnoDB: Initializing buffer pool, size = 122.0M
170609 16:57:10 InnoDB: Completed initialization of buffer pool
170609 16:57:10 InnoDB: highest supported file format is Barracuda.
170609 16:57:11  InnoDB: Waiting for the background threads to start
170609 16:57:12 InnoDB: 5.5.54 started; log sequence number 6024720364
170609 16:57:12 [Note] Server hostname (bind-address): '0.0.0.0'; port: 3306
170609 16:57:12 [Note] - '0.0.0.0' resolves to '0.0.0.0';
170609 16:57:12 [Note] Server socket created on IP: '0.0.0.0'.
170609 16:57:12 [Note] Event Scheduler: Loaded 0 events
170609 16:57:12 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.5.54'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MySQL Community Server (GPL), wsrep_25.19.20170106.aa7e07d
170609 16:57:12 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
170609 16:57:12 [Note] WSREP: REPL Protocols: 7 (3, 2)
170609 16:57:12 [Note] WSREP: Assign initial position for certification: 824276, protocol version: 3
170609 16:57:12 [Note] WSREP: Service thread queue flushed.
170609 16:57:12 [Note] WSREP: Synchronized with group, ready for connections
170609 16:57:12 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
170609 16:57:13 [Note] WSREP: (9f2dfc7e, 'tcp://0.0.0.0:4567') connection to peer 9f2dfc7e with addr tcp://192.168.11.152:4567 timed out, no messages seen in PT3S
170609 16:57:13 [Note] WSREP: (9f2dfc7e, 'tcp://0.0.0.0:4567') turning message relay requesting off

節點順利的連接并加入了集群.

然后我又測試了, 把數據文件都清空的情況, 也是順利的加入了集群,并自動完成了數據同步,

從另個一個幾點的日志可以看到數據同步的情況:

170608 12:05:48 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 3
170608 12:05:48 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
"/var/log/mysqld.log" 744L, 62162C                                                                               112,1       13%
170609 16:42:43 [Note] WSREP: Quorum results:
      version = 4,
      component  = PRIMARY,
      conf_id = 4,
      members = 2/3 (joined/total),
      act_id    = 824275,
      last_appl. = 824274,
      protocols  = 0/7/3 (gcs/repl/appl),
      group UUID = 51391c6d-4bff-11e7-a1c3-b797743e8629
170609 16:42:43 [Note] WSREP: Flow-control interval: [28, 28]
170609 16:42:43 [Note] WSREP: New cluster view: global state: 51391c6d-4bff-11e7-a1c3-b797743e8629:824275, view# 5: Primary, number of nodes: 3, my index: 0, protocol version 3
170609 16:42:43 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
170609 16:42:43 [Note] WSREP: REPL Protocols: 7 (3, 2)
170609 16:42:43 [Note] WSREP: Assign initial position for certification: 824275, protocol version: 3
170609 16:42:43 [Note] WSREP: Service thread queue flushed.
170609 16:42:43 [Note] WSREP: Member 1.0 (11_152) requested state transfer from '*any*'. Selected 0.0 (11_98)(SYNCED) as donor.
170609 16:42:43 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 824275)
170609 16:42:43 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
170609 16:42:43 [Note] WSREP: Running: 'wsrep_sst_rsync --role 'donor' --address '192.168.11.152:4444/rsync_sst' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --gtid '51391c6d-4bff-11e7-a1c3-b797743e8629:824275''
170609 16:42:43 [Note] WSREP: sst_donor_thread signaled with 0
170609 16:42:43 [Note] WSREP: Flushing tables for SST...
170609 16:42:43 [Note] WSREP: Provider paused at 51391c6d-4bff-11e7-a1c3-b797743e8629:824275 (831018)
170609 16:42:43 [Note] WSREP: Tables flushed.

通過這一點, 也基本上驗證了我的猜測.

節點在退出集群后, 從新加入的時候, 如果這個故障節點的ip 在自己的配置文件 wsrep_cluster_address 的選項中的第一個ip .

那么這個節點是永遠都無法再加入這個集群了.

怎么辦呢, 把他的ip 從這個配置項里面,換一下位置. 這個問題就完美解決了.

通過進一步的測試. 如果這個節點是master , 通過--wsrep-new-cluster 啟動的節點,如果ip 排在第一位會有這個問題.
如果這個節點經過上述的步驟能夠重新加入解群了.  那么這個節點應該就拿不到這個master 的角色了.

這個時候, 就不會發生上述的問題, 即便ip 排在第一個的位置,也是可以加入集群的.

這個應該是一個bug 了.

再進一步驗證后, 可以提交bug 記錄了.

規避這個問題的方案就是節點的機器上的配置 wsrep-cluster-address 的配置選項里, 本機的ip 不要放在第一位.

上述就是小編為大家分享的galera mysql cluster 故障節點再次接入集群遇到問題該怎么辦了，如果剛好有類似的疑惑，不妨參照上述分析進行理解。如果想知道更多相關知識，歡迎關注億速云行業資訊頻道。

向AI問一下細節

中文字幕av专区_日韩电影在线播放_精品国产精品久久一区免费式_av在线免费观看网站

galera mysql cluster 故障節點再次接入集群遇到問題該怎么辦

規避這個問題的方案就是節點的機器上的配置 wsrep-cluster-address 的配置選項里, 本機的ip 不要放在第一位.

猜你喜歡

中文字幕av专区_日韩电影在线播放_精品国产精品久久一区免费式_av在线免费观看网站

galera mysql cluster 故障節點再次接入集群遇到問題該怎么辦

規避這個問題的方案就是節點的機器上的配置 wsrep-cluster-address 的配置選項里, 本機的ip 不要放在第一位.

猜你喜歡

最新資訊

相關推薦

相關標簽