随着mysql mgr版本的升级和技术的成熟,mgr在取代mha成为mysql高可用首选方案后,其搭建过程虽然不算复杂,但仍涉及一系列手动操作。为了简化mgr的搭建和故障诊断,开发了一个自动化脚本,用于实现mgr的自动化搭建、自动化故障诊断及修复。
为了简化起见,MGR的自动化搭建测试采用单机多实例模式,预先安装了三个MySQL实例,端口号分别为7001、7002和7003,其中7001作为写入节点,其余两个节点作为读取节点,8000节点为作者的另一个测试节点,可忽略。在指定主从节点的情况下,以下是mgr_tool.py一键搭建MGR集群的测试demo。
MGR故障模拟1
MGR节点故障的自动监测和自愈实现如下图所示,搭建完成后的MGR集群目前处于完全正常的状态。
主观造成主从节点间binlog的丢失
在主节点上对从节点丢失的数据进行操作,GTID无法找到对应的数据,组复制立即停止。
非写入节点出现错误
查看errorlog
如果手动解决,仍需使用GTID跳过错误事务的常规方法,获取master上的GTID信息。
尝试跳过最新的一个事务ID,然后重新连接到组,可以正常连接到组,另一个节点仍处于错误状态。
stop group_replication;
SET GTID_NEXT=’6c81c118-e67c-4416-9cb0-2d573d178c1d:13′;
BEGIN; COMMIT;
set gtid_next=’automatic’;
东盟商机最新AC版As2007 SP1
AS系统本次的主要更新和新开发的功能如下(暂不详述): 1、修复了普及版的一些大大小小的BUG 2、重新规划整个后台,使后台更加个性化、智能化、更加易用 3、重写了广告部分模块,使其更加专业化 4、重写了文章采集模块,添加了定时自动采集功能 5、添加了供求信息采集功能 6、重写了友情连接功能(原来的太简单了) 8、重写了生成HTML模块。(几个主要模块首页不用原来的生成方式,不再会被卡巴斯机杀毒软
0 查看详情
另一个节点类似,依次解决。
MGR故障模拟2
从节点脱离Group
这种情况相对简单,重新启动组复制即可,执行start group_replication。
MGR故障自动检测和修复
对于上述两种情况,1,如果是从节点丢失主节点的事务,尝试在从节点上跳过GTID,重新开始复制即可;2,如果是从节点非丢失主节点事务,尝试在从节点重新开始组复制即可。
实现代码如下:
def auto_fix_mgr_error(conn_master_dict,conn_slave_dict): group_replication_status = get_group_replication_status(conn_slave_dict) if(group_replication_status[0]["MEMBER_STATE"]=="ERROR" or group_replication_status[0]["MEMBER_STATE"] == "OFFLINE"): print(conn_slave_dict["host"]+str(conn_slave_dict["port"])+'------>'+group_replication_status[0]["MEMBER_STATE"]) print("auto fixing......") while 1 > 0: master_gtid_list = get_gtid(conn_master_dict) slave_gtid_list = get_gtid(conn_slave_dict) master_executed_gtid_value = int((master_gtid_list[-1]["Executed_Gtid_Set"]).split("-")[-1]) slave_executed_gtid_value = int(slave_gtid_list[-1]["Executed_Gtid_Set"].split("-")[-1]) slave_executed_gtid_prefix = slave_gtid_list[-1]["Executed_Gtid_Set"].split(":")[0] slave_executed_skiped_gtid = slave_executed_gtid_value + 1 if (master_executed_gtid_value > slave_executed_gtid_value): print("skip gtid and restart group replication,skiped gtid is " + slave_gtid_list[-1]["Executed_Gtid_Set"].split(":")[-1].split("-")[0] + ":"+str(slave_executed_skiped_gtid)) slave_executed_skiped_gtid = slave_executed_gtid_prefix+":"+str(slave_executed_skiped_gtid) skip_gtid_on_slave(conn_slave_dict,slave_executed_skiped_gtid) time.sleep(10) start_group_replication(conn_slave_dict) if(get_group_replication_status(conn_slave_dict)[0]["MEMBER_STATE"]=="ONLINE"): print("mgr cluster fixed,back to normal") break else: start_group_replication(conn_slave_dict) if(get_group_replication_status(conn_slave_dict)[0]["MEMBER_STATE"]=="ONLINE"): print("mgr cluster fixed,back to normal") break elif (group_replication_status[0]['MEMBER_STATE'] == 'ONLINE'): print("mgr cluster is normal,nothing to do") check_replication_group_members(conn_slave_dict)
对于故障类型1,GTID事务不一致的自动化修复
对于故障类型2从节点offline的自动化修复
完整的实现代码
该过程要求MySQL实例必须满足MGR的基本条件,如果环境本身无法满足MGR,一切都无从谈起,因此要非常清楚MGR环境的最基本要求。
完成的实现代码如下,花了一个下午编写,目前存在以下不足:1,创建复制用户时,未指定具体的slave机器,目前直接指定为%:create user repl@’%’ identified by repl2;2,对于slave的修复,目前无法整体修复,只能一台一台修复,实际上是缺少了一个循环slave机器判断的过程;3,目前搭建之前都会reset master(不管主从,主要是清理可能的残留GTID),因此只适合新环境的搭建;4,目前只支持offline和gtid事务冲突的错误类型修复,无法支持其他MGR错误类型的修复;5,开发环境是单机多实例模式测试,没有在多机单实例模式下充分测试。这些问题将逐步改善和加强。
# -*- coding: utf-8 -*-import pymysqlimport loggingimport timeimport decimal
def execute_query(conn_dict,sql):conn = pymysql.connect(host=conn_dict['host'], port=conn_dict['port'], user=conn_dict['user'], passwd=conn_dict['password'], db=conn_dict['db'])cursor = conn.cursor(pymysql.cursors.DictCursor)cursor.execute(sql)list = cursor.fetchall()cursor.close()conn.close()return list
def execute_noquery(conn_dict,sql):conn = pymysql.connect(host=conn_dict['host'], port=conn_dict['port'], user=conn_dict['user'], passwd=conn_dict['password'], db=conn_dict['db'])cursor = conn.cursor()cursor.execute(sql)conn.commit()cursor.close()conn.close()return list
def get_gtid(conn_dict):sql = "show master status;"list = execute_query(conn_dict,sql)return list
def skip_gtid_on_slave(conn_dict,gtid):sql_1 = 'stop group_replication;'sql_2 = '''set gtid_next='{0}';'''.format(gtid)sql_3 = 'begin;'sql_4 = 'commit;'sql_5 = '''set gtid_next='automatic';'''
try: execute_noquery(conn_dict, sql_1) execute_noquery(conn_dict, sql_2) execute_noquery(conn_dict, sql_3) execute_noquery(conn_dict, sql_4) execute_noquery(conn_dict, sql_5)except: raisedef get_group_replication_status(conn_dict):sql = '''select MEMBER_STATE from performance_schema.replication_group_members where (MEMBER_HOST = '{0}' or ifnull(MEMBER_HOST,'') = '') AND (MEMBER_PORT={1} or ifnull(MEMBER_PORT,'') ='') ; '''.format(conn_dict["host"], conn_dict["port"])result = execute_query(conn_dict,sql)if result:return resultelse:return None
def check_replication_group_members(conn_dict):print('-------------------------------------------------------')result = execute_query(conn_dict, " select * from performance_schema.replication_group_members; ")if result:column = result[0].keys()current_row = ''for key in column:current_row += str(key) + " "print(current_row)
for row in result: current_row = '' for key in row.values(): current_row += str(key) + " " print(current_row)print('-------------------------------------------------------')def auto_fix_mgr_error(conn_master_dict,conn_slave_dict):group_replication_status = get_group_replication_status(conn_slave_dict)if(group_replication_status[0]["MEMBER_STATE"]=="ERROR" or group_replication_status[0]["MEMBER_STATE"] == "OFFLINE"):print(conn_slave_dict["host"]+str(conn_slave_dict["port"])+'------>'+group_replication_status[0]["MEMBER_STATE"])print("auto fixing......")while 1 > 0:master_gtid_list = get_gtid(conn_master_dict)slave_gtid_list = get_gtid(conn_slave_dict)master_executed_gtid_value = int((master_gtid_list[-1]["Executed_Gtid_Set"]).split("-")[-1])slave_executed_gtid_value = int(slave_gtid_list[-1]["Executed_Gtid_Set"].split("-")[-1])slave_executed_gtid_prefix = slave_gtid_list[-1]["Executed_Gtid_Set"].split(":")[0]slave_executed_skiped_gtid = slave_executed_gtid_value + 1if (master_executed_gtid_value > slave_executed_gtid_value):print("skip gtid and restart group replication,skiped gtid is "
slave_gtid_list[-1]["Executed_Gtid_Set"].split(":")[-1].split("-")[0]":"+str(slave_executed_skiped_gtid))slave_executed_skiped_gtid = slave_executed_gtid_prefix+":"+str(slave_executed_skiped_gtid)skip_gtid_on_slave(conn_slave_dict,slave_executed_skiped_gtid)time.sleep(10)start_group_replication(conn_slave_dict)if(get_group_replication_status(conn_slave_dict)[0]["MEMBER_STATE"]=="ONLINE"):print("mgr cluster fixed,back to normal")breakelse:start_group_replication(conn_slave_dict)if(get_group_replication_status(conn_slave_dict)[0]["MEMBER_STATE"]=="ONLINE"):print("mgr cluster fixed,back to normal")breakelif (group_replication_status[0]['MEMBER_STATE'] == 'ONLINE'):print("mgr cluster is normal,nothing to do")check_replication_group_members(conn_slave_dict)
'''reset master'''def reset_master(conn_dict):try:execute_noquery(conn_dict, "reset master;")except:raise
def install_group_replication_plugin(conn_dict):get_plugin_sql = "SELECT name,dl FROM mysql.plugin WHERE name = 'group_replication';"install_plugin_sql = '''install plugin group_replication soname 'group_replication.so'; '''try:result = execute_query(conn_dict, get_plugin_sql)if not result:execute_noquery(conn_dict, install_plugin_sql)except:raise
def create_mgr_repl_user(conn_master_dict,user,password):try:reset_master(conn_master_dict)sql_exists_user = '''select user from mysql.user where user = '{0}'; '''.format(user)user_list = execute_query(conn_master_dict,sql_exists_user)if not user_list:create_user_sql = '''create user {0}@'%' identified by '{1}'; '''.format(user,password)grant_privilege_sql = '''grant replication slave on . to {0}@'%';'''.format(user)execute_noquery(conn_master_dict,create_user_sql)execute_noquery(conn_master_dict, grant_privilege_sql)execute_noquery(conn_master_dict, "flush privileges;")except:raise
def set_super_read_only_off(conn_dict):super_read_only_off = '''set global super_read_only = 0;'''execute_noquery(conn_dict, super_read_only_off)
def open_group_replication_bootstrap_group(conn_dict):sql = '''select variable_name,variable_value from performance_schema.global_variables where variable_name = 'group_replication_bootstrap_group';'''result = execute_query(conn_dict, sql)open_bootstrap_group_sql = '''set @@global.group_replication_bootstrap_group=on;'''if result and result[0]['variable_value']=="OFF":execute_noquery(conn_dict, open_bootstrap_group_sql)
def close_group_replication_bootstrap_group(conn_dict):sql = '''select variable_name,variable_value from performance_schema.global_variables where variable_name = 'group_replication_bootstrap_group';'''result = execute_query(conn_dict, sql)close_bootstrap_group_sql = '''set @@global.group_replication_bootstrap_group=off;'''if result and result[0]['variable_value'] == "ON":execute_noquery(conn_dict, close_bootstrap_group_sql)
def start_group_replication(conn_dict):start_group_replication = '''start group_replication;'''group_replication_status = get_group_replication_status(conn_dict)if not (group_replication_status[0]['MEMBER_STATE'] == 'ONLINE'):execute_noquery(conn_dict, start_group_replication)
def connect_to_group(conn_dict,repl_user,repl_password):connect_to_group_sql = '''change master to master_user='{0}', master_password='{1}' for channel 'group_replication_recovery'; '''.format(repl_user,repl_password)try:execute_noquery(conn_dict, connect_to_group_sql)except:raise
def start_mgr_on_master(conn_master_dict,repl_user,repl_password):try:set_super_read_only_off(conn_master_dict)reset_master(conn_master_dict)create_mgr_repl_user(conn_master_dict,repl_user,repl_password)connect_to_group(conn_master_dict,repl_user,repl_password)
open_group_replication_bootstrap_group(conn_master_dict) start_group_replication(conn_master_dict) close_group_replication_bootstrap_group(conn_master_dict) group_replication_status = get_group_replication_status(conn_master_dict) if (group_replication_status[0]['MEMBER_STATE'] == 'ONLINE'): print("master added in mgr and run successfully") return Trueexcept: raiseprint("############start master mgr error################")exit(1)def start_mgr_on_slave(conn_slave_dict,repl_user,repl_password):try:set_super_read_only_off(conn_slave_dict)reset_master(conn_slave_dict)connect_to_group(conn_slave_dict,repl_user,repl_password)start_group_replication(conn_slave_dict)
wait for 10
time.sleep(10) # then check mgr status group_replication_status = get_group_replication_status(conn_slave_dict) if (group_replication_status[0]['MEMBER_STATE'] == 'ONLINE'): print("slave added in mgr and run successfully") if (group_replication_status[0]['MEMBER_STATE'] == 'RECOVERING'): print("slave is recovering")except: print("############start slave mgr error################") exit(1)def auto_mgr(conn_master,conn_slave_1,conn_slave_2,repl_user,repl_password):install_group_replication_plugin(conn_master)master_replication_status = get_group_replication_status(conn_master)
if not (master_replication_status[0]['MEMBER_STATE'] == 'ONLINE'): start_mgr_on_master(conn_master,repl_user,repl_password)slave1_replication_status = get_group_replication_status(conn_slave_1)if not (slave1_replication_status[0]['MEMBER_STATE'] == 'ONLINE'): install_group_replication_plugin(conn_slave_1) start_mgr_on_slave(conn_slave_1, repl_user, repl_user)slave2_replication_status = get_group_replication_status(conn_slave_2)if not (slave2_replication_status[0]['MEMBER_STATE'] == 'ONLINE'): install_group_replication_plugin(conn_slave_2) start_mgr_on_slave(conn_slave_2, repl_user, repl_user)check_replication_group_members(conn_master)if name == 'main':conn_master = {'host': '127.0.0.1', 'port': 7001, 'user': 'root', 'password': 'root', 'db': 'mysql', 'charset': 'utf8mb4'}conn_slave_1 = {'host': '127.0.0.1', 'port': 7002, 'user': 'root', 'password': 'root', 'db': 'mysql', 'charset': 'utf8mb4'}conn_slave_2 = {'host': '127.0.0.1', 'port': 7003, 'user': 'root', 'password': 'root', 'db': 'mysql', 'charset': 'utf8mb4'}repl_user = "repl"repl_password = "repl"
auto_mgr(conn_master,conn_slave_1,conn_slave_2,repl_user,repl_password)
auto_fix_mgr_error(conn_master,conn_slave_1)check_replication_group_members(conn_master)以上就是MySQL MGR集群单主模式的自动搭建和自动化故障修复的详细内容,更多请关注创想鸟其它相关文章!
版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。
如发现本站有涉嫌抄袭侵权/违法违规的内容, 请发送邮件至 chuangxiangniao@163.com 举报,一经查实,本站将立刻删除。
发布者:程序猿,转转请注明出处:https://www.chuangxiangniao.com/p/814295.html
微信扫一扫
支付宝扫一扫