草庐IT

记录线上服务出现MySQL死锁的排查过程

妩媚的轻狂书生 2023-03-28 原文

死锁原因

MySQL中的锁种类繁多,并且MySQL自带死锁检测机制。虽然正常的业务开发中很难遇到死锁的情况,但是最近团队里的一个服务它真的出现了deadlock!!!
遇事不慌,对于别人提的bug或者问题,我们首先确认是不是这样,再来考虑怎么办。所以,我让运维把相关服务日志打包发给我,这里截取一段:

[2022-04-20 13:02:46.360] [http-nio-8080-exec-88] [ERROR] [ at com.cmonelink.osms.cnpnplatform.exception.CustomExceptionResolver.handleException(CustomExceptionResolver.java:137)
] 通用异常 
org.springframework.dao.CannotAcquireLockException: 
### Error updating database.  Cause: com.mysql.cj.jdbc.exceptions.MySQLTransactionRollbackException: Lock wait timeout exceeded; try restarting transaction
### The error may involve com.cmonelink.osms.cnpnplatform.mapper.CbDsmPerformanceNrcellMapper.deleteByList-Inline
### The error occurred while setting parameters
### SQL: delete from CB_DSM_PERFORMANCE_NRCELL         WHERE                        NRCELLDU_ID=? and START_TIME=?          or              NRCELLDU_ID=? and START_TIME=?          or              NRCELLDU_ID=? and START_TIME=?          or              NRCELLDU_ID=? and START_TIME=?          or              NRCELLDU_ID=? and START_TIME=?          or              NRCELLDU_ID=? and START_TIME=?          or              NRCELLDU_ID=? and 

从日志出发,我们可以得到:

  • 确实出现了MySQL锁相关异常(MySQLTransactionRollbackException: Lock wait timeout exceeded)
  • 锁异常的原因是超时
    所以,这并不能代表出现了死锁的情况。因为锁等待超时和死锁是两个概念。于是,我让运维人员在线上MySQL上执行SHOW ENGINE INNODB STATUS,下面截取一段执行结果:
------------------------
LATEST DETECTED DEADLOCK
------------------------
2022-04-20 10:30:22 0x7f7eaf419700
*** (1) TRANSACTION:
TRANSACTION 51226606, ACTIVE 1 sec inserting
mysql tables in use 1, locked 1
LOCK WAIT 129 lock struct(s), heap size 24784, 403 row lock(s), undo log entries 201
MySQL thread id 6687, OS thread handle 140181895087872, query id 41203034 10.215.0.50 cbapp update
INSERT INTO CB_DSM_ZG_GNODEB(NE_ID, LIFECYCLE_STATUS,DEVICE_TYPE,FREQUENCY,GNODEB_NAME,
        LONGITUDE, CREATE_TIME, LATITUDE)
        VALUES
          
            ('2299606',2,1,'2.6GHz','name1',null,
            '2022-04-20 10:30:22.081',null)
         , 
            ('2318773',2,1,'2.6GHz','name2',null,
            '2022-04-20 10:30:22.081',null)
         , 
            ('2318891',2,1,'2.6GHz','name3',null,
            '2022-04-20 10:30:22.081',null)
         , 
            ('2318977',2,1,'2.6GHz','name4',null,
            '2022-04-20 10:30:22.081',null)
         , 
            ('2311856',2,1,'2.6GHz','name5',null,
            '2022-04-20 10:30:22.081',null)
         , 
            ('2306380',2,1,'2.6GHz','name6',null,
            '2022-04-20 10:30

*** (1) HOLDS THE LOCK(S):
RECORD LOCKS space id 280 page no 270 n bits 128 index PRIMARY of table `cnpn`.`CB_DSM_ZG_GNODEB` trx id 51226606 lock_mode X
Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0
 0: len 8; hex 73757072656d756d; asc supremum;;


*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 280 page no 270 n bits 128 index PRIMARY of table `cnpn`.`CB_DSM_ZG_GNODEB` trx id 51226606 lock_mode X insert intention waiting
Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0
 0: len 8; hex 73757072656d756d; asc supremum;;


*** (2) TRANSACTION:
TRANSACTION 51226607, ACTIVE 1 sec inserting
mysql tables in use 1, locked 1
LOCK WAIT 136 lock struct(s), heap size 24784, 403 row lock(s), undo log entries 201
MySQL thread id 6684, OS thread handle 140186323556096, query id 41203036 10.215.0.50 cbapp update
INSERT INTO CB_DSM_ZG_GNODEB(NE_ID, LIFECYCLE_STATUS,DEVICE_TYPE,FREQUENCY,GNODEB_NAME,
        LONGITUDE, CREATE_TIME, LATITUDE)
        VALUES
          
            ('2331081',2,1,'2.6GHz','name1',null,
            '2022-04-20 10:30:22.088',null)
         , 
            ('2331059',2,1,'2.6GHz','name2',null,
            '2022-04-20 10:30:22.088',null)
         , 
            ('2331761',2,1,'2.6GHz','name3',null,
            '2022-04-20 10:30:22.088',null)
         , 
            ('2331174',2,1,'2.6GHz','name4',null,
            '2022-04-20 10:30:22.088',null)
         , 
            ('2334290',2,1,'700M','name5',null,
            '2022-04-20 10:30:22.088',null)
         , 
            ('2331086',2,1,'2.6GHz',name6',null,
            '2022-0

*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 280 page no 270 n bits 128 index PRIMARY of table `cnpn`.`CB_DSM_ZG_GNODEB` trx id 51226607 lock_mode X
Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0
 0: len 8; hex 73757072656d756d; asc supremum;;


*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 280 page no 270 n bits 128 index PRIMARY of table `cnpn`.`CB_DSM_ZG_GNODEB` trx id 51226607 lock_mode X insert intention waiting
Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0
 0: len 8; hex 73757072656d756d; asc supremum;;

*** WE ROLL BACK TRANSACTION (1)

从执行的结果来看,

  1. 确实是发生了死锁。
 ------------------------
LATEST DETECTED DEADLOCK
------------------------
  1. 死锁产生的原因是两个批量insert的事务,事务id分别为5122660651226607
*** (1) TRANSACTION:
TRANSACTION 51226606, ACTIVE 1 sec inserting
.........................................................................
.........................................................................
..........................省略中间部分.........................
*** (2) TRANSACTION:
TRANSACTION 51226607, ACTIVE 1 sec inserting
  1. 两个事务均持有一个锁
    事务51226606
*** (1) HOLDS THE LOCK(S):
RECORD LOCKS space id 280 page no 270 n bits 128 index PRIMARY of table `cnpn`.`CB_DSM_ZG_GNODEB` trx id 51226606 lock_mode X
Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0
 0: len 8; hex 73757072656d756d; asc supremum;;

事务51226607

*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 280 page no 270 n bits 128 index PRIMARY of table `cnpn`.`CB_DSM_ZG_GNODEB` trx id 51226607 lock_mode X
Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0
 0: len 8; hex 73757072656d756d; asc supremum;;

从打印的内容来看,两个事务持有的锁内容一致,并且根据asc supremum;;可以大概判断,两个事务持有的均为间隙锁

  1. 两个事务均在等待锁
    事务51226606
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 280 page no 270 n bits 128 index PRIMARY of table `cnpn`.`CB_DSM_ZG_GNODEB` trx id 51226606 lock_mode X insert intention waiting
Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0
 0: len 8; hex 73757072656d756d; asc supremum;;

事务51226607

*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 280 page no 270 n bits 128 index PRIMARY of table `cnpn`.`CB_DSM_ZG_GNODEB` trx id 51226607 lock_mode X insert intention waiting
Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0
 0: len 8; hex 73757072656d756d; asc supremum;;

lock_mode X insert intention waiting可以得出,两个事务均在等待插入意向锁的释放。且两个事务等的锁内容与上文中两个事务持有的锁一致,我们可以推测出一个假设:两个事务均持有了一个范围比较大的间隙锁,而后续的insert操作均需等待对方释放间隙锁才能拿到插入意向锁,从而产生死锁!

死锁复现

产生死锁的服务业务逻辑通过下面的事务来概要表示

begin
batch_delete_method()//删除表中数据
batch_insert_method()//插入数据
commit

目前怀疑是两个事务的批量delete操作中包含了不存在的数据,因此产生了间隙锁;后续批量insert的时候需要互相等对方释放间隙锁,下面开始验证假设。

  1. 首先创建一张表,id为主键索引,number为普通索引
CREATE TABLE `test1` (
  `id` int(1) NOT NULL AUTO_INCREMENT,
  `number` int(1) NOT NULL COMMENT '数字',
  PRIMARY KEY (`id`),
  KEY `number` (`number`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=24 DEFAULT CHARSET=utf8;
  1. 在表中插入几条数据
INSERT INTO `test1` VALUES (0, 0);
INSERT INTO `test1` VALUES (10, 10);
INSERT INTO `test1` VALUES (15, 15);
  1. 开启两个事务
    事务A
begin
delete from test1 where id = 12//记录不存在,在(10,15)上加间隙锁
insert into test1 values(12,12)//需等事务B释放间隙锁
commit

事务B

begin
delete from test1 where id = 14//记录不存在,在(10,15)上加间隙锁
insert into test1 values(14,14)//需等事务B释放间隙锁
commit

4.执行结果

deadlock.png

可以看到,确实发生了死锁,执行SHOW ENGINE INNODB STATUS,输出结果如下


=====================================
2022-04-21 16:20:48 0x5258 INNODB MONITOR OUTPUT
=====================================
Per second averages calculated from the last 7 seconds
-----------------
BACKGROUND THREAD
-----------------
srv_master_thread loops: 81 srv_active, 0 srv_shutdown, 83574 srv_idle
srv_master_thread log flush and writes: 0
----------
SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 19
OS WAIT ARRAY INFO: signal count 16
RW-shared spins 4, rounds 4, OS waits 0
RW-excl spins 4, rounds 120, OS waits 4
RW-sx spins 0, rounds 0, OS waits 0
Spin rounds per wait: 1.00 RW-shared, 30.00 RW-excl, 0.00 RW-sx
------------------------
LATEST DETECTED DEADLOCK
------------------------
2022-04-21 16:18:49 0x5258
*** (1) TRANSACTION:
TRANSACTION 56319, ACTIVE 12 sec inserting
mysql tables in use 1, locked 1
LOCK WAIT 3 lock struct(s), heap size 1136, 2 row lock(s)
MySQL thread id 23, OS thread handle 17400, query id 1806 localhost ::1 root update
insert into test1 values(12,12)
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 353 page no 4 n bits 80 index PRIMARY of table `ocb_cp_5gmall`.`test1` trx id 56319 lock_mode X locks gap before rec insert intention waiting
Record lock, heap no 3 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
 0: len 4; hex 8000000f; asc     ;;
 1: len 6; hex 00000000dbf9; asc       ;;
 2: len 7; hex 81000000c10151; asc       Q;;
 3: len 4; hex 8000000f; asc     ;;

*** (2) TRANSACTION:
TRANSACTION 56320, ACTIVE 9 sec inserting
mysql tables in use 1, locked 1
3 lock struct(s), heap size 1136, 2 row lock(s)
MySQL thread id 26, OS thread handle 21080, query id 1810 localhost ::1 root update
insert into test1 values(14,14)
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 353 page no 4 n bits 80 index PRIMARY of table `ocb_cp_5gmall`.`test1` trx id 56320 lock_mode X locks gap before rec
Record lock, heap no 3 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
 0: len 4; hex 8000000f; asc     ;;
 1: len 6; hex 00000000dbf9; asc       ;;
 2: len 7; hex 81000000c10151; asc       Q;;
 3: len 4; hex 8000000f; asc     ;;

*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 353 page no 4 n bits 80 index PRIMARY of table `ocb_cp_5gmall`.`test1` trx id 56320 lock_mode X locks gap before rec insert intention waiting
Record lock, heap no 3 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
 0: len 4; hex 8000000f; asc     ;;
 1: len 6; hex 00000000dbf9; asc       ;;
 2: len 7; hex 81000000c10151; asc       Q;;
 3: len 4; hex 8000000f; asc     ;;

*** WE ROLL BACK TRANSACTION (2)
------------
TRANSACTIONS
------------
Trx id counter 56325
Purge done for trx's n:o < 56325 undo n:o < 0 state: running but idle
History list length 14
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 282892835750928, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 282892835749248, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 282892835753448, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 282892835752608, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 282892835748408, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 282892835747568, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 56319, ACTIVE 131 sec
3 lock struct(s), heap size 1136, 3 row lock(s), undo log entries 1
MySQL thread id 23, OS thread handle 17400, query id 1813 localhost ::1 root
.............................................................
.....................省略部分内容.............................
END OF INNODB MONITOR OUTPUT
============================

输出结果与线上MySQL的输出一致,验证了我们的假设!
最后用一张表来演示死锁产生的过程

顺序 事务A 事务B
1 begin
2 begin
3 delete from test1 where id = 12//记录不存在,在(10,15)上加间隙锁
4 delete from test1 where id = 14//记录不存在,在(10,15)上加间隙锁
5 insert into test1 values(12,12)//需等事务B释放间隙锁
6 insert into test1 values(14,14)//需等事务A释放间隙锁

死锁解决

找到了出现死锁的原因,我们要想到解决问题的方法。这里我把死锁产生的原因再贴出来
两个事务的批量delete操作涉及到表中不存在的数据,从而均持有了一个范围比较大的间隙锁,而后续的insert操作均需等待对方释放间隙锁才能拿到插入意向锁,从而产生死锁!
说到底,产生死锁的原因就是间隙锁。因此这里从两个角度来解决,

  • 数据库角度,将事务隔离级别设置为RC,防止产生间隙锁
  • 业务代码角度,在每次delete之前,执行一次select,确保表中存在相关数据,再执行delete,从而防止出现间隙锁

个人体会

网上的博客看的再多,没遇到过真实的问题那仅停留在纸上谈兵,实践出真知!

有关记录线上服务出现MySQL死锁的排查过程的更多相关文章

  1. ruby - 使用 ruby​​ 和 savon 的 SOAP 服务 - 2

    我正在尝试使用ruby​​和Savon来使用网络服务。测试服务为http://www.webservicex.net/WS/WSDetails.aspx?WSID=9&CATID=2require'rubygems'require'savon'client=Savon::Client.new"http://www.webservicex.net/stockquote.asmx?WSDL"client.get_quotedo|soap|soap.body={:symbol=>"AAPL"}end返回SOAP异常。检查soap信封,在我看来soap请求没有正确的命名空间。任何人都可以建议我

  2. ruby - 具有身份验证的私有(private) Ruby Gem 服务器 - 2

    我想安装一个带有一些身份验证的私有(private)Rubygem服务器。我希望能够使用公共(public)Ubuntu服务器托管内部gem。我读到了http://docs.rubygems.org/read/chapter/18.但是那个没有身份验证-如我所见。然后我读到了https://github.com/cwninja/geminabox.但是当我使用基本身份验证(他们在他们的Wiki中有)时,它会提示从我的服务器获取源。所以。如何制作带有身份验证的私有(private)Rubygem服务器?这是不可能的吗?谢谢。编辑:Geminabox问题。我尝试“捆绑”以安装新的gem..

  3. ruby - Sinatra:运行 rspec 测试时记录噪音 - 2

    Sinatra新手;我正在运行一些rspec测试,但在日志中收到了一堆不需要的噪音。如何消除日志中过多的噪音?我仔细检查了环境是否设置为:test,这意味着记录器级别应设置为WARN而不是DEBUG。spec_helper:require"./app"require"sinatra"require"rspec"require"rack/test"require"database_cleaner"require"factory_girl"set:environment,:testFactoryGirl.definition_file_paths=%w{./factories./test/

  4. ruby-on-rails - 启动 Rails 服务器时 ImageMagick 的警告 - 2

    最近,当我启动我的Rails服务器时,我收到了一长串警告。虽然它不影响我的应用程序,但我想知道如何解决这些警告。我的估计是imagemagick以某种方式被调用了两次?当我在警告前后检查我的git日志时。我想知道如何解决这个问题。-bcrypt-ruby(3.1.2)-better_errors(1.0.1)+bcrypt(3.1.7)+bcrypt-ruby(3.1.5)-bcrypt(>=3.1.3)+better_errors(1.1.0)bcrypt和imagemagick有关系吗?/Users/rbchris/.rbenv/versions/2.0.0-p247/lib/ru

  5. ruby-on-rails - s3_direct_upload 在生产服务器中不工作 - 2

    在Rails4.0.2中,我使用s3_direct_upload和aws-sdkgems直接为s3存储桶上传文件。在开发环境中它工作正常,但在生产环境中它会抛出如下错误,ActionView::Template::Error(noimplicitconversionofnilintoString)在View中,create_cv_url,:id=>"s3_uploader",:key=>"cv_uploads/{unique_id}/${filename}",:key_starts_with=>"cv_uploads/",:callback_param=>"cv[direct_uplo

  6. ruby-on-rails - Rails 5 Active Record 记录无效错误 - 2

    我有两个Rails模型,即Invoice和Invoice_details。一个Invoice_details属于Invoice,一个Invoice有多个Invoice_details。我无法使用accepts_nested_attributes_forinInvoice通过Invoice模型保存Invoice_details。我收到以下错误:(0.2ms)BEGIN(0.2ms)ROLLBACKCompleted422UnprocessableEntityin25ms(ActiveRecord:4.0ms)ActiveRecord::RecordInvalid(Validationfa

  7. ruby - 用 Ruby 编写一个简单的网络服务器 - 2

    我想在Ruby中创建一个用于开发目的的极其简单的Web服务器(不,不想使用现成的解决方案)。代码如下:#!/usr/bin/rubyrequire'socket'server=TCPServer.new('127.0.0.1',8080)whileconnection=server.acceptheaders=[]length=0whileline=connection.getsheaders想法是从命令行运行这个脚本,提供另一个脚本,它将在其标准输入上获取请求,并在其标准输出上返回完整的响应。到目前为止一切顺利,但事实证明这真的很脆弱,因为它在第二个请求上中断并出现错误:/usr/b

  8. ruby-on-rails - 在 Rails 中调试生产服务器 - 2

    您如何在Rails中的实时服务器上进行有效调试,无论是在测试版/生产服务器上?我试过直接在服务器上修改文件,然后重启应用,但是修改好像没有生效,或者需要很长时间(缓存?)我也试过在本地做“脚本/服务器生产”,但是那很慢另一种选择是编码和部署,但效率很低。有人对他们如何有效地做到这一点有任何见解吗? 最佳答案 我会回答你的问题,即使我不同意这种热修补服务器代码的方式:)首先,你真的确定你已经重启了服务器吗?您可以通过跟踪日志文件来检查它。您更改的代码显示的View可能会被缓存。缓存页面位于tmp/cache文件夹下。您可以尝试手动删除

  9. 使用canal同步MySQL数据到ES - 2

    文章目录一、概述简介原理模块二、配置Mysql使用版本环境要求1.操作系统2.mysql要求三、配置canal-server离线下载在线下载上传解压修改配置单机配置集群配置分库分表配置1.修改全局配置2.实例配置垂直分库水平分库3.修改group-instance.xml4.启动监听四、配置canal-adapter1修改启动配置2配置映射文件3启动ES数据同步查询所有订阅同步数据同步开关启动4.验证五、配置canal-admin一、概述简介canal是Alibaba旗下的一款开源项目,Java开发。基于数据库增量日志解析,提供增量数据订阅&消费。Git地址:https://github.co

  10. ruby-on-rails - 事件记录 : Select max of limit - 2

    我正在尝试将以下SQL查询转换为ActiveRecord,它正在融化我的大脑。deletefromtablewhereid有什么想法吗?我想做的是限制表中的行数。所以,我想删除少于最近10个条目的所有内容。编辑:通过结合以下几个答案找到了解决方案。Temperature.where('id这给我留下了最新的10个条目。 最佳答案 从您的SQL来看,您似乎想要从表中删除前10条记录。我相信到目前为止的大多数答案都会如此。这里有两个额外的选择:基于MurifoX的版本:Table.where(:id=>Table.order(:id).

随机推荐