最近遇到一個(gè)比較詭異的問題,一個(gè)regionserver由于GC的原因,導(dǎo)致與zookeeper鏈接超時(shí),最終被踢出集群。但是,接下來的現(xiàn)象,才是噩夢的開始?。?!
創(chuàng)新互聯(lián)公司是一家集成都網(wǎng)站設(shè)計(jì)、成都網(wǎng)站制作、外貿(mào)網(wǎng)站建設(shè)、網(wǎng)站頁面設(shè)計(jì)、網(wǎng)站優(yōu)化SEO優(yōu)化為一體的專業(yè)的建站公司,已為成都等多地近百家企業(yè)提供網(wǎng)站建設(shè)服務(wù)。追求良好的瀏覽體驗(yàn),以探求精品塑造與理念升華,設(shè)計(jì)最適合用戶的網(wǎng)站頁面。 合作只是第一步,服務(wù)才是根本,我們始終堅(jiān)持講誠信,負(fù)責(zé)任的原則,為您進(jìn)行細(xì)心、貼心、認(rèn)真的服務(wù),與眾多客戶在蓬勃發(fā)展的市場環(huán)境中,互促共生。
1. 一個(gè)regionserver由于GC的原因,導(dǎo)致與zookeeper鏈接超時(shí),最終被踢出集群。
~~~Hbase regionserver log~~~ 2018-05-31 11:42:17,739 INFO [MemStoreFlusher.0] regionserver.HRegion: Started memstore flush for cn_kong_groups,\x00\x00\xBB\xE9\x03\x03\x00D\xDF,1527701650816.a177e358544ffe3157a4c0531feb8e5a., current region memstore size 123.40 MB, and 1/1 column families' memstores are being flushed. 2018-05-31 11:42:17,740 WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 52612ms GC pool 'ParNew' had collection(s): count=1 time=45897ms GC pool 'ConcurrentMarkSweep' had collection(s): count=2 time=6814ms 2018-05-31 11:42:17,741 WARN [B.defaultRpcServer.handler=0,queue=0,port=16020] ipc.RpcServer: (responseTooSlow): {"processingtimems":52721,"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)","client":"10.21.23.232:55676","starttimems":1527738085020,"queuetimems":0,"class":"HRegionServer","responsesize":15,"method":"Scan"} 2018-05-31 11:42:17,745 INFO [regionserver/regionserver1.bigdata.com/172.16.11.66:16020-SendThread(ip-10-21-14-154.bigdata.com:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 61597ms for sessionid 0x15f3454790276db, closing socket connection and attempting reconnect 2018-05-31 11:42:17,745 INFO [main-SendThread(ip-10-21-14-154.bigdata.com:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 61595ms for sessionid 0x15f3454790276da, closing socket connection and attempting reconnect 2018-05-31 11:42:17,893 INFO [sync.3] wal.FSHLog: Slow sync cost: 148 ms, current pipeline: [DatanodeInfoWithStorage[172.16.11.66:50010,DS-ac448b6f-2964-4900-aeda-4547f2d956b8,DISK], DatanodeInfoWithStorage[172.16.11.67:50010,DS-e8b1727a-81d5-4b65-9854-6b6ad6749b64,DISK], DatanodeInfoWithStorage[10.21.23.41:50010,DS-56eb0e28-5b3c-4047-acc3-d79f3f6bacc2,DISK]] 2018-05-31 11:42:17,893 INFO [sync.2] wal.FSHLog: Slow sync cost: 150 ms, current pipeline: [DatanodeInfoWithStorage[172.16.11.66:50010,DS-ac448b6f-2964-4900-aeda-4547f2d956b8,DISK], DatanodeInfoWithStorage[172.16.11.67:50010,DS-e8b1727a-81d5-4b65-9854-6b6ad6749b64,DISK], DatanodeInfoWithStorage[10.21.23.41:50010,DS-56eb0e28-5b3c-4047-acc3-d79f3f6bacc2,DISK]]
2. Full GC的問題,拋開先不管,一個(gè)詭異的問題出現(xiàn)了,啟動(dòng)這個(gè)regionserver之后,不長時(shí)間這個(gè)regionserver又掛掉了。而且,陸續(xù)有其他regionserver開始掛掉。
2018-05-31 12:20:12,919 ERROR [RS_OPEN_REGION-regionserver1:16020-2] coprocessor.CoprocessorHost: The coprocessor org.apache.dfs.storage.hbase.cube.v1.coprocessor.observer.AggregateRegionObserver threw org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87) at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1774) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1313) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3856) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:1006) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:843) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043) 2018-05-31 12:20:12,923 ERROR [RS_OPEN_REGION-regionserver1:16020-2] regionserver.HRegionServer: ABORTING region server regionserver1.bigdata.com,16020,1527740222472: The coprocessor org.apache.dfs.storage.hbase.cube.v1.coprocessor.observer.AggregateRegionObserver threw org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87) at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1774) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1313)
3. 下面是另一個(gè)regionserver的日志,都可以看到"Operation category READ is not supported in state standby",然后"ABORTING region server"
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87) at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1774) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1313) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3856) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:1006) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:843) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043) 2018-05-31 12:20:13,678 ERROR [RS_OPEN_REGION-regionserver2:16020-2] regionserver.HRegionServer: ABORTING region server regionserver2.bigdata.com,16020,1519906845598: The coprocessor
4. 為什么會(huì)出現(xiàn)這個(gè)情況呢?感覺十分詭異。
最后經(jīng)過查找,并且有老司機(jī)的指引,發(fā)現(xiàn)了問題點(diǎn)。部分hbase表,使用了coprocessor功能,而這個(gè)也不是關(guān)鍵
hbase 表的信息如下,注意這部分:hdfs://dfs/dfs/dfs_metadata/coprocessor/*****
dfs_ZZQSWFZ4VW', {TABLE_ATTRIBUTES => {coprocessor$1 => 'hdfs://dfs/dfs/dfs_metadata/coprocessor/dfs-coprocessor-1.5.4.1-0.jar|org.apache.dfs.storage.hbase.cube.v2.coprocessor.endpoint.CubeVisitService|1001|', coprocessor$2 => 'hdfs://dfs/dfs/dfs_metadata/coprocessor/dfs-coprocessor-1.5.4.1-0.jar|org.apache.dfs.storage.hbase.cube.v1.coprocessor.observer.AggregateRegionObserver|1002|', METADATA => {'CREATION_TIME' => '1478338090339', 'GIT_COMMIT' => 'c4e31c1b3a664f598352061ae8703812e9d9cef7;', 'dfs_HOST' => 'dfs_metadata', 'OWNER' => 'xxxx.owner@bigdata.com', 'SEGMENT' => 'WindGreenwichOffline_dev[20161105070130_20161105092132]', 'SPLIT_POLICY' => 'org.apache.hadoop.hbase.regionserver.DisabledRegionSplitPolicy', 'USER' => 'ADMIN'}}, {NAME => 'F1', DATA_BLOCK_ENCODING => 'FAST_DIFF', BLOOMFILTER => 'NONE', COMPRESSION => 'SNAPPY'}
5. 關(guān)鍵點(diǎn)來了,當(dāng)前的hdfs namenode是HA模式的,但是這個(gè)集群,最早是單點(diǎn)namenode的,而當(dāng)時(shí)創(chuàng)建的一些,使用了coprocessor的hbase表,是問題的觸發(fā)點(diǎn)。
這些coprocessor,指定了hdfs的訪問名稱,所以,只有原來的namenode是active的狀態(tài),才能正常訪問。一旦active-standby發(fā)生切換,這部分表,就無法正常加載與訪問。最終導(dǎo)致整個(gè)regionserver的異常。
hbase 表的信息如下,注意這部分:hdfs://old_namenode_host.bigdata.com:9000/dfs/dfs_metadata/***** 這里使用了具體的主機(jī)名
'dfs_1KT8V5FL1C', {TABLE_ATTRIBUTES => {coprocessor$1 => '|org.apache.hadoop.hbase.security.access.SecureBulkLoadEndpoint|107374663|', coprocessor$2 => 'hdfs://old_namenode_host.bigdata.com:9000/dfs/dfs_metadata/coprocessor/dfs-coprocessor-1.5.4.1-1.jar|org.apache.dfs.storage.hbase.cube.v2.coprocessor.endpoint.CubeVisitService|1001|', coprocessor$3 => 'hdfs://old_namenode_host.bigdata.com:9000/dfs/dfs_metadata/coprocessor/dfs-coprocessor-1.5.4.1-1.jar|org.apache.dfs.storage.hbase.cube.v1.coprocessor.observer.AggregateRegionObserver|1002|', METADATA => {'CREATION_TIME' => '1493632152602', 'GIT_COMMIT' => 'c4e31c1b3a664f598352061ae8703812e9d9cef7;', 'dfs_HOST' => 'dfs_metadata', 'OWNER' => 'xxxx.owner@bigdata.com', 'SEGMENT' => 'WindCGN2_clone[20160501000000_20170501000000]', 'SPLIT_POLICY' => 'org.apache.hadoop.hbase.regionserver.DisabledRegionSplitPolicy', 'USER' => 'ADMIN'}}, {NAME => 'F1', DATA_BLOCK_ENCODING => 'FAST_DIFF', BLOOMFILTER => 'NONE', COMPRESSION => 'SNAPPY'}
最后總結(jié),為什么會(huì)其他的regionserver逐個(gè)掛掉呢?為什么namenode已經(jīng)切換很久,在GC引發(fā)regionserver掛掉,才會(huì)導(dǎo)致regionserver逐個(gè)掛掉的連鎖反應(yīng)?
1. GC引起第一個(gè)regionserver掛掉,而其上的region,一定會(huì)被master分配到其他的regionserver上,而其他的regionserver也不能正確接管這個(gè)region,所以也是以相同的錯(cuò)誤,導(dǎo)致第二個(gè)regionserver掛掉,然后第三個(gè)。理論上,全部的regionserver都會(huì)掛掉。
2. 由于在第一個(gè)regionserver掛掉前,namenode active-standby發(fā)生切換之前,hbase regionserver已經(jīng)啟動(dòng),并且能正常識(shí)別region,所以沒有引發(fā)問題,但是regionserver發(fā)生重啟,就肯定不能正常掛載這個(gè)region的。
關(guān)于coprocessor功能,請參考
http://www.zhyea.com/2017/04/13/using-hbase-coprocessor.html
https://www.3pillarglobal.com/insights/hbase-coprocessors
名稱欄目:Hbaseregionserver逐個(gè)掛掉的問題分析
文章起源:http://www.rwnh.cn/article0/jsdhoo.html
成都網(wǎng)站建設(shè)公司_創(chuàng)新互聯(lián),為您提供微信小程序、網(wǎng)站設(shè)計(jì)、微信公眾號(hào)、網(wǎng)站改版、靜態(tài)網(wǎng)站、網(wǎng)站設(shè)計(jì)公司
聲明:本網(wǎng)站發(fā)布的內(nèi)容(圖片、視頻和文字)以用戶投稿、用戶轉(zhuǎn)載內(nèi)容為主,如果涉及侵權(quán)請盡快告知,我們將會(huì)在第一時(shí)間刪除。文章觀點(diǎn)不代表本網(wǎng)站立場,如需處理請聯(lián)系客服。電話:028-86922220;郵箱:631063699@qq.com。內(nèi)容未經(jīng)允許不得轉(zhuǎn)載,或轉(zhuǎn)載時(shí)需注明來源: 創(chuàng)新互聯(lián)