“HBase”

fujohnwang

2011-12-20


HBase的前世今生

Born from the idea of Google Bigtable

HBase的配置

依赖前提

  1. Java 6 or above
  2. SSH - using passwordless login (Google “ssh passwordless login”)
  3. DNS
  4. NTP
  5. ulimit and nproc
  • base需要很多文件句柄,所以系统默认的文件句柄数基本上不够,一般需要设置为1w以上
  • “upping the file descriptors and nproc for the user who is running the HBase process is an operating system configuration, not an HBase configuration.”

分布式部署

Hbase的配置需要在所有结点之间同步, 可以在一台机器上编辑完,然后用rsync同步,或者,专门搞一台发布机,集中管理配置, 然后分发(puppet之类的软件应该是干这个事情的)。

配置hbase的时候, 最基本的需要通过override默认的配置,告知hbase:

  1. 使用什么FileSystem,是local的还是HDFS等;
  2. 要使用的zookeeper部署位置是什么;

Cautions

HBase will lose data unless it is running on an HDFS that has a durable sync.

HBase Replication

8.6.4. Write Ahead Log (WAL)
    The WAL is in HDFS in /hbase/.logs/ with subdirectories per region.

应该需要了解region server在zk里的ephemeral node,以便在region server failover之后,eromanga也可以转到新的region server上从新的hlog开始抓取变更。

API - http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/replication/package-summary.html#requirements

Once a HRegionServer starts and is opening the regions it hosts it checks if there are some left over log files and applies those all the way down in Store.doReconstructionLog(). Replaying a log is simply done by reading the log and adding the contained edits to the current MemStore. At the end an explicit flush of the MemStore (note, this is not the flush of the log!) helps writing those changes out to disk.

|—-HLog Replay

HBase Replication Ref

  1. http://hbase.apache.org/replication.html
  2. http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/regionserver/wal/HLog.html
  3. http://blog.sematext.com/2011/03/11/hbase-backup-options/ - HBase Backup Options > You might want to check out at MapR’s distro for Apache Hadoop (www.mapr.com). It has consistent point-in-time snapshots, as well the ability to mirror the snapshots to another data-center for disaster-recovery.
  4. https://github.com/mozilla-metrics/akela/blob/master/src/main/java/com/mozilla/hadoop/Backup.java | http://blog.mozilla.com/data/2011/02/04/migrating-hbase-in-the-trenches/ - Mozilla Backup tool
  5. http://javamaster.wordpress.com/2010/03/19/replication-architecture-in-cassandra-and-hbase/
  6. http://koven2049.iteye.com/blog/983633
  7. How Google Serves Data From Multiple Datacenters
  8. Hbase的log管理(一)
  9. Hbase的log管理(二)
  10. HBase异常——当RegionServer Crash之后

Issues

  1. When hbase.hlog.split.skip.errors is set to false, we fail the split but thats it
  2. Figure how to deal with eof splitting logs
  3. Multi data center replication

Schema Design

in general its best to avoid using a timestamp or a sequence (e.g. 1, 2, 3) as the row-key.

http://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-increasing-values-are-bad/

Distributions

  1. Apache HBase
  2. Cloudera
  3. Others

原理

  1. hbase 源码解析之master篇1
  2. hbase 源码解析之master篇2

References

  1. http://www.pigi-project.org/ - Powerful, Invincible, Great Indexing for HBase
  2. ** Bigtable: A Distributed Storage System for Structured Data **
  3. HBase Schema Design - Things you need to know
  4. HBase Architecture 101 - Storage
  5. HBase Architecture 101 - Write-ahead-Log
  6. HBase Schema Design - Things you need to know
  7. http://www.spnguru.com/tag/hbase/ - 趋势科技中国研发中心SPN研发团队hbase tag
  8. Coprocessors: Support aggregate functions
  9. HBase在淘宝的应用和优化小结
  10. http://www.meetup.com/hbaseusergroup/files/
  11. https://www.guru99.com/hbase-architecture-data-flow-usecases.html - Guru99 HBase Tutorial

>>>>>> 更多阅读 <<<<<<


「为AI疯狂」星球上,扶墙老师正在和朋友们讨论有趣的AI话题,你要不要⼀起来呀?^-^
这里

  1. 不但有及时新鲜的AI资讯和深度探讨
  2. 还分享AI工具、产品方法和商业机会
  3. 更有体系化精品付费内容等着你,加入星球(https://t.zsxq.com/0dI3ZA0sL) 即可免费领取。(加入之后一定记得看置顶消息呀!)

知识星球二维码

开天窗,拉认知,订阅「福报」,即刻拥有自己的全模态人工智能。

订阅「福报」