“HBase”

fujohnwang

2011-12-20



1 HBase的前世今生

Born from the idea of Google Bigtable

2 HBase的配置

2.1 依赖前提

  1. Java 6 or above
  2. SSH - using passwordless login (Google “ssh passwordless login”)
  3. DNS
  4. NTP
  5. ulimit and nproc

2.2 分布式部署

Hbase的配置需要在所有结点之间同步, 可以在一台机器上编辑完,然后用rsync同步,或者,专门搞一台发布机,集中管理配置, 然后分发(puppet之类的软件应该是干这个事情的)。

配置hbase的时候, 最基本的需要通过override默认的配置,告知hbase:

  1. 使用什么FileSystem,是local的还是HDFS等;
  2. 要使用的zookeeper部署位置是什么;

2.3 Cautions

HBase will lose data unless it is running on an HDFS that has a durable sync.

3 HBase Replication

8.6.4. Write Ahead Log (WAL)
    The WAL is in HDFS in /hbase/.logs/ with subdirectories per region.

应该需要了解region server在zk里的ephemeral node,以便在region server failover之后,eromanga也可以转到新的region server上从新的hlog开始抓取变更。

API - http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/replication/package-summary.html#requirements

Once a HRegionServer starts and is opening the regions it hosts it checks if there are some left over log files and applies those all the way down in Store.doReconstructionLog(). Replaying a log is simply done by reading the log and adding the contained edits to the current MemStore. At the end an explicit flush of the MemStore (note, this is not the flush of the log!) helps writing those changes out to disk.

|—-HLog Replay

3.1 HBase Replication Ref

  1. http://hbase.apache.org/replication.html
  2. http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/regionserver/wal/HLog.html
  3. http://blog.sematext.com/2011/03/11/hbase-backup-options/ - HBase Backup Options > You might want to check out at MapR’s distro for Apache Hadoop (www.mapr.com). It has consistent point-in-time snapshots, as well the ability to mirror the snapshots to another data-center for disaster-recovery.
  4. https://github.com/mozilla-metrics/akela/blob/master/src/main/java/com/mozilla/hadoop/Backup.java | http://blog.mozilla.com/data/2011/02/04/migrating-hbase-in-the-trenches/ - Mozilla Backup tool
  5. http://javamaster.wordpress.com/2010/03/19/replication-architecture-in-cassandra-and-hbase/
  6. http://koven2049.iteye.com/blog/983633
  7. How Google Serves Data From Multiple Datacenters
  8. Hbase的log管理(一)
  9. Hbase的log管理(二)
  10. HBase异常——当RegionServer Crash之后

3.2 Issues

  1. When hbase.hlog.split.skip.errors is set to false, we fail the split but thats it
  2. Figure how to deal with eof splitting logs
  3. Multi data center replication

4 Schema Design

in general its best to avoid using a timestamp or a sequence (e.g. 1, 2, 3) as the row-key.

http://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-increasing-values-are-bad/

5 Distributions

  1. Apache HBase
  2. Cloudera
  3. Others

6 原理

  1. hbase 源码解析之master篇1
  2. hbase 源码解析之master篇2

7 References

  1. http://www.pigi-project.org/ - Powerful, Invincible, Great Indexing for HBase
  2. ** Bigtable: A Distributed Storage System for Structured Data **
  3. HBase Schema Design - Things you need to know
  4. HBase Architecture 101 - Storage
  5. HBase Architecture 101 - Write-ahead-Log
  6. HBase Schema Design - Things you need to know
  7. http://www.spnguru.com/tag/hbase/ - 趋势科技中国研发中心SPN研发团队hbase tag
  8. Coprocessors: Support aggregate functions
  9. HBase在淘宝的应用和优化小结
  10. http://www.meetup.com/hbaseusergroup/files/
  11. https://www.guru99.com/hbase-architecture-data-flow-usecases.html - Guru99 HBase Tutorial

>>>>>> 更多阅读 <<<<<<


©王福强个人版权所有, All Rights Reserved.