“Notes Of GraphDBs”

fujohnwang

2012-01-10

Basic Concepts

图存储 + 图运算引申出一套生态系统， GraphDB属于存储序列，图计算则以google的pregel为典型代表（使用BSP计算模型），还包括了Hama，Giraph等图计算框架。本Note主要侧重GraphDB相关内容，包括基本概念，基本操作和访问，存储结构等等。

A graph is composed of Nodes(Vertices) [with Properties] and Edges [with Labels]. graph sample picture from wikipedia

Graph Types

directed graph（有向图） -> flow network
undirected graph
Mixed Graph(单向，双向，无向边混合)
Multigraph (多边)
Weighted Graph(加权)
Property Graph
- 对于Property Graph类型，每个node和edge都可以有多组k-v形式的properties来附加更多信息，另外，每个edge可以有label，用来表示不同类型的edge，这应该通常多见于multigraph中，即两个node之间存在多个edge的情形，而多个edge可以是不同类型的edge，比如like， follows等。 label和properties对于edge来说是不同的属性，不要混淆。

Basic Operations

The basic operations provided by a graph data structure G usually include:

adjacent(G, x, y): tests whether there is an edge from node x to node y.
neighbors(G, x): lists all nodes y such that there is an edge from x to y.
add(G, x, y): adds to G the edge from x to y, if it is not there.
delete(G, x, y): removes the edge from x to y, if it is there.
get_node_value(G, x): returns the value associated with the node x.
set_node_value(G, x, a): sets the value associated with the node x to a.

Structures that associate values to the edges usually also provide:

get_edge_value(G, x, y): returns the value associated to the edge (x,y).
set_edge_value(G, x, y, v): sets the value associated to the edge (x,y) to v.

Graph-theoretic data structures

List structures
- Incidence list
- Adjacency list
Matrix structures
- Incidence matrix
- Adjacency matrix
- Laplacian matrix or “Kirchhoff matrix” or “Admittance matrix”
- Distance matrix

图论应用场景

Task planning
Scheduling
Process assignation
Routing
Logistics
League planning
Pattern Recognition
Dependency analysis
Impact analysis
Network flow – Traffic analysis and optimization – Delivery optimization
Optimization of tasks

Typical Products

Neo4J

Graph DB, AGPL3 license(Sucks, I think), java made, ACID, HA, M-S Caching Shards, Domain-specific, embeddable, REST, etc.

Neo4j的授权协议包括商业性质的授权和开源协议的授权，但开源协议授权是基于AGPL，意味着如果要用neo4j，那你的产品也要开源，这对许多公司的产品来说是不可接受的，另外，许多高级特性只在商业版中才有，比如HA， backup等，没有了这些高级特性， neo4j也就剩下玩玩的资本了。

所以，不再对neo4j做更多了解和涉猎，只罗列部分资料，如果有人感兴趣可以看下：

OrientDB

http://www.orientechnologies.com/orient-graphdb.htm

Document-Graph DB, Apache2 License, Java Made, ACID Support, HA via replication, embeddable, REST&SQL-like access.

OrientDB实际上不是一个纯的GraphDB，它是构建在DocumentDB的模型之上。除了GraphDB， OrientDB还在基础的DocumentDB基础模型之上构建了KV, OO等类型的DB。

References Of OrientDB:

FlockDB

twitter特定场景下的graphed产品，并非力图实现所有的features，而只关注twitter自身或者网站常见的一些问题，比如在twitter内部只用flockdb存储用户关系图以及二级索引等信息。

flocked存储采用node+edge一起存储的策略，而且针对一个edge，会按照forward和backward分别存储2条，这种策略内聚性不错，但是否耦合性太紧，不便于扩展那？！（更深层次的edge关联不好管理，尤其是引入分布式存储结构之后！）

flockdb底层使用mysql作为存储，使用twitter的gizzard分布式数据访问层来管理mysql分区集群，在这之上，才是实现的重头戏，成为flapps服务集群， flapps是stateless的，所以可以很容易的横向扩展，而m底层的mysql存储层在gizzard的帮助下也可以横向扩展，所以总的来说， flockdb的横向扩展能力是没啥问题的。

https://github.com/twitter/flockdb/raw/master/doc/flockdb-layout.png

从使用场景上来看， flockdb更适合浅层次的遍历和查询(少于2个层次？！)，而深层次的遍历估计会因为同时牵扯index的查询和数据的查询而性能急剧下降。（个人猜测）从这个角度来看， flockdb还真是一个twitter特定的产品，跟一般意义上的graphdb有很大的差异，即traversal能力的强弱。

flockdb使用thrift暴露远程服务，所以，客户端可以使用多种语言实现，包括ruby， java， c/c++等， twitter内部应该用的是gizzmo（ruby客户端）。

结论： flockdb虽然速度和横向扩展能力不错，但graph模型和处理能力太弱，不一定适合“来往”的业务模型。

References

DEX

C++ core, Java, .Net API, dual Personal-evaluation or Commercial license

Labeled, Directed And Attributed Multigraph Model

Scenarios for DEX

Social Networks
- Twitter, Facebook, Linkedin, Flickr, Delicious, MySpace
Information Networks
- Bibliographical databases, Wikipedia, IMDB
Security Networks & fraud detection
- Economic transactions analysis
Recommendation
- ecommerce
Media Analysis
- Audiovisual content recommendation
Physical Netorks
- Logistics, Transport, Electrical, Telecom Networks
Biological Networks

Curiosity

基于磁盘的高容量存储是怎么做的？！
存储结构是什么样儿的？自定义？
横向扩展能力如何？

References

Infogrid

MeshNetg公司开源产品， MeshNet提供商业支持。产品整体按照不同的关注点分为不同的子项目，包括核心的graphdb，关注cluster的grid，关注存储的stores以及展现和工具支持，可以使用不同的存储方案，包括mysql， FS， HDFS等。

References

Other Ones…

InfiniteGraph
- Gluecon InfiniteGraph Presentation: Scaling the Social Graph in the Cloud
HypergraphDB
- LGPL, Java Made ,ACID, P2P distribution and replication, storage on BDB
- HyperGraphDB:Data Management for Complex Systems
TinkerGraph - In-Memory Graph, Not for production.

Tools & Support

TinkerPop公司在这个领域显然做到工作很出色！

Blueprints
Gremlin
- The Pathology of Graph Databases - Gremlin Graph Language
reXster
Prefuse - visualization tool

关于GraphDB的一些个人想法

个人感觉，GraphDB的服务层可以独立于存储层，只要对存储层做合理的抽象，现在的许多存储方案都可以使用，包括常见的RDBMS，以及各种KV啦，ColumnOrientedDB啦，等等。但数据的组织，应该参考GraphDB的服务层设计，尽量贴近服务层，规避特定存储层在GraphDB的访问模式下的各种弊端，比如“索引的空间和时间消耗”.

GraphDB强调索引的Vertices Adjacency能力，即通过Vertex可以直接获得临近的Vertices，而不用像在RDBMS那样，去B-Tree等索引中先查看索引，再做join的方式来获取相关vertices，换句话说，当前vertex本身已经是临近的vertices的索引。

当然， GraphDB并不排斥其它索引形式，比如完全可以通过B-Tree, B+Tree或者lucency等手段对vertices或者edge的properties进行索引或者对root vertices set进行索引，以加快graph其它相关信息的检索和访问。

存储结构抽象和引申

首先， graph中核心元素的存储可以分开，比如分为：

node存储（适合使用索引的存储）
edge存储（组合索引？！）
properties存储（比较适合用KV存储）

分开的好处是，不同元素的存储方案选择比较灵活，另外分区扩展也很容易，但也有弊端，比如构建索引的空间开销，以及搜索期间跨越不同存储的时间开销等。为了规避弊端，可以根据某些指标（比如子图）将核心元素的存储分区做locality，不过，这又进一步会引入跨分区遍历的开销…

反过来，几种核心元素一起存储又会怎样？！想自然语言的语句似的将所有信息都包含进一个单元（比如句子）当然可以，绝对高内聚，但使用的适合可能就不太方便，要做多级索引才能提高查询和计算速度，而索引增多，随着查找层次的加深，又会进一步降低性能，所以完全的高内聚的一起存储所有信息显然不是啥太明智的做法。

SNS的关系网络原则上并非十分密集，所以，采用相邻列表(Adjacency list)的结构来存储应该比较合适。而使用相邻列表(Adjacency list)的话， node以及其关联的edge就可以一同存储了。在这种情况下，即使node和edge的properties单独存储，也是很容易切割和扩展的。