原文：Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency

IDEA

A cloud storage system that provides customers the ability to store seemingly limitless amounts of data with high availablity and strong consistency. 为用户提供高可用、高一致性并近乎无限空间的云存储。

System characteristics 系统特点：

High availablity and strong consistency 高可用性和强一致性
Global and scalable namespace/storage 全局可扩展的名字空间、存储
Multiple data abstractions from a single stack 支持多种类型的数据
Automatic load balancing 自动负载均衡
Range Partition vs Hashing 使用动态区域划分，而没采用哈希
Append-only system 存储系统只有append 操作。
End-to-end checksum 端到端的校验和
Separate log file per RangePartition 日志文件粒度为RangePartition

高可用通过多副本策略实现（默认三个），数据写入的原子性操作保证强一致性。Azure 支持blob（数据块）、Table（structured storage）和Queues（消息队列）三类数据。所有数据都是以添加的方式写入的。

System overview 系统概述：

Azure 最大的存储单位是storage stamp ，它由多个存储节点组成，现在一个storage stamp 可以存储2PB 数据，未来会扩充到30PB。Client 可以通过Location service 和DNS 通过URI 找到对应的storage stamp。整个Azure 对外提供服务就如下图：

每个storage stamp 自顶到底可以分为Front-Ends、Partition Lyaer 和Stream Layer 三层。

Stream Layer 以称为“streams”的文件为单位进行存储，文件由顺序的顺序的chunks 组成（文中称为extents），Stream Layer 负责extents 的管理、复制，但它并不明白上层的对象信息。
需要特殊指出的是 Stream Layer 其实就是一个文件系统，是Azure 照搬了bing 的存储系统Cosmos。
Partition Layer 明白并管理上层的数据抽象（Blobs，Tables，Queues），并提供一个可扩展的名字空间，名义上存储了对象数据，并在IO 时进行缓存。
Front-Ends 包含了一系列无状态（stateless）的服务器，负责接收请求信息。并维持了一份RangePartittion 的分划表，这样在来了请求的时候就能够指向对应的partition server。

在一个storage stamp 中他们的关系如下图：

数据结构

大体清楚了Azure 的存储结构，我们来看看Azure 中数据的组织。如下图：

以文件//foo 为例，它由若干个Extents（类似于chunks）组成，每个Extent 又有若干个Blocks 组成。

Block 是数据读写的最小单位，客户端每次读或者写一个或若干个相关联的Blocks ，Blocks 的大小可变，每次append 时决定Block 的大小（所以Block 的大小是不定的）。
Extent 是Azure 进行复制的最小单位，extent 的完成称为seal 封装，extent一旦封装就不可以再append 进数据，而会在对应的extent 后新增一个extent 。存储extent 的节点称为EN。
stream 对应文件系统中的文件，名字空间为每个stream 维持一个名称。

既然说到了名字空间，我们下一步来看看名字空间：

名字空间

之前谈Azure 特点的时候就说了Azure 使用统一全局的名字空间，也就是全球不管有多少个数据中心，或者新建多少个数据节点，同一份文件在全球只有一个名称（GUID）。所有数据都可以通过如下格式URI 访问：

http(s)://AccountName.<service>.core.windows.net/PartitionName/ObjectName

accountname 是客户端账户号，也是Azure DNS 解析的一部分
PartititonName 指明了数据是在哪个机群上
ObjectName 是在该机群上对象的名称

接下来具体看看Stream Layer 和Partition Layer，首先从底层的Stream Layer 开始：

Stream Layer

Stream Layer 是一个典型的DFS 的结构，由Client、EN（extent node）和SM 机群构成：

SM（stream manager）机群就相当于元数据服务器机群，内部使用paxos 算法保持一致性。EN 类似于DataServer，是具体保存数据的地方。上图是Client 写数据的一个流程，可以看出Azure 使用了类似于hadoop pipeline 的方式写。SM（stream manager）作用有如下几点：

维持stream 名字空间和所有活跃状态的streams 和extents
监控ENs 的状态
创建并将extents 分配给ENs
performing lazy re-replication 执行懒复制策略
回抽那些没有被任何streams 指向的extents
scheduling the erasure coding of extent data according to stream policy 根据stream 策略对extent 调度erasure coding（erasure code 怎么翻译现在也没定论，是一种冗余纠错编码，文中采用的是Reed-solomen ，算法将extent 分割为固定的大小，并加入M 份校验码，当所有的块在缺少少于或者等于M 份时，这个extent 都是可以被恢复出来的，这样相对于所有的副本都保存三份更节省了存储空间）

感觉这部分大部分都是Cosmos 的工作，所以文章没有重点介绍，更多篇幅放在了他们的工作（Partition Layer）上。

Partition Layer

Partition Layer 存储不同类型对象并了解每个事务需要的对象类型，它提供以下功能：

存储不同类型的对象
处理不同类型对象的逻辑和语义
大容量、可扩充的对象名字空间管理
在可用的Partition servers 之间进行负载均衡
对事务进行排序、保证对象的强一致性

Partition Layer 也由三部分组成：PM（Partition Manager）、PS（Partition Server）、LS（Lock Server）。结构如下图：

Partition Layer 有一个存储对象的重要数据结构OT（Object Table，我觉得这是所有存储的结构化抽象）。OTs 支持insert、update、delete 等操作。这是个非常大的结构，因此在Partition Layer 分为若干个RangePartitions，每个PS 保存一些RangePartitions。

关于RangePartition 文中有说明：A RangePartition is a contiguous range of rows in an OT from a given low-key to high-key. RangePartition 相当于Object Table 表的一部分，比如将Object Table 安装a~z 字母顺序分为a~h、h~q、q~z 三份表（RangePartitions），采用的是动态分表的形式，表可以根据负载均衡等情况进行merge 和split 等操作。

有了Partition Layer 处理的数据，我们来看看PM，PS，LS 都有些什么职能吧：

PM（Partition Manager）与PS 保持通信并将Object Table 分隔成RangePartitions 分配到PS 中
PS（Partition Server）服务对该RangePartitions 范围的的请求
LS（Lock Server）锁服务器是运行paxos 算法用于PM 中leader 的选举

回过头来再看看每个RangePartition 在内存和在磁盘上的数据：

Persistent Data Structures

Metadata stream
Commit log stream ：自上一个checkpoint 开始对RangePartition 的操作
Row data stream ：checkpoint 和 checkpoint 的index
Blob data stream ：Blob Table 保存了blob 数据

In-Memory Data Structures

Memory table ：RangePartition 的内存缓存版本
Index cache ：Row data 的checkpoint 的index（序号，用来区分不同的checkpoints）
Row data cache：
Bloom filters ：当内存没有找到数据时（bloom filter），在磁盘中找

其他

PM 进行负载均衡的指标有：transactions/second、average pending transaction count、throttling rate、CPU usage、network usage、request latency、data size of RangePartition
Throttling 流量控制（针对恶意用户）：采用sample-Hold algorithm 计算最繁忙的N 个AccountName 和PartitionName ，当有机器发生过载时，利用这些信息进行决策。
Seperate Log Files per RangePartition：进行隔离，这样发生错误恢复或者重新启动时间减少。
End-to-End checksum 端到端的校验和
Multiple Data Abstraction from a single stack（同时处理结构化数据和非结构化数据）
Append-only system
文章最后谈到了Azure 与GFS 的区别：

GFS allows relaxed consistency across replicas and does not guarantee
that all replicas are bitwise the same, whereas WAS provides that guarantee
BigTable combines multiple tablets into a single commit log and writes them to two GFS files in parallel to avoid GFS hiccups, whereas we found we could work around both of these by using journaling in our stream layer
we provide a scalable Blob storage system and batch Table transactions integrated into a BigTable-like framework.

呆鸥

Brains first and then Hard Work

《微软云存储架构（Azure Cloud Storage）》上有1条评论

发表回复取消回复