The Facebook database is in reality a smart selected number of different database systems used for different purposes in order to manage big data. The first database system is the MySQL database system that is mostly used in combination with providing various content of profile pages. The key to understand the usage at Facebook is the asynchronous approach of using MySQL. In addition it is supplemented by a large distributed cache called Memcache that offloads the persistent store. Memcache can be considered as a general-purpose networked in-memory data store with a key-value data model. More details can be found here. This database system approach however has shown more and more limits.
The second database system is a giant graph database called ‘The Associations and Objects’ (TAO). This database serves well the ‘creation time locality’ approach that dominates the Facebook workload. In other words, a specific data item is likely to be accessed if it has been recently created. The TAO database system is in production since several years and runs on a large collection of geographically distributed server clusters. It is used with thousands of data types and manages over a billion read requests and millions of write requests every second at Facebook. It is based on a graph data model and offers an interesting API to developers. A data set managed is typically partitioned into hundreds of thousands of so-called shards. All objects and associations in the same shard are stored persistently in the same MySQL database. They are also cached on the same set of servers in each caching cluster. More details can be found here.
The third one is a NoSQL database system named Apache Cassandra that was developed at Facebook in order to power the Facebook inbox search feature and is used at Instagram today. Facebook released Cassandra as an open-source project on Google code in 2008 and afterwards it was becoming a major Apache project. The motivation for developing Cassandra was that Facebook has lots of data to deal with such as copies of messages, reverse indices of messages, all per user data. The Facebook database is facing many incoming requests resulting in a lot of random reads and random writes. All these requirements have been not implemented by any existing production ready solution in the market. As a consequence, Cassandra was developed with over 200 nodes deployed at Facebook. More details can be found at Facebook or here. There is also a nice presentation about Cassandra on SlideShare.
The use of Cassandra was paused in 2010 when Facebook used its Messaging platform on HBase instead. The reason was that Facebook employees found Cassandra’s eventual consistency model to be a difficult pattern. Facebook moved off its pre-Apache Cassandra deployment in late 2010 when they replaced Inbox Search with the Facebook Messaging platform and HBase as written here. Facebook began again using Apache Cassandra in its Instagram unit. More details can be found here.
The fourth database system is also a NoSQL technology called Apache HBase that is slightly extended in Facebook and used for messages. These messages include SMS, chat, email and traditional Facebook Messages into one inbox for each end user. The HBase open-source database system is a distributed key value data store running on top of the Hadoop Distributed File System (HDFS). HBase was chosen as the underlying durable data store because it provides the high write throughput and low latency random read performance necessary for the Facebook messages platform. Other important used are horizontal scalability, strong consistency, and high availability via automatic failover. HBase is thus used for point-read, online transaction processing workloads like Messages. A key feature required for Facebook is reliability that can deal with challenges such as time to recover on individual failings hosts, rack-level failures, and the overall nuances of running on dense storage. Here HBase is facing limits as well justifying some extensions to it.
Today, HBase is used for many other services in Facebook like the online analytics processing workloads where large data scans are taken into account. Other examples of the usage at Facebook are the internal monitoring system, search indexing, streaming data analysis, and data scraping for our internal data warehouses. More details can be found here.
Facebook database details
We refer to the following video for more details: