Sharding - A Database Optimization Technique
What is Sharding ?
Sharding (also known as Data Partitioning) is the process of splitting a large dataset into many small partitions which are placed on different machines. Each partition is known as a "shard".
It is the process of splitting up a DB/table across multiple machines to improve the manageability, performance, availability, and load balancing of an application. The justification for data partitioning is that, after a certain scale point, it is cheaper and more feasible to scale horizontally by adding more machines than to grow it vertically by adding beefier servers.
Types of Sharding
Horizontal Sharding In Horizontal Sharding each new table either has the same schema but unique rows.
In case of Vertical Sharding, the schema is a proper subset of the original table’s schema
There are many different schemes one could use to decide how to break up an application database into multiple smaller DBs. Below are three of the most popular schemes used by various large scale applications.
Key Based Sharding
Key based sharding, also known as hash based sharding, involves using a value taken from newly written data — such as a customer’s ID number, a client application’s IP address, a ZIP code, etc. — and plugging it into a hash function to determine which shard the data should go to. A hash function is a function that takes as input a piece of data (for example, a customer email) and outputs a discrete value, known as a hash value.
The fundamental problem with this approach is that it effectively fixes the total number of DB servers, since adding new servers means changing the hash function which would require redistribution of data and downtime for the service. A workaround for this problem is to use Consistent Hashing (will be explored further in a separate blog).
Directory Based Sharding
Directory based sharding, requires one to create and maintain a lookup table that uses a shard key to keep track of which shard holds which data. In a nutshell, a lookup table is a table that holds a static set of information about where specific data can be found.
List Based Sharding
In this scheme, each partition is assigned a list of values, so whenever we want to insert a new record, we will see which partition contains our key and then store it there. For example, we can decide all users living in continent of NorthAmerica will be stored in a partition for the North America.
Sharding based on 'Continent' putting together all records belonging to one continent in one shard.
Range Based Sharding
Range based sharding involves sharding data based on ranges of a given value. To illustrate, let’s say in the same original database shown above , we now break it further into tables based on a range of values like in this case the age-bracket as shown below.
Sharding based on 'Age-Bracket' putting together all records belonging within a certain range of values for age (age-bracket) in one shard.
Round -Robin Sharding
This is a very simple strategy that ensures uniform data distribution. With ‘n’ partitions, the ‘i’ tuple is assigned to partition (i mod n).
Under this scheme, we combine any of the above partitioning schemes to devise a new scheme. For example, first applying a list partitioning scheme and then a hash based partitioning. Consistent hashing could be considered a composite of hash and list partitioning where the hash reduces the key space to a size that can be listed.
Advantages of Sharding
With a sharded database, if there is outage in one database shard it makes only some part of application or website unavailable to some users, but other shards can continue operating without any issue. If database is unsharded, then an outage has potential to make entire application unavailable.
Faster queries response
Sharded database architecture speed up query response times. When you submit a query on a database that hasn’t been sharded, it may have to search every row in the table you’re querying before it can find the result set you are looking for. For an application with a large, monolithic database, queries can become prohibitively slow. By Sharding one table into multiple, though, queries have to go over fewer rows and their result sets are returned much more quickly.
More write bandwidth
With no master database serializing writes you can write in parallel which increases your write throughput. Writing is major bottleneck for many websites.
Sharding a database can help to facilitate horizontal scaling, known as scaling out. A parallel backend means you can do more work simultaneously. You can handle higher user loads, especially when writing data, because there are parallel paths through your system. You can load balance web servers, which access shards over different network paths, which are processed by separate CPUs, which use separate caches of RAM and separate disk IO paths to process work. Very few bottlenecks limit your work.
Common problems of Sharding
Joins and Denormalization Performing joins on a database which is running on one server is straightforward, but once a database is partitioned and spread across multiple machines it is often not feasible to perform joins that span database partitions. Such joins will not be performance efficient since data has to be compiled from multiple servers. A common workaround for this problem is to denormalize the database so that queries that previously required joins can be performed from a single table. Of course, the service now has to deal with all the perils of denormalization such as data inconsistency.
As we saw that performing a cross-partition query on a partitioned database is not feasible, similarly, trying to enforce data integrity constraints such as foreign keys in a partitioned database can be extremely difficult.
Most of RDBMS do not support foreign keys constraints across databases on different database servers. Which means that applications that require referential integrity on partitioned databases often have to enforce it in application code. Often in such cases, applications have to run regular SQL jobs to clean up dangling references.
There could be many reasons we have to change our partitioning scheme:
The data distribution is not uniform, e.g., there are a lot of places for a particular ZIP code that cannot fit into one database partition.
There is a lot of load on a partition, e.g., there are too many requests being handled by the DB partition dedicated to user photos.
Common scenarios for performing Sharding
Sharding is usually only performed when dealing with very large amounts of data. Here are some common scenarios where it may be beneficial to shard a database:
The amount of application data grows to exceed the storage capacity of a single database node.
The volume of writes or reads to the database surpasses what a single node or its read replicas can handle, resulting in slowed response times or timeouts.
The network bandwidth required by the application outpaces the bandwidth available to a single database node and any read replicas, resulting in slowed response times or timeouts.
Whether or not one should implement a sharded database architecture should be evaluated based on one's database size, performance requirements etc. Some perceive sharding as an inevitable outcome for databases that reach a certain size, while others see it as an overhead that should be avoided unless it’s absolutely necessary, due to the operational complexity that sharding may add to the system, if not chosen wisely.
Sharding if applied suitably to address some of the database painpoints mentioned above can turn out to be a great database optimization technique that can improve the database performance tremendously. Again, when it comes to choosing the right strategy there is no one-size fits all methodology and the right sharding technique needs to be chosen based on the unique problem scenario you are trying to solve for your databases.