MongoDB Shard Cluster
We all choose one or the other database to easily store, query and fetch the data. The point is, what type of database would best support the project at hand. The question goes deeper than “SQL vs. NoSQL” . For example, if I need a flexible schema with recursive graph queries, I go with MongoDB, which is a NoSQL database with JSON documents which represent field and value pairs.
{ "name": "Angelina", "place": "Bangalore", // field: value }
Since the table structure is not fixed, queries are themselves JSON, and thus easily composable.
What is Database Sharding?
Consider a scenario where I have standalone mongoDB running on one machine where I have 2 million user data items. Now, my business is reaching that break-point and will likely surpass 2.5 million users soon. So, I decided to break the database up into two:
The system capacity is now doubled!
What is MongoDB Sharding?
MongoDB sharding is used for deployment consists large sets of data where machines are connected with a few simple rules. This result in high throughput operations.
Components:
- Shards: It contains a subset of sharded data for a sharded cluster.
- Mongos (query router): Provides an interface between client applications and the sharded cluster.
- Config-servers: Config servers store metadata and configuration settings for the cluster.
Sharding is done based on the Shard Key. It is a field or a set of fields that you select from a document of a targeted collection.
Types of Sharding:
- Ranged Sharding: Dividing data into contiguous ranges determined by the shard key values.
- Hashed Sharding: Uses a hashed index to partition data across your shared cluster.
Demo on how to enable hashed sharding for your database with shard key as “_id”.
Deploy a Shard cluster
1. Config Server replica set
For production deployment, deploy a config server with at least three members. Below is mongoDB configuration for config servers.
sharding: clusterRole: configsvr replication: replSetName: <replica_set_name>
Deploying config server with 2 replica members ie. 1 primary and 2 secondary replica members.
config-server replica set name: "test-config-server" ... mongod --configsvr --replSet testing-config-server --dbpath /data --bind_ip localhost,<hostname(s)|ip address(es)>
2. Shard Replica set
Below is mongod setting for each shard cluster
sharding: clusterRole: shardsvr replication: replSetName: <replica_set_name>
Create three mongoDB shard clusters with primary, secondary and one arbiter members for each shard or you can add more replica members if needed. Give a specific name for each shard.
mongod --shardsvr --replSet testing-shard-1 --dbpath <path> --bind_ip localhost,<hostname(s)|ip address(es)> mongod --shardsvr --replSet testing-shard-2 --dbpath <path> --bind_ip localhost,<hostname(s)|ip address(es)> mongod --shardsvr --replSet testing-shard-3 --dbpath <path> --bind_ip localhost,<hostname(s)|ip address(es)>
3. Mongos for the shard cluster
While running mongos add the below setting in configuration file i.e config server replica set name and at least one member of the replica set.
sharding: configDB: <configReplSetName>/<cfg-server1-ip:<port>,<cfg-server2-ip:<port>,<cfg-server3-ip:<port>.
Running mongos:
mongos --configdb test-config-server/cfg1.example.net:27019,cfg2.example.net:27019,cfg3.example.net:27019 --bind_ip localhost,<hostname(s)|ip address(es)>
4. Connect to the Shard Cluster
mongo --host <hostname> --port <port>
4.1 Add Shards to the Cluster
sh.addShard("test-shard-1/<hostname(s)|ip address(es)") sh.addShard("test-shard-2/<hostname(s)|ip address(es)") sh.addShard("test-shard-3/<hostname(s)|ip address(es)") ... sh.addShard("<shard-names>/<hostname(s)|ip address(es)")
4.2 Create database
Creating users database
use users ... use <db name>
4.3 Create collection
Creating collection with name indianUsers
db.createCollection("indianUsers") ... db.createCollection("<collectionName>")
4.4 Check indexes for collection
By default the sharding index will be {_id: “1”} i.e Range based sharding.
db.indianUsers.getIndexes() ... db.<collectionName>.getIndexes()
Output:
[ { "key": { "_id": 1 }, "name": "_id_", "ns": "users.indianUsers", "v": 2 } ]
4.5 Create hashed index
We are going with hash based sharding i.e create an index with {_id: “hashed”} value.
db.indianUsers.ensureIndex({_id: "hashed"}) ... db.<collectionName>.ensureIndex({_id: "hashed"}) ... # Check indexes again db.indianUsers.getIndexes()
Output:
[ { "key": { "_id": 1 }, "name": "_id_", "ns": "users.indianUsers", "v": 2 }, { "key": { "_id": "hashed" }, "name": "_id_hashed", "ns": "users.indianUsers", "v": 2 } ]
4.6 Enable sharding for database
sh.enableSharding('users') ... sh.enableSharding('<db name>')
4.7 Enable hash based sharding for collection
sh.shardCollection("<db name>.<collectionName>",{_id: "hashed"}) sh.shardCollection("users.indianUsers",{_id: "hashed"})
4.8 Check collection distribution
db.indianUsers.getShardDistribution() ... db.<collectionName>.getShardDistribution()
Output:
Shard testing-shard-1-v2 at testing-shard-1-v2/<replica-server-ip:port> data : 0B docs : 0 chunks : 2 estimated data per chunk : 0B estimated docs per chunk : 0 Shard testing-shard-3-v2 at testing-shard-3-v2/<replica-server-ip:port> data : 0B docs : 0 chunks : 2 estimated data per chunk : 0B estimated docs per chunk : 0 Shard testing-shard-2-v2 at testing-shard-2-v2/<replica-server-ip:port> data : 0B docs : 0 chunks : 2 estimated data per chunk : 0B estimated docs per chunk : 0 Totals data : 0B docs : 0 chunks : 6 Shard testing-shard-1-v2 contains 0% data, 0% docs in cluster, avg obj size on shard : 0B Shard testing-shard-3-v2 contains 0% data, 0% docs in cluster, avg obj size on shard : 0B Shard testing-shard-2-v2 contains 0% data, 0% docs in cluster, avg obj size on shard : 0B
4.9 Insert Data into the collection
Using simple for loop to insert some data into the collection.
for (var i = 1; i <= 1000; i++) { db.indianUsers.insert( { 'first': 'helloWorld', 'rollno': i } ) }
Output:
WriteResult({ "nInserted" : 1 }) # This inserts 1000 documents into the collection.# Check collection count db.indianUsers.count() ... db.<collection name>.count()
Output: 1000
Check the shard distribution again. the data will be divided across 3 different shards. Mongos automatically rebalance the data across all shards.
db.indianUsers.getShardDistribution() ... db.<collectionName>.getShardDistribution()Shard testing-shard-2-v2 at testing-shard-2-v2/<replica-server-ip:port> data : 17KiB docs : 324 chunks : 2 estimated data per chunk : 8KiB estimated docs per chunk : 162 Shard testing-shard-1-v2 at testing-shard-1-v2/<replica-server-ip:port> data : 17KiB docs : 332 chunks : 2 estimated data per chunk : 8KiB estimated docs per chunk : 166 Shard testing-shard-3-v2 at testing-shard-3-v2/<replica-server-ip:port> data : 18KiB docs : 344 chunks : 2 estimated data per chunk : 9KiB estimated docs per chunk : 172 Totals data : 53KiB docs : 1000 chunks : 6 Shard testing-shard-2-v2 contains 32.4% data, 32.4% docs in cluster, avg obj size on shard : 55B Shard testing-shard-1-v2 contains 33.2% data, 33.2% docs in cluster, avg obj size on shard : 55B Shard testing-shard-3-v2 contains 34.39% data, 34.39% docs in cluster, avg obj size on shard : 55B
Conclusion
MongoDB makes it more manageable when it comes for sharding, scaling than other popular databases.
I hope this article was able to shed some light on what sharding is in MongoDB.
Originally published at https://www.coffeebeans.io.