Integrating MongoDB with Python Using PyMongo

Introduction

In this post, we will dive into MongoDB as a data store from a Python perspective. To that end, we'll write a simple script to showcase what we can achieve and any benefits we can reap from it.

Web applications, like many other software applications, are powered by data. The organization and storage of this data are important as they dictate how we interact with the various applications at our disposal. The kind of data handled can also have an influence on how we undertake this process.

Databases allow us to organize and store this data, while also controlling how we store, access, and secure the information.

NoSQL Databases

There are two main types of databases - relational and non-relational databases.

Relational databases allow us to store, access, and manipulate data in relation to another piece of data in the database. Data is stored in organized tables with rows and columns with relationships linking the information among tables. To work with these databases, we use the Structured Query Language (SQL) and examples include MySQL and PostgreSQL.

Non-relational databases store data in neither relation or tabular, as in relational databases. They are also referred to as NoSQL databases since we do not use SQL to interact with them.

Furthermore, NoSQL databases can be divided into Key-Value stores, Graph stores, Column stores, and Document Stores, which MongoDB falls under.

MongoDB and When to Use it

MongoDB is a document store and non-relational database. It allows us to store data in collections that are made up of documents.

In MongoDB, a document is simply a JSON-like binary serialization format referred to as a BSON, or Binary-JSON, and has a maximum size of 16 megabytes. This size limit is in place to ensure efficient memory and bandwidth usage during transmission.

MongoDB also provides the GridFS specification in case there is a need to store files larger than the set limit.

Documents are made up of field-value pairs, just like in regular JSON data. However, this BSON format can also contain more data types, such as Date types and Binary Data types. BSON was designed to be lightweight, easily traversable, and efficient when encoding and decoding data to and from BSON.

Being a NoSQL datastore, MongoDB allows us to enjoy the advantages that come with using a non-relational database over a relational one. One advantage is that it offers high scalability by efficiently scaling horizontally through sharding or partitioning of the data and placing it on multiple machines.

MongoDB also allows us to store large volumes of structured, semi-structured, and unstructured data without having to maintain relationships between it. Being open-source, the cost of implementing MongoDB is kept low to just maintenance and expertise.

Like any other solution, there are downsides to using MongoDB. The first one is that it does not maintain relationships between stored data. Due to this, it is hard to perform ACID transactions that ensure consistency.

Complexity is increased when trying to support ACID transactions. MongoDB, like other NoSQL data stores, is not as mature as relational databases and this can make it hard to find experts.

The non-relational nature of MongoDB makes it ideal for the storage of data in specific situations over its relational counterparts. For instance, a scenario where MongoDB is more suitable than a relational database is when the data format is flexible and has no relations.

With flexible/non-relational data, we don't need to maintain ACID properties when storing data as opposed to relational databases. MongoDB also allows us to easily scale data into new nodes.

However, with all its advantages, MongoDB is not ideal when our data is relational in nature. For instance, if we are storing customer records and their orders.

In this situation, we will need a relational database to maintain the relationships between our data, which are important. It is also not suitable to use MongoDB if we need to comply with ACID properties.

Interacting with MongoDB via Mongo Shell

To work with MongoDB, we will need to install the MongoDB Server, which we can download from the official homepage. For this demonstration, we will use the free Community Server.

The MongoDB server comes with a Mongo Shell that we can use to interact with the server via the terminal.

To activate the shell, just type mongo in your terminal. You'll be greeted with information about the MongoDB server set-up, including the MongoDB and Mongo Shell version, alongside the server URL.

For instance, our server is running on:

mongodb://127.0.0.1:27017

In MongoDB, a database is used to hold collections that contains documents. Through the Mongo shell, we can create a new database or switch to an existing one using the use command:

> use SeriesDB

Every operation we execute after this will be effected in our SeriesDB database. In the database, we will store collections, which are similar to tables in relational databases.

For example, for the purposes of this tutorial, let's add a few series to the database:

> db.series.insertMany([
... { name: "Game of Thrones", year: 2012},
... { name: "House of Cards", year: 2013 },
... { name: "Suits", year: 2011}
... ])

We're greeted with:

{
    "acknowledged" : true,
    "insertedIds" : [
        ObjectId("5e300724c013a3b1a742c3b9"),
        ObjectId("5e300724c013a3b1a742c3ba"),
        ObjectId("5e300724c013a3b1a742c3bb")
    ]
}

To fetch all the documents stored in our series collection, we use db.inventory.find({}), whose SQL equivalent is SELECT * FROM series. Passing an empty query (i.e. {}) will return all the documents:

> db.series.find({})

{ "_id" : ObjectId("5e3006258c33209a674d1d1e"), "name" : "The Blacklist", "year" : 2013 }
{ "_id" : ObjectId("5e300724c013a3b1a742c3b9"), "name" : "Game of Thrones", "year" : 2012 }
{ "_id" : ObjectId("5e300724c013a3b1a742c3ba"), "name" : "House of Cards", "year" : 2013 }
{ "_id" : ObjectId("5e300724c013a3b1a742c3bb"), "name" : "Suits", "year" : 2011 }

We can also query data using the equality condition, for instance, to return all the TV series that premiered in 2013:

> db.series.find({ year: 2013 })
{ "_id" : ObjectId("5e3006258c33209a674d1d1e"), "name" : "The Blacklist", "year" : 2013 }
{ "_id" : ObjectId("5e300724c013a3b1a742c3ba"), "name" : "House of Cards", "year" : 2013 }

The SQL equivalent would be SELECT * FROM series WHERE year=2013.

MongoDB also allows us to update individual documents using db.collection.UpdateOne(), or perform batch updates using db.collection.UpdateMany(). For example, to update the release year for Suits:

> db.series.updateOne(
{ name: "Suits" },
{
    $set: { year: 2010 }
}
)
{ "acknowledged" : true, "matchedCount" : 1, "modifiedCount" : 1 }

Finally, to delete documents, the Mongo Shell offers the db.collection.deleteOne() and db.collection.deleteMany() functions.

For instance, to delete all the series that premiered in 2012, we'd run:

> db.series.deleteMany({ year: 2012 })
{ "acknowledged" : true, "deletedCount" : 2 }

More information on the CRUD operations on MongoDB can be found in the online reference including more examples, performing operations with conditions, atomicity, and mapping of SQL concepts to MongoDB concepts and terminology.

Integrating Python with MongoDB

MongoDB provides drivers and tools for interacting with a MongoDB datastore using various programming languages including Python, JavaScript, Java, Go, and C#, among others.

PyMongo is the official MongoDB driver for Python, and we will use it to create a simple script that we will use to manipulate data stored in our SeriesDB database.

With Python 3.6+ and Virtualenv installed in our machines, let us create a virtual environment for our application and install PyMongo via pip:

$ virtualenv --python=python3 env --no-site-packages
$ source env/bin/activate
$ pip install pymongo

Using PyMongo, we are going to write a simple script that we can execute to perform different operations on our MongoDB database.

Connecting to MongoDB

First, we import pymongo in our mongo_db_script.py and create a client connected to our locally running instance of MongoDB:

import pymongo

# Create the client
client = MongoClient('localhost', 27017)

# Connect to our database
db = client['SeriesDB']

# Fetch our series collection
series_collection = db['series']

So far, we have created a client that connects to our MongoDB server and used it to fetch our 'SeriesDB' database. We then fetch our 'series' collection and store it in an object.

Creating Documents

To make our script more convenient, we will write functions that wrap around PyMongo to enable us to easily manipulate data. We will use Python dictionaries to represent documents and we will pass these dictionaries to our functions. First, let us create a function to insert data into our 'series' collection:

# Imports truncated for brevity

def insert_document(collection, data):
    """ Function to insert a document into a collection and
    return the document's id.
    """
    return collection.insert_one(data).inserted_id

This function receives a collection and a dictionary of data and inserts the data into the provided collection. The function then returns an identifier that we can use to accurately query the individual object from the database.

We should also note that MongoDB adds an additional _id key to our documents, when they are not provided, when creating the data.

Now let's try adding a show using our function:

new_show = {
    "name": "FRIENDS",
    "year": 1994
}
print(insert_document(series_collection, new_show))

The output is:

5e4465cfdcbbdc68a6df233f

When we run our script, the _id of our new show is printed on the terminal and we can use this identifier to fetch the show later on.

We can provide an _id value instead of having it assigned automatically, which we'd provide in the dictionary:

new_show = {
    "_id": "1",
    "name": "FRIENDS",
    "year": 1994
}

And if we were to try and store a document with an existing _id, we'd be greeted with an error similar to the following:

DuplicateKeyError: E11000 duplicate key error index: SeriesDB.series.$id dup key: { : 1}

Retrieving Documents

To retrieve documents from the database we'll use find_document(), which queries our collection for single or multiple documents. Our function will receive a dictionary that contains the elements we want to filter by, and an optional argument to specify whether we want one document or multiple documents:

# Imports and previous code truncated for brevity

def find_document(collection, elements, multiple=False):
    """ Function to retrieve single or multiple documents from a provided
    Collection using a dictionary containing a document's elements.
    """
    if multiple:
        results = collection.find(elements)
        return [r for r in results]
    else:
        return collection.find_one(elements)

And now, let's use this function to find some documents:

result = find_document(series_collection, {'name': 'FRIENDS'})
print(result)

When executing our function, we did not provide the multiple parameter and the result is a single document:

{'_id': ObjectId('5e3031440597a8b07d2f4111'), 'name': 'FRIENDS', 'year': 1994}

When the multiple parameter is provided, the result is a list of all the documents in our collection that have a name attribute set to FRIENDS.

Updating Documents

Our next function, update_document(), will be used to update a single specific document. We will use the _id of the document and the collection it belongs to when locating it:

# Imports and previous code truncated for brevity

def update_document(collection, query_elements, new_values):
    """ Function to update a single document in a collection.
    """
    collection.update_one(query_elements, {'$set': new_values})

Now, let's insert a document:

new_show = {
    "name": "FRIENDS",
    "year": 1995
}
id_ = insert_document(series_collection, new_show)

With that done, let's update the document, which we'll specify using the _id returned from adding it:

update_document(series_collection, {'_id': id_}, {'name': 'F.R.I.E.N.D.S'})

And finally, let's fetch it to verify that the new value has been put in place and print the result:

result = find_document(series_collection, {'_id': id_})
print(result)

When we execute our script, we can see that our document has been updated:

{'_id': ObjectId('5e30378e96729abc101e3997'), 'name': 'F.R.I.E.N.D.S', 'year': 1995}

Deleting Documents

And finally, let's write a function for deleting documents:

# Imports and previous code truncated for brevity

def delete_document(collection, query):
    """ Function to delete a single document from a collection.
    """
    collection.delete_one(query)

Since we're using the delete_one method, only one document can be deleted per call, even if the query matches multiple documents.

Now, let's use the function to delete an entry:

delete_document(series_collection, {'_id': id_})

If we try retrieving that same document:

result = find_document(series_collection, {'_id': id_})
print(result)

We're greeted with the expected result:

None

Next Steps

We have highlighted and used a few of PyMongo's methods to interact with our MongoDB server from a Python script. However, we have not utilized all the methods available to us through the module.

All the available methods can be found in the official PyMongo documentation and are classified according to the submodules.

We've written a simple script that performs rudimentary CRUD functionality on a MongoDB database. While we could import the functions in a more complex codebase, or into a Flask/Django application for example, these frameworks have libraries to achieve the same results already. These libraries make it easier, more conventient, and help us connect more securely to MongoDB.

For example, with Django we can use libraries such as Django MongoDB Engine and Djongo, while Flask has Flask-PyMongo that helps bridge the gap between Flask and PyMongo to facilitate seamless connectivity to a MongoDB database.

Conclusion

MongoDB is a document store and falls under the category of non-relational databases (NoSQL). It has certain advantages compared to relational databases, as well as some disadvantages.

While it is not suitable for all situations, we can still use MongoDB to store data and manipulate the data from our Python applications using PyMongo among other libraries - allowing us to harness the power of MongoDB in situations where it is best suited.

It is therefore up to us to carefully examine our requirements before making the decision to use MongoDB to store data.

The script we have written in this post can be found on GitHub.