by Al Danial

Work with MongoDB in MATLAB Using the pymongo Python Module

Part 6 of the Python Is The Ultimate MATLAB Toolbox series.

Earlier posts showed how Python can help MATLAB interact with SQLite and PostgreSQL. I’ll conclude the database portion of this series with examples showing how to work with MongoDB in MATLAB using the Python module pymongo. As with the other databases, the best way to interact with MongoDB in MATLAB is to use the MathWorks' Database Toolbox. This post may only be useful to you if you can include Python in your MATLAB workflow and don’t have access to the Database Toolbox.

Install pymongo

You’ll need the pymongo Python module in your installation to call it from MATLAB. Check your Python installation to see if it has pymongo by starting the Python command line REPL and importing the module. If it imports successfully you can also check the version number:

> python
>>> import pymongo
>>> pymongo.__version__
'3.11.0'

If you get a ModuleNotFoundError, you’ll need to install it. That can be done on an Anaconda Python installation with

> conda install pymongo

or with

> python -m pip install pymongo

using other Python distributions.

Connecting to a MongoDB Database

As with the PostgreSQL post, probably the hardest step in this post involves making the initial connection to a MongoDB server. That will only be successful if you have

  • an account and password
  • permission to create and delete databases
  • network access to the MongoDB server (if it isn’t on your local computer) on or to a MongoDB server.

To keep things simple, here I’ll work with a MongoDB instance running on my localhost. Refer to instructions on starting a MongoDB database on Ubuntu, Red Hat, Windows, macOS if you’re unfamiliar with the process. The mongo command line tool shows that the database engine is running and that I’m able to connect to it:

> mongo
MongoDB shell version v3.6.8
connecting to: mongodb://127.0.0.1:27017
Implicit session: session { "id" : UUID("e629be9e-e016-4e37-872c-e3f4ea3eb739") }
MongoDB server version: 3.6.8
Server has startup warnings:
2022-07-08T09:40:01.686-0700 I STORAGE  [initandlisten]
2022-07-08T09:40:03.337-0700 I CONTROL  [initandlisten]
2022-07-08T09:40:03.337-0700 I CONTROL  [initandlisten] ** WARNING: Access control is not enabled for the database.
2022-07-08T09:40:03.337-0700 I CONTROL  [initandlisten] **          Read and write access to data and configuration is unrestricted.
2022-07-08T09:40:03.337-0700 I CONTROL  [initandlisten]

> show dbs;
admin            0.000GB
config           0.000GB
local            0.000GB

Connecting manually works so I’ll try it programmatically, first in Python, then in MATLAB:

Python: mongo_connect.py

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
#!/usr/bin/env python
# file: mongo_connect.py
from pymongo import MongoClient
client = MongoClient('localhost:27017')
for db in client.list_databases():
    # db is a dictionary with entries like
    #   {'name': 'admin', 'sizeOnDisk': 32768.0, 'empty': False}
    print(db)
    print(db['name'])
client.close()

Running this gives the same result as show dbs; in the mongo shell:

> ./mongo_connect.py
admin
config
local

Now in MATLAB:

MATLAB: mongo_connect.m

1
2
3
4
5
6
7
% file: mongo_connect.m
pymongo = py.importlib.import_module('pymongo');
client = pymongo.MongoClient('localhost:27017');
for db = py.list(client.list_databases())
    fprintf('%s\n', string(db{1}{'name'}))
end
client.close()

The MATLAB version is a bit more complex for a couple reasons:

  1. The for loop cannot iterate over client.list_databases() directly as Python can. Attempting to do that results in the error Array formation and parentheses-style indexing with objects of class 'py.pymongo.command_cursor.CommandCursor' is not allowed. Use objects of class 'py.pymongo.command_cursor.CommandCursor' only as scalars or use a cell array. Error in mongo_connect The work-around is to explicitly expand this function call to a Python list.
  2. The loop variable db is a single-item Python list containing the dictionary we expect rather than the dictionary itself. The {1} index before the {'name'} key retrieves the dictionary out of the list. Next, the database name returned by db{1}{'name'} is a a Python string, but if we attempt to fprintf() that directly, MATLAB will print the individual characters of the string. To get around that we have to explicitly convert the Python string to a MATLAB version so we end up with string(db{1}{'name'}).

MATLAB 2020b and higher then give the expected output:

>> mongo_connect
admin
config
local

While relatively simple, the issues we’re stumbling across using MongoDB in MATLAB with pymongo foreshadow more challenges ahead.

Aside: connecting to a remote MongoDB Database

The client connection to a local MongoDB instance done with

client = MongoClient('localhost:27017')

takes a notably different form when connecting to a remote database that requires authentication. In that case the connection string passed to the MongoClient() function is more involved:

Python:

db_user = 'al'
db_passwd = 'SuperSecret'
db_host = '413.198.231.158'
db_port = 27027
uri = "mongodb://%s:%s@%d/purchase_orders" % (
         db_user, db_passwd, db_host, db_port)
client = MongoClient(uri)

MATLAB 2020b and higher:

db_user = 'al';
db_passwd = 'SuperSecret';
db_host = '413.198.231.158';
db_port = 27027;
uri = sprintf("mongodb://%s:%s@%s:%d/purchase_orders", ...
        db_user, db_passwd, db_host, db_port);
client = pymongo.MongoClient(uri);

Insert

I’ll insert the same package delivery data used for the SQLite+MATLAB and PostgreSQL+MATLAB. Well, not exactly the same since NoSQL database engines such as MongoDB don’t have a concept of tables. Rather than separating the delivery data into orders and rates tables, all information will be inserted into documents (BSON records) of a collection.

The first rows of each SQL table in the SQLite example are

orders table

origin item weight destination
Z66 tube 1.73 N51

rates table

item cost by weight
tube 0.46

These can be represented by the following JSON record (JSON is a subset of BSON):

  { "origin" : "Z66",
    "item" : "tube",
    "weight" : 1.73,
    "dest" : "N51",
    "cost_by_weight" : 0.46}

Inserting a single record

Creating a collection and inserting a document takes just a few lines of code. This is an expanded form of the connection tester which inserts the record above:

Python: mongo_insert_one.py

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
#!/usr/bin/env python
# file: mongo_insert_one.py
from pymongo import MongoClient
client = MongoClient('localhost:27017')
db = client.deliveries
record =  { "origin" : "Z66", "item" : "tube",
            "weight" : 1.73, "dest" : "N51",
            "cost_by_weight" : 0.46}
db.deliveries.insert_one( record )
client.close()

Note that I didn’t pass a JSON string to the .insert_one() method. Instead, I gave it a Python dictionary. This is a convenience provided by the pymongo module which converts dictionaries to JSON—or BSON if needed—for us. Once again the MATLAB version requires more effort as it won’t create the deliveries collection for us automatically as in Python. Here we have to explicitly spell out a database and a collection to create:

MATLAB: mongo_insert_one.m

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
% file: mongo_insert_one.m
pymongo = py.importlib.import_module('pymongo');
client = pymongo.MongoClient('localhost:27017');
deliveries = client.get_database('deliveries').get_collection('deliveries');
record = py.dict(pyargs( ...
                 "origin", "Z66", "item", "tube", ...
                 "weight", 1.73, "dest", "N51",   ...
                 "cost_by_weight", 0.46));
deliveries.insert_one( record )
client.close()

Now that we’ve covered the basic mechanics of connecting to a MongoDB database and inserting a record we can move on to a more realistic program that reads many records from a text file then inserts them in bulk.

Inserting multiple records

Once again I’ll reuse the fake_data.m m-file from the SQLite article to create a text file of delivery data. Running it in MATLAB,

>> fake_data
>>
leaves me with the text file, orders.txt, which has 100,000 lines of delivery orders.

I’d also like to reuse the four line Python module load_orders.py to speed up data ingest in MATLAB as I did for the SQLite and PostgreSQL examples but the output from its readfile() function isn’t well-suited for use with MongoDB. We saw that pymongo’s insert_one() method takes a Python dictionary as an input argument. The corresponding bulk insert method, insert_many(), takes a list of dictionaries. The load_orders.readfile() function returns a list of lists which would have to be manipulated further to a list of dictionaries. Rather than doing that, I’ll implement a new file reader that directly returns the structure I want. It looks like this:

Python: load_orders_for_mongodb.py

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# file: load_orders_for_mongodb.py
import pathlib
def readfile(File):
    cbw = {'tube'   : 0.46, 'package': 0.65,
           'letter' : 0.16, 'card'   : 0.05}
    P = pathlib.Path(File)
    all_orders = []
    rows = [_.split() for _ in P.read_text().split('\n') if _]
    for origin, item, weight, dest in rows:
        order = { 'origin' : origin, 'item' : item,
                  'weight' : float(weight), 'dest' : dest,
                  'cost_by_weight' : cbw[item] }
        all_orders.append( order )
    return all_orders

I’ll import this module and call its readfile() function in both Python and MATLAB bulk insert programs:

Python: mongo_insert_many.py

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#!/usr/bin/env python
# file: mongo_insert_many.py
from pymongo import MongoClient
from load_orders_for_mongodb import readfile
import time
client = MongoClient('localhost:27017')
db = client.deliveries

tic = time.time()
all_records = readfile('orders.txt')
toc = time.time() - tic
print(f'read   {len(all_records)} took {toc:.3f} s')

# delete the deliveries collection if it already exists
if 'deliveries' in db.list_collection_names():
    db['deliveries'].drop()

tic = time.time()
db.deliveries.insert_many( all_records )
toc = time.time() - tic
print(f'insert {len(all_records)} took {toc:.3f} s')

client.close()

MATLAB: mongo_insert_many.m

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
% file: mongo_insert_many.m
pymongo = py.importlib.import_module('pymongo');
LO = py.importlib.import_module('load_orders_for_mongodb');
client = pymongo.MongoClient('localhost:27017');
db = client.get_database('deliveries');
deliveries = db.get_collection('deliveries');

tic;
all_records = LO.readfile('orders.txt');
fprintf('read   %d took %.3f s\n', length(all_records), toc)

% delete the deliveries collection if it already exists
if db.list_collection_names().count('deliveries') > 0
    db.drop_collection('deliveries');
end

tic;
deliveries.insert_many( all_records );
fprintf('insert %d took %.3f s\n', length(all_records), toc)

client.close()

Insert performance is fairly impressive. (Times in seconds; hardware is a 2015 Dell XPS 13 laptop running Ubuntu 20.04, Anaconda Python 2021.05, and MATLAB 2020b)

n_rows MATLAB + Python read MATLAB insert Python read Python insert
100,000 0.33 1.41 0.18 1.30
1,000,000 2.89 15.15 2.23 14.87

Query

The SQL statement

select origin,sum(weight*cost_by_weight) from orders O, rates R
  where O.item = R.item group by origin;
used in the SQLite and PostgreSQL examples takes a starkly different form in MongoDB. There the query can be expressed as
db.deliveries.aggregate([{$group :
    {_id : {F1:'$origin', F2:'$dest'},
     tot : {$sum : '$weight'} } } ])
which can be implemented like this in Python:

Python: mongo_query.py

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
#!/usr/bin/env python
# file: mongo_query.py
from pymongo import MongoClient
client = MongoClient('localhost:27017')
db = client.deliveries

n_records = db.deliveries.count_documents({})
print(f'number of records in deliveries = {n_records}')

pipeline = [{'$group' : {'_id' : {'F1':'$origin', 'F2':'$dest'},
                         'tot' : {'$sum' : '$weight'}}}]

for x in db.deliveries.aggregate(pipeline=pipeline):
    origin = x['_id']['F1']
    dest   = x['_id']['F2']
    sum_weight = x['tot']
    print(f'{origin} -> {dest}   {sum_weight:6.2f}')

client.close()

Lines 9-13 in the MATLAB code below show the rather clumsy notation needed to express the multiply-nested dictionaries of the aggregation pipeline:

MATLAB: mongo_query.m

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
% file: mongo_query.m
pymongo = py.importlib.import_module('pymongo');
client = pymongo.MongoClient('localhost:27017');
deliveries = client.get_database('deliveries').get_collection('deliveries');

n_records = deliveries.count_documents(py.dict());
fprintf('number of records in deliveries = %d\n', n_records)

query = py.dict(pyargs('$group', py.dict(pyargs(             ...
        '_id', py.dict(pyargs('F1','$origin','F2','$dest')), ...
        'tot', py.dict(pyargs('$sum','$weight')) )) ));
pipeline = py.list();
pipeline.append(query);

for x = py.list(deliveries.aggregate(pyargs('pipeline',pipeline)))
    origin = string(x{1}{'_id'}{'F1'});
    dest   = string(x{1}{'_id'}{'F2'});
    sum_weight = x{1}{'tot'};
    fprintf('%s -> %s   %6.2f\n', origin, dest, sum_weight)
end
client.close()

Update

In the SQLite and PostgreSQL examples I updated those respective databases by scaling the weight of packages by 2.2 with this SQL statement:

update orders set weight = weight*2.2 where item = 'package';
The equivalent in MongoDB is considerably more complex with my setup as the version I’m running, v3.6.8, does not support multiple record updates with operators (this is supported in v4.2). My options are to write a loop over the single record update function update_one(), or to completely replace the records which have item = 'package' with new entries in Python. Looping over a single record update function sounds slow so I’ll take the second approach.

Python: mongo_updage.py

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
#!/usr/bin/env python
# file: mongo_update.py
from pymongo import MongoClient

def print_records(cursor):
    for i,x in enumerate(cursor):
        origin = x['origin']
        item = x['item']
        weight = x['weight']
        dest = x['dest']
        print(f'{i+1:3d}. {origin} {item:8s} {weight:10.6f} {dest}')

def heaviest_packages(n, db):
    packages = { "item": { "$eq": "package" } }
    return db.find(packages).sort([('weight', -1)]).limit(n)

client = MongoClient('localhost:27017')
db = client.deliveries

print('Before scaling weight of packages')
print_records(heaviest_packages(3, db.deliveries))

packages = { "item": { "$eq": "package" } }
package_records = db.deliveries.find(packages)
updated_packages = []
for record in package_records:
    new_record = record
    new_record['weight'] *= 2.2
    del new_record['_id']
    updated_packages.append(new_record)

db.deliveries.delete_many(packages)
db.deliveries.insert_many(updated_packages)

print('After scaling weight of packages')
print_records(heaviest_packages(3, db.deliveries))

client.close()

MATLAB: mongo_updage.m

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
% file: mongo_update.m
pymongo = py.importlib.import_module('pymongo');
client = pymongo.MongoClient('localhost:27017');
deliveries = client.get_database('deliveries').get_collection('deliveries');

fprintf('Before scaling weight of packages\n')
print_records(heaviest_packages(10, deliveries))

packages = py.dict(pyargs("item",py.dict(pyargs("$eq","package"))));
%        = { "item" : { "$eq" : "packages } }
package_records = deliveries.find(packages);
updated_packages = py.list();
for record = py.list(package_records)
    new_record = record;
    new_record{1}.update(pyargs('weight', new_record{1}{'weight'} * 2.2));
    new_record{1}.pop('_id'); % == del new_record['_id']
    updated_packages.append(new_record{1});
end

deliveries.delete_many(packages);
deliveries.insert_many(updated_packages);

fprintf('After  scaling weight of packages\n')
print_records(heaviest_packages(3, deliveries))

client.close()

function print_records(cursor)
    i = 0;
    for x = py.list(cursor)
        i = i + 1;
        origin = string(x{1}{'origin'});
        item = string(x{1}{'item'});
        weight = x{1}{'weight'};
        dest = string(x{1}{'dest'});
        fprintf('%3d. %s %8s %10.6f %s\n', ...
            i, origin, item, weight, dest)
    end
end

function [cursor] = heaviest_packages(n, db)
    packages = py.dict(pyargs("item",py.dict(pyargs("$eq","package"))));
    by_weight = py.list({py.tuple({'weight', int64(-1)})});
    cursor = db.find(packages).sort(by_weight).limit(int64(n));
end

As a sanity check, both programs print records having the three heaviest packages before and after the update. The output from both is correct but also a bit disappointing:

Before scaling weight of packages
  1. C30 package    5.000000 Q38
  2. Y33 package    5.000000 E99
  3. B72 package    5.000000 L20
After scaling weight of packages
  1. C30 package   11.000000 Q38
  2. Y33 package   11.000000 E99
  3. B72 package   11.000000 L20

There’s no variation in the weight because the fake data generator only used three significant figures for weight. Consequently the maximum value of 5.00 appears many times in 100,000 records. A more convincing test is to work with a small database (for example, set n_rows to 50 in fake_data.m) then print all records before and after.

Join me again on August 6, 2022 to see how MATLAB can work with the Redis network-based key/value store using the redis-py Python module.