Work with MongoDB in MATLAB Using the pymongo Python Module
Part 6 of the Python Is The Ultimate MATLAB Toolbox series.
Earlier posts showed how Python can help MATLAB interact
with SQLite
and PostgreSQL.
I’ll conclude the database portion of this series with examples
showing how to work with
MongoDB
in MATLAB using
the Python module pymongo
.
As with the other databases, the best way to interact with MongoDB in MATLAB is
to use the MathWorks'
Database Toolbox.
This post may only be useful to you if you
can include Python in your MATLAB workflow
and don’t have access to the Database Toolbox.
Install pymongo
You’ll need the pymongo
Python module in your installation to
call it from MATLAB.
Check your Python installation to see if it has pymongo
by starting the Python command line REPL and importing
the module. If it imports successfully you can also check
the version number:
> python
>>> import pymongo
>>> pymongo.__version__
'3.11.0'
If you get a ModuleNotFoundError
, you’ll need to install it.
That can be done on an Anaconda Python installation with
> conda install pymongo
or with
> python -m pip install pymongo
using other Python distributions.
Connecting to a MongoDB Database
As with the PostgreSQL post, probably the hardest step in this post involves making the initial connection to a MongoDB server. That will only be successful if you have
- an account and password
- permission to create and delete databases
- network access to the MongoDB server (if it isn’t on your local computer) on or to a MongoDB server.
To keep things simple, here I’ll work with a MongoDB instance running
on my localhost.
Refer to instructions on starting a MongoDB database on
Ubuntu,
Red Hat,
Windows,
macOS
if you’re unfamiliar with the process.
The mongo
command line tool shows that the database engine is
running and that I’m able to connect to it:
> mongo
MongoDB shell version v3.6.8
connecting to: mongodb://127.0.0.1:27017
Implicit session: session { "id" : UUID("e629be9e-e016-4e37-872c-e3f4ea3eb739") }
MongoDB server version: 3.6.8
Server has startup warnings:
2022-07-08T09:40:01.686-0700 I STORAGE [initandlisten]
2022-07-08T09:40:03.337-0700 I CONTROL [initandlisten]
2022-07-08T09:40:03.337-0700 I CONTROL [initandlisten] ** WARNING: Access control is not enabled for the database.
2022-07-08T09:40:03.337-0700 I CONTROL [initandlisten] ** Read and write access to data and configuration is unrestricted.
2022-07-08T09:40:03.337-0700 I CONTROL [initandlisten]
> show dbs;
admin 0.000GB
config 0.000GB
local 0.000GB
Connecting manually works so I’ll try it programmatically, first in Python, then in MATLAB:
Python: mongo_connect.py
|
|
Running this gives the same result as show dbs;
in the mongo shell:
> ./mongo_connect.py
admin
config
local
Now in MATLAB:
MATLAB: mongo_connect.m
|
|
The MATLAB version is a bit more complex for a couple reasons:
- The
for
loop cannot iterate overclient.list_databases()
directly as Python can. Attempting to do that results in the errorArray formation and parentheses-style indexing with objects of class 'py.pymongo.command_cursor.CommandCursor' is not allowed. Use objects of class 'py.pymongo.command_cursor.CommandCursor' only as scalars or use a cell array. Error in mongo_connect
The work-around is to explicitly expand this function call to a Python list. - The loop variable
db
is a single-item Python list containing the dictionary we expect rather than the dictionary itself. The{1}
index before the{'name'}
key retrieves the dictionary out of the list. Next, the database name returned bydb{1}{'name'}
is a a Python string, but if we attempt tofprintf()
that directly, MATLAB will print the individual characters of the string. To get around that we have to explicitly convert the Python string to a MATLAB version so we end up withstring(db{1}{'name'})
.
MATLAB 2020b and higher then give the expected output:
>> mongo_connect
admin
config
local
While relatively simple, the issues we’re stumbling across using
MongoDB in MATLAB with pymongo
foreshadow more challenges ahead.
Aside: connecting to a remote MongoDB Database
The client connection to a local MongoDB instance done with
client = MongoClient('localhost:27017')
takes a notably different form when connecting to a remote
database that requires authentication. In that case
the connection string passed to the
MongoClient()
function is more involved:
Python:
db_user = 'al'
db_passwd = 'SuperSecret'
db_host = '413.198.231.158'
db_port = 27027
uri = "mongodb://%s:%s@%d/purchase_orders" % (
db_user, db_passwd, db_host, db_port)
client = MongoClient(uri)
MATLAB 2020b and higher:
db_user = 'al';
db_passwd = 'SuperSecret';
db_host = '413.198.231.158';
db_port = 27027;
uri = sprintf("mongodb://%s:%s@%s:%d/purchase_orders", ...
db_user, db_passwd, db_host, db_port);
client = pymongo.MongoClient(uri);
Insert
I’ll insert the same package delivery data used for the
SQLite+MATLAB
and
PostgreSQL+MATLAB.
Well, not exactly the same since NoSQL database engines such as
MongoDB don’t have a concept of tables. Rather than separating the
delivery data into orders
and rates
tables, all information
will be inserted into documents
(BSON records)
of a collection.
The first rows of each SQL table in the SQLite example are
orders
table
origin | item | weight | destination |
---|---|---|---|
Z66 | tube | 1.73 | N51 |
rates
table
item | cost by weight |
---|---|
tube | 0.46 |
These can be represented by the following JSON record (JSON is a subset of BSON):
{ "origin" : "Z66",
"item" : "tube",
"weight" : 1.73,
"dest" : "N51",
"cost_by_weight" : 0.46}
Inserting a single record
Creating a collection and inserting a document takes just a few lines of code. This is an expanded form of the connection tester which inserts the record above:
Python: mongo_insert_one.py
|
|
Note that I didn’t pass a JSON string to the .insert_one()
method. Instead, I gave it a Python dictionary. This is
a convenience provided by the pymongo
module which converts
dictionaries to JSON—or BSON if needed—for us.
Once again the MATLAB version requires more effort as it
won’t create the deliveries
collection for us automatically
as in Python.
Here we have to explicitly spell out a database and
a collection to create:
MATLAB: mongo_insert_one.m
|
|
Now that we’ve covered the basic mechanics of connecting to a MongoDB database and inserting a record we can move on to a more realistic program that reads many records from a text file then inserts them in bulk.
Inserting multiple records
Once again I’ll reuse the
fake_data.m
m-file from the SQLite article
to create a text file of delivery data.
Running it in MATLAB,
>> fake_data
>>
orders.txt
, which
has 100,000 lines of delivery orders.
I’d also like to reuse the four line Python module
load_orders.py
to speed up data ingest in MATLAB
as I did for the SQLite and PostgreSQL examples
but the output from its readfile()
function
isn’t well-suited for use with MongoDB.
We saw that pymongo
’s insert_one()
method takes a
Python dictionary as an input argument. The corresponding bulk insert
method, insert_many()
, takes a list of dictionaries.
The load_orders.readfile()
function returns a list of lists which
would have to be manipulated further to a list of dictionaries.
Rather than doing that, I’ll implement
a new file reader that directly returns the structure I want.
It looks like this:
Python: load_orders_for_mongodb.py
|
|
I’ll import this module and call its readfile()
function in both Python
and MATLAB bulk insert programs:
Python: mongo_insert_many.py
|
|
MATLAB: mongo_insert_many.m
|
|
Insert performance is fairly impressive. (Times in seconds; hardware is a 2015 Dell XPS 13 laptop running Ubuntu 20.04, Anaconda Python 2021.05, and MATLAB 2020b)
n_rows | MATLAB + Python read | MATLAB insert | Python read | Python insert |
---|---|---|---|---|
100,000 | 0.33 | 1.41 | 0.18 | 1.30 |
1,000,000 | 2.89 | 15.15 | 2.23 | 14.87 |
Query
The SQL statement
select origin,sum(weight*cost_by_weight) from orders O, rates R
where O.item = R.item group by origin;
db.deliveries.aggregate([{$group :
{_id : {F1:'$origin', F2:'$dest'},
tot : {$sum : '$weight'} } } ])
which can be implemented like this in Python:
Python: mongo_query.py
|
|
Lines 9-13 in the MATLAB code below show the rather clumsy notation needed to express the multiply-nested dictionaries of the aggregation pipeline:
MATLAB: mongo_query.m
|
|
Update
In the SQLite and PostgreSQL examples I updated those respective databases by scaling the weight of packages by 2.2 with this SQL statement:
update orders set weight = weight*2.2 where item = 'package';
update_one()
, or to completely replace the records
which have item = 'package'
with new entries in Python.
Looping over a single record update function sounds slow so
I’ll take the second approach.
Python: mongo_updage.py
|
|
MATLAB: mongo_updage.m
|
|
As a sanity check, both programs print records having the three heaviest packages before and after the update. The output from both is correct but also a bit disappointing:
Before scaling weight of packages
1. C30 package 5.000000 Q38
2. Y33 package 5.000000 E99
3. B72 package 5.000000 L20
After scaling weight of packages
1. C30 package 11.000000 Q38
2. Y33 package 11.000000 E99
3. B72 package 11.000000 L20
There’s no variation in the weight because
the fake data generator only used three significant
figures for weight. Consequently the maximum value of 5.00
appears
many times in
100,000 records. A more convincing test is to work with a small
database (for example, set n_rows
to 50 in fake_data.m
)
then print all records before and after.
Table of Contents | Previous: Interact with PostgreSQL | Next : Interact with Redis