by Al Danial

Work with PostgreSQL in MATLAB Using the psycopg2 Python Module

Part 5 of the Python Is The Ultimate MATLAB Toolbox series.

In Read and write SQLite databases in MATLAB with Python I showed how MATLAB can work with SQLite database files using the sqlite3 module from the Python standard library. Here I’ll demonstrate CRUD—create, read, update, delete—operations using the PostgreSQL database engine and the psycopg2 Python module. The best way to work with PostgreSQL in MATLAB is to use the MathWorks' Database Toolbox. This post may only interest you if you lack access to the Database Toolbox and you’re willing to include Python in your MATLAB workflow.

Install psycopg2

Unlike sqlite3, psycopg2 is not in the standard Python library. Check your Python installation to see if it has psycopg2 by starting the Python command line REPL and importing the module. If it imports successfully you can also check the version number:

> python
>>> import psycopg2
>>> psycopg2.__version__
'2.9.3 (dt dec pq3 ext lo64)'

If you get a ModuleNotFoundError, you’ll need to install it. That can be done on an Anaconda Python installation with

> conda install psycopg2

or with

> python -m pip install psycopg2

using other Python distributions.

Connecting to a PostgreSQL Database

The most challenging step in this post is setting up, configuring, and making an initial connection to a PostgreSQL database service—all of which are beyond the scope of this post. Ideally you are already familiar with the process, or have access to an existing database service, or know someone who can get you set up. If not, follow the PostgreSQL getting started documentation. The rest of this post assumes you have

  • an account and password
  • permission to create and delete databases
  • network access

on or to a PostgreSQL server. Briefly, a PostgreSQL server administrator on a Linux computer might issue commands such as

> sudo -u postgres psql
postgres=# create role Al with login password 'SuperSecret';
postgres=# create database deliveries;
postgres=# grant all privileges on database deliveries to Al;
postgres=# quit
to set me up.

Run this small Python program to check if you can connect to your PostgreSQL server:

#!/usr/bin/env python
# test_pg_connection.py
import psycopg2
conn = psycopg2.connect(host="big_db_server", port=5432, # default port
                        user="Al", password="SuperSecret",
                        database="deliveries")
conn.close()

Of course change the hostname, username, password, and/or port number as appropriate for your setup. All is well if the program runs without errors. If there are errors, they’ll likely be one of these:

  • psycopg2.OperationalError: could not connect to server: Connection refused
  • psycopg2.OperationalError: FATAL: password authentication failed for user Al
  • psycopg2.OperationalError: FATAL: database deliveries does not exist

The first means either the database service is down, or a network issue (such as a firewall) is preventing access. The second and third errors show the database service is reachable, but the account or database you want to use either does not exist or is not configured properly. You or your database administrator will need to reexamine the setup.

Create (and delete)

As hinted by the deliveries database in the SQL commands above, I’ll repeat the package delivery example from the MATLAB+SQLite article. To start, I’ll run the fake_data.m program in MATLAB to create a text file, orders.txt, containing 100,000 lines of data to insert into the database.

MATLAB:

>> fake_data
>>

A Python program that reads orders.txt, creates and inserts the values into the orders table in batches of 10,000 rows at a time looks like this:

File: make_pg_db.py

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
#!/usr/bin/env python
# make_pg_db.py
import time
import psycopg2
import pathlib
conn = psycopg2.connect(host="localhost", port=5432,
                        user="Al", password="SuperSecret",
                        database="deliveries")

cursor = conn.cursor()

# delete/create
cursor.execute("drop table if exists orders;")
cursor.execute("drop table if exists rates;")
cursor.execute("""create table rates(item varchar primary key,
                                     cost_by_weight float);""")
create_orders_table = """create table orders (
                              origin varchar,
                              item   varchar,
                              weight float,
                              dest   varchar);"""
cursor.execute(create_orders_table)

rows = [ ['tube'   , 0.46],
         ['package', 0.65],
         ['letter' , 0.16],
         ['card'   , 0.05] ]
insert = "insert into rates (item,cost_by_weight) values(%s,%s)"
cursor.executemany(insert, rows)

# load orders data from the text file
T_s = time.time()
P = pathlib.Path('orders.txt')
rows = [_.split() for _ in P.read_text().split('\n') if _]
print(f'Data read time = {time.time() - T_s:.3f} seconds')

insert = "insert into orders (origin,item,weight,dest) values(%s,%s,%s,%s)"
T_s = time.time()
batch_size = 10_000 # insert this many rows at a time
for i in range(0, len(rows), batch_size):
    rows_subset = rows[i:i+batch_size]
    cursor.executemany(insert, rows_subset)
    print(f'inserted {i+1} to {i+batch_size}')
conn.commit()
print(f'Data insert time = {time.time() - T_s:.3f} seconds')

conn.close()

I’m connecting to a PostgreSQL instance running on the same computer (2015 Dell XPS 13 laptop running Ubuntu 20.04) where I run make_pg_db.py. Insert performance is about 8,000 rows per second, 50x slower that SQLite inserts. I’m sure there are PostgreSQL configuration optimizations I could make to increase insert performance if I knew more about this database engine.

As with inserts to a SQLite database, MATLAB needs to store its insertion data row-major. We already saw that reading and reshuffling orders.txt from column-major to row-major in MATLAB is slow so our MATLAB implementation of the PostgreSQL insert program will use the same load_orders.py Python module used to speed up MATLAB in the SQLite example.

Two equivalent MATLAB insertion program appear below. The first relies on the Python function keyword argument enhancements in 2022a.

File: make_pg_db.m

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
% MATLAB 2022a and newer
% make_pg_db.m
psycopg2 = py.importlib.import_module('psycopg2');
conn = psycopg2.connect(host="localhost", port=int64(5432), ...
                        user="Al", password="SuperSecret",  ...
                        database="deliveries");
cursor = conn.cursor();

% delete/create
cursor.execute("drop table if exists rates");
cursor.execute("drop table if exists orders");
cursor.execute("create table rates(item varchar primary key, " + ...
                                   "cost_by_weight float);");
create_orders_table = "create table orders (" + ...
                           "origin varchar,"  + ...
                           "item   varchar,"  + ...
                           "weight float,"    + ...
                           "dest   varchar);";
cursor.execute(create_orders_table);

% populate the rates table
rows = { {'tube'   , 0.46}, ...
         {'package', 0.65}, ...
         {'letter' , 0.16}, ...
         {'card'   , 0.05} };
insert = "insert into rates (item,cost_by_weight) values(%s,%s)";
cursor.executemany(insert, rows);

% load orders data from the text file and insert to orders table
tic;
rows = py.load_orders.readfile('orders.txt');
fprintf('Data read time = %.6f seconds\n', toc)

insert = "insert into orders (origin,item,weight,dest) values(%s,%s,%s,%s)";                                                                                          
tic
batch_size = 10000; % insert this many rows at a time
n_rows = length(rows);
for i = 1: batch_size:n_rows
    i_end = min(i+batch_size-1, n_rows);
    rows_subset = rows(i:i_end);
    cursor.executemany(insert, rows_subset)
    fprintf('inserted %d to %d\n', i, i_end);
end
conn.commit();
fprintf('Data insert time = %.6f seconds\n', toc)
conn.close();

while the version below uses the older pyargs() function to wrap Python keyword arguments. It works with MATLAB 2020b and newer. The two versions of this program differ only at lines 4, 5 and 6.

File: make_pg_db_2020b.m

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
% MATLAB 2020b and newer
% make_pg_db_2020b.m
psycopg2 = py.importlib.import_module('psycopg2');
conn = psycopg2.connect(pyargs("host","localhost","port",int64(5432), ...
                               "user","Al","password","SuperSecret",  ...
                               "database","deliveries"));
cursor = conn.cursor();

% delete/create
cursor.execute("drop table if exists rates");
cursor.execute("drop table if exists orders");
cursor.execute("create table rates(item varchar primary key, " + ...
                                   "cost_by_weight float);");
create_orders_table = "create table orders (" + ...
                           "origin varchar,"  + ...
                           "item   varchar,"  + ...
                           "weight float,"    + ...
                           "dest   varchar);";
cursor.execute(create_orders_table);

% populate the rates table
rows = { {'tube'   , 0.46}, ...
         {'package', 0.65}, ...
         {'letter' , 0.16}, ...
         {'card'   , 0.05} };
insert = "insert into rates (item,cost_by_weight) values(%s,%s)";
cursor.executemany(insert, rows);

% load orders data from the text file and insert to orders table
tic;
rows = py.load_orders.readfile('orders.txt');
fprintf('Data read time = %.6f seconds\n', toc)

insert = "insert into orders (origin,item,weight,dest) values(%s,%s,%s,%s)";                                                                                          
tic
batch_size = 10000; % insert this many rows at a time
n_rows = length(rows);
for i = 1: batch_size:n_rows
    i_end = min(i+batch_size-1, n_rows);
    rows_subset = rows(i:i_end);
    cursor.executemany(insert, rows_subset)
    fprintf('inserted %d to %d\n', i, i_end);
end
conn.commit();
fprintf('Data insert time = %.6f seconds\n', toc)
conn.close();

MATLAB insert performance matches Python’s, about 8,000 rows per second—in other words, not great compared to SQLite inserts.

Read

This query computes the cummulative cost of shipments by originating office.

select origin,sum(weight*cost_by_weight) from orders O, rates R
  where O.item = R.item group by origin;

Programmatic implementations of the query in Python and MATLAB resemble those in the SQLite select example:

File: select_pg.py

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
#!/usr/bin/env python
# select_pg.py
import psycopg2
conn = psycopg2.connect(host="localhost", port=5432,
                        user="Al", password="SuperSecret",
                        database="deliveries")
cursor = conn.cursor()

query = "select origin,sum(weight*cost_by_weight) from orders O, rates R" \
        "    where O.item = R.item group by origin order by origin;"
cursor.execute(query)
for row in cursor.fetchall():
    print(f'{row[0]}  {row[1]:7.2f}')

conn.close()

and equivalently in MATLAB 2022a+ with

File: select_pg.m

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
% MATLAB 2022a and newer
% select_pg.m
psycopg2 = py.importlib.import_module('psycopg2');
conn = psycopg2.connect(host="localhost", port=int64(5432), ...
                        user="Al", password="SuperSecret",  ...
                        database="deliveries");
cursor = conn.cursor();

query = "select origin,sum(weight*cost_by_weight) from orders O, rates R" + ...
        "    where O.item = R.item group by origin order by origin;";
cursor.execute(query);
for row = cursor.fetchall()
    fprintf('%s  %7.2f\n', string(row{1}{1}), row{1}{2});
end

conn.close()

and in MATLAB 2020b+ with

File: select_pg_2020b.m

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
% MATLAB 2020b and newer
% select_pg.m
psycopg2 = py.importlib.import_module('psycopg2');
conn = psycopg2.connect(pyargs("host","localhost","port",int64(5432), ...
                               "user","Al","password","SuperSecret",  ...
                               "database","deliveries"));
cursor = conn.cursor();

query = "select origin,sum(weight*cost_by_weight) from orders O, rates R" + ...
        "    where O.item = R.item group by origin order by origin;";
cursor.execute(query);
for row = cursor.fetchall()
    fprintf('%s  %7.2f\n', string(row{1}{1}), row{1}{2});
end

conn.close()

Update

As in the SQLite update example, this SQL update scales the weight of packages by 2.2 while leaving the wieghts of cards, letters, and tubes alone:

update orders set weight = weight*2.2 where item = 'package';

The programs below repeat the select statement from the previous section to show total cost by origin. The SQL update is implemente with line 17 of the Python program and lines 18 in the two MATLAB versions. After updating the weight field the programs re-query for the new cummulative costs:

File: select_update_pg.py

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#!/usr/bin/env python
# select_update_pg.py
import psycopg2
conn = psycopg2.connect(host="localhost", port=5432,
                        user="Al", password="SuperSecret",
                        database="deliveries")
cursor = conn.cursor()

query = "select origin,sum(weight*cost_by_weight) from orders O, rates R" \
        "    where O.item = R.item group by origin order by origin;"
print('Original cost:')
cursor.execute(query)
for row in cursor.fetchall():
    print(f'{row[0]}  {row[1]:7.2f}')

# update:  scale the weight of packages
cursor.execute("update orders set weight = weight*2.2 where item = 'package'")

print('After the weight correction:')
cursor.execute(query)
for row in cursor.fetchall():
    print(f'{row[0]}  {row[1]:7.2f}')

conn.close()

MATLAB 2022a+:

File: select_update_pg.m

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
% MATLAB 2022a and newer
% select_update_pg.m
psycopg2 = py.importlib.import_module('psycopg2');
conn = psycopg2.connect(pyargs("host","localhost","port",int64(5432), ...
                               "user","Al","password","SuperSecret",  ...
                               "database","deliveries"));
cursor = conn.cursor();

query = "select origin,sum(weight*cost_by_weight) from orders O, rates R" + ...
        "    where O.item = R.item group by origin order by origin;";
fprintf('Original cost:\n')
cursor.execute(query);
for row = cursor.fetchall()
    fprintf('%s  %7.2f\n', string(row{1}{1}), row{1}{2});
end

% update:  scale the weight of packages
cursor.execute("update orders set weight = weight*2.2 where item = 'package'");

fprintf('After the weight correction:\n')
cursor.execute(query);
for row = cursor.fetchall()
    fprintf('%s  %7.2f\n', string(row{1}{1}), row{1}{2});
end

conn.close()

MATLAB 2020b+:

File: select_update_pg_2020b.m

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
% MATLAB 2020b and newer
% select_update_pg_2020b.m
psycopg2 = py.importlib.import_module('psycopg2');
conn = psycopg2.connect(pyargs("host","localhost","port",int64(5432), ...
                               "user","Al","password","SuperSecret",  ...
                               "database","deliveries"));
cursor = conn.cursor();

query = "select origin,sum(weight*cost_by_weight) from orders O, rates R" + ...
        "    where O.item = R.item group by origin order by origin;";
fprintf('Original cost:\n')
cursor.execute(query);
for row = cursor.fetchall()
    fprintf('%s  %7.2f\n', string(row{1}{1}), row{1}{2});
end

% update:  scale the weight of packages
cursor.execute("update orders set weight = weight*2.2 where item = 'package'");

fprintf('After the weight correction:\n')
cursor.execute(query);
for row = cursor.fetchall()
    fprintf('%s  %7.2f\n', string(row{1}{1}), row{1}{2});
end

conn.close()

Join me again on July 23, 2022 to see how MATLAB can work with MongoDB databases using the pymongo Python module.