Part 11 of the Python Is The Ultimate MATLAB Toolbox series.
Upper panel: flow computed with MATLAB 2022b using code by Jamie Johns.
Lower panel: flow computed with MATLAB 2022b calling the same Navier-Stokes solver translated to Python with Numba. It repeats the simulation more than three times in the amount of time it takes the MATLAB-only solver to do this once.
MATLAB’s number crunching power is impressive. If you’re working with a compute-bound application though, you probably want it to run even faster. After you’ve exhausted the obvious speed enhancements like using more efficient algorithms, vectorizing all operations, and running on faster hardware, your last hope may be to to rewrite the slow parts in a compiled language linked to MATLAB through mex. This is no easy task though. Re-implementing a MATLAB function in C++, Fortran, or Java is tedious at best, and completley impractical at worst if the code to be rewritten calls complex MATLAB or Toolbox functions.
Python and Numba are an alternative to mex
In this article I show how Python and Numba can greatly improve MATLAB performance with less effort, and with greater flexibility, than using mex with C++, Fortran, or Java. You’ll need to include Python in your MATLAB workflow, but that’s no more difficult than installing compilers and configuring mex build environments.
Two examples demonstrate the performance boost Python and Numba can bring MATLAB: a simple Mandelbrot set computation, and a much more involved Navier-Stokes solver for incompressible fluid flow. In both cases, the Python augmentation makes the MATLAB application run several times faster.
- Python+Numba as an alternative to mex
- What is Numba?
- Hardware, OS details
- Example 1: Mandelbrot set
- Example 2: incompressible fluid flow
Python+Numba as an alternative to mex
The mex compiler front-end enables the creation of functions written in C, C++, Fortran, Java, and .Net that can be called directly from MATLAB. To use it effectively, you’ll need to be proficient in one of these languages and have a supported compiler. With those in place, the fastest way to begin writing a mex-compiled function is to start with existing working code such as any of the MathWorks' mex examples and modifying it to your needs.
A big challenge to translating MATLAB code is that one line of MATLAB can translate to dozens of lines in a compiled language. Standard MATLAB statements such as
y = max((A\b)*v',,'all');
become a logistical headache in the compiled languages.
Do you have access to an optimized linear algebra library?
Do you know how to set up the calling arguments?
What if the inputs are complex?
A is rectangular?
It is not straightforward.
Python, on the otherhand, can often match MATLAB expressions one-to-one (although the Python expressions are usually longer). The Python version of the line above is
y = np.max((np.linalg.solve(A,b).dot(v.T))
np. is the prefix for the NumPy module.
If you’re familiar with Python and its ecosystem of numeric and scientific modules, you’ll be able to translate a MATLAB function to Python much faster than you can translate MATLAB to any compiled language.
For the most part, numerically intensive codes run at comparable speeds in MATLAB and Python. Simply translating MATLAB to Python will rarely give you a worthwhile performance boost. To see real acceleration you’ll need to additionally use a Python code compiler such as Cython, Pythran, f2py, or Numba.
Numba offers the most bang for the buck—the greatest performance for the least amount of effort—and is the focus of this post.
What is Numba?
From the Numba project’s Overview:
Numba is a compiler for Python array and numerical functions that gives you the power to speed up your applications with high performance functions written directly in Python.
Numba generates optimized machine code from pure Python code using the LLVM compiler infrastructure. With a few simple annotations, array-oriented and math-heavy Python code can be just-in-time optimized to performance similar as C, C++ and Fortran, without having to switch languages or Python interpreters.
Type casting Python function arguments
Three steps are needed to turn a conventional Python function into a much faster Numba-compiled function:
- Import the
jitfunction and all data types you plan to use from the Numba module. Also import the
prangefunction if you plan to run for loops in parallel.
- Precede your function definition with
- Pass to
@jit()type declarations for of each input argument and return value
The steps are illustrated below
with a simple function that accepts a 2D array
of 32 bit floating point numbers and a 64 bit floating point
returns an unsigned 64 bit integer containing the count of
terms in the array that are less than the tolerance.
We’ll ignore the fact that this is a simple operation
in both MATLAB (
sum(A < tol)) and Python (
np.sum(A < tol)).)
First the plain Python version:
def n_below(A, tol): count = 0 nR, nC = A.shape for r in range(nR): for c in range(nC): if A[r,c] < tol: count += 1 return count
Now with Numba:
Only the two highlighted lines were added; the body of the function did not change. Let’s see what this buys us:
In : A = np.random.rand(2000,3000).astype(np.float32) In : tol = 0.5 In : %timeit n_below(A, tol) # conventional Python 11.3 s ± 72.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In : %timeit n_below_numba(A, tol) # Python+Numba 9.92 ms ± 251 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Time dropped from 11.3 seconds to 9.92 milliseconds, a factor of more than 1,000. While this probably says more about how slow Python 3.8 is for this problem1, the Numba version is impressive. It is in fact even a bit faster than the native NumPy version:
In : %timeit np.sum(A < tol) 10.2 ms ± 193 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
For completeness, MATLAB 2022b does this about as quickly as NumPy:
>> A = single(rand(2000,3000)); >> tol = 0.5; >> tic; sum(A < tol); toc
Repeating the last line a few times gives a peak (that is, minimum) time of 0.011850 seconds (11.85 ms).
@jit() decorator has a number of options that affect how
Python code is compiled. All four of the options
below appear on every
@jit() instance in the Navier-Stokes
code. Here’s an example:
@jit(float64[:,:](float64[:,:], float64), nopython=True, fastmath=True, parallel=True, cache=True) def DX2(a,dn): # finite difference for Laplace of operator in two dimensions # (second deriv x + second deriv ystaggered grid) return (a[:-2,1:-1] + a[2:,1:-1] + a[1:-1,2:] + a[1:-1,:-2] - 4*a[1:-1,1:-1])/(dn**2)
Numba compiled functions can work either in
object mode, where
the code can interact with the Python interpreter, or in
mode where the Python interpeter is not used.
excludes the Python interpreter and results in higher performance.
This option matches the
-Ofast optimization switch
used by compilers in the
Gnu Compiler Collection. It allows the compiler to
resequence floating point computations in a non IEEE 754
compliant manner—enabling significant speed increases
at the risk of generating incorrect (!) results.
To use this option with confidence, compare solutions produced with and without it. Obviously if the results differ appreciably you shouldn’t use this option.
This option tells Numba to attempt automatic parallelization of
code blocks in jit compiled functions. Parallelization may
be possible even
if these functions do not explicitly call the
for loop run in parallel.
Let’s try this out on the simple
from numba import jit, prange, uint64, float32, float64 @jit(uint64(float32[:,:], float64), nopython=True, parallel=True) def n_below_parallel(A, tol): count = 0 nR, nC = A.shape for r in prange(nR): # <- parallel for loop for c in range(nC): if A[r,c] < tol: count += 1 return count
By parallelizing the rows of
A over cores I get twice the
performance on my 4 core laptop:
In : %timeit n_below_parallel(A, tol) 5.96 ms ± 299 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Even more impressively,
the Navier-Stokes solver runs nearly 2x more quickly with
parallel=True on my 4 core laptop even though it has no
A program that includes Numba-enhanced Python functions start more slowly because the jit compiler needs to compile the functions before the program can begin running. The delay is barely noticable with small functions.
That’s not the case for the more complicated jit-compiled
functions in the Navier-Stokes solver. These impose a 10 second
start-up delay on my laptop. The
cache=True option tells Numba
to cache the result of its compilations for immediate on
subsequent runs. The code is only recompiled if any of the
decorated functions change.
Hardware, OS details
The examples were run on a 2015 Dell XPS13 laptop with 4 cores of i5-5200U CPU @ 2.2 GHz and 8 GB memory. The OS is Ubuntu 20.04. MATLAB 2022b was used but the code will run with any MATLAB version from 2020b onward.
Anaconda Python 2020.07 with Python 3.8.8 was used to permit running on older versions of MATLAB, specifically 2020b. (Yes, I’m aware Python 3.11 runs more quickly than 3.8. MATLAB does not yet support that version though.)
Example 1: Mandelbrot set
This example comes from the High Performance Computing chapter of my book. There I implement eight versions of code that compute terms of the Mandelbrot set: MATLAB, Python, Python with multiprocessing, Python with Cython, Python with Pythran, Python with f2py, Python with Numba, and finally MATLAB calling the Python/Numba functions.
The baseline MATLAB version is much faster than the baseline Python implementation. However, employing any of the compiler-augmented tools like Cython, Pythran, f2py, or Numba turns the tables and makes the Python solutions much faster than MATLAB.
baseline MATLAB2 :
The Python with Numba version is the same as the baseline Python version with five additional Numba-specific lines. An explanation of each highlighted line appears below the code:
Python with Numba:
Line 5 imports the necessary items from the
jit() is a decorator that precedes functions
we want Numba to compile.
prange() is a parallel version of
range() function. It turns regular
for loops into
for loops where work is automatically distributed
over the cores of your computer.
The remaining items from
complex128 define data
types that will be employed in function signatures.
Line 6 defines the signature of the
as taking a complex scalar and 64 bit integer as inputs
and returning an unsigned 8 bit integer.
(The remaining keyword arguments like
nopython= were explained above.)
Lines 15 and 16 define the signature of the
as taking a pair of 1D double precision arrays and a scalar 64 integer
and returning a 2D array of unsigned 8 bit integers.
Line 21 implements a parallel
for loop; iterations for
different values of
i are distributed over the cores.
What have we gained with the five extra Numba lines? This table shows how our three codes perform for different image sizes; times are in seconds.
|N||MATLAB 2022b||Python 3.8.8||Python 3.8.8 + Numba|
The Python+Numba performance borders on the unbelievable—but don’t take my word for it, run the three implementations yourself!
One “cheat” the Python+Numba solution enjoys is that its parallel
for loop takes advantage of the four cores on my laptop.
parfor parallel for loop construct but, inexplicably,
it runs slower than a conventional sequential
for loop. Evidently
multiple cores are only used if you have a license for
the Parallel Computing Toolbox (which I don’t).
MATLAB + Python + Numba
If Python can run faster with Numba, MATLAB can too. All we need
to do is import the
MB_numba module and call its
MB() function from MATLAB:
is a Python function so it returns a Python result.
To make the benchmark against the baseline
MATLAB version fair, the program includes conversion of the
img array to a MATLAB matrix (using
in the elapsed time.
The table below repeats the MATLAB baseline times from the
previous table. Numeric values in the middle and right column
are elapsed time in seconds.
|N||MATLAB 2022b||MATLAB 2022b + Python 3.8.8 + Numba|
Visualizing the result
functions on the
if you want to see what the matrix looks like.
Similarly, in Python:
Example 2: incompressible fluid flow
The Mandelbrot example is useful for demonstrating how to use Numba and for giving insight to the performance boosts it can give. But useful as a computational pattern resembling real work? Not so much.
The second example represents computations as might appear in a realistic scientific application. It demonstrates MATLAB+Python+Numba with a two dimensional Navier-Stokes solver for incompressible fluid flow. I chose an implementation by Jamie Johns which shows MATLAB at its peak performance using fully vectorized code. While this code only computes flow in 2D, the sequence of calculations and memory access patterns are representative of scientific and engineering applications. Jamie Johns' YouTube channel has interesting videos produced with his code.
A clever aspect of Jamie Johns' code is that boundary conditions are defined graphically. You create an image, for example a PNG, where colors define properties at each grid point: black = boundary, red= outflow (source), and green = inflow (sink). The image borders, if they aren’t already black, also define boundaries.
This PNG image, which can be found in the original code’s
represents flow moving from left to right around a circle,
a canonical fluid dynamics problem I studied as an undergrad:
Of course, if boundary conditions are supplied as images, they can be arbitrarily complex. I recalled an amusing commercial by Maxell, “Blown Away Guy”, from my youth. With a bit of photo editing I converted
to the following boundary conditions image:
The domain dimensions are 15 meters x 3.22 meters and the CFD mesh has 400 x 86 grid points, giving a relatively coarse resolution of 3.7 cm between points. My memory-limited (8 GB) laptop can’t handle a finer mesh.
- time step: 0.003 seconds
- number of iterations: 12,900
- air density = 1 kg/m$^3$
- dynamic viscosity = 0.001 kg/(m s)
- air speed leaving the speaker: 0.45 m/s
The low air speed is admittedly contrived and certainly lower than the commercial’s video suggests. (my guess is for that is at least 2 m/s). I found the solver fails to converge, then exits with an error, if the flow speed is too high. Same thing for number of iterations—too many and the solver diverges. Trial and error brought me to 0.45 m/s and 12,900 iterations as values that produce the longest animation.
I don’t understand the reason of the failure (mesh too coarse?) and haven’t been motivated to figure out what’s going on. This is, after all, an exercise in performance tweaking rather than studying turbulence or computing coefficients of drag.
Running the code
My modifications to Jamie Johns' MATLAB code and the translation
to Python + Numba can be found at
The code there can be run three ways:
entirely in MATLAB, entirely in Python, or with
MATLAB calling Python.
One thing to note is that the computational codes are purely batch
programs that write
.npy data files.
No graphics appear.
The flow can be visualized with
a separate post-processing
step after the flow speeds are saved as files.
MATLAB (2020b through 2022b)
Start MATLAB and enter the name of the main program,
Main, at the prompt. I had to start MATLAB with
to suppress the GUI (and close all other applications, especially
web browsers) to give the program enough memory.
Python + Numba
On Linux and macOS, just type
./py_Main.py in a console.
On Windows, open a Python-configured terminal (such as the
Anaconda console) and enter
MATLAB + Python + Numba
Main.m and change the value of
Save it, then run
Main and the MATLAB prompt.
Regardless of how you run the code (MATLAB / Python / MATLAB + Python),
the steps of operation are the same: the application creates a subdirectory
NS_velocity then populates it with
.npy files (Python
and MATLAB+Python) or
.mat files (pure MATLAB).
These files contain the x and y components of the flow speed at
each grid point for a batch of iterations.
The first time the Python or MATLAB+Python program starts,
expect to see delay of about 30 seconds while Numba compiles the
@jit decorated functions in
This is a one-time cost though; subsequent runs will start
much more quickly.
MATLAB runs more than three times faster using the Python+Numba
Navier-Stokes solver. More surprisingly, Python without Numba
also runs faster than MATLAB. I updated the performance table
on 2022-10-30 following a LinkedIn post by
asking how the recently released 3.11 version fares on this problem.
Numba is not yet available for 3.11 so I reran without
decorators—and threw in a 3.8.8 run without Numba as well:
|MATLAB 2022b + Python 3.8.8 + Numba||128.89|
|Python 3.8.8 + Numba||126.99|
Two surprises here: plain Python + NumPy is faster than MATLAB, and Python 3.11.0 is no faster than 3.8.8 for this compute-bound problem.
animate_navier_stokes.py generates an animation
of flow computed earlier with
Main.m. The most direct
way to use it is by giving it the directory (
default) containing flow velocity
files and whether you want to see
*.mat files generated by MATLAB (use
*.npy files generated by either Python or the MATLAB+Python combination
(Note: the program loads many Python modules and takes a
long time to start the first time.)
./animate_navier_stokes.py --python NS_velocity
With this command the program first loads all
directory, then writes a large
merged file (
merged_frames.npy, 1.7 GB)
containing the entire solution in one file. You can later
load the merged file much more quickly than reading
The first thing you’ll notice is the animation proceeds slowly. To speed things up, skip a few frames. The next command reads from the merged frame file and only shows every 50th frame:
./animate_navier_stokes.py --npy merged_frames.npy --incr 50
--png-dir switch to write individual frames to PNG files
in the given directory. For example
./animate_navier_stokes.py --npy merged_frames.npy --png-dir PNG --incr 400
will create the directory
PNG then populate it with 33 files with
PNG/frame_08000.png. I created the video at the
top of this post by saving individual frames, overlaying the Maxell
image on each, then generating an MP4 file with ffmpeg.
Table of Contents | Previous: Symbolic math | Next : Distribute MATLAB Computations To Python Running on a Cluster