mædoc's notes

NumPy performance and fragmentation

NumPy is a standard package for working with arrays in Python. The algorithms I work on, originally written in MATLAB, were ported long ago to NumPy. They provide the reference implementation for our simulator engine. However, these routines pose two problems:

These factors translate to a myriad of minor simulator implementations to which (including several of my own).

why are our simulations slow?

The simulations take many time steps, and the memory access pattern is irregular.

example data, kernels and timing

let's build NumPy from scratch

pip install -r requirements/build_requirements.txt
pip install . -Csetup-args=-Dcpu-baseline="avx2 fma3" -Csetup-args=-Dcpu-dispatch="max"

let's optimize the memory access instead

Intuitively, retrieving things which are nearby in memory requires less work.

how well does a very highly optimized implementation do?

ISPC fused kernel, random numbers etc.

fragmentation

what is jax about anyway?

JAX tries to bring Tensorflow thinking to users of NumPy

problems

#algo