Python - parallelize a python loop for 2D masked array?

Question

Probably a commonplace question, but how can I parallelize this loop in Python?

for i in range(0,Nx.shape[2]):
  for j in range(0,Nx.shape[2]):
    NI=Nx[:,:,i]; NJ=Nx[:,:,j]
    Ku[i,j] = (NI[mask!=True]*NJ[mask!=True]).sum()

So my question: what's the easiest way to parallelize this code?

         ---------- EDIT LATER------------------

An example of data

import random
import numpy as np
import numpy.ma as ma
from numpy import unravel_index    

#my input
Nx = np.random.rand(5,5,5)  

#mask creation
mask_positions = zip(*np.where((Nx[:,:,0] < 0.4)))
mask_array_positions = np.asarray(mask_positions)
i, j = mask_array_positions.T
mask = np.zeros(Nx[:,:,0].shape, bool)
mask[i,j] = True

And i want to calculate Ku by parallelizing. My aim is to use the Ku array to solve a linear problem so i have to put the mask values apart (represent near the half of my array)

By the way, outside of implementing @hpaulj's excellent answer, you might want to consider that your mask, which you fabricate in 5 lines is equivalent to just writing mask = Nx[:,:,0] < 0.4. — Oliver W.
– Oliver W., Commented Feb 27, 2015 at 21:38

hpaulj · Accepted Answer · 2015-02-28 18:49:27Z

4

I think you want to 'vectorize', to use numpy terminology, not parallelize in the multiprocess way.

Your calculation is essentially a dot (matrix) product. Apply the mask once to the whole array to get a 2d array, NIJ. Its shape will be (N,5), where N is the number of True values in ~mask. Then it's just a (5,N) array 'dotted' with a (N,5) - ie. sum over the N dimension, leaving you with a (5,5) array.

NIJ = Nx[~mask,:]
Ku = np.dot(NIJ.T,NIJ)

In quick tests it matches the Ku produced by your double loop. Depending on the underlying library used for np.dot there might be some multicore calculation, but that's usually not a priority issue for numpy users.

Applying the large boolean mask is the most time consuming part of these calculations - both the vectorized and iterative versions.

For a mask with 400,000 True values, compare these 2 indexing times:

In [195]: timeit (NI[:400,:1000],NJ[:400,:1000])
100000 loops, best of 3: 4.87 us per loop
In [196]: timeit (NI[mask],NJ[mask])
10 loops, best of 3: 98.8 ms per loop

Selecting the same number of items with basic (slice) indexing is several orders of magnitude faster than advanced indexing with the mask.

Substituting np.dot(NI[mask],NJ[mask]) for (NI[mask]*NJ[mask]).sum() only saves a few ms.

edited Feb 28, 2015 at 18:49

answered Feb 27, 2015 at 21:26

hpaulj

233k14 gold badges260 silver badges392 bronze badges

Sign up to request clarification or add additional context in comments.

13 Comments

user3601754 Over a year ago

I have to do this on great arrays and a great number of time and then i solve a system with these arrays with 'np.linalg.solve' I would like to accelerate it...

Oliver W. Over a year ago

@user3601754 did you try this code? It's fast. Perform some timing comparisons if you need convincing. I know the history of your questions and you should definitely consider this solution before attempting to do multiprocessing.

user3601754 Over a year ago

13.237869 secondes for my case and i m waiting the results since 2mins for this code

hpaulj Over a year ago

What size of arrays are you working with? And which takes more time, applying the mask or doing the dot? With large enough ones there might be a trade off between memory use and vectorization.

user3601754 Over a year ago

My Nx array has a shape : (1400, 1528, 20) and i have to repeat the calculation a lot of time! it seems to be the mask step

|

Community · Accepted Answer · 2017-05-23 11:43:33Z

1

I'd like to extend @hpaulj's excellent answer (great analysis of the problem too by the way) for large matrices.

The operation

Ku = np.dot(NIJ.T,NIJ)

can be replaced by

Ku = np.einsum('ij,ik->jk', NIJ, NIJ)

It should also be noted that np.dot could fall back to slower routines if numpy was not compiled to use BLAS.

For a test matrix NIJ of shape (1250711, 50), I got 54.9 s with the dot method, while the einsum does it in 1.67 s. On my system, numpy is compiled with BLAS support.

Remark: np.einsum does not always outperform np.dot, a situation that became apparent on my system when you would compare any of the following

Nx = np.random.rand(1400,1528,20).astype(np.float16)
Nx = np.random.rand(1400,1528,20).astype(np.float32)

(or even a dtype of np.float64).

edited May 23, 2017 at 11:43

CommunityBot

11 silver badge

answered Feb 27, 2015 at 22:48

Oliver W.

13.6k3 gold badges41 silver badges52 bronze badges

4 Comments

user3601754 Over a year ago

Thanks for your help! if i use your code, i got 263.650027 sec :) For the moment, my code is the better (in my case)!

Oliver W. Over a year ago

@user3601754, you should consider hpaulj's comment: without more information about the size of the arrays you're working on (and possible the size of the RAM), it's hard to improve on this algorithm even more. Just FYI, using hpaulj's algorithm (without the einsum), on a matrix of Nx = np.random.rand(500,5000,50), I saw a threefold speed improvement compared to your method. But such speed gains depend a lot on the dimensions of the arrays involved (is the last axis the biggest one?)

user3601754 Over a year ago

Ok! my shape is (1400, 1528, 20)

user3601754 Over a year ago

and my type is float64

Collectives™ on Stack Overflow

Python - parallelize a python loop for 2D masked array?

2 Answers 2

13 Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

13 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related