1

Probably a commonplace question, but how can I parallelize this loop in Python?

for i in range(0,Nx.shape[2]):
  for j in range(0,Nx.shape[2]):
    NI=Nx[:,:,i]; NJ=Nx[:,:,j]
    Ku[i,j] = (NI[mask!=True]*NJ[mask!=True]).sum()

So my question: what's the easiest way to parallelize this code?

         ---------- EDIT LATER------------------

An example of data

import random
import numpy as np
import numpy.ma as ma
from numpy import unravel_index    

#my input
Nx = np.random.rand(5,5,5)  

#mask creation
mask_positions = zip(*np.where((Nx[:,:,0] < 0.4)))
mask_array_positions = np.asarray(mask_positions)
i, j = mask_array_positions.T
mask = np.zeros(Nx[:,:,0].shape, bool)
mask[i,j] = True

And i want to calculate Ku by parallelizing. My aim is to use the Ku array to solve a linear problem so i have to put the mask values apart (represent near the half of my array)

3
  • 1
    Please provide some sample data and expected output. Commented Feb 27, 2015 at 19:15
  • multiprocessing.Pool is a good starting place. Commented Feb 27, 2015 at 21:20
  • 1
    By the way, outside of implementing @hpaulj's excellent answer, you might want to consider that your mask, which you fabricate in 5 lines is equivalent to just writing mask = Nx[:,:,0] < 0.4. Commented Feb 27, 2015 at 21:38

2 Answers 2

4

I think you want to 'vectorize', to use numpy terminology, not parallelize in the multiprocess way.

Your calculation is essentially a dot (matrix) product. Apply the mask once to the whole array to get a 2d array, NIJ. Its shape will be (N,5), where N is the number of True values in ~mask. Then it's just a (5,N) array 'dotted' with a (N,5) - ie. sum over the N dimension, leaving you with a (5,5) array.

NIJ = Nx[~mask,:]
Ku = np.dot(NIJ.T,NIJ)

In quick tests it matches the Ku produced by your double loop. Depending on the underlying library used for np.dot there might be some multicore calculation, but that's usually not a priority issue for numpy users.


Applying the large boolean mask is the most time consuming part of these calculations - both the vectorized and iterative versions.

For a mask with 400,000 True values, compare these 2 indexing times:

In [195]: timeit (NI[:400,:1000],NJ[:400,:1000])
100000 loops, best of 3: 4.87 us per loop
In [196]: timeit (NI[mask],NJ[mask])
10 loops, best of 3: 98.8 ms per loop

Selecting the same number of items with basic (slice) indexing is several orders of magnitude faster than advanced indexing with the mask.

Substituting np.dot(NI[mask],NJ[mask]) for (NI[mask]*NJ[mask]).sum() only saves a few ms.

Sign up to request clarification or add additional context in comments.

13 Comments

I have to do this on great arrays and a great number of time and then i solve a system with these arrays with 'np.linalg.solve' I would like to accelerate it...
@user3601754 did you try this code? It's fast. Perform some timing comparisons if you need convincing. I know the history of your questions and you should definitely consider this solution before attempting to do multiprocessing.
13.237869 secondes for my case and i m waiting the results since 2mins for this code
What size of arrays are you working with? And which takes more time, applying the mask or doing the dot? With large enough ones there might be a trade off between memory use and vectorization.
My Nx array has a shape : (1400, 1528, 20) and i have to repeat the calculation a lot of time! it seems to be the mask step
|
1

I'd like to extend @hpaulj's excellent answer (great analysis of the problem too by the way) for large matrices.

The operation

Ku = np.dot(NIJ.T,NIJ)

can be replaced by

Ku = np.einsum('ij,ik->jk', NIJ, NIJ)

It should also be noted that np.dot could fall back to slower routines if numpy was not compiled to use BLAS.

For a test matrix NIJ of shape (1250711, 50), I got 54.9 s with the dot method, while the einsum does it in 1.67 s. On my system, numpy is compiled with BLAS support.

Remark: np.einsum does not always outperform np.dot, a situation that became apparent on my system when you would compare any of the following

Nx = np.random.rand(1400,1528,20).astype(np.float16)
Nx = np.random.rand(1400,1528,20).astype(np.float32)

(or even a dtype of np.float64).

4 Comments

Thanks for your help! if i use your code, i got 263.650027 sec :) For the moment, my code is the better (in my case)!
@user3601754, you should consider hpaulj's comment: without more information about the size of the arrays you're working on (and possible the size of the RAM), it's hard to improve on this algorithm even more. Just FYI, using hpaulj's algorithm (without the einsum), on a matrix of Nx = np.random.rand(500,5000,50), I saw a threefold speed improvement compared to your method. But such speed gains depend a lot on the dimensions of the arrays involved (is the last axis the biggest one?)
Ok! my shape is (1400, 1528, 20)
and my type is float64

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.