dimanche 19 septembre 2010

filtering data: the where() function

I'm now trying to use my favorite IDL function in Python: the where() function. It is used to select the subscripts of an array where a condition is fulfilled.
As I'm using the pylab version of ipython, it comes with a where function that can match more or less the IDL one. But it seems that another method can also be used...
Let's do some examples: I want to select from a table of 1000000 lines and 10 columns the elements that have a value for the 3rd and 7th columns bigger than 0.5.


I'm first creating a 2D-table on which I will apply my filter.

a=random((1e6,10))

I first had to realize that the order of the subscripts are in the inverse order than in IDL: first rows, then columns.
So the elements of the 3rd columns are a[:,2], and the ones for the 7th columns are a[:,6].
I first try:

tt = where(a[:,2] > 0.5 and a[:,6] > 0.5)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

OK, I'm not yet at the level of understanding what Python told me, but clearly it's not correct. I finally found that with parenthesis and & instead og and, things are going well:
tt = where((a[:,2] > 0.5) & (a[:,6] > 0.5))

tt is a so-called "tuple", which seems to be an array, it has a size:
In [90]: size(tt)
Out[90]: 250248


It's a credible value given the condition and the size of the input table.

I can now use this variable to extract the values from the initial table:
x=a[tt,2]
y=a[tt,6]

The main problem here is that the table I have in x and y are not correctly shaped:
In [112]: x.shape
Out[112]: (1, 250248)


If I try to plot this (using plot(x,y,'.')), it doesn't work (well, after some minutes waiting for a result I killed the plot!).
I can reshape the x and y tables using for example the transpose function:
plot(transpose(x),transpose(y),'.')
but the best is to transpose the filter before using it:
tt2 = transpose(where((a[:,2] > 0.5) & (a[:,6] > 0.5)))
plot(a[tt2,2],a[tt2,6],'.')
is working fine. BTW, tt2 is not anymore a tuple, it's not an array of integers, so for example:
In [146]: tt2.size
Out[146]: 250248

Another way of filtering the data is to generate a table of booleans:
tt3 = (a[:,2] > 0.5) & (a[:,6] > 0.5)
This is an array:
In [147]: tt3.size
Out[147]: 1000000

In [148]: tt3.dtype
Out[148]: dtype('bool')

Contrary to IDL, it can directly be used in a table:
In [149]: a[tt3,2].size
Out[149]: 250248

In [150]: a[tt3,2].shape
Out[150]: (250248,)


Fine, the plot works also with this:
In [154]: plot(a[tt3,2],a[tt3,6],'.')

I tried both methods (where and boolean) on big table, but didn't saw any real difference on time execution. If some readers could tell me which one is really the more "python-way"...

3 commentaires:

  1. it is possible to use the Where function to obtain not the subscripts but the Boolean (as in the tt3 example):
    tt3bis = where((a[:,2] > 0.5) & (a[:,6] > 0.5),True,False)

    RépondreSupprimer
  2. And I found another way to use the where function:
    tt5 = where(logical_and(a[:,2] > 0.5,a[:,6] > 0.5),1,0)
    x=compress(tt4,a[:,2])

    RépondreSupprimer
  3. You have to unpack the tuple returned by where
    tt, = where((a[:,2] > 0.5) & (a[:,6] > 0.5))
    (Note the comma after tt)

    You can also use boolean indexing:
    tt = (a[:,2] > 0.5) & (a[:,6] > 0.5)

    RépondreSupprimer