mercredi 13 octobre 2010

list comprehension

There is a way to write loops and test on lists in Python which is very condensate: the so-called list comprehension.
An example found in the French forum: http://www.developpez.net/forums/d962510/autres-langages/python-zope/general-python/operations-listes-dictionnaires/.
A guy wanted to transform a dictionary like this:
d= {'Ei': (1,3,4,4,6) , 'id' : ('r','r','t','t','t')}
into the list of occurence of each id:

[('r', [1, 3]), ('t', [4, 4, 6])] 

One of the answers was:

L=[(d['id'][i],d['Ei'][i]) for i in xrange(0,len(d['Ei']))]

R=[(x,[y[1] for y in L if y[0]==x]) for x in set(d['id']) ] 

Quite efficient!

The first step is to build a dictionary with the correspondances between the id and the Ei, the L:

[('r', 1), ('r', 3), ('t', 4), ('t', 4), ('t', 6)]

This is done with something like this:

In [8]: L2 = []

In [9]: for i in xrange(0,len(d['Ei'])):
   ...:     L2.append((d['id'][i],d['Ei'][i]))

Which can effectively be compacted as:

L=[(d['id'][i],d['Ei'][i]) for i in xrange(0,len(d['Ei']))] 

The second step (R) is the list comprehension form of the following, where we are counting the occurences of each id. First find the set of the different and uniq values of id: set() gives the solution.

Then looping on these values and find the occurences for each values. Finally put all this into a dictionary. The expand form would be:

R2=[]

uniqid = set(d['id']) 
for x in uniqid:
    for y in L:
        if y[0]==x:
            R2.append([x,[y[1]]])

What can also be obtained like this:

R=[(x,[y[1] for y in L if y[0]==x]) for x in set(d['id']) ] 

This is more compact, in some sense more elegant, and I think more efficient (quicker), but not really sure of that...

lundi 11 octobre 2010

Save and Restore (2)

I guess I got it, not the same syntax as in IDL, but quite as fast in reading/writing:
def save(file,**kwargs):
    """
    Save the value of some data in a file.
    Usage: save('misdatos.pypic',a=a,b=b,test=test)
    """
    import cPickle
    f=open(file,"wb")
    cPickle.dump(kwargs,f,protocol=2)
    f.close
def restore(file):
    """
    Read data saved with save function.
    Usage: datos = restore('misdatos.pypic')
    """
    import cPickle
    f=open(file,"rb")
    result = cPickle.load(f)
    f.close
    return result
For example (notice I use to import my stuff as CM):
CM.save('data3.pypic',data3=data3,a=a,b=b)
and
dd=CM.restore('data3.pypic')
take a small second to write and read.
Something interesting:
dd is a dictionary containing the 3 variables data3, a and b.
Now if I want to use data3, I can "extract" it:

data3 = dd['data3']

Nice, and very quick, as it's not a copy, but rather an object poiting to the same memory place:
In [15]: data3 is dd['data3']
Out[15]: True
In [16]: id(data3)
Out[16]: 4393756504
In [17]: id(dd['data3'])
Out[17]: 4393756504
Nice way to do the things ;-)

ADD:
Some people can find easier to save data using

save(file,"data")
others will prefere:
save(file,data=data)
You can use both with the following save function:

def save(file,*args,**kwargs):

  """

     Save the value of some data in a file.

     Usage: save('misdatos.pypic','a',b=b)

     """

     import cPickle

     f=open(file,"wb")

     dico = kwargs

     for name in args:

           dico[name] = eval(name)

     cPickle.dump(dico,f,protocol=2)

     f.close

dimanche 10 octobre 2010

Save and Restore

I use a lot the save/restore facilities in IDL. BTW, this is the lack of a complete save/restore set of tools that avoid me to use GDL in some projects. I was looking for an equivalent in Python, and realized that there are some few tools, the main problem now is choosing the right one ;-)
I found this page: http://kbyanc.blogspot.com/2007/07/python-serializer-benchmarks.html
and made a few tests with cPickle, which is part of Python (no need to install extra module).
It is actually very efficient when used with the protocol=2 option.
Marshal seems to be even faster, but cannot handle the rec.array objects (at least it doesn't work for me...).
Example of the use:
In [2]: import CMorisset as CM
In [3]: data3 = CM.ReadFortran('test3.dat','a10,1x,f6.2,1x,f6.2,1x,i2',['name', 'ra', 'dec','mag'])
In [4]: import cPickle
In [5]: cPickle.dump(data3, open("data3.pickle", "wb"),protocol=2)
In [6]: data3=cPickle.load(open("data3.pickle","rb"))

Reading the 27Mo of the test3.dat file take me 30 seconds with the Fortran format, and only 1 with the cPickle function! The main problems are that 1) one need to know the name of the store variable, and 2) only one object can be saved at a time.
I think it can be bypassed using a dictionary containing the variables and the names.
My main issue now is to build the dicionnary from the arguments passed to a function, so that I could just have:

save(file='data',dat1,dat2,dat3)

Will se this latter...

Parameters in functions: take care!

I found a nice page with something really strange for me, I'll have to keep this in my mind because it's totally different from IDL.
The page: http://hetland.org/writing/instant-python.html
The "problem", as I rewrote it:
Let's define 2 functions:
def test(x):
x=2
def test2(x):
x[0]=2
Now, we'll call test and test2 with different variables:
In [3]: y=1
In [4]: test(y)
In [5]: y
Out[5]: 1
y is not changed. BUT:
In [7]: y=[1,2,3]
In [8]: test2(y)
In [9]: y
Out[9]: [2, 2, 3]
Now y[0] is changed!!! And even worst:
In [10]: y=[[4,5,6],[14,15,16],[24,25,26]]
In [11]: test2(y)
In [12]: y
Out[12]: [2, [14, 15, 16], [24, 25, 26]]
And the best for the end:

In [13]: y=[[4,5,6],[14,15,16],[24,25,26]] 

In [15]: test2(y[0])
In [16]: y
Out[16]: [[2, 5, 6], [14, 15, 16], [24, 25, 26]]
So part of the table can be changed, but a single variable not: exactly the opposite of IDL...

dimanche 3 octobre 2010

My first module! Still on reading ascii formatted file.

I did my first program in Python! Here it is:

def ReadFortran(file,format,names,comment="#"):
    """
    Read a file using a Fortran-style format.
    Return a NumPy rec.array with each column named following the given names.
    Example: data = ReadFortran('test2.dat','a10,1x,f6.2,1x,f6.2,1x,i2',['name', 'ra', 'dec','mag'],comment="#")
    Morisset, IA-UNAM, Oct. 2010
    """
    import Scientific.IO.FortranFormat as FF
    import numpy.core.records as nprec
    FFformat = FF.FortranFormat(format)
    f=open(file,'r')
    rows=[]
    for line in f:
        if line[0] != comment:
            row = FF.FortranLine(line,FFformat)
            rows.append(row.data)
    f.close()
    return nprec.fromrecords(rows, names=names)

OK, it's not a very big one, but it took me a lot of time trying to avoid the list.append command. And I didn't found. But it seems that most of the execution time is on the FortranLine command.
It is very slower than the same in IDL: it reads a 1000000 lines file in some 45 seconds, while IDL take 5... The csv2rec takes 20 secs.
Perhaps one of these days I'll try to call a fortran routine to read the file...

ADD:
It seems that a more compact and pythonesk way of writing the loop is to change:
rows=[]
    for line in f:
        if line[0] != comment:
            row = FF.FortranLine(line,FFformat)
            rows.append(row.data)
into:
rows = [FF.FortranLine(line,FFformat).data for line in f if line[0] != comment]

The map function could also be used:

rows = map(lambda line:FF.FortranLine(line,FFformat).data,f)

BUT in this latest case we can't manage the comment parameter.

Got the tips from http://jaynes.colorado.edu/PythonIdioms.html

Reading formated ascii file a la fortran

I first learned to program in Fortran (after some introductions to Basic, ADA, Turbo pascal, in the early 80's). Then I meet IDL in 1994 (thanks to mi friend Philippe) and the life changed! Interactive + Data + Language was exactly what I needed. But as already said at the beginning of this blog, I now want to change for a free access open language.
But I feel very difficult this change, I'm like a baby learning to walk and talk... For example, I was looking since 2 weeks a way to read a simple formated ascii file, like I used to do in IDL.
The file is just:

     alpha 193.63 18.40 19
      beta 280.12   0.52 16
     gamma 206.59   0.06 17
     delta 23.74 17.92 19
       eta 18.10 10.07 19
and the IDL process is:

data = replicate({name:'',ra:0.0,dec:0.0,mag:0},n_lines)

openr,lun,/get_lun,file

readf,lun,data,format='(a10,1x,f6.2,1x,f6.2,1x,i2)'

et voilà!

The format string is using the Fortran convention, that is quite powerful in describing quite any fixed format. It seems that it's was not possible to do this in Python, 'till I found a module including this facility! Developped by french people from CNRS in Orleans, it is avilable here:
http://dirac.cnrs-orleans.fr/plone/software/scientificpython/

The part of the module I want is this one:

Module FortranFormat

Fortran-style formatted input/output

This module provides two classes that aid in reading and writing Fortran-formatted text files.

Examples:

Input::

   >>>s = '   59999'
   >>>format = FortranFormat('2I4')
   >>>line = FortranLine(s, format)
   >>>print line[0]
   >>>print line[1]

 prints::

   >>>5
   >>>9999


 Output::

   >>>format = FortranFormat('2D15.5')
   >>>line = FortranLine([3.1415926, 2.71828], format)
   >>>print str(line)

 prints::

   '3.14159D+00    2.71828D+00'

I used it to read the same file as previously in IDL
data=numpy.rec.array(['          ',0.,0.,0], names=['name', 'ra', 'dec','mag'])
from Scientific.IO import FortranFormat as FF
format = FF.FortranFormat('a10,1x,f6.2,1x,f6.2,1x,i2')
f=open('test1.dat','r')
for line in f:
#    data['name'],data['ra'],data['dec'],data['mag'] = FF.FortranLine(line,format)
    data.name,data.ra,data.dec,data.mag = FF.FortranLine(line,format)
    print data
The main problem is that I don't know how to have the whole array in the data variable. Anyway, the problem of reading fixed formatted ascii file is solved ;-)