All Categories

Lazy File Reading in Python

11/20/2013

I was checking a youtube video about haskell and its lazy file reading options and was wondering... wait a sec I can do that in python too.. We have generators too :P

Generators are pretty cool when it comes to lazy evaluation. Now the purists reading this post... I know strictly speaking this may not be LE.. but it comes closer to it.

0 Comments

Converting List of Set into Dictionary

11/20/2013

0 Comments

One of my favorite one-liner is to convert a list of sets into a dictionary. Its awesome how easily I can do it in python.

0 Comments

This is the extension of the problem I faced while messing with boost::asio library. So, finally I attempted to run different async loops in muti-threads and have spectacularly failed. Let me share some experiences of mine wrt. The following is the code using boost libraries , dont worry if you dont understand it much. So the easiest way to interface c++ with python is to create the "c" wrapper for the same and then creating a shared .so file for it. Python has an awesome (not so awesome actually ) library called cytpes where u can call .so files in ur python program.

So, compiling this program seems straight forward assuming boost is already been setup.

$ g++ -c -fPIC scale.cpp -o scale.o -L/usr/lib -lboost_system -lboost_thread

$ g++ -shared -Wl,-soname,libscale.so -o libscale.so scale.o

Soo ideally this should work right ? I mean why not you have created your object file with necessary boost links and then create a shared .so of the same. Lets create a simple python wrapper for the same and call the run module.

If u run the same using an interpreter , voila .. as soon as you load the library you will see the following error. Damn! its not able to identify the boost_thread symbols ... But how is this possible ?

>>> lib = cdll.LoadLibrary('./libscale.so')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.6/ctypes/__init__.py", line 431, in LoadLibrary
return self._dlltype(name)
File "/usr/lib/python2.6/ctypes/__init__.py", line 353, in __init__
self._handle = _dlopen(self._name, mode)
OSError: ./libscale.so: undefined symbol: _ZTIN5boost6detail16thread_data_baseE

rahulram ~/programs/cpp $ ldd ./libscale.so
linux-gate.so.1 => (0xb789c000)
libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0xb775f000)
libm.so.6 => /lib/libm.so.6 (0xb7723000)
libgcc_s.so.1 => /lib/libgcc_s.so.1 (0xb7705000)
libc.so.6 => /lib/libc.so.6 (0xb75a7000)
/lib/ld-linux.so.2 (0xb789d000)

If u see the linked libraries , libscale.so doest not include the boost libraries. Why on the world will gcc compiler not warn me of that ? I have no frekin idea .. Finally I had to recreate the .so files mentionnig boost_threads and systems

rahulram~/programs/cpp $ g++ -shared -Wl,-soname,libscale.so -o libscale.so scale.o -lboost_system -lboost_thread

rahulram ~/programs/cpp $ ldd libscale.so
linux-gate.so.1 => (0xb783e000)
libboost_system.so.1.42.0 => /usr/lib/libboost_system.so.1.42.0 (0xb77e3000)
libboost_thread.so.1.42.0 => /usr/lib/libboost_thread.so.1.42.0 (0xb77cf000)
libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0xb76e7000)
libm.so.6 => /lib/libm.so.6 (0xb76ab000)
libgcc_s.so.1 => /lib/libgcc_s.so.1 (0xb768e000)
libc.so.6 => /lib/libc.so.6 (0xb7530000)
librt.so.1 => /lib/librt.so.1 (0xb7527000)
libpthread.so.0 => /lib/libpthread.so.0 (0xb750d000)
/lib/ld-linux.so.2 (0xb783f000)

ohh la la .. It seems to have linked the desired ilbraries . Now you can re-run the same python program, it should work..

PS: I learnt this the hard way :(

1 Comment

Xy Problem

1/8/2013

0 Comments

The Xy problem is a very classic case of user/client asking for a problem X, but intending to solve problem Y and finally lands up sapping everyone. So, this is how it goes

User wants to do X.
User doesn't know how to do X, but thinks they can fumble their way to a solution if they can just manage to do Y.
User doesn't know how to do Y either.
User asks for help with Y.
Others try to help user with Y, but are confused because Y seems like a strange problem to want to solve.
After much interaction and wasted time, it finally becomes clear that the user really wants help with X, and that Y wasn't even a suitable substitute for X.

Heres an example I encountered.

Problem Title: Pattern Matching and Regular Expression

The input file is in the following format and the data is to be crunched.

#
# <Title>
#
[Space]
[
Few lines of information about the title
]

[Space]

#
# <Title>
#
[Space]
[
Few lines of information about the title
]

The title is soo obtuse . This is a simple case of structuring unstructured data. Anyways after a lot of hocth-poch I figured out what was really required and started brainstorming on it.

Whenever it comes to python wrt to regex I kinda hesitate to use regex in them , infact I try as possible to avoid it. Why ?
I am bad @ regex . I tend to miss out edge cases and then get screwed. Secondly , I have happen to see some deadly Perl regex and they can crunch data way faster than what python could do. I dont intend to start a language debate, but most of the times i have seen perl regex do much better than pythons. You could compile the regex and then use it in python for better performance. But still, this is one area where Perl dominates.

The initial solution I saw for the above problem was to fetch the whole file content into a string and compile it using regex. Seems like a fair deal , but here is the problem

AFAIK str is immutable data in python . A copy of it is made in memory before changing it. So, if the file size if tooooooo BIG it may not fit into memory.
REgex , (Yeah I hate it! )

I prefer reading the file line by line and then structring it because , all we needed to extract was title and its relevant context. My main concern was to avoid loading the whole file at once and keep regex to minimum.

I agree , the code in not exactly pythonic. It can be made more pythonic , but that exercise i left to OP ;)

0 Comments

Memory Allocations in python

1/5/2013

0 Comments

Python memory allocations seems pretty convoluted when I first began to wonder on what happens to memory when the object is not being used anymore. Well, memory allocation happens using malloc and free (Python is written in C ) and these operations are pretty expensive ( I will quantify that in a while).

Now, usually what you will think is when I allocate memory in delete it , the memory consumed by the objec t is freed. Guess what ? It is definetly not and I assure you thats not a bug in python. Lets take a closer look on how this works.

Until Python 2.3ish versions , python primarily handled memory management using reference counting (Google it :P) . Here is an interesting article (Its a must read) on how the memory handling happens, when only reference counting was used. The summaries, python holds the memory when the objects is deleted. Why? Cuz memory operations are expensive so if your program has a sudden burst of object creations, without pyMalloc you would suffer huge performance issues. Since python interpreter holds the memory for you (which any decent high level language should do) , you dont have to worry about it, unless you intend to write a long daemon process and have throw int, floats and lists all over your program. (ints / floats / lists are not handled via PyMalloc ) . But definitely a BIG reason to worry, if the code is running in the server and for very very long time.

So what do we do about it ?
Good news. On later versions (probably python 2.6 onwards , I am not too sure on that ) Python uses more reference counting and garbage collections. I am not getting into the debate on whether GC is a good thing or not. Hard-core C++ programmers will never agree with me, as GC is not suitable for real time systems. Since, they use significant amount of computing power to clean up memory.
Automatic garbage collection will not run if your Python device is running out of memory

In your python program, you can " import gc " (GC in python). By default GC is enabled in python , unless u manually disabled it during installation.

import gc

gc.get_threshold() # Will print the memory threshold limit after which GC will run in python
gc.collect() #will collect unwanted objects and freeup memory for you.

If you are running a long daemon process, you can call gc.collect() after a function call or could be time based. Its upto you.

A few Caveats to keep in mind:

Its better off using xrange functions to populate a list than range. Lists are pretty tricky in python , they are immortal and unbound. For example .

l = range(20000)
del l
You may have deleted l , but all the memory pertaining to it is been help up by python interpreter
(What ?? I am still confused . Read the article I mentioned above). In such scenarios using xrange is
a better idea

Debug your code with Garbage Collector

In here we are creating a single object . The following is the output

As you can see the GC has detected L dictionary to be unreachable and after we run gc.collect() the object is freed.

0 Comments

Lazy File Reading in Python

Converting List of Set into Dictionary

Interfacing C++ with Python

Xy Problem

Memory Allocations in python

Archives

Categories