Py4J Backlog, Bytes, and Open Source

Since my Ph.D. thesis is being printed right now, I thought I could give a status update on Py4J.

One Py4J contributor/user reported a problem with how Py4J handles byte arrays almost a year ago. Because Py4J was treating byte arrays as any other arrays (i.e., a reference), access to individual cells in the arrays were costly (one roundtrip per access). Byte arrays are special beasts because when you go down to the level of bytes, you usually want the raw power and the hanging rope that come with it: you certainly don’t want the programming language or a particular library to stand in your way. Because Py4J uses a String protocol (e.g., newlines are used as separators), transferring raw bytes would require a lot of modifications and would introduce a special case that would need more code than the usual case.

I thus implemented a naive solution that just shifted the byte by 8 bit, to make sure that I could still use my dear newlines. The same person came back at me a few months later though, and introduced me to the concept of UTF-16 surrogates and how Java did not like these special pairs of characters, even in UTF-8, the default encoding for Py4J.

I boosted the priority of this issue, but because I had started a new job and I was trying to finish my thesis during the weekends (advice: this is the fastest way to end up in an asylum), I did not have the time nor the strength to find a solution. Fortunately, a contributor from The Atlantic made a nice Christmas present to Py4J users: he implemented a fix using Base64 and opened a pull request. I merged the pull request in January, but I’m still fighting with some test glitches caused by the difference between Python 2 and Python 3. The Open Source community has been very kind to me and I have been fortunate to receive significant contributions from Py4J users in the past (Python 3 support anyone?). Because I am working for a company that is sympathetic to open source contributions, I will make sure in the near future that the effort behind the various Py4J patches were not in vain.

There are currently 5 open issues that I need to close before releasing 0.8, but all issues have some work in progress so I am confident that I will go through this backlog soon. After that, I will try to come back to a regular release cycle.

Py4J and Exceptions

Py4J 0.6 is almost ready to be released, thanks to Jakub L. Gustak who submitted important bug reports, feature requests, and patches. I have been trying to polish Py4J in the latest releases to make the API more consistent and predictable and the biggest “feature” of 0.6 will no doubt be how Py4J treats Exceptions.

Currently, exceptions can be raised in four places: (1) in the Py4J Python code, (2) in the Py4J Java code, (3) in the Java client code, and (4) in the network stack. An exception might be raised in the Py4J code if the client code is not correct, for example, if the client tries to call from Python a Java method that does not exist. Before 0.6, Py4J raised a Py4JError in cases 1,2,3 and a Py4JNetworkError (a subtype of Py4JError) in case 4. Moreover, if the Java exception was raised on the Java side, the Java stack trace was copied, as a string, in the Py4JError.

There are two issues with this approach. First, the client does not have access to the exception instance on the Java side, and this exception may have some important fields and methods that can help the error recovery. Second, it is very difficult for the client to determine at runtime the source of the error.

Starting from 0.6, Py4J will raise three types of exceptions: Py4JNetworkError in case #4, Py4JJavaError in case #3, Py4JError in cases #1 and #2. Py4JNetworkError and Py4JJavaError will be a subtype of Py4JError (so a client can implement a catch all). Py4JJavaError will also have a method that will return the instance of the Java exception and Py4JError will still display the Java stack trace for case #2.

Stay tuned for 0.6!

 

Memory Management and Circular References in Python – Part 3

In the previous post, we saw that finalizers are difficult to write in the presence of circular references, even when using weak references to hold the finalizer callback. The main issue is that if object A’s finalizer is held by object A, and object A has a circular reference with object B, the finalizer will never get invoked because both objects are removed before the finalizer has a change to be invoked.

There are two strategies to work around this issue: one is to save the finalizers in an external object that does not belong to a cycle with object A, the other is to remove the circular reference with object B. This is essentially what JavaObject1 and JavaObject2 do in the next code snippet:

finalizers = []

class JavaObject1(object):
    def __init__(self, id):
        self._id = id
        self._methods = {}
        finalizers.append(weakref.ref(self, lambda wr : inc1()))

    def __getattr__(self, name):
        if name not in self._methods:
            self._methods[name] = JavaMember(name, self)
        return self._methods[name]

class JavaObject2(object):
    def __init__(self, id):
        self._id = id
        self._wr = weakref.ref(self, lambda wr : inc2())  

    def __getattr__(self, name):
        return JavaMember(name, self)

class JavaMember(object):
    def __init__(self, name, container):
        self.name = name
        self.container = container

    def __call__(self, *args):
        j = 0
        for i in xrange(1, 10):
            j += i

As we can see, the weak reference/finalizer in JavaObject1 is held by the global variable finalizers. JavaObject2 is no longer part of a cycle because it creates a new JavaMember every time a member is requested. The JavaMember still refers to the JavaObject2 instance to prevent the eager garbage collection of the JavaObject2 instance discussed at the beginning of the last post. If we try to test our new implementations, we obtain these results:

def m1():
    for j in xrange(0, 100):
        java_object = JavaObject1('o' + str(j))
        for i in xrange(10000):
            java_object.method1()

def m2():
    for j in xrange(0, 100):
        java_object = JavaObject2('o' + str(j))
        for i in xrange(10000):
            java_object.method1()

if __name__ == '__main__':
    timer(m1,'With JavaObject1: ')
    timer(m2,'With JavaObject2: ')

...

With JavaObject1: 1.83600997925
acc1:47
acc2:0
With JavaObject2: 2.38709688187
acc1:47
acc2:100

REMINDER: the float number is the time it took to make 1 million method calls (1oo instances X 10 000 calls), the number beside acc1 indicates how many times a finalizer of a JavaObject1 instance was called (max 100), and acc2 indicates how many times a finalizer of a JavaObject2 instance was called.

First, we can see that JavaObject1 is performing a little better than JavaObject2 because method objects are cached and do not need to be instantiated at every call. Second, we can see that not all finalizers of JavaObject1 instances are called. Why?

Well, the answer lies in the garbage collection strategies used by CPython. Roughly, when there is no circular dependency, CPython uses a reference counting strategy: when the last reference to an object is deleted, the object is immediately garbaged collected (hence the perfect acc2 score for JavaObject2). But when there is a circular dependency, CPython uses a mark and sweep strategy which is only executed once in a while. In other words, the Python interpreter has to stop the program execution, inspect the objects, and clean the objects. The frequency of the garbage collection can be set using the gc module.

In the previous example, because the program did not execute long enough (more precisely, because the number of allocations/deallocations did not reach the gc thresholds), some objects were never collected before the end of the program. If instead we explicitly invoke the garbage collector, we see that now all finalizers are invoked:

if __name__ == '__main__':
    timer(m1,'With JavaObject1: ')
    timer(m2,'With JavaObject2: ')

    print('Running GC')
    gc.collect()
    print('acc1:' + str(accumulator1))
    print('acc2:' + str(accumulator2))

...

With JavaObject1: 1.83600997925
acc1:47
acc2:0
With JavaObject2: 2.38709688187
acc1:47
acc2:100
Running GC
acc1:100
acc2:100

The next post will provide a summary of the lessons learned!

Eating your own dog food

I believe there are three cost-effective strategies to polish and simplify the API of an open-source project. The first strategy is to write getting started documentation (e.g., tutorial): bad APIs are embarrassing to write about and a tutorials about good APIs are also easier and quicker to write.

The second strategy is to build a community and listen to their enhancement requests, while being able to say no to keep the code cohesive and focused. Of course, usability is generally improved and evaluated through observation of usage and not through interviews and requests. But usability studies are not cost-effective in the context of open source projects: who has the time and the money to perform such studies?

The third strategy is to use your own project. This often happens when a project starts, but I suspect that as the project matures, some of the contributors don’t use their project as often so new features might not get the same polishing attention as earlier features.

I’m currently eating my own dog food by using Py4J to analyze and index Java code through Eclipse. I’m delighted that accessing Eclipse from Python and exploring new Java APIs are so easy to do. What used to take me one hour of exploration now takes fifteen minutes. Usually, my workflow when hacking in Eclipse looks like: read the javadoc and the code of potentially interesting classes, write a plug-in, debug the plug-in, look for alternatives in the Eclipse code, repeat. With Py4J, I just try to call a method and see how it goes. When I get something back, I call gateway.help() and I have a list of methods I can call. No more leap of faith (“let’s see what happens when I call these 10 methods together).

Still, I encountered a few frustrating things that I will fix as soon as the next big feature (callback) is implemented:

Arguably, I knew about the first and third points, but I had no idea how frustrating they were!

Memory Management and Circular References in Python – Part 2

When we left last time, I presented a simplified version of the JavaObject and JavaMember classes in Py4J. In Py4J, when a JavaMember is called, Py4J calls the equivalent method on the JVM and when a JavaObject is garbage collected, it is dereferenced on the JVM. Then, I asked asked the question: what is wrong with the way garbage collection is handled:

class JavaObject1(object):
    def __init__(self, id):
        self._id = id
        self._methods = {}
        self._wr = weakref.ref(self, lambda wr : inc1())  

    def __getattr__(self, name):
        if name not in self._methods:
            self._methods[name] = JavaMember(name)
        return self._methods[name]

class JavaMember(object):
    def __init__(self, name):
        self.name = name

    def __call__(self, *args):
        j = 0
        for i in xrange(1, 10):
            j += i

The problem comes from the fact that JavaMember does not reference JavaObject1, so in the following statement:

javaObjet.method1()

if javaObject is no longer referenced in the Python program, javaObject could be garbage collected before method1() is called. Indeed, the order of the operations could be:

  1. get attribute method1()
  2. decrease reference count of javaObject
  3. garbage collect javaObject on Python VM
  4. garbage collect javaObject on Java VM
  5. call method1.__call__() method
  6. call javaObject.method1 on Java VM
  7. Error!!! javaObject no longer exists on the Java VM!

One solution is to make sure that javaObject is never garbage collected until all its methods have been garbage collected too. This is done by adding a reference to JavaObject from JavaMember:

class JavaObject1(object):
    def __init__(self, id):
        self._id = id
        self._methods = {}
        self._wr = weakref.ref(self, lambda wr : inc1())  

    def __getattr__(self, name):
        if name not in self._methods:
            self._methods[name] = JavaMember(name, self)
        return self._methods[name]

class JavaMember(object):
    def __init__(self, name, container):
        self.name = name
        self.container = container

    def __call__(self, *args):
        j = 0
        for i in xrange(1, 10):
            j += i

Now, if we try to run the following test, we see some strange results:

def m1():
    for j in xrange(0, 100):
        java_object = JavaObject1('o' + str(j))
        for i in xrange(10000):
            java_object.method1()

if __name__ == '__main__':
    timer(m1,'With JavaObject1: ')

# Returns:
With JavaObject1: 1.8906121254
acc1:0

Although there is a circular reference between JavaMember and JavaObject1, this should not be a problem because weak references do not prevent objects to be garbage collected as __del__ methods do. Right?

Well, from the output of the test, we see that acc1 = 0 so the finalizer of JavaObject1 was never called! The reason, and it took me a while to figure this out, is that the finalizer is registered by an instance of JavaObject1 (self._wr), which gets deleted itself before the finalizer has a chance to run. Indeed, a weak reference callback is not invoked if the instance holding the callback is garbage collected before the callback is invoked.

It follows that the instances are really garbage collected (this can be seen by using the gc module), but the finalizers are never called.

The problem, for Py4J, is that the Python VM must tell the Java VM when an object is garbage collected to avoid creating a memory leak. It turns out that there are only two families of solutions, each family bringing its own trade-offs. This is the topic of the next post.

Go to Part 3 of this series.

Memory Management and Circular References in Python – Part 1

Coming from the Java world, it was a big, unpleasant, surprise to discover that Python and circular references are no friend. Sure, you can always find your way around, but in general, it’s a PITA to deal with circular references. This series of posts will cover my trip into the wonderful world of garbage collection, circular references and finalizers in Python.

Finalizers
Finalizers in Python go wild when there is a circular reference around. Roughly speaking, a finalizer is a function that is called when an object is about to be destroyed/garbage collected. These are special methods that should be used with care, especially because there is no guarantee when they will be called (even in Java). Finalizers can be implemented by overriding the __del__ method of a class, but objects that override the __del__ method and that have a circular reference are not garbage collected until the circular references are manually broken. The recommended way of implementing a finalizer is to create a weak reference to the object that will call a function once there is no longer a strong reference to the object. But even this solution is not trouble-free. Read along to learn why.

Py4J Model
To illustrate how memory management can be tricky with Python, I will use a simplified representation of the Py4J model. Py4J enables Python programs to access objects residing in a Java Virtual Machine. The Java objects are represented by a JavaObject instance in Python while Java Methods are represented by a JavaMember instance in Python. Here is one possible implementation of JavaObject and JavaMember:

class JavaObject1(object):
    def __init__(self, id):
        self._id = id
        self._methods = {}

    def __getattr__(self, name):
       if name not in self._methods:
           self._methods[name] = JavaMember(name)
       return self._methods[name]

class JavaMember(object):
    def __init__(self, name):
        self.name = name

    def __call__(self, *args):
        j = 0
        # Do some work. In Py4J, make a remote call to the Java method.
        for i in xrange(1,10):
            j += i

# Example of use:
# my_object = JavaObject1('oid123')
# my_object.someJavaMethod() 
#   --> calls: JavaObject1.__get__attr__, then calls JavaMember.__call__

In the previous example, JavaObject1 represents a java object with an identifier and any call to a method will create an appropriate instance of that method and cache it in _methods.

Garbage Collection
Here comes the fun part. In object brokering systems like Py4J, there must be a way from one side to tell the other side that an object is no longer used.

Assume you created an instance in Python like this: my_obj = JavaObject1. In the (real) Py4J, this has the effect of creating a reference to an object on the corresponding JVM. When my_obj is no longer referenced by Python, it will be garbage collected on the Python side, creating a leak on the JVM side. There must thus be a way to link the garbage collection process of Python with the JVM’s one.

This is where finalizers come in handy: when Python is ready to collect a JavaObject instance, the instance’s finalizer should warn the JVM that it no longer needs to reference this instance. This is what we do in the new version of JavaObject1. Instead of communicating with the JVM, we will simply increase an accumulator, to keep track of the destroyed instances:

accumulator1 = 0

def inc1():
    global accumulator1
    accumulator1 += 1 

class JavaObject1(object):
    def __init__(self, id):
        self._id = id
        self._methods = {}
        self._wr = weakref.ref(self, lambda wr : inc1())  

    ...

def m1():
    for j in xrange(0,100):
        java_object = JavaObject1('o' + str(j))
        for i in xrange(10000):
            java_object.method1()

def timer(func,run):
    start = time.time()
    func()
    print(run + str(time.time() - start))
    print('acc1:' + str(accumulator1))

if __name__ == '__main__':
    timer(m1,'With JavaObject1: ')

When we create a new instance of JavaObject1, this instance creates a weak reference to itself with a callback that will be invoked when the instance is about to be garbage collected. We also defined a function that will time the execution of (1) creating 100 JavaObject instances and (2) calling a method 100 * 10 000 times. This will be useful when we will compare various implementations of JavaObject. If we run our timer function:

With JavaObject1: 1.8631708622
acc1:100

We see that the finalizers worked: 100 JavaObject1 instances were “destroyed”. But there is a problem with this scheme. Can you spot it? Here is again how one would use Py4J. See the next post for the solution…:

my_obj = JavaObject1()
my_obj.method1() # Calls: JavaObject1.__get__attr__, then calls JavaMember.__call__

Go to Part 2 of this series.

Context Resuming

Poor souls like me who work on their open source projects in their spare time sometimes suffer from a form of contextus resumis. You know, when you sit down, think of all the time you can finally spend on your favorite project, and then, realize, horror-struck, that you don’t know how to resume the task you were working on two weeks ago?

This is particularly an issue when you are working on core tasks that affect most parts of the project and that have deep design implications. My guess is that they are also the kind of tasks that cause contextus resumis: small tasks can (and should) generally be completed in one coding session.

Sure, there are software solutions like Mylyn that can make your IDE look like the way it was when you started to work on your task, but I always found that this kind of solution did not work well for system-wide tasks. What would Mylyn do? Open up all source files of Py4J? Anyway, Mylyn is not an option right now because it cannot connect to SourceForge’s trac installations, a problem that has been known for 9 months now.

One obvious solution is to divide your big task into smaller tasks (I know, you wanted to shout this from the beginning). But I’m currently changing the network and threading models of Py4J and these two models cannot be separated from each other. They also impact both the Java and the Python sides and these changes are part of a bigger redesign effort to enable Java code to callback Python code (more on this in the next post).

Do you have any tips or tricks to share?