Py4J and Exceptions

Py4J 0.6 is almost ready to be released, thanks to Jakub L. Gustak who submitted important bug reports, feature requests, and patches. I have been trying to polish Py4J in the latest releases to make the API more consistent and predictable and the biggest “feature” of 0.6 will no doubt be how Py4J treats Exceptions.

Currently, exceptions can be raised in four places: (1) in the Py4J Python code, (2) in the Py4J Java code, (3) in the Java client code, and (4) in the network stack. An exception might be raised in the Py4J code if the client code is not correct, for example, if the client tries to call from Python a Java method that does not exist. Before 0.6, Py4J raised a Py4JError in cases 1,2,3 and a Py4JNetworkError (a subtype of Py4JError) in case 4. Moreover, if the Java exception was raised on the Java side, the Java stack trace was copied, as a string, in the Py4JError.

There are two issues with this approach. First, the client does not have access to the exception instance on the Java side, and this exception may have some important fields and methods that can help the error recovery. Second, it is very difficult for the client to determine at runtime the source of the error.

Starting from 0.6, Py4J will raise three types of exceptions: Py4JNetworkError in case #4, Py4JJavaError in case #3, Py4JError in cases #1 and #2. Py4JNetworkError and Py4JJavaError will be a subtype of Py4JError (so a client can implement a catch all). Py4JJavaError will also have a method that will return the instance of the Java exception and Py4JError will still display the Java stack trace for case #2.

Stay tuned for 0.6!

 

Py4J code moves to GitHub

It is now official. Py4J code has been moved to GitHub. The issues have been transferred there too thanks to a handy trac-to-github python script. I will be shutting down the trac instance on SourceForge. The web site, mailing list, and alternate download site will still remain on SourceForge (Py4J has always been available on pypi as well).

The decision to move to a Distributed Version Control System (DVCS) was a no brainer. I tried mercurial for another open source project I work on, Qualyzer, and DVCS are clearly superior when it comes to merging and enabling collaboration. The speed boost is also quite welcome (thank you local commit).

The main question was thus: mercurial or git? Actually, the real question was, bitbucket or github? This was not an easy decision, but in the end, the numerous outages and bugs of bitbucket and the higher number of potential collaborators on github convinced me to go with github. In addition, github seems speedier and they developed a tool, hg-git, that allows mercurial users to work with github repositories.

I must say that I was disappointed with the issue tracker at github which is slightly inferior to the bitbucket tracker (only tags, no milestones, no automatic reference between commit and issues except when closing issues, no complex queries, etc.) and I just learned that their wiki engine was only recently moved to an implicit git repository: I believe bitbucket wikis have always been backed by a mercurial repository… Anyway, I just wanted to say this because I have read many posts bashing bitbucket over github and I did not want to sound like one of them…

 

 

Py4J 0.5 Released!

Oh Yeah! It’s another release of Py4J and there are many interesting features that made their way in:

  • The ability to import packages (e.g., java_import(gateway.jvm, 'java.io.*'))
  • Support for pattern filtering in JavaGateway.help() (e.g., gateway.help(obj,'get*Foo*Bar'))
  • Support for automatic conversion of Python collections (list, set, dictionary) to Java collections.
  • Two Eclipse features: one embeds the Py4J Java library. The other provides a default GatewayServer that is started when Eclipse starts. Both features are available on the new Py4J Eclipse update site: http://py4j.sourceforge.net/py4j_eclipse
  • New module decomposition: there are no more mandatory circular dependencies among modules.
  • Have a look at the Trac 0.5 milestone for a detailed description of the issues fixed in this release!

This feature marks the return to a predictable and regular schedule: one release every two months. I am particularly proud that I could make this release given my crazy and stressful schedule these last days 😉

I should restate that I use Py4J every day since version 0.4 and most of the features that made it painful to use are now completed. The next release will focus on source code cleanup (be more pythonic), a few corner case features that are difficult to implement but that will make Py4J a killer application, and better unit test organisation. As usual, comments are welcome and if there is any feature you would like to see in the next release, now would be a good time to make your voice heard!

Py4J 0.4 released

Py4J 0.4 has just been released! New features include:

  • Polishing of existing features: fields can be set (not just read), None is accepted as a method parameter, methods are sorted alhabetically in gateway.help(), etc.
  • Java Exception Stack Trace are now propagated to Python side.
  • Changed interfaces member in Callback classes to implements.
  • Internal refactoring to adopt clearer terminology and make Py4J protocol extensible.
  • Many bug fixes: most are related to the callback feature.

Since the beginning of Py4J, I have been trying to release early and often and to release on a regular basis (2 months). Unfortunately, because I have been extremely busy during the summer (hey, we moved to our new condo!), I have made the mistake of pushing back the release date in the hope that it would give me time to complete all the tickets for 0.4.

This is not how it should work. The next release is due in 6 weeks and even if there is only one new bug fix or one new feature, it will be released at that date.

In the meantime, you should go ahead and download Py4J!

Py4J 0.4 Release Delayed

Unfortunately, I must delay the 0.4 release because I have other priorities for now. The tentative release date is by the end of August.

The SVN head, available at https://py4j.svn.sourceforge.net/svnroot/py4j/trunk, contains the most important new features and bug fixes of 0.4 so do not hesitate to use the code from the repository. In fact, I use this code every day for the tool I’m developing for my Ph.D.

As always, comments are welcome!

Py4J 0.3 is now available!

I just released Py4J 0.3, a major milestone. This release adds support for more Java collections such as array and set, but more importantly, it enables Java objects to call Python objects that implements a Java interface.

While working on Py4J, I constantly pushed the hard features and the nasty bug fixes to milestone 0.3. Let’s just say that I’m happy this release is now behind me and I am promoting Py4J to beta because all major features are now implemented and relatively well tested.

Enabling callback required a major rearchitecture of Py4J internals and I had to implement a multi-threaded and multi-socket RMI-like server on both the Python and the Java sides. Fortunately, this is not the first time I am involved in shady businesses like this (I remember a graduate software fault tolerance course where I created a redundant framework relying on an orgy of sockets, threads, and processes).

Another area that caused me some worries was the garbage collection in Python and in Java. It took me dozen of hours to reach a good level of understanding of Python handling of garbage collection, weak references, and circular references. I would have preferred to have spent these hours on Team Fortress 2, but in the end, I am glad I learned something valuable.

Getting this kind of code to work is hard and I expect it will take a few more releases to polish it and to uncover bugs, but I believe the time I took to design the new architecture on paper was worth it and the initial tests are green 🙂

Now, stop reading this post and go download Py4J 0.3!

Memory Management and Circular References in Python – Part 3

In the previous post, we saw that finalizers are difficult to write in the presence of circular references, even when using weak references to hold the finalizer callback. The main issue is that if object A’s finalizer is held by object A, and object A has a circular reference with object B, the finalizer will never get invoked because both objects are removed before the finalizer has a change to be invoked.

There are two strategies to work around this issue: one is to save the finalizers in an external object that does not belong to a cycle with object A, the other is to remove the circular reference with object B. This is essentially what JavaObject1 and JavaObject2 do in the next code snippet:

finalizers = []

class JavaObject1(object):
    def __init__(self, id):
        self._id = id
        self._methods = {}
        finalizers.append(weakref.ref(self, lambda wr : inc1()))

    def __getattr__(self, name):
        if name not in self._methods:
            self._methods[name] = JavaMember(name, self)
        return self._methods[name]

class JavaObject2(object):
    def __init__(self, id):
        self._id = id
        self._wr = weakref.ref(self, lambda wr : inc2())  

    def __getattr__(self, name):
        return JavaMember(name, self)

class JavaMember(object):
    def __init__(self, name, container):
        self.name = name
        self.container = container

    def __call__(self, *args):
        j = 0
        for i in xrange(1, 10):
            j += i

As we can see, the weak reference/finalizer in JavaObject1 is held by the global variable finalizers. JavaObject2 is no longer part of a cycle because it creates a new JavaMember every time a member is requested. The JavaMember still refers to the JavaObject2 instance to prevent the eager garbage collection of the JavaObject2 instance discussed at the beginning of the last post. If we try to test our new implementations, we obtain these results:

def m1():
    for j in xrange(0, 100):
        java_object = JavaObject1('o' + str(j))
        for i in xrange(10000):
            java_object.method1()

def m2():
    for j in xrange(0, 100):
        java_object = JavaObject2('o' + str(j))
        for i in xrange(10000):
            java_object.method1()

if __name__ == '__main__':
    timer(m1,'With JavaObject1: ')
    timer(m2,'With JavaObject2: ')

...

With JavaObject1: 1.83600997925
acc1:47
acc2:0
With JavaObject2: 2.38709688187
acc1:47
acc2:100

REMINDER: the float number is the time it took to make 1 million method calls (1oo instances X 10 000 calls), the number beside acc1 indicates how many times a finalizer of a JavaObject1 instance was called (max 100), and acc2 indicates how many times a finalizer of a JavaObject2 instance was called.

First, we can see that JavaObject1 is performing a little better than JavaObject2 because method objects are cached and do not need to be instantiated at every call. Second, we can see that not all finalizers of JavaObject1 instances are called. Why?

Well, the answer lies in the garbage collection strategies used by CPython. Roughly, when there is no circular dependency, CPython uses a reference counting strategy: when the last reference to an object is deleted, the object is immediately garbaged collected (hence the perfect acc2 score for JavaObject2). But when there is a circular dependency, CPython uses a mark and sweep strategy which is only executed once in a while. In other words, the Python interpreter has to stop the program execution, inspect the objects, and clean the objects. The frequency of the garbage collection can be set using the gc module.

In the previous example, because the program did not execute long enough (more precisely, because the number of allocations/deallocations did not reach the gc thresholds), some objects were never collected before the end of the program. If instead we explicitly invoke the garbage collector, we see that now all finalizers are invoked:

if __name__ == '__main__':
    timer(m1,'With JavaObject1: ')
    timer(m2,'With JavaObject2: ')

    print('Running GC')
    gc.collect()
    print('acc1:' + str(accumulator1))
    print('acc2:' + str(accumulator2))

...

With JavaObject1: 1.83600997925
acc1:47
acc2:0
With JavaObject2: 2.38709688187
acc1:47
acc2:100
Running GC
acc1:100
acc2:100

The next post will provide a summary of the lessons learned!