A client reported that they were doing "time.sleep(1)" and sometimes it was raising an IOError exception with errno set to 514. There isn't much discussion of this on google, and many of the hits go off in the wrong direction, so I wanted to blog on it so others can find more information about this easier. The short answer is that it's probably a bug in the Linux kernel...
The system in question is a dual CPU dual core 2GHz Xeon 5100 series processor. This code is working fine on other systems, including one with 4 several-year-old Xeon CPUs (with hyperthreading enabled), so it seems to be something at least somewhat related to this particular system.
Python's "time.sleep()" is implemented by calling the select system-call, to allow for sub-second sleeps. In looking at recent Linux kernel source (18.104.22.168), I see that errno 514 is ERESTARTNOHAND (restart if no handler), and is in a section marked as "should never be seen by user programs".
So, it would seem that the kernel is leaking this information where it shouldn't be. I dug some into fs/select.c, and I see two possibilities for leaking. The first I think is likely the problem:
Note that in the Python case, a time value is always passed to select(). So, when reviewing the code, I'm assuming that the "if (tvp)" code paths are being called.
Again, I don't fully understand what clearing ERESTARTNOHAND in these cases would imply to other code (in and outside the kernel). At the very least, it looks like the statement in include/linux/errno.h saying that this error should never reach user-space is wrong. It mostly is true, but sporadically it is making it to user-space.
In Python, this is easy enough to catch with a sleep wrapper that does:
try: time.sleep() except IOError, e: if e.errno != 514: raise
It would be nice to know more about whether this really should be making it back to user-space, or not.comments powered by Disqus