Python+Linux time.sleep() returning IOError 514 (tummy.com, ltd. Journal Entry)
tummy.com: we do linux

Wednesday January 10, 2007 at 16:10
Subject: Python+Linux time.sleep() returning IOError 514
Keywords: Bug, Linux, Python
Posted by: Sean Reifschneider

A client reported that they were doing "time.sleep(1)" and sometimes it was raising an IOError exception with errno set to 514. There isn't much discussion of this on google, and many of the hits go off in the wrong direction, so I wanted to blog on it so others can find more information about this easier. The short answer is that it's probably a bug in the Linux kernel...

The system in question is a dual CPU dual core 2GHz Xeon 5100 series processor. This code is working fine on other systems, including one with 4 several-year-old Xeon CPUs (with hyperthreading enabled), so it seems to be something at least somewhat related to this particular system.

Python's "time.sleep()" is implemented by calling the select system-call, to allow for sub-second sleeps. In looking at recent Linux kernel source (2.6.19.1), I see that errno 514 is ERESTARTNOHAND (restart if no handler), and is in a section marked as "should never be seen by user programs".

So, it would seem that the kernel is leaking this information where it shouldn't be. I dug some into fs/select.c, and I see two possibilities for leaking. The first I think is likely the problem:

  • In sys_select(), if STICK_TIMEOUTS is not set, and the copy_to_user() call (fs/select.c:418 for 2.6.19.1) returns non-zero, ERESTARTNOHAND could be propagated to user-space. Moving the "if (ret == -ERESTARTNOHAND)" block outside one or both of the "if" blocks it's currently in could reduce this. However, I don't fully understand the implications of this move.
  • In sys_pselect7(), it has similar code, but after the block mentioned above it has an "if (ret == -ERESTARTNOHAND)" block (fs/select.c:500 for 2.6.19.1), but it never changes the ERESTARTNOHAND into -EINTR. So, it looks like ERESTARTNOHAND can definitely propagate back to userspace here.

Note that in the Python case, a time value is always passed to select(). So, when reviewing the code, I'm assuming that the "if (tvp)" code paths are being called.

Again, I don't fully understand what clearing ERESTARTNOHAND in these cases would imply to other code (in and outside the kernel). At the very least, it looks like the statement in include/linux/errno.h saying that this error should never reach user-space is wrong. It mostly is true, but sporadically it is making it to user-space.

In Python, this is easy enough to catch with a sleep wrapper that does:

try: time.sleep()
except IOError, e:
   if e.errno != 514: raise

It would be nice to know more about whether this really should be making it back to user-space, or not.
(Post Reply)

Comment
Hiro Sugawara
Subject: ERESTARTNOHAND in userspace
I've seen the same thing here. The platform is an x86_64 with SMP running 2.6.12. The server is a multi-threaded process. So far, only one incident has been reported.

There is an interesting posting by IBM for their S390 Linux at http://www-128.ibm.com/developerworks/linux/linux390/linux-2.6.5-s390-25-april2004.html that refers to the exact same symptom, but it patches the assembly code in entry.S which I found little similarity to my case.

Comment
Zac Conn
Subject: ERESTARTNOHAND
I have a similar problem here, did you find any final conclusion about this?
Comment
Author: Sean Reifschneider
Subject: Don't know about the solution.
I never worked this to complete solution. I believe what happened was that the client worked around this in their code.