Your Linux Data Center Experts

A client reported that they were doing “time.sleep(1)” and sometimes it was raising an IOError exception with errno set to 514. There isn't much discussion of this on google, and many of the hits go off in the wrong direction, so I wanted to blog on it so others can find more information about this easier. The short answer is that it's probably a bug in the Linux kernel…

The system in question is a dual CPU dual core 2GHz Xeon 5100 series processor. This code is working fine on other systems, including one with 4 several-year-old Xeon CPUs (with hyperthreading enabled), so it seems to be something at least somewhat related to this particular system.

Python's “time.sleep()” is implemented by calling the select system-call, to allow for sub-second sleeps. In looking at recent Linux kernel source (2.6.19.1), I see that errno 514 is ERESTARTNOHAND (restart if no handler), and is in a section marked as “should never be seen by user programs”.

So, it would seem that the kernel is leaking this information where it shouldn't be. I dug some into fs/select.c, and I see two possibilities for leaking. The first I think is likely the problem:

  • In sys_select(), if STICK_TIMEOUTS is not set, and the copy_to_user() call (fs/select.c:418 for 2.6.19.1) returns non-zero, ERESTARTNOHAND could be propagated to user-space. Moving the “if (ret == -ERESTARTNOHAND)” block outside one or both of the “if” blocks it's currently in could reduce this. However, I don't fully understand the implications of this move.
  • In sys_pselect7(), it has similar code, but after the block mentioned above it has an “if (ret == -ERESTARTNOHAND)” block (fs/select.c:500 for 2.6.19.1), but it never changes the ERESTARTNOHAND into -EINTR. So, it looks like ERESTARTNOHAND can definitely propagate back to userspace here.

Note that in the Python case, a time value is always passed to select(). So, when reviewing the code, I'm assuming that the “if (tvp)” code paths are being called.

Again, I don't fully understand what clearing ERESTARTNOHAND in these cases would imply to other code (in and outside the kernel). At the very least, it looks like the statement in include/linux/errno.h saying that this error should never reach user-space is wrong. It mostly is true, but sporadically it is making it to user-space.

In Python, this is easy enough to catch with a sleep wrapper that does:

try: time.sleep()
except IOError, e:
   if e.errno != 514: raise

It would be nice to know more about whether this really should be making it back to user-space, or not.

comments powered by Disqus

Join our other satisfied clients. Contact us today.