We've had a similar situation occur. For about three months, we've run
several Windows 2008 R2 guests with virtio drivers that record video
surveillance. We have long suffered an issue where the guest appears to
hang indefinitely (or until we intervene). For the sake of this
conversation, we call this state "wedged", because it appears something
(rbd, qemu, virtio, etc) gets stuck on a deadlock. When a guest gets
wedged, we see the following:
- the guest will not respond to pings
- the qemu-system-x86_64 process drops to 0% cpu
- graphite graphs show the interface traffic dropping to 0bps
- the guest will stay wedged forever (or until we intervene)
- strace of qemu-system-x86_64 shows QEMU is making progress 
We can "un-wedge" the guest by opening a NoVNC session or running a
'virsh screenshot' command. After that, the guest resumes and runs as
expected. At that point we can examine the guest. Each time we'll see:
- No Windows error logs whatsoever while the guest is wedged
- A time sync typically occurs right after the guest gets un-wedged
- Scheduled tasks do not run while wedged
- Windows error logs do not show any evidence of suspend, sleep, etc
We had so many issue with guests becoming wedged, we wrote a script to
'virsh screenshot' them via cron. Then we installed some updates and had
a month or so of higher stability (wedging happened maybe 1/10th as
often). Until today we couldn't figure out why.
Yesterday, I realized qemu was starting the instances without specifying
cache=writeback. We corrected that, and let them run overnight. With RBD
writeback re-enabled, wedging came back as often as we had seen in the
past. I've counted ~40 occurrences in the past 12-hour period. So I feel
like writeback caching in RBD certainly makes the deadlock more likely
Joshd asked us to gather RBD client logs:
"joshd> it could very well be the writeback cache not doing a callback
at some point - if you could gather logs of a vm getting stuck with
debug rbd = 20, debug ms = 1, and debug objectcacher = 30 that would be
We'll do that over the weekend. If you could as well, we'd love the help!
Co-Founder & Director of Cloud Architecture
6330 East 75th Street, Suite 170
Indianapolis, IN 46250
Post by Oliver Francke
I believe, I'm the winner of buzzwords-bingo for today.
But seriously speaking... as I don't have this particular problem with
qcow2 with kernel 3.2 nor qemu-1.2.2 nor newer kernels, I hope I'm not
We have a raising number of tickets from people reinstalling from ISO's
Fast fallback is to start all VM's with qemu-1.2.2, but we then lose
some features ala latency-free-RBD-cache ;)
with all dirty details.
Installing a backport-kernel 3.9.x or upgrade Ubuntu-kernel to 3.8.x
"fixes" it. So we have a bad combination for all distros with 3.2-kernel
and rbd as storage-backend, I assume.
Any similar findings?
Any idea of tracing/debugging ( Josh? ;) ) very welcome,