Jay Taylor's notes
back to listing indexFixing race conditions and deadlocks in Go
[web search]In the past month we worked to scale up botbot.me. Every few days the bot would get stuck in a deadlock situation that caused outage of the logging service. I had some time in front of me that I decided to dedicate to improving the situation. I learned a few things along the way that I would like to share with you.
Bumpy road toward a better code base
The issues described in this post will likely remain unnoticed in your application for long period of time. One day, however, you'll reach the critical scale where they will start to become visible. For us it was the sheer number of networks, channels, messages we are receiving and logging.
As I was learning more about Go and the subtlety of the code base I started to notice that it was suffering from many race conditions. We wrote most of the code base in the early days of Go 1.0, predating the introduction of the race detector. Upgrading to the latest Go version let me take advantage of this tool to fix them one at a time. A bit more about the race detector in the next section Tooling.
Part of fixing the races involve using an RWMutex
, it is a reader/writer mutual exclusion lock. No rocket science there, it is pretty straightforward. You add a RWMutex
on your struct and then you can call:
theStruct.Lock()
theStruct.Unlock()
theStruct.RLock()
theStruct.RUnlock()
Here is an example of usage inside github.com/BotBotMe/botbot-bot
Even if there is nothing hard there it is easy to end up trading race conditions for deadlocks. This is especially true when you have several independent goroutines sharing mutable states. Fighting this deadlock situation has been the hardest thing in the last month. It took a long time to understand the problem and find the patterns leading to these situations. And this because I overlooked another fantastic tool in the Go standard library net/http/pprof
. It serves runtime profiling data via an HTTP server in the format expected by the pprof
visualization tool.
We still have some road in front of us to make botbot a race free, deadlock free, panic free program but we've already learned some important lessons along the way.
Tooling
Race detector
The race detector added in Go 1.1 is simple to use. You just need to pass the -race
flag to your build process to detect race conditions. They will appear in your log as:
==================
WARNING: DATA RACE
Read by goroutine 12:
github.com/BotBotMe/botbot-bot/network/irc.(*ircBot).String()
<autogenerated>:12 +0xc6
fmt.(*pp).handleMethods()
/home/yml/go/src/pkg/fmt/print.go:699 +0x694
fmt.(*pp).printArg()
/home/yml/go/src/pkg/fmt/print.go:790 +0x5c4
fmt.(*pp).doPrint()
/home/yml/go/src/pkg/fmt/print.go:1194 +0x33f
fmt.Fprintln()
/home/yml/go/src/pkg/fmt/print.go:254 +0x7f
github.com/golang/glog.(*loggingT).println()
/srv/virtualenvs/botbotenv/src/github.com/golang/glog/glog.go:616 +0x9c
github.com/golang/glog.Infoln()
/srv/virtualenvs/botbotenv/src/github.com/golang/glog/glog.go:970 +0x5d
github.com/BotBotMe/botbot-bot/network/irc.(*ircBot).readSocket()
/srv/virtualenvs/botbotenv/src/github.com/BotBotMe/botbot-bot/network/irc/irc.go:593 +0x4b8
Previous write by goroutine 13:
github.com/BotBotMe/botbot-bot/network/irc.(*ircBot).act()
/srv/virtualenvs/botbotenv/src/github.com/BotBotMe/botbot-bot/network/irc/irc.go:619 +0x2da
github.com/BotBotMe/botbot-bot/network/irc.(*ircBot).ListenAndSend()
/srv/virtualenvs/botbotenv/src/github.com/BotBotMe/botbot-bot/network/irc/irc.go:223 +0x3b2
Goroutine 12 (running) created at:
github.com/BotBotMe/botbot-bot/network/irc.(*ircBot).Connect()
/srv/virtualenvs/botbotenv/src/github.com/BotBotMe/botbot-bot/network/irc/irc.go:324 +0x132
github.com/BotBotMe/botbot-bot/network/irc.(*ircBot).Init()
/srv/virtualenvs/botbotenv/src/github.com/BotBotMe/botbot-bot/network/irc/irc.go:290 +0x1f5
github.com/BotBotMe/botbot-bot/network/irc.NewBot()
/srv/virtualenvs/botbotenv/src/github.com/BotBotMe/botbot-bot/network/irc/irc.go:123 +0xa0d
github.com/BotBotMe/botbot-bot/network.(*NetworkManager).Connect()
/srv/virtualenvs/botbotenv/src/github.com/BotBotMe/botbot-bot/network/network.go:105 +0x161
github.com/BotBotMe/botbot-bot/network.(*NetworkManager).RefreshChatbots()
/srv/virtualenvs/botbotenv/src/github.com/BotBotMe/botbot-bot/network/network.go:70 +0x590
main.NewBotBot()
/srv/virtualenvs/botbotenv/src/github.com/BotBotMe/botbot-bot/botbot.go:37 +0x17c
main.main()
/srv/virtualenvs/botbotenv/src/github.com/BotBotMe/botbot-bot/main.go:32 +0x19e
Goroutine 13 (running) created at:
github.com/BotBotMe/botbot-bot/network/irc.(*ircBot).Init()
/srv/virtualenvs/botbotenv/src/github.com/BotBotMe/botbot-bot/network/irc/irc.go:293 +0x219
github.com/BotBotMe/botbot-bot/network/irc.NewBot()
/srv/virtualenvs/botbotenv/src/github.com/BotBotMe/botbot-bot/network/irc/irc.go:123 +0xa0d
github.com/BotBotMe/botbot-bot/network.(*NetworkManager).Connect()
/srv/virtualenvs/botbotenv/src/github.com/BotBotMe/botbot-bot/network/network.go:105 +0x161
github.com/BotBotMe/botbot-bot/network.(*NetworkManager).RefreshChatbots()
/srv/virtualenvs/botbotenv/src/github.com/BotBotMe/botbot-bot/network/network.go:70 +0x590
main.NewBotBot()
/srv/virtualenvs/botbotenv/src/github.com/BotBotMe/botbot-bot/botbot.go:37 +0x17c
main.main()
/srv/virtualenvs/botbotenv/src/github.com/BotBotMe/botbot-bot/main.go:32 +0x19e
As you can see, you will get lots of details to help you fix the issue. Most of the time races are easy to fix but hard to discover by human inspection.
Package pprof
Instead of paraphrasing the doc let me quote it
Package pprof serves via its HTTP server runtime profiling data in the format expected by the pprof visualization tool. For more information about pprof, see http://code.google.com/p/google-perftools/.
The package is typically only imported for the side effect of registering its HTTP handlers. The handled paths all begin with /debug/pprof/.
To use pprof, import this package into your program.
This tool as been invaluable to troubleshoot the deadlock that started to appear on botbot.me every few days. The one key feature we have been using is full goroutine stack dump. It gives you an "on demand stack trace" with the current status of each goroutine. It is a pretty lengthy piece of text but these are what you want to look for:
- semacquire
- mutliple goroutine blocked on
chan send, 3 minutes
for the same amount of time.
Below you will find a truncated version of the full goroutine stack dump that we are going to use analyze the issue.
In the first example we can see that the pointer to ircBot (0xc2080761c0
) has 2 goroutines (27, 26) blocked on a chan send operation. This exhibits a fault in our design; the goroutine readSocket
can blocked by ListenAndSend
.
goroutine 105 [running]:
[ ... ]
goroutine 27 [chan send, 1 minutes]:
github.com/BotBotMe/botbot-bot/network/irc.(*ircBot).act(0xc2080761c0, 0xc20843c0a0)
/srv/virtualenvs/botbotenv/src/github.com/BotBotMe/botbot-bot/network/irc/irc.go:678 +0x8df
github.com/BotBotMe/botbot-bot/network/irc.(*ircBot).ListenAndSend(0xc2080761c0, 0xc208004a80)
/srv/virtualenvs/botbotenv/src/github.com/BotBotMe/botbot-bot/network/irc/irc.go:223 +0x3b3
created by github.com/BotBotMe/botbot-bot/network/irc.(*ircBot).Init
/srv/virtualenvs/botbotenv/src/github.com/BotBotMe/botbot-bot/network/irc/irc.go:293 +0x219
[ ... ]
goroutine 27 [chan send, 3 minutes]:
github.com/BotBotMe/botbot-bot/network/irc.(*ircBot).act(0xc2080761c0, 0xc20843c0a0)
/srv/virtualenvs/botbotenv/src/github.com/BotBotMe/botbot-bot/network/irc/irc.go:678 +0x8df
github.com/BotBotMe/botbot-bot/network/irc.(*ircBot).ListenAndSend(0xc2080761c0, 0xc208004a80)
/srv/virtualenvs/botbotenv/src/github.com/BotBotMe/botbot-bot/network/irc/irc.go:223 +0x3b3
created by github.com/BotBotMe/botbot-bot/network/irc.(*ircBot).Init
/srv/virtualenvs/botbotenv/src/github.com/BotBotMe/botbot-bot/network/irc/irc.go:293 +0x219
[ ... ]
goroutine 26 [chan send, 3 minutes]:
github.com/BotBotMe/botbot-bot/network/irc.(*ircBot).readSocket(0xc2080761c0, 0xc208004a80)
/srv/virtualenvs/botbotenv/src/github.com/BotBotMe/botbot-bot/network/irc/irc.go:594 +0x51d
created by github.com/BotBotMe/botbot-bot/network/irc.(*ircBot).Connect
/srv/virtualenvs/botbotenv/src/github.com/BotBotMe/botbot-bot/network/irc/irc.go:324 +0x132
goroutine 28 [select]:
github.com/BotBotMe/botbot-bot/network/irc.(*ircBot).monitor(0xc2080761c0, 0xc208004a80)
/srv/virtualenvs/botbotenv/src/github.com/BotBotMe/botbot-bot/network/irc/irc.go:157 +0xa6d
created by github.com/BotBotMe/botbot-bot/network/irc.(*ircBot).Init
/srv/virtualenvs/botbotenv/src/github.com/BotBotMe/botbot-bot/network/irc/irc.go:296 +0x23e
[ ... ]
The wall of text from the goroutines stack dump is available in this gist. It exhibits a few other issues that you might want to try to figure out.
** Pro tip:** Logging the address of the pointer receiver can be really useful to relate the log with the behavior of the application and the stack traces. In addition, it lets you grep the very verbose stack traces for the elements related to the problem you are debugging.
curl http://127.0.0.1:3030/debug/pprof/goroutine?debug=2 | grep -B10 -A 20 "0xc2080761c0"
Conclusions
- Always update to the latest version of Go to take advantage of the constantly improving toolchain and standard library. One side effect is that your code will be faster and the garbage collector more efficient. All this by just rebuilding your binary.
- Make sure your goroutines are orthogonal or do not share mutable states. If for some reason they have to share mutable states explicitly pass them instead of relying on Mutex. It is even better if you can refactor your code to share memory by communicating instead of communicating by sharing memory.
- Avoid nested select and blocking cases.
- Make sure that none of your case inside a select can block for an unknown period of time.
- Evolving a code base to get rid of these issues takes some time and may require a significant effort. However Go provides tooling that will help you along the way.
Botbot.me is the guinea pig on which we decided to learn Go. We made some mistakes along the way by implementing non-idiomatic Go solutions. Part of the growing process, scaling to several hundreds of channels on many networks, required re-architecting the application.
Thank you for your patience and interest if you have read this far!
About Yann Malet
Get In Touch
Subscribe and Learn
Join Django Round-Up, a low traffic mailing list with high quality Django content.
→Learn more