DNS, eglibc and resolv-replace on Heroku
2015-03-01: Fixed versions of eglibc are available for Ubuntu Precise and Trusty. Time to update.
I work on the team that runs Heroku Postgres. As we have continued to grow, I have been tracking an intermitent error with Rollbar that occurs about once every 50,000 HTTP requests. As we are doing many hundreds of thousands of API calls a minute to various services, this error can pop up fairly frequently and in very inconvenient places. The most common traceback seems to indicate a failure to resolve DNS:
#<SocketError: getaddrinfo: Name or service not known>
/app/vendor/ruby-2.1.5/lib/ruby/2.1.0/net/http.rb:879:in 'initialize'
/app/vendor/ruby-2.1.5/lib/ruby/2.1.0/net/http.rb:879:in 'open'
/app/vendor/ruby-2.1.5/lib/ruby/2.1.0/net/http.rb:879:in 'block in connect'
...
Google led me to a pertinent blog post that recommended using ruby’s
Resolv library for all DNS requests via a script called resolv-replace.
Adding a single line to our initializers, require resolv-replace
, caused errors
while submitting Logplex messages to immediately drop:
As did errors from trying to interact with our monitoring service, Observatory:
In an internal thread, Ed Muller pointed out a golang work around of a bug in glibc which is very likely to be a factor in this error:
Under high load, getaddrinfo() starts sending DNS queries to random file descriptors, e.g. some unrelated socket connected to a remote service.
As Heroku is a shared platform with multitenant runtime instances, it is
possible for a random runtime to experience high load and the cedar-14 glibc
binaries are known to be impacted by this bug. Version 2.20 of glibc has a
fix and as of 2.19-0ubuntu6.6 and 2.15-0ubuntu10.11 this fix was
backported to Ubuntu Precise and Trusty. However, Ubuntu Precise currently
ships 2.15-0ubuntu10.10 and Trusty provides 2.19-0ubuntu6.5, so this
bug may continue to be a problem for some time to come.
My immediate recommendation is to use language native DNS resolution like
resolv-replace
whenever possible, on Heroku or other systems. However, if you
require ipv6 or run into problems with third party gems attempting to resolve
nil
addresses, and are stuck with the system DNS, upgrade yourself!
please indicate that this bug affects you on the Launchpad bug report
requesting backporting to supported versions of Ubuntu.
Thanks to Ed Muller, Michael Hale, Keiko Oda, Steve Conklin, Terence Lee and Richard Schneeman for help in figuring this out.