Breaking DNS with consul and dnsmasq forwarding
Last month, I broke DNS on a production cluster by attempting to resolve a hostname.
Discovery
On my laptop, I run Debian 10, with systemd-resolved for DNS.
My VPN client hooks into systemd-resolved to add a private DNS server and local domains (including *.service.consul
and *.node.consul
) to the configuration. That makes it very easy to connect to internal services by name. Multiple instances of these services are deployed to achieve high availability, consul DNS will always direct me to a working instance.
Last week I upgraded my laptop from Debian 9 to 10. This also came with a change in defaults for systemd-resolved
: DNSSEC is now set to allow-downgrade
by default, which will send additional DNS queries.
After connecting to the VPN, I try to connect to a consul service. This fails because of a DNS resolution failure.
Manually trying to resolve the name with dig
results in connection timeouts as well.
Then trying with systemd-resolve
results in an other error message.
lars@lars-debian:~$ systemd-resolve xyz.service.consul
xyz.service.consul: resolve call failed: DNSSEC validation failed: no-signature
The logs of systemd-resolved
are filled with these DNSSEC validation failures too.
After a lot of retried attempts, systemd-resolved
marks the DNS server as non-DNSSEC capable: Jul 18 14:31:22 lars-debian systemd-resolved[635]: Server 10.88.30.3 does not support DNSSEC, downgrading to non-DNSSEC mode.
Unfortunately, at this time the damage is already done.
Walking in circles
What happened? To check for DNSSEC support, systemd-resolved
sends a DNS query for the DS
record type of consul.
to its configured recursor.
In this case, that is the dnsmasq server running at 10.88.10.3
. dnsmasq receives the query, and sees that a specific resolver has been configured for consul.
. It forwards the DNS query to the consul server at 10.88.10.2
to resolve it.
Consul is configured with a recursor, so DNS queries for other (not *.consul
) domains would be forwarded to the same dnsmasq server.
However, the underlying DNS server implementation of consul delegates DS
queries to the parent handler, which are the configured recursors.
The result is that a DS consul.
query is sent from consul 10.88.10.3
to the dnsmasq server running at 10.88.10.2
.
Upon seeing a query for consul.
, dnsmasq forwards it back to consul, and round and round in circles it goes.
Amplification
Dnsmasq has a maximum number of concurrent pending queries, by default 150.
When this limit is reached, it starts answering SERVFAIL
to all new requests.
This response starts making its way back up the stack of recursive DNS queries.
Upon receiving a SERVFAIL
answer, dnsmasq will retry the DNS query once more.
There is a stack of 150 pending queries that are waiting for a response of the previous query.
Every time a slot becomes available, it is immediately occupied by a retry.
The result is that very little legitimate DNS queries are able to get through. The only way to resolve this loop is to break it by restarting either dnsmasq or consul.
Conclusion
Don’t create a loop in your DNS recursors’ configuration. In our case, nothing was querying consul directly anyways, since the dnsmasq server was configured the system resolver.
I reported an issue to consul, because either the documentation or the code is incorrect. This lead us to think that our configuration was okay.