I operate authoritative name servers for almost 10.000 domains. Originally, I used a default TTL of 2 days, as recommended by RIPE-203¹ (which is also compatible with the recommendations of RFC 1912²), but this was not accepted by users, who didn’t want to wait two days. Therefore, for all records except SOA and NS records, I changed the default TTL to one hour, which I still use as the default value unless a change is scheduled and/or planned, in which case I lower it to 5 minutes. I do not want to lower it any more, as I’ve heard rumors of buggy resolvers interpreting “too low” TTLs as bad, and reverting to some very-high default TTL, and thereby wrecking my carefully planned DNS changeover. I have, however, not seen any real numbers or good references on what numbers are “too low”, and would like to hear from anyone who might have some information on this.
Unless you have insight into the end users DNS deployments I would say this is the appropriate amount of caution to apply. Besides just TTL being low, a frequent issue I had when first migrating to AWS years ago was CNAME to CNAME records not resolving among some end users. Primary schools were the worst offenders, I assume some of them still have Novell deployed.
I am hesitant to say; we only target the local area, and our home page isn’t even available in English. Our main role is as a domain name registrar, also providing, in increasingly tangential order, domain name strategy planning, some trade mark strategy, DNS hosting, HTTP redirects, E-mail, and web hosting. Our main value proposition is support; call us and talk to us directly, or send an e-mail, and get an answer more or less immediately. We only very reluctantly provide self-service control panels, and we don’t mention its availability unless people directly ask for it, and we generally discourage its use, preferring that people simply tell us what they want done in their DNS. Some people, including some very large companies, prefer this arrangement, and if you are one of them, and you are part of our local market, I’m sure you’ll be able to find us.
The irony of all of this is that those TTLs are almost meaningless as a server operator anyway. Even if you set your TTL to 5 minutes, there are a whole lot of clients that will ignore it.
When I made a DNS switch at reddit, even with a 5 minute TTL, it still took an hour for 80% of the traffic to shift. After a week, only 95% had shifted. After two weeks we still had 1% of traffic going to the old IP.
And after a month there was still some traffic at the old endpoint. At some point I just shut off the old endpoint with active traffic (mostly scrapers with hard coded IPs at that point as far as I could tell).
One of my friends who ran an ISP in Alaska told me that they would ignore all TTLs and set them all to 7 days because they didn't have enough bandwidth to make all the DNS queries to the lower 48.
So yeah, set your TTL to 40 hours. It won't matter anyway. In an emergency, you'll need something other than DNS to rapidly shift your traffic (like a routed IP where you can change the router configs).
It was some time ago, but I've had similar trouble. Back when I had a small webhost (I stopped when shared/reseller hosting descended too far into an overselling-and-deliberately-misleading-advertising-to-compete-on-price race to the bottom) a customer who "left" (was told to get lost due to non-payment) demanding I keep his content up because some users were still ending up there. As far as I could tell me of the records had ever had a TTL longer than the four hours )my default) while they were pointing at that address, yet more than a month later some traffic was still coming in to that address for that domain. I didn't look too deeply into it due to the history of the client in question, but it was certainly a real problem at that point in time.
So for my own stuff if there is a controlled change I try to keep the original address operating as a relay to the new address for a chunk of time, and for a while after that have it host a message saying "your DNS resolution seems to be broken, you shouldn't have been sent here, please report this to your ISP or local SysAdmin, if you want me to try fix it for you here is a list of my consulting fees".
I've had a lot of issues with this for some of our customers. Nobody wants to run old deploy environments for weeks..
Turns out their routers somehow sets an insane TTL (like max int), and reports this to their clients, which in turn also get stuck with an insane TTL. You have to reboot or flush both the router and the clients to get it unstuck. I don't know if it's a "feature" or some sort of memory corruption / race condition.
The routers were almost always Asus routers, ex: Asus RT66. And the same customers were repeat offenders (not every time, but often enough).
In the end we had them set DNS on their computers to something like 184.108.40.206
I know that HTTP clients in some platforms like .NET will not resend the DNS query until the underlying TCP connection is closed. So if the server is using keep alive an you keep sending requests, you might actually end up using the same IP address for a long time.
I think this is actually the desirable behavior. If you've got very long-running connections and you want to force a switch-over you can always drop the connection even if you don't have some in-band mechanism. If you're constantly watching for a DNS change all it takes is one DNS failure to kill a connection (or all connections). In general transient network issues are probably going to be more common than IP changes on your infrastructure and issues caused by the former are harder to debug.
As someone who overrides TTL's for all domains on my home network, I agree with this. I use Unbound DNS to query upstream servers over a VPN. I override min ttl to 20 minutes and that has never caused any issues as far as I can tell. I have been doing this for many years.
Even with short TTLs, I often see Facebook mobile app users lingering for days(!) on the old IP in the logs, long after all other traffic is gone.
I'm not really sure what's up with that, as no-one has ever reported the site not being reachable in the Facebook app after an IP change. Either it does some very aggressive caching, or something is pretending to be the Facebook app.
The author seem to be missing one of the big reasons ridiculously low TTLs are used: it lets passive eavesdroppers discover a good approximation of your browsing history. Passive logging of HTTP has (fortunately) been hindered as most traffic moved to HTTPS, but DNS is still plaintext.
Low TTLs mean a new DNS request happens (apx) every time someone clicks a link. Seeing which domain names someone is interacting with every 60s (or less!) is enough to build a very detailed pattern-of-life. Remember, it's probably not just one domain name per click; the set of domain names that are requested to fetch the js/css/images/etc for each page can easily fingerprint specific activities within a domain.
Yes, TTLs need to have some kind of saner minimum. Even more important is moving to an encrypted protocol. Unfortunately DOH doesn't solve this problem; it just moves the passive eavesdropping problem to a different upstream server (e.g. Cloudflare). The real solution is an encrypted protocol that allows everyone to do the recursive resolution locally.
> The author seem to be missing one of the big reasons ridiculously low TTLs are used: it lets passive eavesdroppers discover a good approximation of your browsing history.
I operate DNS for hundreds of thousands of domains. I've tried to reassemble browsing history from DNS logs, and I can tell you it is damn near impossible. You have DNS caches in the browser, the OS, broadband routers, and ISPs/public resolvers to account for - and half of them don't respect TTLs anyways.
The reason people set low TTLs is they don't want to wait around for things to expire when they want to make a change. DNS operators encourage low TTLs because it appears broken to the user when they make a change and "it doesn't work" for anywhere from a few hours to a few days.
I can't tell. I run Firefox at home, and set up my own DoH server (mainly because I saw the writing on the wall and and if Mozilla/Google are going to shove this down my throat, I want it shoved down on my terms, but I digress). If I visit my blog (which has a DNS TTL of 86,400) I get a query for my domain not only on every request, but even if I just hover over the link. It will also do a query when I click on a link to news.ycombinator.com (with a TTL of 300) but not when I hover over a link. It's bizarre.
I seem to remember a paper a few years ago that (IIRC) tested this by setting a very low TTL (like 60), changing the value, and seeing how long they continued to receive requests at the old value... and most updated within the TTL, but there were some that took up to (I want to say) an hour. I'm probably getting bits of this wrong though..
The violations in that paper that are important are those that have increased the TTL. Reducing the TTL increases costs for the DNS provider, but isn't important here. The slowest update was about 2 hours (with the TTL set to 333).
Of those that violated the TTL, we don't know what portion of those would function correctly with a different TTL (increasing the TTL indicates they're already not following spec). So I wouldn't assume that increasing the TTL would get them to abide by your requested TTL. They're following their own rules, and those could by anything.
Considering how common low TTLs are... you're worrying about a DNS server that's already potentially causing errors for major well known websites.
It is important to note that this study used active probes asking selected recursive resolvers around the world.
From my own experience when changing records and seeing when the long tail of clients stops calling the old addresses (with the name), it is a really long tail. An extreme example that lasted almost six months was a web spider that just refused to update their DNS records and continued to request websites using the old addresses.
Is there a lot of custom written code that does their own DNS caching? Yes. One other example is internal DNS servers that shadow external DNS. There is a lot of very old DNS software running year after year. Occasionally at work we stumble onto servers which are very clearly handwritten by someone a few decades ago by people with only a vague idea of what the RFCs actually say. Those are not public resolvers of major ISPs, so the above study would not catch them.
Naturally if you have a public resolver where people are constantly accessing common sites with low TTLs then issues would crop up quickly and the support cost would get them to fix the resolver. If it's an internal resolver inside a company where non-work sites are blocked then you might not notice until the company moves to a new web hosting solution and suddenly all employees can't access the new site, an hour later they call the public DNS hosting provider, the provider diagnoses the issue to be internal of the customer's network, and then finally several hours later the faulty resolver gets fixed.
Yep, older java versions had some ridiculous caching of both positive and negative DNS responses. That was some weird problem to troubleshoot. We ended up writing our own caching then, back in Java7ish. And the first version of our DNS caching was broken and promptly triggered load alerts on 2 DNS servers of our operations team by issuing ... a lot of DNS queries very very quickly :)
> Of course, a service can switch to a new cloud provider, a new server, a new network, requiring clients to use up-to-date DNS records. And having reasonably low TTLs helps make the transition friction-free. However, no one moving to a new infrastructure is going to expect clients to use the new DNS records within 1 minute, 5 minutes or 15 minutes. Setting a minimum TTL of 40 minutes instead of 5 minutes is not going to prevent users from accessing the service.
Note that you can still get the benefit of a low TTL during a planned switch to a new cloud provider, server, or network even if you run with a high TTL normally. You just have to lower it as you approach the switch.
For example, let's say you normally run with a TTL of 24 hours. 25 hours before you are going to throw the switch on the provider change, change the TTL to 1 hour. 61 minutes before the switch, change TTL to 1 minute.
If your cloud provider does an oopsie (e.g. https://news.ycombinator.com/item?id=20064169) and takes down your entire infrastructure, or you have to move quickly for some other reason, or you're recovering from a misconfiguration, the long TTL can add 24 hours to your mitigation time.
If you're just playing around with your personal project/web site, you just added a giant round of whack-a-cache to your "let's finally clean up my personal server mess" evening.
As most who's ever worked with web hosting can confirm, small business customers often have no idea of what they're doing, and I've talked to many people who switched providers after seeing an ad for cheap hosting, without realising that they have to a) wait for the DNS changes to propagate, b) that they have to actually move their web site from one provider to another.
Subsequently, my previous employer lowered the default TTL simply because it got rid of all the bad Trustpilot ratings about customers being "prevented from leaving", and started offering a "move my WordPress site for me" service to profit from all the panicking new-comers who had no idea about how to do trivial things like importing/exporting a database and transferring files.
It would have been interested to see actual delay rather than qualitative results of the nature "<x>% wasn't in cache so this is horrible!". Admins and users don't care if it's in cache, they care what the impact to operations and load time is. https://www.dnsperf.com/dns-speed-benchmark says lookup times for my personal domain results in 20ms-40ms. Ironically the same dns test for 00f.net is taking 100ms-150ms.
99% of apps will gladly trade a 30ms increase in session start (assuming the browser's prefetcher hasn't already beaten them to it) to not have to worry about things taking an hour to change. Not all efficiency is about how technically slick something is.
I just tested 00f.net and got as low numbers as 6ms. Latency is a question about network traffic between the client and the server, and unless you use anycast you will get different latency depending on what place in the world the client and server reside in, and if you use anycast it depend on how good the contracts and spread the anycast network has.
This is very common (Dynect, NS1, AWS, GCP, etc all depend on this for monitoring and failover). The author is an incorrect.
amazon.com, reddit.com, facebook.com, and others use low TTLs on their domains for this reason. Anyone who can't maintain an anycast infrastructure around the world and doesn't want to depend on Cloudflare will use this method.
For literally my entire career in SRE, well over a decade now, I’ve only interacted with systems that use DNS for this purpose, from small shops to parts of every HN reader’s life. That sentence in your quote, as well as the assertive nature of the post on such a weak foundation, are sufficient to disqualify a hiring candidate on account of lack of experience despite the authored software presented. It simply does not align with reality when presented with two logically separate networks and a required mechanism to transition between them.
The only other alternative for that scenario is using anycast addressing, and that has a colorful bag of limitations that are quite different from those of low-TTL DNS (including being out of reach for most).
DNS failover is used extensively, especially in the cloud world.
I see nothing wrong with using low DNS TTLs for failover - really don't understand the author's objections here, and them claiming that "DNS is not used for failover any more" significantly discredits them, IMO.
> I mean, my company does this for certain failure scenarios involving our CDNs. Can anyone tell me why we're idiots, or is this just hyperbole?
I came here to say exactly that. Our company uses DNS entries with low TTL for failover and load balancing purposes as well -- it's a very common approach. Services like AWS Route 53 and CloudFlare make it very easy to setup and low cost. I was surprised that the author didn't give much acknowledgement to this type of usage.
How would I use a load balancer to fail traffic between, say, London and Amsterdam with no fiber in place between them? Where would the load balancer physically exist in that scenario and how would it fail to the other when power is lost in one location? Would I make a third PoP to isolate it? What would then be my redundancy story for that PoP? How would I relocate traffic to my backup load balancer PoP number four?
Within a single network, sure, load balance all you want. That’s not the scenario low TTLs go after.
> How would I use a load balancer to fail traffic between, say, London and Amsterdam with no fiber in place between them?
What people use in those sitations is Anycast.
Of course, DNS itself, or e-mail, don’t need this kind of redundancy, since the NS (or MX) records themselves provides a list of failover servers. The corresponding alternative for HTTP, SRV records, has been consistently stonewalled by standard writers for HTTP/2, QUIC, etc.
There is an interesting draft RFC which I am keeping an eye on, but I don’t want to get my hopes up:
This requires that you blow a publicly-advertisable prefix for every unique combination of services you would want to fail-over.
E.g., if you wanted to be able to have independent fail-over between your customer-facing self-service portal and your webmail interface (each relying on specific state that you can't replicate synchronously, and can't guarantee replicate consistently with each other), you would need to /24s, one dedicated to anycast for the webmail interface, one for the self-service portal, and separate from any services which are active-active.
Whereas using DNS, you could use your other existing public /24s that you are already using for your active-active services.
In the last days of IPv4, an extra 2 /24s just for this is quite an expense.
Within a region, Anycast is how many big companies move things around mostly seamlessly. Why inside a region? If RIPE or ARIN catch you advertising their IP in the other's territory, they will send nasty emails and threaten to take back your CIDR blocks. I have no idea if they will follow through as we always stopped violating their rules when told to.
Outside of a region, they use DNS and sometimes a combination of WAN accelerators and VPN's, dark fiber.
No, there isn’t. The specification as implemented requires no invalidation mechanism, which means no such mechanism across all caches exists, nor will it ever. The long tail kills you in such a failure scenario, and remember, people who make kitchen appliances write DNS resolvers.
In fact, if you use Route 53 with an alias to an ELB, the TTL is hard-coded at 60s -- it is not even configurable. If it were, we'd follow the practice of lowering it prior to changes, and raising again once things or stable, but as it is, that's not an option (moving DNS off AWS would be a hard sell, not cause it's terribly hard but afamic, there's not really any value to doing it).
I would maintain that if you are experiencing poor performance for a web site, there are MUCH more fruitful places to look than DNS latency. Third party objects, excessive page sizes, lack of overall optimization based on device are just the tip of the iceberg.
There was a dns record looked up primarily by large supercomputers that had a 0 ttl. It was used for stats via a UDP packet (because it was non blocking, nevermind that the dns query was blocking). This was set to 0 for "failover" but it hadn't changed in years. I worked out that our systems alone had caused billions of queries for this name.
After I complained I think they upped the ttl.. to 60.
Reminds me a of a server pair at the last healthcare place I worked. Between the two of them they'd generate something around 1200 DNS lookups per second (about 60% of the load on the DNS servers) of their own name. I think the logic was if the name stopped responding then server A was primary. If the name was responding the server that owned the IP it was responding for was primary. If the servers wanted to swap primary/secondary they would issue a DDNS request.
After about 8 years we were were restructuring our DNS infrastructure for performance and I rate limited those two to 10 or so queries per second each. In that time there must have been 300 billion or so requests from those two boxes alone.
In my experience that sort of thing is from the local hostname not being present in /etc/hosts and (of course) a caching resolver not in use.
Some process on the system wants to connect to itself, which then causes a dns lookup. Add a high transaction rate on top of that, and 1,200/second is easy.
The funniest one I remember seeing was thousands and thousands of lookups for HTACCESS. Turned out apache was running on top of a web root stored in AFS and not configured to stop looking for .htaccess files at the project root, so it would try to open
Can anyone explain why ping.ring.com needs to have such a low TTLs?
$ drill ping.ring.com @220.127.116.11
;; ->>HEADER<<- opcode: QUERY, rcode: NXDOMAIN, id: 36008
;; flags: qr rd ra ; QUERY: 1, ANSWER: 2, AUTHORITY: 1, ADDITIONAL: 0
;; QUESTION SECTION:
;; ping.ring.com. IN A
;; ANSWER SECTION:
ping.ring.com. 3 IN CNAME iperf.ring.com.
iperf.ring.com. 3 IN CNAME ap-southeast-2-iperf.ring.com.
;; AUTHORITY SECTION:
ring.com. 573 IN SOA ns-385.awsdns-48.com. awsdns-hostmaster.amazon.com. 1 7200 900 1209600 86400
I've been trying to find out from Ring support for a few days, and while the support layer has been trying to find out, not much information seems to be getting back. To put this in perspective, in my house with two Ring devices (a doorbell and a chime), I am getting 10,000+ DNS un-cached requests a day, which easily is 20x more than the second most requested domain.
The command output has the answer, it's a CNAME to whatever random AWS instance happens to be up and running. They probably let the instances autoscale to load and don't guarantee they'll be around for any amount of time and rather than configure an additional service for heatbeating they just used DNS.
There are caching nameservers that allow you to override the minimum TTL but be aware the device is likely relying on this being immediately up to date and may not work during a change with an extended TTL set.
That's actually a pretty long TTL by Amazon standards.
;; ANSWER SECTION:
amazon.com. 60 IN A 18.104.22.168
amazon.com. 60 IN A 22.214.171.124
amazon.com. 60 IN A 126.96.36.199
Or some AWS services
;; ANSWER SECTION:
glacier.us-east-1.amazonaws.com. 60 IN A 188.8.131.52
S3 has even shorter:
;; ANSWER SECTION:
s3.ap-northeast-1.amazonaws.com. 5 IN A 184.108.40.206
Or, say, DynamoDB:
;; ANSWER SECTION:
dynamodb.us-east-1.amazonaws.com. 5 IN A 220.127.116.11
The main reason to do so is to be nimble, it's to be able to react to incidents as fast as you can and change, and to make potential deployment patterns possible.
From time to time, you need to do something with customer facing infrastructure: Remove the DNS entry, watch the traffic drain over the next 5-10 minutes, and then do what you need to do on the device, test, and then add it back in the DNS again, from which you can watch traffic return to normal levels and verify everything is good.
Well, in my case it makes sense, I think: I host my server at home, and have a dynamic IPv4. I don't know when it could change, so I just set the TTL to something low.
Since the traffic is low, though, I can afford to check for an IP change every ~5min, and although I set a TTL of ~15min on most services, the main CNAME (ovh-provided dynamic dns service, TTL set by them) is set to 60s.
My IPv6 record was set to 1h, but I'll look into increasing it. It is true that my mobile phone often pings my server, so I imagine that it could reduce the battery usage.
Please excuse any ignorant use of terminology, I am not a DNS expert like others on here, but I can share some experience in the smaller business world.
A company I worked with a couple of years ago was using Dyn as their DNS provider, and one day we got a notifcation that we had passed the usage limits for our account. This seemed impossible considering our site was getting a couple of hundred unique visitors a day. A few things came out of the analytics.
1) A short TTL on an A record had been left on from a website migration project. The majority of the requests were coming from our internal website administrators. I moved it up to a couple of hours and this went a way.
2) We were getting a huge amount of AAAA record hits. I think most modern browsers/OS try quad A first??? We didn't have IPv6 configured, and therefore the negative resolution had a TTL set to the minimum on the SOA record, which was 1second! A change of this to 60 caused a huge reduction in requests. I suppose I should have set up ipv6, but I didn't.
3) When we sent out stuff to our mailing list the SPF (or rather TXT) records saw a peak that was off the chart. We had a pretty settled infrasructure, so I moved that TTL to a day (I think from memory) and it flattened the peak somewhat.
4) There was a large peak in MX request around 9am. I put this down to people opening their email when they got to work and replying to us. I had to set the TTL to a couple of days (of course) to smooth that one.
I like to think it was worthwhile and improved things for users. I at least had a nice warm glow that I had saved the internet from a bunch of junk requests, and it just felt tidier.
This already exists (NOTIFY), but it's only used for master-slave setups (ie. a bunch of DNS servers serving some authoritative zone who want changes to be transmitted to all slaves ASAP)
It would be interesting to (ab)use this mechanism in the way you suggest. A recursive DNS server could ask to be NOTIFY'ed of changes in the zone they are querying...it would, of course, add load to the server, and it would need strict limits to avoid DoS, but it seems an interesting idea.
The big problem, to the extent there is one, is between the client and the recursive server. Not as much the recursive and the authoritative. Cost is highly amortized between recursive and authoritative for busy names.
The author says low TTLs are bad because of latency but never attempts to quantify how much latency we are actually talking about. Its hard to know how outraged I'm supposed to be without actually seeing the numbers.
It seems that a lot of sites are ok with slightly higher latencies if it means greater operational flexibility.
Latency is dependent on many things. DNS server location - accessing my website from australia will take 500ms for DNS lookup (or twice as much if I'm using cnames). If this is not cached somewhere, that's 500ms every few seconds with those <1minute TTLs. If I'm on GPRS, or similar, that will add a bunch more hundreds of ms to every useless DNS resolution, incl. unpredictable variability.
I run my own dnsmasq server on an old laptop and force really long TTL caching regardless of what the records come back with. I even cache nxdomain. It works great, except once or twice a month I have to flush the cache because Slack seems to not handle it well.
I’m honestly not sure what this author is complaining about. If the infrastructure can handle it and the zone owner is willing to pay for the excessive traffic, and DNS cache operators are fine with it, then this seems like a call for premature optimization.
I have worked in a place using GTM to fail over from a bad data center to a good data center. Maybe few minutes TTL. I worried about it but availability is much higher this way, especially combined with a only change one data center at a time.
I'm kind of suprised that I can't see any other comments talking about GTM's (I assume you mean F5 Global Traffic Managers)
Where I am at the moment GTM's are used everywhere, and everywhere the TTL is set to 30s.
The only part of this that really annoys me is that the global default configuration, rather than serving up a subset of the list of IP addresses, only a single IP address is returned when you resolve down to the A record.
When I've pressed the issue that _at least_ on our internal GTM's we should just return a bunch of IP addresses every time someone resolves the address, I've been told that it would break load balancing... which blows my mind because who on earth is relying on DNS to load balance traffic with a 30s TTL, I would have thought that the normal thing to do, if you actually wanted load to balance, would be to result a subset of IP addresses in a different order and with a different subset each time. That way other DNS servers which resolvers that will cache that record can at least be returning multiple addresses to all the clients it serves, as opposed to everyone using that resolver getting stuck to a single address for 30 seconds...
But all of that being said, it would make perfect sense to me to just return like 4 IP addresses publicly for every resolution and rotating setting the TTL to like 30s so that clients could spend 30s iterating through the A Records they have cached, then hit your resolver up again and get a different sites addresses back if your site had gone down...
To avoid delay when migrating a website IP, what I usualy do is to first migrate on a HAProxy (like 2 days before switching) so all ISP DNS are update and on the D-day I switch my backend to the new website/VM.
And then I change my DNS again to the new IP.
You have to tune a bit to get the right IP in your logs but so far it works.
>> The urban legend that DNS-based load balancing depends on TTLs (it doesn’t - since Netscape Navigator, clients pick a random IP from a RR set, and transparently try another one if they can’t connect)
Unless you do not return an RR set and what you return is based on geolocation and data center health.
Thanks. It seems, unfortunately, that only Google DNS and OpenDNS (Cisco iirc) include the data as of now. Older articles even mention how you have to have your website (well, nameservers) whitelisted for them to forward client subnet as part of DNS queries, not sure if that is still the case.
Of course, caching gets more complicated and less useful with this.
This is definitely a feature I've also thought would be useful to have in DNS providers.
I've worked on managing thousands of (sub)domains and the administrative overhead of changing the TTLs for everything manually would be considerable. I'd certainly like an automated way to say "These records should gradually increase TTL up to <X> time over <Y> time" (e.g., gradually raise TTL to 2 days over 2 weeks if there are no changes).
There are downsides to high TTLs though: (1) you need to remember to preemptively lower them ahead of any planned changes (if you want those changes to take effect quickly), and (2) you can't change the records quickly in an emergency. But, fortunately, lots of record types are ones that you probably don't need to change in an emergency -- and for ones that you do, you can use a low TTL.
Anyway, I'd personally like to see automated TTL management as a feature in DNS software.
Maybe up to a point, but really the TTL should be set for how long is acceptable for traffic to continue to flow to the old destination after a change. That's not necessarily correlated with time between changes: just because a service IP hasn't changed for two years doesn't mean I would want to wait a day for most traffic to move.
Of course, the reality is some traffic will continue to flow to the old destination for as long as you care to measure. There's plenty of absurdly broken DNS caching out there.
I've seen some that have a somewhat reasonable minimum time. You can go below it but it will reset to their minimum after a day or two.
But it's a risky play for providers. It reduces their DNS load (which is pretty cheap to handle), but increases the risk that a customer will come yelling why they couldn't fix their outage quickly because the algorithm increased their TTL to something large.
They've collected data on DNS queries "for a few hours". By definition, clients who have DNS cached (iow, most clients, since browsers and resolv calls in operating systems will do that for you), will not issue DNS requests for any records that have a TTL that has not yet expired.
So, they've caught all (well, all that were re-requested) the TTLs shorter than whatever "a few hours" is, and only those longer ones that expired exactly during the experiment and were re-requested.
To run a proper experiment testing for "short" vs "regular" (let's say 1-3 days), you need to collect data for days (eg. at least 7 days, preferably at least 30), but even that would not report most TTLs longer than 7/30 days.
Articles like this are bad because they can easily confuse even the knowledgeable people like the HN crowd.
Nothing precludes you from upping the TTL after the change. Traditionally DNS admins progressively drop the TTL prior to a change to reduce the time an RRSet is in flux (so if your TTL is N, N + 1 seconds prior you drop it to half N, and again and again until its your preferred window size) and cautious ones slowly ramp back up again to the regular value.
Am I missing something, or is the reason most of the queries observed have low TTL because, well, they have a low TTL? IOW, the higher TTL responses would be cached downstream and so you'd see them less often. If that is the case, the distribution shown is not all that surprising.
It's weird how people are not understanding this: perhaps it's the way you phrased it. Or perhaps you missed to mention the core part from the article: the experiment was only run "for a few hours". This means that many a DNS record (well, most) with TTL greater than the experiment duration would not show up in the data.
FWIW, I've learned in the past that while there are plenty of people who claim to want communication to be as succint as possible, majority are unable to understand when somebody is really terse (while still saying exactly enough). I've learned to follow up such a terse statement with examples and longer explanations for the majority that does not get it.
But maybe it's just that people don't expect the mathematics-level precision on the internet :)
Anycast addresses change all the time. Ask Google, Microsoft, Amazon, Akamai, Cloudlfare and so on if you don't believe me. About the only anycast IPs that don't change are public DNS resolvers but that's also true of unicast resolvers as well.
By that logic BGP is a layer 7 load balancer since it has an application layer. BGP only exchanges layer 3 reachability information to update route tables therefore you can only load balance layer 3 with it.
This is true in that the range encapsulates most keep-alive timeouts not in that keepalives longer than a minute or two are actually the majority. nginx defaults to 100 seconds, Apache is less than that. Most don't mess with these let alone bump them to 900. Generally 60 to 120 is considered standard with some cap on the number of active keep-alive sessions as well. Some go ultra-low or disable it all together, very few go ultra-high.
> WTF is your timeout?
Also please try to keep the conversations together.
You're argument about keep alive makes no sense. You're confusing the nginx documentation. 100 is the number of default connections it will hold open. 60 is the default timeout. Also it's sent as an HTTP response to the client in the headers...
Meant "TTLs" as in the plural of TTL but my phone capitalized the whole block on me.
> You're argument about keep alive makes no sense. You're confusing the nginx documentation. 100 is the number of default connections it will hold open. 60 is the default timeout. Also it's sent as an HTTP response to the client in the headers... Here you need to read up: http://nginx.org/en/docs/http/ngx_http_core_module.html#keep...
Are you're right about the 100 being the default active keepalives not the default timeout. According to your own nginx link 75 is the timeout not 60 though: "Default:
Either way, 75/60/100/120 are significantly far off from 15 minutes.
> The timeout only matters if you're not making requests.
Or if the server reaches max connections.
> Setting a low keep alive will actually result in more DNS requests...doh
Which is how the discussion on DNS TTLs comes about in the first place. It's trivial to set the DNS TTL astronomically higher than the HTTP keep-alive, in which case the browser & OS won't actually make a lookup request since it's cached.
One use case where short TTL-s make sense is running a service on a residential network where a power outage or router reboot can trigger an IP address change. If the IP address changes then you won't be offline for too long.
Yes, it is not exactly great, but at least it works well enough for self-hosting services.