Random DNS problem?

I'm looking for troubleshooting suggestions. I have a Perl script on another VPS that runs every night and uses Net::ftp to transfer a zip file to my Linode.

This has been working perfectly every night for over a year.

In the last couple of weeks, it has started to fail about half the time. The script reports "Net::FTP: Bad hostname 'www.MYLINODE.com' (where MYLINODE.com is a domain with an A record pointing to my linode, and works fine all day long (I also have Apache & wordpress and email working fine on this domain). I am using Linode's nameservers for my DNS (ns1.linode.com, ns2.linode.com, etc.) and the TTL for my "www" record is 1 hour. On the remote side I am using my other VPS provider's DNS servers.

It seems that in the middle of the night (1am PT and recently tried switching to 3am PT), my other VPS sometimes can't resolve a DNS name that points to my linode. I can't tell if the remote DNS server is unresponsive or if my Linode is down, or nsX.linode.com is just not responding at that time of night. The script only tries once, since this has never been an issue until now.

I could simply plug in the Linode's static IP address to the script, but I kind of want to know why this is failing on principle. I'm also too old to stay up until 1am and do any "live" troubleshooting. Every time I run the script manually during the day, it works fine with no errors. I can't reproduce this from 7am to 11pm. I am running CSF/LFD firewall on both hosts, but that doesn't explain the random nature of this failure (and both IP addresses are whitelisted on each box anyway).

Any suggestions on where to start narrowing this down?

12 Replies

It is tough to troubleshoot when you are using a third-party recursive server. Sometimes DNS issues like this will go away if you try more than once, since the response to the original request comes in and gets cached by the recursive server after you have already timed out.

To see whether your linode DNS is causing the problems, you might run one of these before your script (they both bypass your recursive DNS server and talk directly to the authoritative nameservers): dig +trace www.MYLINODE.com ```````` dig +nssearch www.MYLINODE.com
(You might have to install dig)

Thanks Stever,

I will give that a try. I think you meant 'nosearch' rather than 'nssearch'. I haven't studied Net::ftp much to see how long it waits for a reply, etc. I also thought about just adding a quick sleep-and-try-again routine if the first connection fails. I'm guessing any of these might help.

It's more about "why did it start failing and then only some of the time". I can't stand things like that. :) I'd love to be able to point to one thing and say "here's what is happening and how to fix it".

@haus:

I think you meant 'nosearch' rather than 'nssearch'.
Nope, meant it exactly as it was typed. 'nssearch' hits ALL the authoritative nameservers and tells you how long they took to respond.
@man dig:

+[no]nssearch

When this option is set, dig attempts to find the authoritative name servers for the zone containing the name being looked up and display the SOA record that each name server has for the zone.

````
$ dig +nssearch linode.com
SOA ns1.linode.com. dns.linode.com. 2010122118 7200 3600 604800 86400 from server ns3.linode.com in 33 ms.
SOA ns1.linode.com. dns.linode.com. 2010122118 7200 3600 604800 86400 from server ns4.linode.com in 36 ms.
SOA ns1.linode.com. dns.linode.com. 2010122118 7200 3600 604800 86400 from server ns1.linode.com in 71 ms.
SOA ns1.linode.com. dns.linode.com. 2010122118 7200 3600 604800 86400 from server ns2.linode.com in 103 ms.
SOA ns1.linode.com. dns.linode.com. 2010122118 7200 3600 604800 86400 from server ns5.linode.com in 113 ms.

````

Nevermind, didn't look hard enough. I see it now. Sorry!

When I do subdomain.MYLINODE.com I get nothing back. When I do MYLINODE.com I get results like the ones you posted.

I'm going to be obvious here and point out that neither mylinode.com nor subdomain.mylinode.com are extant domains. As hoopycat would point out, it's difficult to help you when you're giving us fake information.

I'm not willing to provide my real domain name to an open troubleshooting forum.

Obviously there's a chance this is a configuration issue specific to the domain in question, but since this is a new intermittent failure I'm guessing it relates more to a broader issue (something changed elsewhere beyond my immediate control). Particularly as I haven't made any DNS changes in over 6 months for any of the VPS' or domains in question.

Anyway, I "solved" the issue by adding a quick loop in my script that tries the FTP connection up to 5 times with a short break in between. Last night it failed on the first try and then succeeded on the second attempt. So thanks again to Stever for the suggestions and helping me learn something new.

If I manage to sort out the issue for real someday I'll post the solution, but for now this will work.

@haus:

I'm not willing to provide my real domain name to an open troubleshooting forum.

Obviously there's a chance this is a configuration issue specific to the domain in question, but since this is a new intermittent failure I'm guessing it relates more to a broader issue (something changed elsewhere beyond my immediate control). Particularly as I haven't made any DNS changes in over 6 months for any of the VPS' or domains in question.

Yes, but it also means that nobody else can reproduce the problem on their own linode.

Anyone can test this. It's a perl script. Plug in values for $ftphostname, $ftpport, and $ftp_passive, and run from cron or the command line.

#!/usr/local/bin/perl 

my $ftp_hostname = ''; # ftp host name
my $ftp_port = '21'; # typical value
my $ftp_passive = 0; # change to 1 for passive mode

use Net::FTP;

my $ftp = Net::FTP->new($ftp_hostname, Port => $ftp_port, Passive => $ftp_passive);

print "Content-type: text/html\n\n";
if (!ftp) {
     print "FTP connection failed: $@";
} else {
     print "FTP connection successful.";
}
exit;

That's just a snippet pulled from my original code, which would result in a "bad hostname" error about half the time when run in the wee hours of the morning. Again just for clarity, this script is running on a different host, trying to connect via FTP to my linode.

The script only runs once per day, so I suspect this may relate to a DNS cache (which might explain why it works all day long when I try it at the command line; the lookup has already occurred so it is now cached for the day, even though the script may not have waited long enough for the query to finish).

Out of curiosity, if you are giving the server a public DNS name or public IP address, then does it really matter whether you are posting in an open forum? The server is already public. Not posting the DNS name is simply security through obscurity.

Obscurity doesn't provide security but it does preclude identity.

If it's on another box, it could be the DNS server used on that box is flaky. You may want to try changing the DNS server to something else (such as Google's Public DNS at 8.8.8.8 or 8.8.4.4) and seeing if the problem still occurs.

Yes, that's a great idea. Thank you. I'm also going to learn how to do more detailed logging of DNS queries on that box so hopefully I can get more info than just "bad hostname".

Reply

Please enter an answer
Tips:

You can mention users to notify them: @username

You can use Markdown to format your question. For more examples see the Markdown Cheatsheet.

> I’m a blockquote.

I’m a blockquote.

[I'm a link] (https://www.google.com)

I'm a link

**I am bold** I am bold

*I am italicized* I am italicized

Community Code of Conduct