Linode suddenly stuck on SYN_RECV for most requests

Hi everyone,

My server suddenly started to timeout on every local request yesterday.

I'm pretty inexperienced in networking and would love to learn a process for debugging these connectivity issues.

What confuses me is that yesterday, some people (my phone, me at home, friends at home) could consistently access the site and I see with netstat that a connection has been established. I disabled firewalls and set iptables to accept all connections to rule out any strange auto rules blacklisting our IP. I'm not sure if its relevant but a traceroute from the local network times out - traceroute from some machines outside find my server.

My phone can access 69.164.201.172, but not anybody on this network. Judging by the server load charts, I'm guessing it's not my network locked out.

I can SSH into this linode from my other linode, but will get the SYN_RECV timeout if I attempt to ssh into my linode from my local network.

I've confirmed various settings are correct by comparing to the settings on my development server.

The following files match my dev environment (except for their respective ip addresses):

/etc/hosts 
/etc/hosts.allow
/etc/hosts.deny
/etc/networking/interfaces 
ifconfig

Apache is listening on port 80 and the setup looks exactly the same as my functioning server.

# server that doesn't work:
tcp        0      0 0.0.0.0:80              0.0.0.0:*               LISTEN      22008/apache2
tcp        0      0 69.164.201.172:80       71.56.137.10:57487      SYN_RECV    -

# server that does work
tcp        0      0 0.0.0.0:80              0.0.0.0:*               LISTEN      3334/apache2
tcp        0      0 72.14.189.46:80         71.56.137.10:57490      ESTABLISHED 20931/apache2

Here's my attempt at understanding what is happening…

Every time I load the page once, netstat -an | grep :80 reveals all connections in SYN_RECV state.

tcp        0      0 0.0.0.0:80              0.0.0.0:*               LISTEN
tcp        0      0 69.164.201.172:80       71.56.137.10:56657      SYN_RECV
tcp        0      0 69.164.201.172:80       71.56.137.10:56669      SYN_RECV
tcp        0      0 69.164.201.172:80       71.56.137.10:56671      SYN_RECV

So the SYN_RECV means the server is waiting for an ACK to be sent back from the client.

How do I debug whether an ACK is being sent back?

How do I debug where this communication is failing?

Here's what a tcpdump looks like when I attempt to load the page once.

What does this mean? That the client isn't getting the response? Or perhaps I'm swallowing the response somewhere in the server? How do I know to narrow down the culprit further?

`tcpdump -i eth0 -n -tttt port 80
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes
2011-05-25 20:12:54.627417 IP 71.56.137.10.57160 > 69.164.201.172.80: Flags [s], seq 382527960, win 8192, options [mss 1460,nop,wscale 2,nop,nop,sackOK], length 0
2011-05-25 20:12:54.627512 IP 69.164.201.172.80 > 71.56.137.10.57160: Flags [S.], seq 1330600505, ack 382527961, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 6], length 0
2011-05-25 20:12:54.814463 IP 69.164.201.172.80 > 71.56.137.10.57157: Flags [S.], seq 604630211, ack 496040070, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 6], length 0
2011-05-25 20:12:55.214482 IP 69.164.201.172.80 > 71.56.137.10.57158: Flags [S.], seq 998358186, ack 2224730755, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 6], length 0
2011-05-25 20:12:57.624737 IP 71.56.137.10.57160 > 69.164.201.172.80: Flags [s], seq 382527960, win 8192, options [mss 1460,nop,wscale 2,nop,nop,sackOK], length 0
2011-05-25 20:12:57.624793 IP 69.164.201.172.80 > 71.56.137.10.57160: Flags [S.], seq 1330600505, ack 382527961, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 6], length 0
2011-05-25 20:12:59.014477 IP 69.164.201.172.80 > 71.56.137.10.57160: Flags [S.], seq 1330600505, ack 382527961, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 6], length 0
2011-05-25 20:13:03.618790 IP 71.56.137.10.57160 > 69.164.201.172.80: Flags [s], seq 382527960, win 8192, options [mss 1460,nop,nop,sackOK], length 0
2011-05-25 20:13:03.618866 IP 69.164.201.172.80 > 71.56.137.10.57160: Flags [S.], seq 1330600505, ack 382527961, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 6], length 0
2011-05-25 20:13:05.014514 IP 69.164.201.172.80 > 71.56.137.10.57160: Flags [S.], seq 1330600505, ack 382527961, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 6], length 0
2011-05-25 20:13:17.014504 IP 69.164.201.172.80 > 71.56.137.10.57160: Flags [S.], seq 1330600505, ack 382527961, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 6], length 0`

tcpdump for functional server
`~~[code]~~00:00:00.000000 IP 71.56.137.10.57260 > 72.14.189.46.80: Flags [s], seq 34114118s [mss 1460,nop,wscale 2,nop,nop,sackOK], length 0
00:00:00.000110 IP 72.14.189.46.80 > 71.56.137.10.57260: Flags [S.], seq 2454858 win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 5], length 0
00:00:00.061827 IP 71.56.137.10.57260 > 72.14.189.46.80: Flags [.], ack 1, win 100:00:00.004292 IP 71.56.137.10.57260 > 72.14.189.46.80: Flags [P.], seq 1:597, ngth 596
00:00:00.000074 IP 72.14.189.46.80 > 71.56.137.10.57260: Flags [.], ack 597, win00:00:00.493990 IP 72.14.189.46.80 > 71.56.137.10.57260: Flags [.], seq 1:2921, ngth 2920
00:00:00.000024 IP 72.14.189.46.80 > 71.56.137.10.57260: Flags [P.], seq 2921:30, length 98
00:00:00.065135 IP 71.56.137.10.57260 > 72.14.189.46.80: Flags [.], ack 3019, wi00:00:00.034766 IP 71.56.137.10.57260 > 72.14.189.46.80: Flags [P.], seq 597:12925, length 699
00:00:00.000035 IP 72.14.189.46.80 > 71.56.137.10.57260: Flags [.], ack 1296, wi00:00:00.000457 IP 72.14.189.46.80 > 71.56.137.10.57260: Flags [P.], seq 3019:328, length 211
00:00:00.019196 IP 71.56.137.10.57262 > 72.14.189.46.80: Flags [s], seq 10674886s [mss 1460,nop,wscale 2,nop,nop,sackOK], length 0`

Any suggestions, explanations, or comments would be hugely appreciated so that I can understand TCP a little more and hopefully be a little more useful next time I need to debug a problem like this.

I've gotten some feedback on ServerFault, but the netmask that appears in ifconfig is the same on both servers. 

<quote>~~[quote]~~The one time I've seen this before it was a strange timing issue. The connections were getting stuck in the half-open state (what SYN_RECV means) and hanging. What ended up being the problem was two fold:

The server had an incorrect netmask (/16 instead of /24)

There were two devices on the server subnet that issue proxy-ARP packets<e>[/quote]</e></quote>

Can anyone suggest where or how I debug this next? Even reading material to point me in the right direction would help.

My solution at the moment is to just ignore this problem and upgrade the dev linode, but I'd really like to be able to figure this out. 

Thank you so much!

EDIT: Update - the problem has fixed itself... but I wish it hadn't, so I could feel less helpless facing this situation. 

I'd like to be able to get enough information out of my debugging attempts to get a reasonable idea why something is failing. It seems like the fact that it fixed itself without any further input from me mean it was outside my control in the first place.[/s][/s][/code][/s][/s][/s]

2 Replies

You mentioned traceroute from the local network timed out.

I would guess there was a routing problem between the source location (home) that was experiencing the timeouts and the destination (linode). Disabling iptables is a good start to rule out problems. Then you should use mtr (winmtr for windows) from both sides of the connection points. That will give you an idea whether it's a routing issue. I have seen cases where traffic was getting to the destination (server), but the replies were not getting back to the workstation because of peering issues, which usually resolve themselves within a reasonable amount of time.

Once you have ruled out networking issues, you can start looking at the application layer.

Travis

Hi everyone,

My server is also experiencing a similar issue. I get many SYNRECV, SYNSENT, and CLOSEWAIT states. But currently, I'm seeing many SYNRECV.

This appears to be due to normal traffic and not a denial of service.

example:

netstat -an |grep :80

tcp 0 0 0.0.0.0:80 0.0.0.0:* LISTEN

tcp 0 0 173.255.221.179:80 66.249.68.147:40942 SYN_RECV

tcp 0 0 173.255.221.179:80 124.115.0.169:58194 SYN_RECV

tcp 0 0 173.255.221.179:80 220.181.108.111:9299 SYN_RECV

tcp 0 0 173.255.221.179:80 220.181.94.227:62780 SYN_RECV

tcp 0 0 173.255.221.179:80 119.63.196.107:44705 SYN_RECV

tcp 0 0 173.255.221.179:80 173.255.221.179:55539 SYN_RECV

tcp 0 0 173.255.221.179:80 69.171.224.244:59395 SYN_RECV

tcp 0 0 173.255.221.179:80 173.255.221.179:55540 SYN_RECV

tcp 0 0 173.255.221.179:80 173.242.125.206:51632 SYN_RECV

tcp 0 0 173.255.221.179:80 173.242.125.206:43500 SYN_RECV

another example:

$ netstat -an | grep :80

tcp 0 0 0.0.0.0:80 0.0.0.0:* LISTEN

tcp 1 0 173.255.221.179:80 123.125.67.204:28346 CLOSE_WAIT

Because the SYN_RECV's don't go away, I think it's triggering my MaxClients limit. Thus, no new visitors can see my sites.

It seems mainly due to bot crawlers. If so, not sure if there's anything we can do about that. This hasn't happened with other providers.

** Restarting apache2 does get things going again – if only for a short period of time. For now, I've increased my MaxClients hoping it's good enough to wait out the

** hoping the problem at least goes way -- but like yuchant -- knowing the root cause is better.

Thanks!

Reply

Please enter an answer
Tips:

You can mention users to notify them: @username

You can use Markdown to format your question. For more examples see the Markdown Cheatsheet.

> I’m a blockquote.

I’m a blockquote.

[I'm a link] (https://www.google.com)

I'm a link

**I am bold** I am bold

*I am italicized* I am italicized

Community Code of Conduct