[RndTbl] sshd: Corrupted MAC on input.

Gilbert E. Detillieux gedetil at cs.umanitoba.ca
Thu Jul 30 10:07:20 CDT 2020


On 2020-07-29 8:31 p.m., Trevor Cordes wrote:
> On 2020-07-29 Gilbert E. Detillieux wrote:
>> What's the likely cause of this?  A bad NIC?  Bad RAM?  (I'm guessing
>> something is corrupting the packets once in a while, but I'm not sure
>> what.  If so, it seems to get past TCP's error correcting.)
> 
> I would try the same type of transfer using a different client to the
> same server.  Then try a different server for the same client.  If you
> can get the same behavior with a different server, that would be
> extremely useful.

This is from a local backup server to an off-site backup.  I can easily 
try a different local server, but won't be able to exactly replicate the 
rsync, though I can try with other large file(s).  As for a different 
remote destination, that's not easily replicated, but I'd at least know 
if the problem is limited to the off-site data path and/or server.

> You could also try using nc from /dev/zero from the server to the
> client into a file, then use a script (or something) to check if the
> file is all zeros.

A script?  Just using "od" would tell me that.  :)

> It would be neat to see the actual corruption that
> occurs.  Make sure nc is using TCP (though UDP would be an interesting
> test as well, but not critical or required).
> 
> You're right that TCP shouldn't really allow such (line) errors to get
> through to the ssh layer.

TCP checksums aren't perfect, and with very large transfers, there is a 
statistically significant probability of errors getting through, if the 
underlying layers aren't doing their job.  (Normally, Ethernet frame 
errors are more likely to weed out the bad packets than TCP checksums, 
but I remember in the days of PPP over dial-up, that TCP checksums were 
often inadequate.  If we've got problems with something in the Ethernet 
data path letting through bad packets, sshd could be seeing errors that 
TCP misses.)

> If your NIC has TCP checksum offloading, try turning it off (ethtool is
> what I used to use for that, not sure if it's still "the way").  That
> will eliminate the NIC and bus from the equation, leaving you with
> RAM/CPU and/or mobo between the two (but not out to the cards/bridge).
> 
> If you turn off offloading and the problem goes away, your transfer
> performance should tank because it'll be doing TCP retries each time.

Good suggestion.  This is an onboard Intel NIC, and on another server, I 
had to do this...

# Prevent Intel e1000e hangs/resets due to buggy GSO, GRO and TSO.
# As suggested here...
# 
https://serverfault.com/questions/616485/e1000e-reset-adapter-unexpectedly-detected-hardware-unit-hang

ethtool -K em1 gso off gro off tso off

It's a different chipset here, and I'm not seeing this specific error, 
but it could be something chipset-related anyway.

> My guess, as always, is... wait for it... bad caps on the board, likely
> near the NIC slot, or, if onboard, near the NIC onboard chip.  I've had
> weird NIC behavior before and it's always turned out to be the caps
> near the card slot, usually 1000uf little jobbers.
> 
> I just decommissioned my main workstation I used since 2008(!) that was
> starting to get occasional VGA lockups, and lo and behold, the caps
> near the slots were just starting to get puffy (on a very high end
> Intel board).  I'll be repairing them soon to repurpose the system.
> 
> P.S. If a repair or replacement isn't possible for a while, sometimes
> moving the NIC as far away from the puffiest caps can help for a while
> until more caps go bad.  Each 1 or 2 slots usually gets its own cap(s).
> Also, putting in a junkier NIC might help if it draws less power.
> These cap problems are always exacerbated by higher (transient/peak)
> power draws.

I had thought of just putting in a network card, and disabling the 
onboard NIC, but I didn't want to do that until I was sure it was the 
NIC and not something software related or MB related.  And since this is 
an off-site system (albeit still on campus), I have to coordinate with 
someone else who's normally working from home these days.

So, looking for things I can test remotely, at the moment...

> Keep us posted!

Will do.

Gilbert

-- 
Gilbert E. Detillieux        E-mail:  <gedetil at cs.umanitoba.ca>
Dept. of Computer Science    Web:     http://www.cs.umanitoba.ca/~gedetil/
University of Manitoba       Phone:   (204)474-8161
Winnipeg MB CANADA  R3T 2N2  Fax:     (204)474-7609


More information about the Roundtable mailing list