Sudden application crash in servers

Hi,

This weekend there was a sudden application crash in the server.
I did not know where to start to investigate the problem, so I first looked into the /var/adm/syslog/syslog.log, and this was what I found :

Dec 17 00:38:02 L28bi01 sshd[126]: error: accept: No buffer space available
Dec 17 00:38:02 L28bi01 sshd[24333]: error: setsockopt SO_KEEPALIVE: Invalid argument
Dec 17 00:38:07 L28bi01 sshd[24379]: error: setsockopt SO_KEEPALIVE: Invalid argument
Dec 17 00:38:21 L28bi01 sshd[24445]: error: PAM: No account present for user for illegal user UlGLXBTX from 10.61.1.55
Dec 17 00:38:21 L28bi01 sshd[24447]: error: PAM: No account present for user for illegal user anonymous from 10.61.1.55
Dec 17 00:38:26 L28bi01 sshd[24511]: error: PAM: No account present for user for illegal user guest from 10.61.1.55
Dec 17 00:38:27 L28bi01 sshd[24515]: error: PAM: No account present for user for illegal user IyoYLEnT from 10.61.1.55
Dec 17 00:38:28 L28bi01 sshd[24517]: error: PAM: No account present for user for illegal user shelladmin from 10.61.1.55
Dec 17 00:38:31 L28bi01 sshd[24524]: error: PAM: Authentication failed for root from 10.61.1.55
Dec 17 00:38:31 L28bi01 sshd[24525]: error: PAM: No account present for user for illegal user netscreen from 10.61.1.55
Dec 17 00:38:33 L28bi01 sshd[24528]: error: PAM: No account present for user for illegal user admin from 10.61.1.55
Dec 17 00:38:38 L28bi01 sshd[24534]: error: PAM: Authentication failed for root from 10.61.1.55
Dec 17 00:38:58 L28bi01 sshd[24542]: error: PAM: No account present for user for illegal user admin1 from 10.61.1.55
Dec 17 00:39:06 L28bi01 sshd[24552]: error: PAM: No account present for user for illegal user admin from 10.61.1.55
Dec 17 00:39:18 L28bi01 sshd[24561]: error: PAM: No account present for user for illegal user emailswitch from 10.61.1.55
Dec 17 00:39:22 L28bi01 sshd[24584]: error: PAM: No account present for user for illegal user product from 10.61.1.55
Dec 17 00:39:23 L28bi01 sshd[24599]: error: PAM: No account present for user for illegal user admin from 10.61.1.55
Dec 17 00:39:27 L28bi01 sshd[24621]: error: PAM: Authentication failed for root from 10.61.1.55
Dec 17 00:39:29 L28bi01 sshd[24626]: error: PAM: No account present for user for illegal user n3ssus from 10.61.1.55
Dec 17 00:39:31 L28bi01 sshd[24632]: error: PAM: Authentication failed for root from 10.61.1.55
Dec 17 00:41:01 L28bi01 sshd[126]: error: accept: No buffer space available
Dec 17 00:41:01 L28bi01 sshd[25366]: error: setsockopt SO_KEEPALIVE: Invalid argument
Dec 17 00:41:55 L28bi01 sshd[26128]: error: PAM: No account present for user for illegal user cisco from 10.61.1.55
Dec 17 00:42:00 L28bi01 sshd[26134]: error: PAM: No account present for user for illegal user Cisco from 10.61.1.55
Dec 17 00:42:02 L28bi01 sshd[26142]: error: PAM: No account present for user for illegal user admin from 10.61.1.55
Dec 17 00:42:04 L28bi01 sshd[26175]: error: PAM: No account present for user for illegal user  from 10.61.1.55
Dec 17 00:42:10 L28bi01 sshd[26254]: error: PAM: No account present for user for illegal user manage from 10.61.1.55
Dec 17 00:42:15 L28bi01 sshd[26273]: error: PAM: No account present for user for illegal user monitor from 10.61.1.55
Dec 17 00:42:19 L28bi01 sshd[26280]: error: PAM: No account present for user for illegal user ftp from 10.61.1.55
Dec 17 00:42:54 L28bi01 sshd[26792]: error: PAM: No account present for user for illegal user Fortimanager_Access from 10.61.1.55
Dec 17 00:42:54 L28bi01 sshd[26791]: error: PAM: No account present for user for illegal user nessus_oJgOWh46 from 10.61.1.55
Dec 17 00:42:56 L28bi01 sshd[26791]: error: PAM: No account present for user for illegal user nessus_oJgOWh46 from 10.61.1.55
Dec 17 00:43:27 L28bi01 sshd[26926]: error: setsockopt SO_KEEPALIVE: Invalid argument

The error that is most related to this problem is "No buffer space available".
When I googled this error, there was no solid solution, some say memory pressure, and some say check the kernel value "tcp_conn_request_max" but I do not see this value present at all in the server.

However, the application logs present this error :

File: data.c, Line: 2963, Time: 2017.12.17 00:36:56, RC: -23

Text:    CL_receive_message failed

Error during 'read'

  System error: Connection timed out

 

File: data.c, Line: 2963, Time: 2017.12.17 00:37:46, RC: -23

Text:    CL_receive_message failed

Error during 'read'

  System error: Connection timed out

 

File: data.c, Line: 2963, Time: 2017.12.17 00:37:46, RC: -23

Text:    CL_receive_message failed

Error during 'read'

  System error: Connection timed out

 

File: data.c, Line: 825, Time: 2017.12.17 00:38:52, RC: -28

Text:

Connection between client and server was terminated

 

File: data.c, Line: 918, Time: 2017.12.17 00:38:52, RC: -28

Text:

Connection between client and server was terminated

 

File: data.c, Line: 3564, Time: 2017.12.17 00:43:27, RC: -20

Text:

Socket option error

  System error: Invalid argument

 

File: dta_ids.c, Line: 4027, Time: 2017.12.17 00:43:27, RC: 0

Text:    DaTA shutting down: ids clients finished

 

 

File: dta_ids.c, Line: 4052, Time: 2017.12.17 00:43:28, RC: 0

Text:    DaTA shutting down: std clients finished

 

 

File: dta_ids.c, Line: 4078, Time: 2017.12.17 00:43:31, RC: 0

Text:    DaTA shutting down: file queues synchronized

Could this be a network issue?
How do I investigate this problem, I need to know the RCA of it. Please help.

Q:
Are you using (in your site, that is..) a "security" device thats "scans" for possible security issues?

Possibly the system has run out of network buffer resources. Has the load/number of users been steadily increasing?

Network configurations determine the number of buffers available for network packets (of various different sizes) arriving and departing, and also the maximum number of connections.

Exactly and some security scanning devices though said "non intrusive" manage to get you in such embarrassing situation...Why do these messages occurs?:

No account present for user for illegal user admin from 10.61.1.55

What is behind that IP address?...

You need to tell us all what hardware this is, what O/S you are running and which version. Give us a clue!!

EDIT: Okay, you've posted to the HP-UX forum, got that much.

Tons of login attempts to various users would make me very suspicious, even though they seem to occur from the local area network. Even more as they include four failed attempts for root.
Identify the machine that attempts come from and check it for malware.

Hi,

I have seen almost exactly this before, it was a product called "Foundstone" - the Wintel team had deployed this straight out of the box and it caused mayhem on the Unix Estate.

I would think that this one of two things.

  1. You have a security breach and you're going to have a significant issue.
  2. Or from what I can see there is some kind of Scanner running and it needs to be configured or stopped before it starts locking accounts and causing other issues.

I would be tempted to speak to the other teams and find out what has changed. Also watch for it happening on a regular basis "Weekly, Monthly etc".

It could also be that someone has got something to evaluate and don't understand the implications.

As advised earlier in the thread, find the machine and beat the user up - you have an excuse!

Regards

Gull04

Hi,

Thanks all for the tons of replies and suggestions.

Could be possible that the load/number of users steadily increased.
Network configurations in this sense means? Which file?
Maximum connections you mean maximum ssh connections allowed? I believe it is still at the system default which is 60.
A week or 2 earlier, there were few occurrences of where users were not able to login into the server due to too many ssh connections from one user which had root (0) privilege.

Machine model : HP RX7640 ia64
OS version : HPUX B11.23
CPU : 4 Physical, 8 Core
Memory : 128 G

I am not aware of any security scanning devices used in the server.

There was nothing in particular happening in the server at the time of the crash.

The scanner is not in the server but somewhere on your network, internal I hope, or it may be an attack as mentionned... so in internal it seems there is something looking like a such device at the IP I pointed out - Check!
Years ago I had many HP-UX crashes once a month till I decided to write at the direction saying those "non-intrusive" devices were all but that and crashing HP servers or some mainframe devices, mostly the ones trusted and having/using NFS, the reason is mainly it opens so many connections id doesnt care for itself (MS.. OS?) but on a UNIX server the timeouts are regularly over 5 minutes so a opened port cannot be used till it is cleaned and so in such cases quickly you run out and then no one can connect, not even root and you are doomed... Once I proved where it came from the HP servers were listed out the scanning process... and now they changed system, but I have no more HP either... If you have NFS mounted on that server and the scan manages to make it unreadable then your system depending what is running will try to read desperatly and it s load will go beyond control till the system crashes...

Hi,

As "vbe" and myself have already said, track down the machine/user combination at "10.61.1.55" - if this is not a machine on your LAN/WAN it is some kind of intrusion.

If it is on your network find out what it's function is, even if you resolve any local issues with your HPUX system locally - it is likely that they will return if this machine carries on doing what it is doing.

Regards

Gull04