I am facing a situation where when there are lesser users, i am able to login to the AIX server. If the number of users increase - the login prompt is getting delayed. Sometimes even timeout occurs. This is after the upgrade to AIX 7.1 TL 4. Can someone suggest a way to overcome this situation?
How are you logging in?
What is the number of users on the system when you can no longer login reliably?
What is the load average on the system when you can no longer login reliably?
Has the amount of work assigned to this system changed since the upgrade? If so, how?
Maybe, but to do so we need to know more about the system:
How many is an "increased number" of users? (no exact number required, but are we talking about 10, 100 or 1000 users here?)
What is the size of the system? How many processors? How much memory? Have you tried a vmstat -tw 1
in a time of many users? also a vmstat -vs
might help. If so, please post significant portions of the outpput here (within CODE-tags).
How is the name resolution done? We had "network problems", where the network was working fine but the name server took a long time to answer so that the connection seemed to be made slowly. Overall, can you describe you network a bit (in case the culprit is there)?
How is your system used by the many users you describe? Are they using a certain application, working on the command line mainly, do they only connect via the application layer (like a database running on the server and your users use DB Visualizer or a similar tool to connect to it)?
This is - off the top of my head - all i can come up with. Please provide a few more data and we take it from there.
I hope this helps.
bakunin
/PS: seems like i am not typing fast enough to catch up with Don.
The number of users when this issue was happening was 250. Connection is using SSH. We connect using any of the GUI (xterm) to login to the servers.
Network does not seem to be an issue.
System configuration: lcpu=64 mem=256511MB
kthr memory page faults cpu time
------- --------------------- ------------------------------------ ------------------ ----------- --------
r b avm fre re pi po fr sr cy in sy cs us sy id wa hr mi se
0 0 4749214 4687647 0 0 0 0 0 0 6 2970 760 0 0 99 0 00:19:28
0 0 4750341 4686520 0 0 0 0 0 0 18 35490 893 0 1 99 0 00:19:29
0 0 4749218 4687643 0 0 0 0 0 0 2 14121 815 0 0 99 0 00:19:30
0 0 4749218 4687638 0 0 0 0 0 0 17 3871 836 0 0 99 0 00:19:31
vmstat -vs
4465310913 total address trans. faults
294360283 page ins
49324316 page outs
0 paging space page ins
0 paging space page outs
0 total reclaims
1465526223 zero filled pages faults
60683 executable filled pages faults
114267450 pages examined by clock
0 revolutions of the clock hand
55792932 pages freed by the clock
13653791 backtracks
0 free frame waits
0 extend XPT waits
52515800 pending I/O waits
343680275 start I/Os
80649413 iodones
3850688645 cpu context switches
228964899 device interrupts
44816910 software interrupts
2057467216 decrementer interrupts
505680 mpc-sent interrupts
505680 mpc-receive interrupts
230520 phantom interrupts
0 traps
262745045277 syscalls
65666912 memory pages
63779457 lruable pages
4687825 free pages
8 memory pools
4365673 pinned pages
90.0 maxpin percentage
3.0 minperm percentage
90.0 maxperm percentage
88.0 numperm percentage
56179798 file pages
0.0 compressed percentage
0 compressed pages
88.0 numclient percentage
90.0 maxclient percentage
56171829 client pages
0 remote pageouts scheduled
1190 pending disk I/Os blocked with no pbuf
0 paging space I/Os blocked with no psbuf
2400 filesystem I/Os blocked with no fsbuf
44434 client filesystem I/Os blocked with no fsbuf
25130 external pager filesystem I/Os blocked with no fsbuf
7.3 percentage of memory used for computational pages
---------- Post updated 02-03-17 at 07:17 AM ---------- Previous update was 02-02-17 at 09:20 PM ----------
Right now, when the problem is happening
vmstat -vs
4704754850 total address trans. faults
295484675 page ins
50773171 page outs
0 paging space page ins
0 paging space page outs
0 total reclaims
1548595741 zero filled pages faults
62036 executable filled pages faults
114267450 pages examined by clock
0 revolutions of the clock hand
55792932 pages freed by the clock
14409311 backtracks
0 free frame waits
0 extend XPT waits
52975790 pending I/O waits
346252431 start I/Os
81917039 iodones
3938511790 cpu context switches
233154812 device interrupts
46845701 software interrupts
2125584724 decrementer interrupts
528400 mpc-sent interrupts
528400 mpc-receive interrupts
242521 phantom interrupts
0 traps
264369806068 syscalls
65666912 memory pages
63779457 lruable pages
3413598 free pages
8 memory pools
4381020 pinned pages
90.0 maxpin percentage
3.0 minperm percentage
90.0 maxperm percentage
89.8 numperm percentage
57312540 file pages
0.0 compressed percentage
0 compressed pages
89.8 numclient percentage
90.0 maxclient percentage
57304175 client pages
0 remote pageouts scheduled
1235 pending disk I/Os blocked with no pbuf
0 paging space I/Os blocked with no psbuf
2400 filesystem I/Os blocked with no fsbuf
44434 client filesystem I/Os blocked with no fsbuf
25136 external pager filesystem I/Os blocked with no fsbuf
7.5 percentage of memory used for computational pages
Please use code tags, not icode tags, press the button.
A debugging tip. You can trace the sshd.
# proctree | grep -w sshd
258180 /usr/sbin/sshd a
274564 sshd: root@pts/0 a
Then trace the originator/listener i.e. pid 258180
# truss -f -p 258180
And then do a login. You'll see what it does, and where it takes a long time.
Further investigation has shown the delays are occurring when the loginsuccess() function, called by sshd, tries to acquire a lock on the file /etc/security/lastlog.
Any idea what could be causing this? Or where to go look for a solution