How to troubleshoot a 1000 nodes Apache cluster?

Hi all.

May I get some expert advice on troubleshooting performance issues of a 1000 nodes Apache LB cluster. Users report slow loading/response of webpages. Different websites are hosted on this cluster for different clients. But all are reporting the same issue.

Could you please let me know what all basic aspects you would take into consideration for this sort of issue.

FYI. I do not have access to the load-balancers.

The load balancer is going to be a choking point (bandwidth and CPU utilization-wise). If it's bandwidth you may consider increasing the number of links. You can try to increase links through interface bonding. If it's CPU utilization, you might consider multiple load balances (kind of like google.com does) so you can spread load amongst several physical machines. I'd recommend doing that latter anyways, for availability reasons.

Before anyone can give you specific advice though, you need to localize the performance issue. For example, if you run a jMeter test against the serving nodes directly is that faster? If it is, then the load balancer is the choke point. If not then you may check to see if the network links are being saturated with some sort of command-line bandwidth checking tool (I use bwm-ng). If that's not the issue then move onto CPU, memory, etc. Once you've done that you should be in a better position to do something about the performance problems.

It's also possible that it's an application-level latency. For example if a particular website is served by a particular DB cluster and that cluster is going slow it may slow down the end user's experience.