I have an opportunity to improve the performance of batch processing that sends out thousands of e-mails per day to inform customers (yes, real people spending money!) that their order has been picked, packed, dispatched or whatever. The logic in the code that notifies them works on blocks of customer orders. It fires e-mails as it detects processing is moving on and waits to confirm that the email has been accepted by our internal mail-relay server. There are multiple queues of work being processed so there can be multiple email requests being sent to the mail relay server at the same time. The batch waiting is a business requirement and I'm not sure I can change that.
Both the application server and the mail-relay server are CentOS 6
The problem I have seems to be on the mail relay side. According to the logs (and I'm not sure I'm reading them correctly) the records per email transaction is as follows:-
Jun 19 13:17:31 client=unknown[application server IP]
Jun 19 13:17:54 message-id=<2082491272.62178179332185502.JavaMail.wasuser@server.with.our.internal.suffix>
Jun 19 13:17:54 warning:
Jun 19 13:17:54 from=<orders@my-company-email.domain>,
Jun 19 13:17:55 enabling
Jun 19 13:17:56 to=<customer-email@email.provider.domain>,
Jun 19 13:17:56 removed
This sanitised records is an extract from /var/log/maillog and is fields 1, 2, 3 & 7 based on field 6 being a unique reference number, in this case EA866605CA I've grabbed only these fields as the rest are markers such as what wrote the message etc. I can provide the full (sanitised) output if that is needed but I'm hoping it's not.
The problem I see is that only about 30% of the mail is being processed within 1 second. The above example has a delay of 23 seconds between the first two records. As a cumulative frequency graph, after 30% getting processed within 1 second (or actually under 2 seconds I guess) to get to 31% we have to count everything under 18 seconds. Half is accepted under 22 seconds and 98% under 28 seconds. The volumes don't seem to matter if this is a few days worth of logs, or tied to a specific hour - it's a 24-hour operation so there are no real quiet periods.
The mail is sporadic and I can find no pattern between multiple requests and a slow-down at all. Some running by themselves are slow and some running in parallel fly through.
I have been digging to find what I would consider a normal performance bottleneck or contention (CPU is 88% idle at worst, swap is unused, eth0 is ca. 20Kbps for both in- and out-bound etc.) and I'm very confused where to look next
There are various delays on actually sending the email out of the company, but I think that the application will have carried on by that point, so I'm not too worried about that side. Sadly, the server has been build with the minimum two single filesystems ( / & /boot ) which isn't great but it's not full (currently 46% of 7.5Gb used)
Can anyone suggest where I should be looking next? I would very much like a simple guide because I've only worked with mail as either a client or defining which server is the relay before.
Many thanks, in advance,
Robin