Accepting mail from batch is slow

rbatte1 · July 15, 2016, 9:36am

I have an opportunity to improve the performance of batch processing that sends out thousands of e-mails per day to inform customers (yes, real people spending money!) that their order has been picked, packed, dispatched or whatever. The logic in the code that notifies them works on blocks of customer orders. It fires e-mails as it detects processing is moving on and waits to confirm that the email has been accepted by our internal mail-relay server. There are multiple queues of work being processed so there can be multiple email requests being sent to the mail relay server at the same time. The batch waiting is a business requirement and I'm not sure I can change that.

Both the application server and the mail-relay server are CentOS 6

The problem I have seems to be on the mail relay side. According to the logs (and I'm not sure I'm reading them correctly) the records per email transaction is as follows:-

Jun 19 13:17:31 client=unknown[application server IP]
Jun 19 13:17:54 message-id=<2082491272.62178179332185502.JavaMail.wasuser@server.with.our.internal.suffix>
Jun 19 13:17:54 warning:
Jun 19 13:17:54 from=<orders@my-company-email.domain>,
Jun 19 13:17:55 enabling
Jun 19 13:17:56 to=<customer-email@email.provider.domain>,
Jun 19 13:17:56 removed

This sanitised records is an extract from /var/log/maillog and is fields 1, 2, 3 & 7 based on field 6 being a unique reference number, in this case EA866605CA I've grabbed only these fields as the rest are markers such as what wrote the message etc. I can provide the full (sanitised) output if that is needed but I'm hoping it's not.

The problem I see is that only about 30% of the mail is being processed within 1 second. The above example has a delay of 23 seconds between the first two records. As a cumulative frequency graph, after 30% getting processed within 1 second (or actually under 2 seconds I guess) to get to 31% we have to count everything under 18 seconds. Half is accepted under 22 seconds and 98% under 28 seconds. The volumes don't seem to matter if this is a few days worth of logs, or tied to a specific hour - it's a 24-hour operation so there are no real quiet periods.

The mail is sporadic and I can find no pattern between multiple requests and a slow-down at all. Some running by themselves are slow and some running in parallel fly through.

I have been digging to find what I would consider a normal performance bottleneck or contention (CPU is 88% idle at worst, swap is unused, eth0 is ca. 20Kbps for both in- and out-bound etc.) and I'm very confused where to look next

There are various delays on actually sending the email out of the company, but I think that the application will have carried on by that point, so I'm not too worried about that side. Sadly, the server has been build with the minimum two single filesystems ( / & /boot ) which isn't great but it's not full (currently 46% of 7.5Gb used)

Can anyone suggest where I should be looking next? I would very much like a simple guide because I've only worked with mail as either a client or defining which server is the relay before.

Many thanks, in advance,
Robin

RudiC · July 15, 2016, 10:04am

A DNS problem, mayhap? Timing out when target name not found? Is it always the same sender and/or recipient and/or application server?

rbatte1 · July 15, 2016, 10:16am

Thank you for your thoughts RudiC,

I can confirm that it is always the same sender because we force the sender name.

We have considered DNS, but the frequency seems too high to be mistakes so the mail won't deliver (besides we just need to get it into the queue) and if it was a general problem I wouldn't think we should get any processed really quickly. The nearest I have on DNS is that the mail is from application generating the mail is an active-active cluster and what appears in the log is the boot IP address sending mail in, not a DNS name. Perhaps it is trying to reverse lookup, but the time is not consistent so leads me away from that being the issue. DNS query timeout has not yet been adjusted as people are not keen to fiddle with it, but it's still an option if we can be fairly certain.

Kind regards,
Robin

RudiC · July 15, 2016, 10:33am

What made me suspicious is the client=unknown in direct combination with the 23 sec delay. Is that a syslog excerpt? Could you post a good (full) example along with a bad one, best from the same originator?

RudiC · July 15, 2016, 10:48am

Searching the net for client=unknown , I found this (in German), this, and the postfix.org site saying

BTW, are you running postfix ?