Remove Duplicates on multiple Key Columns and get the Latest Record from Date/Time Column

vijaykodukula · April 23, 2013, 5:39am

Hi Experts ,

we have a CDC file where we need to get the latest record of the Key columns
Key Columns will be CDC_FLAG and SRC_PMTN_I
and fetch the latest record from the CDC_PRCS_TS

Can we do it with a single awk command.
Please help.

CDC_PRCS_TS|CDC_SEQ_I|CDC_FLAG|CDC_UPDT_USER|SRC_PMTN_I|TGT_PMTN_I|PMTN_N|PMTN_DESC_T
2013-03-27 10:32:30|0|I|NOT SET   |124|215|PROO-2:PMI PROJECT|INITIAL PHASE
2013-03-27 10:32:31|0|I|NOT SET   |124|215|PROO-2:PMI PROJECT|INITIAL PHASE
2013-03-27 10:32:32|0|I|NOT SET   |124|215|PRMO-2:PMI PROJECT|INITIAL PHASE
2013-03-27 10:32:33|0|I|NOT SET   |124|215|PROO-2:PMI PROJECT|INITIAL PHASE
2013-03-27 10:32:30|0|U|NOT SET   |125|216|PROO-2:PMI PROJECT|INITIAL PHASE
2013-03-27 10:32:31|0|U|NOT SET   |125|216|PROO-2:PMI PROJECT|INITIAL PHASE
2013-03-27 10:32:32|0|U|NOT SET   |125|216|PROO-2:PMI PROJECT|INITIAL PHASE
2013-03-27 10:32:33|0|U|NOT SET   |125|216|PROO-2:PMI PROJECT|INITIAL PHASE
2013-03-27 10:32:30|0|U|NOT SET   |125|216|PROO-2:PMI PROJECT|INITIAL PHASE
2013-03-27 10:32:30|0|I|NOT SET   |126|217|PROO-2:PMI PROJECT|INITIAL PHASE
2013-03-27 10:32:30|0|U|NOT SET   |127|768|MDS PROJECT|UAT PHASE
2013-03-27 10:32:30|0|U|NOT SET   |128|454|MDS PROJECT|UAT PHASE
2013-03-27 10:32:30|0|U|NOT SET   |129|234|PROJECT|UAT PHASE
2013-03-27 10:32:30|0|D|NOT SET   |130|123|PROMO-2:PMI PROJECT|INITIAL PHASE
2013-03-27 10:32:30|0|D|NOT SET   |131|212|PROO-2:PMI PROJECT|INITIAL PHASE
2013-03-27 10:32:30|0|D|NOT SET   |132|213|PROO-2:PMI PROJECT|INITIAL PHASE

Don_Cragun · April 25, 2013, 7:55pm

Can we assume that CDC_FLAG will always be the 3rd field, SRC_PMTN_I will always be the 5th field, and CDC_PRCS_TS will always be the 1st field; or do we have to match the strings against the header line to determine which fields to use?

Do the output records need to be in the same order as they appeared in the input file or can the output be in random order except that the 1st output line must be the 1st input line (the headings)?

vijaykodukula · April 26, 2013, 12:03am

Hi Don,

We have multiple files .Each file has different layout and has its own awk command to identify the fields for the Key columns.This is one sample.

Advance thanks a lot for the help.

we are using this for one of the sample file.

Regards

Don_Cragun · April 26, 2013, 1:01am

Hi,
Please use CODE tags (rather than the color blue) to display sample input, output, and code.

You didn't really answer my questions. If you're trying to get a single script that can handle any of your files, how will that script know which fields need to be used to select entries, which field will be used to select the latest entry, and which field should be used to sort the output?

Is it always fields with the headers CDC_FLAG and SRC_PMTN_I as keys, always the field with the header CDC_PRCS_TS to select a line out of all lines with matching keys, and always sort the output with CDC_SEQ_I as the primary sort key and CDC_PRCS_TS as the secondary sort key?

Do you have a file that contains filenames that will be processed and the associated field numbers (or names) from which the above data can be extracted by the script?

The sample input file you showed is not sorted in the order that you said should be the output sort order. It seems to be sorted with field 5 as the primary sort key and field 1 as the secondary sort key, not field 2 as the primary sort key. Is this a mistake in your specification or is the script supposed to reorder the output as well as eliminate duplicates?