This is a question that is related to one I had last August when I was trying to sort/merge two files by millsecond time column (in this case column 6).
The script (below) that helped me last august by RudiC solved the puzzle of sorting/merging two files by time, except it gets lost when the day/time changes back to zero (next day).
awk '
NR == 1 {getline HD1 < F1
HD2 = $0
next
}
$6 >= T[6] {do {LAST = TMP
ST = getline TMP < F1
split (TMP, T, FS)
}
while (($6 >= T[6]) && (ST == 1))
if (ST == 0) {LAST = TMP
T[6] = "ZZZ"
}
print HD1
print LAST
print HD2
print
next
}
{print
}
' FS="," F1=file1 file2
Recap: The output file needs to have file1 contents on top of file2 contents (headers included) where file2 column 6 is >= to file1 column 6, and file2 column 6 (same value) is < file1 column 6 (next value).
It works well until the day changes and time restarts, then it gets lost and doesn't match data properly.
Original files (snippets from very large files)
Column 6 = Bold
File1
TIMEFORMATTED G_CCSDS_2HDR_FLAG G_CCSDS_APID G_CCSDS_SEQ_COUNT G_CCSDS_DOY G_CCSDS_MSEC
6/19/2017 23:59:58 1 572 11353 21719 86398214
6/19/2017 23:59:59 1 572 11354 21719 86399214
6/20/2017 0:00:00 1 572 11355 21720 214
6/20/2017 0:00:01 1 572 11356 21720 1214
File2
TIMEFORMATTED CCSDS_2HDR_FLAG CCSDS_APID CCSDS_SEQ_COUNT CCSDS_DOY CCSDS_MSEC
6/19/2017 23:59:58 1 544 6677 21719 86398318
6/19/2017 23:59:59 1 544 6678 21719 86399318
6/20/2017 0:00:00 1 544 6679 21720 318
6/20/2017 0:00:01 1 544 6680 21720 1318
I need it to do this below, so that when the day/time changes, it still needs to match column 6 properly.
You can see how the times from both files match in column 1, but in column 6 (millisecond) the preceding file1 is always less than file2 column 6.
When the day changes to 6/20, it still needs to sort properly like this.
Desired Output
File1 - green
File2 - red
TIMEFORMATTED G_CCSDS_2HDR_FLAG G_CCSDS_APID G_CCSDS_SEQ_COUNT G_CCSDS_DOY G_CCSDS_MSEC
6/19/2017 23:59:58 1 572 11353 21719 86398214
TIMEFORMATTED CCSDS_2HDR_FLAG CCSDS_APID CCSDS_SEQ_COUNT CCSDS_DOY CCSDS_MSEC
6/19/2017 23:59:58 1 544 6677 21719 86398318
TIMEFORMATTED G_CCSDS_2HDR_FLAG G_CCSDS_APID G_CCSDS_SEQ_COUNT G_CCSDS_DOY G_CCSDS_MSEC
6/19/2017 23:59:59 1 572 11354 21719 86399214
TIMEFORMATTED CCSDS_2HDR_FLAG CCSDS_APID CCSDS_SEQ_COUNT CCSDS_DOY CCSDS_MSEC
6/19/2017 23:59:59 1 544 6678 21719 86399318
TIMEFORMATTED G_CCSDS_2HDR_FLAG G_CCSDS_APID G_CCSDS_SEQ_COUNT G_CCSDS_DOY G_CCSDS_MSEC
6/20/2017 0:00:00 1 572 11355 21720 214
TIMEFORMATTED CCSDS_2HDR_FLAG CCSDS_APID CCSDS_SEQ_COUNT CCSDS_DOY CCSDS_MSEC
6/20/2017 0:00:00 1 544 6679 21720 318
TIMEFORMATTED G_CCSDS_2HDR_FLAG G_CCSDS_APID G_CCSDS_SEQ_COUNT G_CCSDS_DOY G_CCSDS_MSEC
6/20/2017 0:00:01 1 572 11356 21720 1214
TIMEFORMATTED CCSDS_2HDR_FLAG CCSDS_APID CCSDS_SEQ_COUNT CCSDS_DOY CCSDS_MSEC
6/20/2017 0:00:01 1 544 6680 21720 1318
This is so difficult for me to explain, sorry!
---------- Post updated 06-29-17 at 01:30 PM ---------- Previous update was 06-28-17 at 03:26 PM ----------
I may have found a workaround by adding column 1 in the conditional checks. That way it not only finds >= in the column 1 time, but still keeps the the lower millisecond number (column 6) with file 1 row preceding file 2 row.
There is probably a better way to compensate for the changing of the day/time back to zero, but this seems to work.
awk '
NR == 1 {getline HD1 < F1
HD2 = $0
next
}
$6 >= T[6] && $1 >= T[1] {do {LAST = TMP
ST = getline TMP < F1
split (TMP, T, FS)
}
while (($6 >= T[6]) && ($1 >= T[1]) && (ST == 1))
if (ST == 0) {LAST = TMP
T[6] = "ZZZ"
}
print HD1
print LAST
print HD2
print
next
}
{print
}
' FS="," F1=file1 file2