Keep only the closet match of timestamped row (include headers) from file1 to precede file2 row/s

aachave1 · August 11, 2016, 1:27pm

My original files are like this below and I distinguish them from the AP_ID (file1 has 572 and file2 has 544). Also, the header on file1 has �G_� pre-pended. NOTE: these are only snippets of very large files and much of the data is not present here.

Original File 1:

TIMEFORMATTED,G_CCSDS_VERSION,G_CCSDS_TYPE,G_CCSDS_2HDR_FLAG,G_CCSDS_APID,G_CCSDS_GRP_FLAGS,G_CCSDS_SEQ_COUNT,G_CCSDS_PKT_LEN,G_CCSDS_DOY,G_CCSDS_MSEC
2014/04/07 16:02:55,0,0,1,572,3,0,1917,20550,57775339
2014/04/07 16:03:00,0,0,1,572,3,0,1917,20550,57780339
2014/04/07 16:03:05,0,0,1,572,3,0,1917,20550,57785339
2014/04/07 16:03:10,0,0,1,572,3,0,1917,20550,57790339
2014/04/07 16:03:15,0,0,1,572,3,0,1917,20550,57795339

Original File 2:    

TIMEFORMATTED,CCSDS_VERSION,CCSDS_TYPE,CCSDS_2HDR_FLAG,CCSDS_APID,CCSDS_GRP_FLAGS,CCSDS_SEQ_COUNT,CCSDS_PKT_LEN,CCSDS_DOY,CCSDS_MSEC
2014/04/07 16:03:12,0,0,1,544,3,0,985,20550,57788894
2014/04/07 16:03:13,0,0,1,544,3,0,985,20550,57793894
2014/04/07 16:03:14,0,0,1,544,3,0,985,20550,57794894
2014/04/07 16:03:15,0,0,1,544,3,0,985,20550,57795894
2014/04/07 16:03:16,0,0,1,544,3,0,985,20550,57796894
2014/04/07 16:03:17, 0,0,1,544,3,0,985,20550,57797894

I sorted/merged with this code below. Note: the �-k21,21� was only used with my very large �real� files.

#!/bin/bash
function f() { awk 'NR==1{h=$0; next} {print $0 "\t" h}' $1; }; sort -t"," -k21,21 <(f file1) <(f file2)  | 
  awk -F'\t' '$2!=p{print $2; p=$2} {print $1}' > temp5

PROBLEM: I only need one row from file1 that is an equal match or nearest match to file2 timestamp/s row/s and precede the file2 row/s (1 file1 row to 1 to many file2 rows). As you can see on this �example�, there are 3 rows (after first header) from file1 that are not needed. I only need the file1 row with timestamp �16:03:10�. So basically I only need the last row from file1 (AP_ID=572) to precede file2 row/s (1 to many). The space is only for readability between matched data.

MY OUTPUT:

TIMEFORMATTED,G_CCSDS_VERSION,G_CCSDS_TYPE,G_CCSDS_2HDR_FLAG,G_CCSDS_APID,G_CCSDS_GRP_FLAGS,G_CCSDS_SEQ_COUNT,G_CCSDS_PKT_LEN,G_CCSDS_DOY,G_CCSDS_MSEC
2014/04/07 16:02:55,0,0,1,572,3,0,1917,20550,57775339
2014/04/07 16:03:00,0,0,1,572,3,0,1917,20550,57780339
2014/04/07 16:03:05,0,0,1,572,3,0,1917,20550,57785339
2014/04/07 16:03:10,0,0,1,572,3,0,1917,20550,57790339
TIMEFORMATTED,CCSDS_VERSION,CCSDS_TYPE,CCSDS_2HDR_FLAG,CCSDS_APID,CCSDS_GRP_FLAGS,CCSDS_SEQ_COUNT,CCSDS_PKT_LEN,CCSDS_DOY,CCSDS_MSEC
2014/04/07 16:03:12,0,0,1,544,3,0,985,20550,57788894
2014/04/07 16:03:13,0,0,1,544,3,0,985,20550,57793894
2014/04/07 16:03:14,0,0,1,544,3,0,985,20550,57794894

TIMEFORMATTED,G_CCSDS_VERSION,G_CCSDS_TYPE,G_CCSDS_2HDR_FLAG,G_CCSDS_APID,G_CCSDS_GRP_FLAGS,G_CCSDS_SEQ_COUNT,G_CCSDS_PKT_LEN,G_CCSDS_DOY,G_CCSDS_MSEC
2014/04/07 16:03:15,0,0,1,572,3,0,1917,20550,57795339
TIMEFORMATTED,CCSDS_VERSION,CCSDS_TYPE,CCSDS_2HDR_FLAG,CCSDS_APID,CCSDS_GRP_FLAGS,CCSDS_SEQ_COUNT,CCSDS_PKT_LEN,CCSDS_DOY,CCSDS_MSEC
2014/04/07 16:03:15,0,0,1,544,3,0,985,20550,57795894
2014/04/07 16:03:16,0,0,1,544,3,0,985,20550,57796894
2014/04/07 16:03:17, 0,0,1,544,3,0,985,20550,57797894

I then ran this below to try and resolve, but it only kept the FIRST file1 row, but not the preferred LAST.
QUESTION: How can I modify this code to keep only the last file1 (AP_ID=572) row?

#!/bin/bash
function f() { awk 'NR==1{h=$0; next} {print $0 "\t" h}' $1; }; sort -t"," -k21,21 <(f file1) <(f file2)  | 
  awk -F'\t' '$2!=p{print $2; p=$2; b++; c=1} !(b%2)||c&&c--{print $1}' > temp5

I hope this isn't too long winded and confusing. Thank you!!

joker · August 11, 2016, 3:07pm

Sorry, but I do not understand what you are writing here at all. Can you explain in detail with an example with your shown data on every sentence what you are trying to achieve?

aachave1 · August 11, 2016, 3:48pm

I'll try to simplify, but not sure if I can.

I have a file1 and file2 that have timestamped rows of data and headers. I need to use the data from file1 as metadata to precede file2 data when sorted/merged (including headers). I can only use the file1 data that the timestamp is equal to or nearest match to file2 rows (below example).

My problem is that I only need ONE row from the file1 data that the timestamp matches or nearest to a file2 timestamped data. There can be many file2 rows, but only one file1 row assigned to them (because it is the closest match timestamp wise).

The rows in RED are the only rows I need displayed because the file1 metadata row timestamp is the nearest match to the file2 rows. All other non-red file1 rows must be deleted.

MY OUTPUT:

TIMEFORMATTED,G_CCSDS_VERSION,G_CCSDS_TYPE,G_CCSDS_2HDR_FLAG,G_CCSDS_APID,G_CCSDS_GRP_FLAGS,G_CCSDS_SEQ_COUNT,G_CCSDS_PKT_LEN,G_CCSDS_DOY,G_CCSDS_MSEC
2014/04/07 16:02:55,0,0,1,572,3,0,1917,20550,57775339
2014/04/07 16:03:00,0,0,1,572,3,0,1917,20550,57780339
2014/04/07 16:03:05,0,0,1,572,3,0,1917,20550,57785339
2014/04/07 16:03:10,0,0,1,572,3,0,1917,20550,57790339
TIMEFORMATTED,CCSDS_VERSION,CCSDS_TYPE,CCSDS_2HDR_FLAG,CCSDS_APID,CCSDS_GRP_FLAGS,CCSDS_SEQ_COUNT,CCSDS_PKT_LEN,CCSDS_DOY,CCSDS_MSEC
2014/04/07 16:03:12,0,0,1,544,3,0,985,20550,57788894
2014/04/07 16:03:13,0,0,1,544,3,0,985,20550,57793894
2014/04/07 16:03:14,0,0,1,544,3,0,985,20550,57794894

TIMEFORMATTED,G_CCSDS_VERSION,G_CCSDS_TYPE,G_CCSDS_2HDR_FLAG,G_CCSDS_APID,G_CCSDS_GRP_FLAGS,G_CCSDS_SEQ_COUNT,G_CCSDS_PKT_LEN,G_CCSDS_DOY,G_CCSDS_MSEC
2014/04/07 16:03:15,0,0,1,572,3,0,1917,20550,57795339
TIMEFORMATTED,CCSDS_VERSION,CCSDS_TYPE,CCSDS_2HDR_FLAG,CCSDS_APID,CCSDS_GRP_FLAGS,CCSDS_SEQ_COUNT,CCSDS_PKT_LEN,CCSDS_DOY,CCSDS_MSEC
2014/04/07 16:03:15,0,0,1,544,3,0,985,20550,57795894
2014/04/07 16:03:16,0,0,1,544,3,0,985,20550,57796894
2014/04/07 16:03:17, 0,0,1,544,3,0,985,20550,57797894

These are very large files and sometimes there will be ONE file1 timestamp match or nearest match to ONE file2 timestamp and sometimes there will be ONE file1 timestamp match or nearest match to MANY file2 timestamps. It all depends on the rate both files were ran because as you can see file1 timestamps are 5 seconds apart, while file2 timestamps are 1 second apart.

joker · August 11, 2016, 4:13pm

0 TIMEFORMATTED,G_CCSDS_VERSION,G_CCSDS_TYPE,G_CCSDS_2HDR_FLAG,G_CCSDS_APID,G_CCSDS_GRP_FLAGS,G_CCSDS_SEQ_COUNT,G_CCSDS_PKT_LEN,G_CCSDS_DOY,G_CCSDS_MSEC 
1 2014/04/07 16:03:15,0,0,1,572,3,0,1917,20550,57795339 
2 TIMEFORMATTED,CCSDS_VERSION,CCSDS_TYPE,CCSDS_2HDR_FLAG,CCSDS_APID,CCSDS_GRP_FLAGS,CCSDS_SEQ_COUNT,CCSDS_PKT_LEN,CCSDS_DOY,CCSDS_MSEC 
3 2014/04/07 16:03:15,0,0,1,544,3,0,985,20550,57795894 
4 2014/04/07 16:03:16,0,0,1,544,3,0,985,20550,57796894 
5 2014/04/07 16:03:17, 0,0,1,544,3,0,985,20550,57797894

To say what I understood:

The green part is from File1 and the blue part is from File2.
Line 1 is the needed Line from File1 for File2

Correct?

Questions

Is Line 1 needed for the complete File2 ore only for Line3 of File2?
Is/Are there another Line(s) from File1 needed for the blue line 4 and 5 of File2 which is closest to the time of the line?

aachave1 · August 11, 2016, 4:34pm

"To say what I understood:"
"The green part is from File1 and the blue part is from File2." Yes, but also the headers are different because the file1 header (green also) has "G_" pre-pended to the field names, while the file2 header does not (blue also).
"Line 1 is the needed Line from File1 for File2" Yes, since it is the closest timestamp match to lines 3, 4, and 5.

"Questions"
"Is Line 1 needed for the complete File2 ore only for Line3 of File2?" It is needed for the complete file2 since line 1 matches or is nearest timestamp to lines 3, 4, & 5.
Is/Are there another Line(s) from File1 needed for the blue line 4 and 5 of File2 which is closest to the time of the line? No, just the one file1 line 1 (plus its header) since it is the nearest timestamp match to 4 & 5. (like the example).

Sorry for the confusion. It is difficult to explain this.

joker · August 11, 2016, 4:43pm

Line 3,4 and 5 of File2 have different Timestamps. What if there are different timestamps nearest to different lines?

Example

0 TIMEFORMATTED,G_CCSDS_VERSION,G_CCSDS_TYPE,G_CCSDS_2HDR_FLAG,G_CCSDS_APID,G_CCSDS_GRP_FLAGS,G_CCSDS_SEQ_COUNT,G_CCSDS_PKT_LEN,G_CCSDS_DOY,G_CCSDS_MSEC 
1 2014/04/07 16:03:15,0,0,1,572,3,0,1917,20550,57795339 
2 2014/04/07 16:03:16,0,0,1,572,3,0,1917,20550,57795339 
3 2014/04/07 16:03:17,0,0,1,572,3,0,1917,20550,57795339  
4 TIMEFORMATTED,CCSDS_VERSION,CCSDS_TYPE,CCSDS_2HDR_FLAG,CCSDS_APID,CCSDS_GRP_FLAGS,CCSDS_SEQ_COUNT,CCSDS_PKT_LEN,CCSDS_DOY,CCSDS_MSEC 
5 2014/04/07 16:03:15,0,0,1,544,3,0,985,20550,57795894  
6 2014/04/07 16:03:16,0,0,1,544,3,0,985,20550,57796894  
7 2014/04/07 16:03:17,0,0,1,544,3,0,985,20550,57797894

Here line 1 timestamp of File1 is matching exactly line 5 of File2.
line 2 / File1 matching line 6 / File2
line 3 / File1 matching line 7 / File2

Which one is correct line to choose from file1 for file2?

No problem. Lots of fun puzzling for a solution

aachave1 · August 11, 2016, 5:00pm

Is this a specific scenario question, because my examples are different?

But, if this were a real output from file1 and file2 of mine, it would look like this below after a proper sort/merge.

0 TIMEFORMATTED,G_CCSDS_VERSION,G_CCSDS_TYPE,G_CCSDS_2HDR_FLAG,G_CCSDS_APID,G_CCSDS_GRP_FLAGS,G_CCSDS_SEQ_COUNT,G_CCSDS_PKT_LEN,G_CCSDS_DOY,G_CCSDS_MSEC 
1 2014/04/07 16:03:15,0,0,1,572,3,0,1917,20550,57795339
4 TIMEFORMATTED,CCSDS_VERSION,CCSDS_TYPE,CCSDS_2HDR_FLAG,CCSDS_APID,CCSDS_GRP_FLAGS,CCSDS_SEQ_COUNT,CCSDS_PKT_LEN,CCSDS_DOY,CCSDS_MSEC 
5 2014/04/07 16:03:15,0,0,1,544,3,0,985,20550,57795894

0 TIMEFORMATTED,G_CCSDS_VERSION,G_CCSDS_TYPE,G_CCSDS_2HDR_FLAG,G_CCSDS_APID,G_CCSDS_GRP_FLAGS,G_CCSDS_SEQ_COUNT,G_CCSDS_PKT_LEN,G_CCSDS_DOY,G_CCSDS_MSEC 
2 2014/04/07 16:03:16,0,0,1,572,3,0,1917,20550,57795339 
4 TIMEFORMATTED,CCSDS_VERSION,CCSDS_TYPE,CCSDS_2HDR_FLAG,CCSDS_APID,CCSDS_GRP_FLAGS,CCSDS_SEQ_COUNT,CCSDS_PKT_LEN,CCSDS_DOY,CCSDS_MSEC   
6 2014/04/07 16:03:16,0,0,1,544,3,0,985,20550,57796894

0 TIMEFORMATTED,G_CCSDS_VERSION,G_CCSDS_TYPE,G_CCSDS_2HDR_FLAG,G_CCSDS_APID,G_CCSDS_GRP_FLAGS,G_CCSDS_SEQ_COUNT,G_CCSDS_PKT_LEN,G_CCSDS_DOY,G_CCSDS_MSEC 
3 2014/04/07 16:03:17,0,0,1,572,3,0,1917,20550,57795339
4 TIMEFORMATTED,CCSDS_VERSION,CCSDS_TYPE,CCSDS_2HDR_FLAG,CCSDS_APID,CCSDS_GRP_FLAGS,CCSDS_SEQ_COUNT,CCSDS_PKT_LEN,CCSDS_DOY,CCSDS_MSEC   
7 2014/04/07 16:03:17,0,0,1,544,3,0,985,20550,57797894

joker · August 11, 2016, 5:10pm

I'm just asking. If what I'm asking will never happen, just say it.

What I'm making out of your response is that the output file may contain exactly one or more of this group of lines:

G_HEADERF1_FIELD1,G_HEADERF1_FIELD2,....
F1_FIELD1,F1_FIELD2,...
HEADERF2_FIELD1,HEADERF2_FIELD2,...
F2_FIELD1,F2_FIELD2,...
F2_FIELD1,F2_FIELD2,...

Now that the task defined, a solution will come shortly

aachave1 · August 11, 2016, 5:36pm

Here a simple example with pseudo file data where the column 1 is the virtual timestamp. file1 time is every 5 seconds and file2 time is every 1 second.

File1:

Header1
1
5
10
15
20

File2:

Header2
1
2
3
4
5
6
7
8
9
10
11
12

Desired  Output:

Header1
1
Header2
1
2
3
4

Header1
5
Header2
5
6
7
8
9

Header1
10
Header2
10
11
12

joker · August 11, 2016, 5:44pm

4 is closer to 5 than to 1. Is 1 nevertheless the correct line for 4?

aachave1 · August 11, 2016, 6:16pm

"4 is closer to 5 than to 1. Is 1 nevertheless the correct line for 4?"

Yes, 4 is closer to 5, but it has to be the nearest preceding file1 timestamp to file2's timestamp - not the timestamp after. So yes, 1 timestamp line is the correct one for 4.

---------- Post updated at 06:16 PM ---------- Previous update was at 05:52 PM ----------

More examples: Sometimes the timestamps will be different. For example below.

File1:

Header1
1
1.2.
1.4
1.6
1.8
2
2.2

Flie2:

Header2
1.1
1.3
1.5
1.7
1.9
2.1
2.3


Output would be:

Header1
1
Header2
1.1
1.3
1.5
1.7
1.9

Header1
2
Header2
2.1
2.3

joker · August 11, 2016, 9:46pm

I hadn't looked at your solution yet. When thinking about it, awk may not be such a good solution here, as it can not read 2 files simultaneously(AFAIK) but only sequentially. So you have to store the data of one file completely into memory. Maybe the amount of data is small, then this is irrelevant.

With a scripting language you can read as needed from either file.

Here's a try in perl:

#!/usr/bin/env perl

use Time::Local;

open($f1,"<",$ARGV[0]);
open($f2,"<",$ARGV[1]);

our $last_printed_data1;

sub read_file {
        $fh=shift;
        my $data = <$fh>;
        my ($y,$m,$d,$H,$M,$S) = $data =~ m#([0-9]{4})/([0-9]{2})/([0-9]{2}) ([0-9]{2}):([0-9]{2}):([0-9]{2})#;
        return timelocal($S,$M,$H,$d,$m,$y), $data if($y);
}

sub out {
        my ($header1, $data1, $header2, $data2) = @_;
        if($last_printed_data1 eq $data1) {
                print("$data2");
        } else {
                print($header1,$data1,$header2,$data2);
                $last_printed_data1 = $data1;
        }
}

$header1=<$f1>; ($time1, $data1)=read_file($f1);
$header2=<$f2>; ($time2, $data2)=read_file($f2);

while(!eof($f2)) {

        if($time1==$time2) {
                out($header1,$data1,$header2,$data2);
                ($time2, $data2) = read_file($f2);
        }elsif($time1>$time2) {
                if($time1_old) {
                        out($header1,$data1_old,$header2,$data2);
                        ($time2, $data2) = read_file($f2);
                } else {
                        out("No preceding f1-value\n","",$header2,$data2);
                        ($time2, $data2) = read_file($f2);
                }
        } else {
                if(eof($f1)) {
                        out($header1,$data1,$header2,$data2);
                        ($time2, $data2) = read_file($f2);
                } else {
                        $time1_old = $time1;
                        $data1_old = $data1;
                        ($time1, $data1) = read_file($f1);
                }
        }
}
out($header1,$data1,$header2,$data2);

Use it like this:

./prog.pl file1 file2

With your data file1...

TIMEFORMATTED,G_CCSDS_VERSION,G_CCSDS_TYPE,G_CCSDS_2HDR_FLAG,G_CCSDS_APID,G_CCSDS_GRP_FLAGS,G_CCSDS_SEQ_COUNT,G_CCSDS_PKT_LEN,G_CCSDS_DOY,G_CCSDS_MSEC
2014/04/07 16:02:55,0,0,1,572,3,0,1917,20550,57775339
2014/04/07 16:03:00,0,0,1,572,3,0,1917,20550,57780339
2014/04/07 16:03:05,0,0,1,572,3,0,1917,20550,57785339
2014/04/07 16:03:10,0,0,1,572,3,0,1917,20550,57790339
2014/04/07 16:03:15,0,0,1,572,3,0,1917,20550,57795339

and file2 ...

TIMEFORMATTED,CCSDS_VERSION,CCSDS_TYPE,CCSDS_2HDR_FLAG,CCSDS_APID,CCSDS_GRP_FLAGS,CCSDS_SEQ_COUNT,CCSDS_PKT_LEN,CCSDS_DOY,CCSDS_MSEC
2014/04/07 16:03:12,0,0,1,544,3,0,985,20550,57788894
2014/04/07 16:03:13,0,0,1,544,3,0,985,20550,57793894
2014/04/07 16:03:14,0,0,1,544,3,0,985,20550,57794894
2014/04/07 16:03:15,0,0,1,544,3,0,985,20550,57795894
2014/04/07 16:03:16,0,0,1,544,3,0,985,20550,57796894
2014/04/07 16:03:17,0,0,1,544,3,0,985,20550,57797894

the output is...

TIMEFORMATTED,G_CCSDS_VERSION,G_CCSDS_TYPE,G_CCSDS_2HDR_FLAG,G_CCSDS_APID,G_CCSDS_GRP_FLAGS,G_CCSDS_SEQ_COUNT,G_CCSDS_PKT_LEN,G_CCSDS_DOY,G_CCSDS_MSEC
2014/04/07 16:03:10,0,0,1,572,3,0,1917,20550,57790339
TIMEFORMATTED,CCSDS_VERSION,CCSDS_TYPE,CCSDS_2HDR_FLAG,CCSDS_APID,CCSDS_GRP_FLAGS,CCSDS_SEQ_COUNT,CCSDS_PKT_LEN,CCSDS_DOY,CCSDS_MSEC
2014/04/07 16:03:12,0,0,1,544,3,0,985,20550,57788894
2014/04/07 16:03:13,0,0,1,544,3,0,985,20550,57793894
2014/04/07 16:03:14,0,0,1,544,3,0,985,20550,57794894
TIMEFORMATTED,G_CCSDS_VERSION,G_CCSDS_TYPE,G_CCSDS_2HDR_FLAG,G_CCSDS_APID,G_CCSDS_GRP_FLAGS,G_CCSDS_SEQ_COUNT,G_CCSDS_PKT_LEN,G_CCSDS_DOY,G_CCSDS_MSEC
2014/04/07 16:03:15,0,0,1,572,3,0,1917,20550,57795339
TIMEFORMATTED,CCSDS_VERSION,CCSDS_TYPE,CCSDS_2HDR_FLAG,CCSDS_APID,CCSDS_GRP_FLAGS,CCSDS_SEQ_COUNT,CCSDS_PKT_LEN,CCSDS_DOY,CCSDS_MSEC
2014/04/07 16:03:15,0,0,1,544,3,0,985,20550,57795894
2014/04/07 16:03:16,0,0,1,544,3,0,985,20550,57796894
2014/04/07 16:03:17,0,0,1,544,3,0,985,20550,57797894

aachave1 · August 12, 2016, 1:37am

Stomp, that seemed so close to working with two of my "real" files, except it missed the very last match in the output file. For example, the file1 timestamp "16:31:20" matches the file2 "16:31:20" just fine, however the file1 time "16:31:25" is missing right above the file2 timestamp "16:31:25" (in red).

I verified that file1 does have a "16:31:25" row, but it was left out.

TIMEFORMATTED,G_CCSDS_VERSION,G_CCSDS_TYPE,G_CCSDS_2HDR_FLAG,G_CCSDS_APID,G_CCSDS_GRP_FLAGS,G_CCSDS_SEQ_COUNT,G_CCSDS_PKT_LEN,G_CCSDS_DOY,G_CCSDS_MSEC
2014/04/07 16:31:20,0,0,1,572,3,0,1917,20550,57795339
TIMEFORMATTED,CCSDS_VERSION,CCSDS_TYPE,CCSDS_2HDR_FLAG,CCSDS_APID,CCSDS_GRP_FLAGS,CCSDS_SEQ_COUNT,CCSDS_PKT_LEN,CCSDS_DOY,CCSDS_MSEC
2014/04/07 16:31:20,0,0,1,544,3,0,985,20550,57795894
2014/04/07 16:31:25,0,0,1,544,3,0,985,20550,57796894

Also, I know you don't have my "real" files, but it would be more accurate sorting off of column 21 (not shown in these snippet examples) of my files since this is a msec column. How would I sort off of column 21 of both of my files in your code? I did it with my awk code above, but not sure how to do it with your pearl code.

Thank you!

joker · August 13, 2016, 5:33pm

If you show your data files which aren't working, I may take a look.

aachave1 · August 13, 2016, 6:09pm

I'm trying to remote to my work PC at the moment and having connection issues.

As far as my files, I found out that it seems to work fine if file1 is smaller than file2, but if file1 is larger than file2, then it misses the last match of the files (like described above).

In my situation, there will be times when file1 is larger than file2 and situations where file2 is larger, so I need it to be adaptable to varying sizes. Thanks!

joker · August 14, 2016, 10:21am

Here another possibillity, a bit more stream oriented ...

perl -MTime::Local -pe '
BEGIN{$fc=1}
s#(([0-9]{4})/([0-9]{2})/([0-9]{2}) ([0-9]{2}):([0-9]{2}):([0-9]{2}))#timelocal($7,$6,$5,$4,$3,$2).",$fc,$1"#e; 
$fc++ if eof;' FILE1 FILE2 | \
\
sort -n | \
\
awk -F, 'FNR<3{print $0;next} 
SRC_OLD==1 && SRC_OLD!=$2 {print LINE_OLD} 
$2==2 {print $0} {SRC_OLD=$2;LINE_OLD=$0}' | \
\
awk -F, 'BEGIN{OFS="\n"} 
FNR==1{H2=$0;next} 
FNR==2{H1=$0;next}  
{FID=$2;sub(/^([^,]+,){2}/,"")} 
FID==1{print H1,$0,H2;next} 
{print}'

Explanation of the steps:

Use Perl to add timestamps as epoch and filenumber at beginning of each line
sort lines using epoch and file nr (Headers stay first, If The Headers change this may break)
remove duplicate file1 lines and keep only the newest
detect the headers and insert them before and after every file1 line and print out the data without the additional fields

...and a bit optimized too(no need to have 2 awk calls)...

perl -MTime::Local -pe '
BEGIN{$fc=1}
s#(([0-9]{4})/([0-9]{2})/([0-9]{2}) ([0-9]{2}):([0-9]{2}):([0-9]{2}))#timelocal($7,$6,$5,$4,$3,$2).",$fc,$1"#e;
$fc++ if eof;' FILE1 FILE2 | \
\
sort -n  | \
\
awk -F, 'BEGIN{OFS="\n"}
FNR==1{H2=$0;next}
FNR==2{H1=$0;next}
{FID=$2;sub(/^([^,]+,){2}/,"")}
FID_OLD==1 && FID!=1 {print H1,LINE_OLD,H2}
FID==2 {print $0}
{FID_OLD=FID;LINE_OLD=$0}'

aachave1 · August 14, 2016, 11:48am

Thanks Stomp for all these options! This latest script above works with the proper matching of data, however, if file2 is larger than file1, The headers are opposite (file1 headers are with file2 data and vice versa). It works fine though if file1 is larger than file2.

I will try to fix code so that it adapts to either file being larger/smaller than the other. I will reply if I have any luck:)

joker · August 14, 2016, 3:11pm

After finally checking out your original try, I'm realizing that creating an epoch timestamp is absolutely not necessary here. The date given in file1/file2 is sortable without transformation.

Ok. I leave some of the fun for you. But it has nothing to do with either file being larger or smaller. It is probably point 2 of the list of my last post

aachave1 · August 14, 2016, 5:00pm

Okay, but why does it only have issues when one file is larger than the other, but works fine the other way around? This is the case for both sets of code that you had. Any hints as to why?

Where can I actually attach my "real" files to this forum?

I will try a few things now. Thanks!

joker · August 14, 2016, 5:17pm

I suppose that's coincidence.

TIMEFORMATTED,CCSDS_VERSION,CCSDS_TYPE,CCSDS_2HDR_FLAG,CCSDS_APID,CCSDS_GRP_FLAGS,CCSDS_SEQ_COUNT,CCSDS_PKT_LEN,CCSDS_DOY,CCSDS_MSEC
TIMEFORMATTED,G_CCSDS_VERSION,G_CCSDS_TYPE,G_CCSDS_2HDR_FLAG,G_CCSDS_APID,G_CCSDS_GRP_FLAGS,G_CCSDS_SEQ_COUNT,G_CCSDS_PKT_LEN,G_CCSDS_DOY,G_CCSDS_MSEC
1399471375,1,2014/04/07 16:02:55,0,0,1,572,3,0,1917,20550,57775339
1399471380,1,2014/04/07 16:03:00,0,0,1,572,3,0,1917,20550,57780339
1399471385,1,2014/04/07 16:03:05,0,0,1,572,3,0,1917,20550,57785339

If you sort the above with numeric sort which I chose(sort -n), the numeric value of both of the first two lines is 0. Then a fallback of a string sort is used and hereby the header line of FILE2(CCSDS_VERSION) is smaller than the header line of FILE1(G_CCSDS_VERSION) which is decided at the colored character(C<G).If the C Character is something greater than G (H,I,....), the header lines are wrongly switched.

The rest of the file is correctly sorted because my generated epoch timestamps should exactly be sorted numerically.

You can verify that, if you for example insert an "A" at the beginning of the second field in your problematic file 2 in the first header line. Then the output should be correct.

If my diagnosis is correct, the obvious question is: What can be done about this error here?

---

You can attach files to your post in "Advanced mode".