Compare first column from two csv files with greater than or equal, and less than

aachave1 · July 19, 2016, 2:57pm

I have two csv files of different sizes. The output file needs to have file1 contents on top of file2 contents where file2 col1 is >= to file1 col1, and file2 col1(same value) is < file1 col1 (next value). So basically, some file2 rows will be matched to the same file1 row because it is the closet match that is not exact. Thank you!

file1:

20160525162519,SN1
20160525162524,SN2
20160525162529,SN3
20160525162534,SN4
20160525162539,SN5
20160525162544,SN6

file2:

20160525162522,464374526,1464193527,206,0,0,0,0,1,1,544,544
20160525162523,464374526,1464193527,206,0,0,0,0,1,1,544,544
20160525162524,464374526,1464193527,206,0,0,0,0,1,1,544,544
20160525162525,464374526,1464193527,206,0,0,0,0,1,1,544,544
20160525162526,464374526,1464193527,206,0,0,0,0,1,1,544,544
20160525162527,464374526,1464193527,206,0,0,0,0,1,1,544,544
20160525162528,464374526,1464193527,206,0,0,0,0,1,1,544,544
20160525162529,464374526,1464193527,206,0,0,0,0,1,1,544,544
20160525162530,464374526,1464193527,206,0,0,0,0,1,1,544,544
20160525162531,464374526,1464193527,206,0,0,0,0,1,1,544,544

Desired output:

20160525162519,SN1
20160525162522,464374526,1464193527,206,0,0,0,0,1,1,544,544

20160525162519,SN1
20160525162523,464374526,1464193527,206,0,0,0,0,1,1,544,544

20160525162524,SN2
20160525162524,464374526,1464193527,206,0,0,0,0,1,1,544,544

20160525162524,SN2
20160525162525,464374526,1464193527,206,0,0,0,0,1,1,544,544

20160525162524,SN2
20160525162526,464374526,1464193527,206,0,0,0,0,1,1,544,544

20160525162524,SN2
20160525162527,464374526,1464193527,206,0,0,0,0,1,1,544,544

20160525162524,SN2
20160525162528,464374526,1464193527,206,0,0,0,0,1,1,544,544

20160525162529,SN3
20160525162529,464374526,1464193527,206,0,0,0,0,1,1,544,544

20160525162529,SN3
20160525162530,464374526,1464193527,206,0,0,0,0,1,1,544,544

20160525162529,SN3
20160525162531,464374526,1464193527,206,0,0,0,0,1,1,544,544

This code almost works, but only displays exact matches.

awk -F, 'NR==FNR { a[$1]=$0;next}; 
(FNR != NR  ) || ( ($1 in a) && ($0 >= a[$1]) && ($0 < a[$1])){print a[$1]; print}' file1.csv file2.csv > temp

RudiC · July 19, 2016, 3:41pm

Not sure I understood your problem correctly, but this at least works on the samples given:

awk -F, '
BEGIN           {CNT+=2
                }
NR == FNR       {a[NR] = $0
                 b[NR] = $1
                 next
                }
$1 >= b[CNT]    {CNT++
                }
$1 <  b[CNT]    {print a[CNT-1]
                 print $0, RS
                }
' file1 file2
20160525162519,SN1
20160525162522,464374526,1464193527,206,0,0,0,0,1,1,544,544 

20160525162519,SN1
20160525162523,464374526,1464193527,206,0,0,0,0,1,1,544,544 

20160525162524,SN2
20160525162524,464374526,1464193527,206,0,0,0,0,1,1,544,544 

20160525162524,SN2
20160525162525,464374526,1464193527,206,0,0,0,0,1,1,544,544 

20160525162524,SN2
20160525162526,464374526,1464193527,206,0,0,0,0,1,1,544,544 

20160525162524,SN2
20160525162527,464374526,1464193527,206,0,0,0,0,1,1,544,544 

20160525162524,SN2
20160525162528,464374526,1464193527,206,0,0,0,0,1,1,544,544 

20160525162529,SN3
20160525162529,464374526,1464193527,206,0,0,0,0,1,1,544,544 

20160525162529,SN3
20160525162530,464374526,1464193527,206,0,0,0,0,1,1,544,544 

20160525162529,SN3
20160525162531,464374526,1464193527,206,0,0,0,0,1,1,544,544

aachave1 · July 19, 2016, 4:50pm

Thank you so much! That worked with my previous example files, however, it needs to be adaptable to these new files where now file1 is larger than file2 and it has 2 or more matching time stamps. It still needs to check for >= and < times, even though there may not be any of one of the scenarios.
It doesn�t seem to work with these column 1 time stamps (format issue with �:� and �/� maybe?).

file1:

TIMEFORMATTED,G_TP01_OPER_ID,G_TP01_OPER_ID(RAW)
2016/05/25 16:23:04,0,0
2016/05/25 16:23:05,0,0
2016/05/25 16:23:06,0,0
2016/05/25 16:23:07,0,0
2016/05/25 16:23:08,0,0
2016/05/25 16:23:09,0,0
2016/05/25 16:23:10,0,0

2016/05/25 16:24:01,0,0
2016/05/25 16:24:02,0,0
2016/05/25 16:24:03,0,0
2016/05/25 16:24:04,0,0
2016/05/25 16:25:05,0,0

2016/05/25 16:25:01,0,0
2016/05/25 16:25:02,0,0
2016/05/25 16:25:03,0,0
2016/05/25 16:25:04,0,0
2016/05/25 16:25:05,0,0

file2:

TIMEFORMATTED,HDR_SYNC,HDR_SEC,HDR_MSEC,CCSDS_VERSION
2016/05/25 16:24:02,464374526,1464193527,206,0
2016/05/25 16:25:05,464374526,1464193527,206,0

Desired output:

2016/05/25 16:24:02,0,0
2016/05/25 16:24:02,464374526,1464193527,206,0

2016/05/25 16:25:05,0,0
2016/05/25 16:25:05,464374526,1464193527,206,0

aachave1 · July 21, 2016, 2:25pm

Does anyone have any ideas how to get this code to work with my previous file examples? RudiC got it to work with my original post, but Iv'e tried many scenarios to have it work more universal on different size files with unpredictable time stamps in column 1. Sometimes file 1 will be larger with more time stamped data and sometimes file 2 may have more time stamped data - either way I still need the >= and < logic to cover all bases. Thank you!!!

wk -F, '
BEGIN           {CNT+=2
                }
NR == FNR       {a[NR] = $0
                 b[NR] = $1
                 next
                }
$1 >= b[CNT]    {CNT++
                }
$1 <  b[CNT]    {print a[CNT-1]
                 print $0, RS
                }
' file1 file2

rdrtx1 · July 21, 2016, 5:08pm

awk 'NR==FNR && NR > 1 {l=$0; s=FS; $0=$0; a[$1]=l; next}
a[$1] {print a[$1]; print $0; print ""}
' FS=, file1 FS=, file2

aachave1 · July 21, 2016, 5:27pm

awk 'NR==FNR && NR > 1 {l=$0; s=FS; $0=$0; a[$1]=l; next}
a[$1] {print a[$1]; print $0; print ""}
' FS=, file1 FS=, file2

This works with the last file examples (#3 post), but not the first file examples (#1 post). The previous code from RudiC works with the first file examples (#1 post), but not the last (#3 post). I need to find something that handles both scenarios. I'll keep tying - thanks!!

Don_Cragun · July 22, 2016, 1:26am

The first rule for producing code that works is to get a clear description of the input format(s) and the desired output format(s). Do not expect us to write code for you that magically guesses correctly at inconsistent input file formats. Note that:

there are no blank lines nor any header lines in the files in post #1, and
there are blank lines in file1 and headers in both input files in post #4.

We might be able to make adjustments for blank or empty lines in your input files (if we can determine whether or not a blank or empty line is a header).

If you give us a clear description of your input file formats that can be used in ALL cases for your various input file formats such that we could be sure programmatically whether or not a line in an input file is a header or data, we might be able to handle that as well. But, I'm not going to attempt to guess at what might appear as data or as headers, how many lines of headers might be present in each input file, what characters might appear in headers, nor what characters might appear in data in other input files you might throw at us later.

Note also that character string comparisons of field 1 in both sets of sample input files happens to work with the date formats used in those files. That might not be true for other date and time formats. (And, if the format used in file1 is not the same format as the date and time format in used in file2 in any pair of input files, the code needed to normalize date and time strings for comparison might be a significant research project unless you pass your script a clear description of the date and time formats used in each input file.)

aachave1 · July 22, 2016, 10:18am

Yes, I understand. I am so sorry for the lack of clarity, but it seemed difficult to explain since my files are huge and these are just little snippets.

Here are some more examples with files that will always be in this format (no spaces). The first file example has file1 larger than file2 and some time stamps match, while some don't. The desired output will have the headers with them, but if too difficult, then I can put them in later. The second file example has file1 smaller that file2, so I need to attach a file1 row that best matches a file2 row (with >= and < logic). As you can see the time in seconds varies from both files so some will match exactly while some are close.

File 1:
TIMEFORMATTED,G_TP01_OPER_ID,G_TP01_OPER_ID(RAW),G_TP02_PROC_NO,G_TP02_PROC_NO(RAW),G_TP03_PROC_REV
2016/05/25 16:25:19,0,0,0,NO_DEF,-2147483647
2016/05/25 16:25:20,0,0,0,NO_DEF,-2147483648
2016/05/25 16:25:21,0,0,0,NO_DEF,-2147483649
2016/05/25 16:25:22,0,0,0,NO_DEF,-2147483650
2016/05/25 16:25:23,0,0,0,NO_DEF,-2147483651
2016/05/25 16:25:24,0,0,0,NO_DEF,-2147483652
2016/05/25 16:25:25,0,0,0,NO_DEF,-2147483653
2016/05/25 16:25:26,0,0,0,NO_DEF,-2147483654
2016/05/25 16:25:27,0,0,0,NO_DEF,-2147483655


File 2:

TIMEFORMATTED,HDR_SYNC,HDR_SEC,HDR_MSEC,G_CCSDS_VERSION,G_CCSDS_VERSION(RAW)
2016/05/25 16:25:22,464374526,1464193527,206,0,0
2016/05/25 16:25:26,464374526,1464193532,206,0,0
2016/05/25 16:25:31,464374526,1464193537,207,0,0


Desired Output:

TIMEFORMATTED,G_TP01_OPER_ID,G_TP01_OPER_ID(RAW),G_TP02_PROC_NO,G_TP02_PROC_NO(RAW),G_TP03_PROC_REV
2016/05/25 16:25:22,0,0,0,NO_DEF,-2147483650
TIMEFORMATTED,HDR_SYNC,HDR_SEC,HDR_MSEC,G_CCSDS_VERSION,G_CCSDS_VERSION(RAW
2016/05/25 16:25:22,464374526,1464193527,206,0,0
TIMEFORMATTED,G_TP01_OPER_ID,G_TP01_OPER_ID(RAW),G_TP02_PROC_NO,G_TP02_PROC_NO(RAW),G_TP03_PROC_REV
2016/05/25 16:25:26,0,0,0,NO_DEF,-2147483654
TIMEFORMATTED,HDR_SYNC,HDR_SEC,HDR_MSEC,G_CCSDS_VERSION,G_CCSDS_VERSION(RAW
2016/05/25 16:25:26,464374526,1464193532,206,0,0
TIMEFORMATTED,G_TP01_OPER_ID,G_TP01_OPER_ID(RAW),G_TP02_PROC_NO,G_TP02_PROC_NO(RAW),G_TP03_PROC_REV
2016/05/25 16:25:27,0,0,0,NO_DEF,-2147483655
TIMEFORMATTED,HDR_SYNC,HDR_SEC,HDR_MSEC,G_CCSDS_VERSION,G_CCSDS_VERSION(RAW
2016/05/25 16:25:31,464374526,1464193537,207,0,0







Second example:

File 1:

TIMEFORMATTED,G_TP01_OPER_ID,G_TP01_OPER_ID(RAW),G_TP02_PROC_NO,G_TP02_PROC_NO(RAW),G_TP03_PROC_REV
2014/04/07 16:00:30,0,0,0,NO_DEF,-2147483647
2014/04/07 16:00:35,0,0,0,NO_DEF,-2147483648
2014/04/07 16:00:40,0,0,0,NO_DEF,-2147483649
2014/04/07 16:00:45,0,0,0,NO_DEF,-2147483650
2014/04/07 16:00:50,0,0,0,NO_DEF,-2147483651
2014/04/07 16:00:55,0,0,0,NO_DEF,-2147483652
2014/04/07 16:00:60,0,0,0,NO_DEF,-2147483653

File 2:

TIMEFORMATTED,CCSDS_VERSION,CCSDS_VERSION(RAW),CCSDS_TYPE,CCSDS_TYPE(RAW),CCSDS_2HDR_FLAG,CCSDS_2HDR_FLAG(RAW),ID
2014/04/07 16:00:43,0,0,0,0,1,1,544
2014/04/07 16:00:45,0,0,0,0,1,3,544
2014/04/07 16:00:47,0,0,0,0,1,1,544
2014/04/07 16:00:49,0,0,0,0,4,1,544
2014/04/07 16:00:51,0,0,0,0,1,1,544
2014/04/07 16:00:53,0,0,0,0,1,7,544
2014/04/07 16:00:55,0,0,0,0,8,1,544
2014/04/07 16:00:57,0,0,0,0,1,2,544
2014/04/07 16:00:59,0,0,0,0,3,1,544
2014/04/07 16:00:61,0,0,0,0,1,1,544
2014/04/07 16:00:63,0,0,0,0,1,9,544
2014/04/07 16:00:65,0,0,0,0,4,1,544
2014/04/07 16:00:67,0,0,0,0,1,1,544


Output: I prefer the headers to be attached like first output example, but I�ll take this if it is easier.

2014/04/07 16:00:40,0,0,0,NO_DEF,-2147483649
2014/04/07 16:00:43,0,0,0,0,1,1,544
2014/04/07 16:00:45,0,0,0,NO_DEF,-2147483650
2014/04/07 16:00:45,0,0,0,0,1,3,544
2014/04/07 16:00:45,0,0,0,NO_DEF,-2147483650
2014/04/07 16:00:47,0,0,0,0,1,1,544
2014/04/07 16:00:45,0,0,0,NO_DEF,-2147483650
2014/04/07 16:00:49,0,0,0,0,4,1,544
2014/04/07 16:00:50,0,0,0,NO_DEF,-2147483651
2014/04/07 16:00:51,0,0,0,0,1,1,544
2014/04/07 16:00:50,0,0,0,NO_DEF,-2147483651
2014/04/07 16:00:53,0,0,0,0,1,7,544
2014/04/07 16:00:55,0,0,0,NO_DEF,-2147483652
2014/04/07 16:00:55,0,0,0,0,8,1,544
2014/04/07 16:00:55,0,0,0,NO_DEF,-2147483652
2014/04/07 16:00:57,0,0,0,0,1,2,544
2014/04/07 16:00:55,0,0,0,NO_DEF,-2147483652
2014/04/07 16:00:59,0,0,0,0,3,1,544
2014/04/07 16:00:60,0,0,0,NO_DEF,-2147483653
2014/04/07 16:00:61,0,0,0,0,1,1,544