Strange situation of file sorting and merging

manas_ranjan · August 8, 2012, 9:33am

I have a strange situation of sorting and merging two files based on similar columns

previusly both files has same count of records so, I made below way which is working fine until they reduced the count of one files .
I.e. some times the count of records of both will same and some times it won't but only the columns will remain unchanged .

 cat /var/tmp/today.csv > /var/tmp/outputfile
                for file in `ls /var/tmp/yday.csv | xargs `
                do
                        cut -d"," -f7,8,9 $file > /var/tmp/tmp
                        paste -d"," /var/tmp/outputfile /var/tmp/tmp >> /var/tmp/final_outputfile
                done

working well for same row count files

and columns we are talking about are 1,2,3,5,6 which are common in both files.

now files has changed
for today

36000807	A	123 	78	0	1	0.1	 0.2 	0.3
36000807	A	123 	79	0	1	-0.1 	 0.2 	-0.3
36000807	A	123 	78	0	5	0.1	 0.2 	0.3
36000807	A	123 	79	0	5	-0.1 	 0.2 	-0.3
36000807	A	123 	78	0	10	0.1	 0.2 	0.3
36000807	A	123 	79	0	10	-0.1 	 0.2 	-0.3

for yday.csv

36000807	A	123 	76	0	1	0.1	 0.2 	0.3
36000807	A	123 	76	0	5	-0.1 	 0.2 	-0.3
36000807	A	123 	76	0	10	-0.1 	 0.2 	-0.3

now final_outputfile
should looklike

36000807	A	123 	78	0	1	0.1	 0.2 	0.3	0.1	0.2	0.3
36000807	A	123 	79	0	1	-0.1 	 0.2 	-0.3	NA	NA	NA
36000807	A	123 	78	0	5	0.1	 0.2 	0.3	-0.1 	 0.2 	-0.3
36000807	A	123 	79	0	5	-0.1 	 0.2 	-0.3	NA	NA	NA
36000807	A	123 	78	0	10	0.1	 0.2 	0.3	-0.1 	 0.2 	-0.3
36000807	A	123 	79	0	10	-0.1 	 0.2 	-0.3	NA	NA	NA

it's possible for yday.csv first row could be missing totally or any other rows
then
final_output should be looklike

36000807	A	123 	78	0	1	0.1	 0.2 	0.3	NA	NA	NA
36000807	A	123 	79	0	1	-0.1 	 0.2 	-0.3	NA	NA	NA
36000807	A	123 	78	0	5	0.1	 0.2 	0.3	-0.1 	 0.2 	-0.3
36000807	A	123 	79	0	5	-0.1 	 0.2 	-0.3	NA	NA	NA
36000807	A	123 	78	0	10	0.1	 0.2 	0.3	-0.1 	 0.2 	-0.3
36000807	A	123 	79	0	10	-0.1 	 0.2 	-0.3	NA	NA	NA

anyone have any idea to work this out?

Corona688 · August 8, 2012, 1:54pm

I am struggling to try and understand the logic which produces that output from that input. What decides which rows get NA'ed and which don't?

RudiC · August 8, 2012, 3:30pm

Your script does not - by all means - produce the output you describe, even with files of equal line count. The best that I could get at with generous interpretation of the command in the script is

36000807    A    123     78    0    1    0.1     0.2     0.3,0.1     0.2     0.3
36000807    A    123     79    0    1    -0.1      0.2     -0.3,-0.1      0.2     -0.3
36000807    A    123     78    0    5    0.1     0.2     0.3,0.1     0.2     0.3
36000807    A    123     79    0    5    -0.1      0.2     -0.3,-0.1      0.2     -0.3
36000807    A    123     78    0    10    0.1     0.2     0.3,0.1     0.2     0.3
36000807    A    123     79    0    10    -0.1      0.2     -0.3,-0.1      0.2     -0.3

When files are of different length, the leftover lines will get no or empty fields pasted. Why should a line with field 4 = 79 in today.csv get NAs when field 4 in yday.csv is 76? Please explain the intended algorithm.

manas_ranjan · August 9, 2012, 7:40am

Hi Corona,

Apology for making a mess .:wall:

My requirement is combine yday and today csv file and create final output file. taking consideration of common columns 1,2,3,5,6 in both yday and today file .

sometimes today file counts are greater than the file counts of yday one.

So merging would only take 1st row of (combined column of 1,2,3,5,6 ) lowest count file (yady) to the first row (combined column of 1,2,3,5,6 ) of highest count file (today) and the very same records (combined column of 1,2,3,5,6 ) for next one in the highest count file (today) should have NA NA NA as there is no record matching for lowest count since 1st row of (combined column of 1,2,3,5,6 ) lowset count file matched above.

please refer below input file for different cases
PS: All field separator is , instead of "TAB". My mistake.

case one
today.csv

36000807	A	123 	78	0	1	0.1	 0.2 	0.3
36000807	A	123 	79	0	1	-0.1 	 0.2 	-0.3
36000807	A	123 	78	0	5	0.1	 0.2 	0.3
36000807	A	123 	79	0	5	-0.1 	 0.2 	-0.3
36000807	A	123 	78	0	10	0.1	 0.2 	0.3
36000807	A	123 	79	0	10	-0.1 	 0.2 	-0.3

yday.csv

36000807	A	123 	76	0	1	0.56	 0.47 	0.39
36000807	A	123 	76	0	5	-0.34 	 0.27 	-0.38
36000807	A	123 	76	0	10	-0.14 	 0.25 	-0.53

merged file

f1.1		f1.2	f1.3	f1.4	f1.5	f1.6	f1.7	f1.8	f1.9	f2.4	f2.7	f2.8	f2.9
36000807	A	123 	78	0	1	0.1	 0.2 	0.3	76	0.56	 0.47 	0.39
36000807	A	123 	79	0	1	-0.1 	 0.2 	-0.3	76	NA	NA	NA
36000807	A	123 	78	0	5	0.1	 0.2 	0.3	76	-0.34 	 0.27 	-0.38
36000807	A	123 	79	0	5	-0.1 	 0.2 	-0.3	76	NA	NA	NA
36000807	A	123 	78	0	10	0.1	 0.2 	0.3	76	-0.14 	 0.25 	-0.53
36000807	A	123 	79	0	10	-0.1 	 0.2 	-0.3	76	NA	NA	NA
--------------------------today.csv------------------------------------------****-----------yday.csv------------

option two

today.csv
36000807	A	123 	78	0	1	0.1	 0.2 	0.3
36000807	A	123 	79	0	1	-0.1 	 0.2 	-0.3
36000807	A	123 	78	0	5	0.1	 0.2 	0.3
36000807	A	123 	79	0	5	-0.1 	 0.2 	-0.3
36000807	A	123 	78	0	10	0.1	 0.2 	0.3
36000807	A	123 	79	0	10	-0.1 	 0.2 	-0.3

yday.csv

36000807	A	123 	76	0	5	-0.34 	 0.27 	-0.38
36000807	A	123 	76	0	10	-0.14 	 0.25 	-0.53

final merged file

f1.1		f1.2	f1.3	f1.4	f1.5	f1.6	f1.7	f1.8	f1.9	f2.4	f2.7	f2.8	f2.9
36000807	A	123 	78	0	1	0.1	 0.2 	0.3	NA	NA	NA	NA
36000807	A	123 	79	0	1	-0.1 	 0.2 	-0.3	NA	NA	NA	NA
36000807	A	123 	78	0	5	0.1	 0.2 	0.3	76	-0.34 	 0.27 	-0.38
36000807	A	123 	79	0	5	-0.1 	 0.2 	-0.3	76	NA	NA	NA
36000807	A	123 	78	0	10	0.1	 0.2 	0.3	76	-0.14 	 0.25 	-0.53
36000807	A	123 	79	0	10	-0.1 	 0.2 	-0.3	76	NA	NA	NA
--------------------------today.csv-----------------------------------------***-----------yday.csv------------

case three
today.csv

36000807	A	123 	78	0	1	0.1	 0.2 	0.3
36000807	A	123 	78	0	5	0.1	 0.2 	0.3
36000807	A	123 	79	0	10	-0.1 	 0.2 	-0.3

yday.csv

36000807	A	123 	76	0	1	0.56	 0.47 	0.39
36000807	A	123 	76	0	5	-0.34 	 0.27 	-0.38
36000807	A	123 	76	0	10	-0.14 	 0.25 	-0.53

final merged one

f1.1		f1.2	f1.3	f1.4	f1.5	f1.6	f1.7	f1.8	f1.9	f2.4	f2.7	f2.8	f2.9
36000807	A	123 	78	0	1	0.1	 0.2 	0.3	76	0.56	 0.47 	0.39
36000807	A	123 	78	0	5	0.1	 0.2 	0.3	76	-0.34 	 0.27 	-0.38
36000807	A	123 	79	0	10	-0.1 	 0.2 	-0.3	76	-0.14 	 0.25 	-0.53
--------------------------today.csv----------------------------------------****-----------yday.csv------------

f1=today.csv
f2=yday.csv
f1.X today.csv's X column number
f2.X yday.csv's X column number

---------- Post updated at 06:40 AM ---------- Previous update was at 06:24 AM ----------

Hi Corona,

I tried to resolve your struggling at my best . Do let me know, if something left out.
Cheers!!!

raj_saini20 · August 9, 2012, 7:53am

You mean to say sixth column will be common in both the file and need to merge on the basis of sixth column.
is the file1 sixth column will be sorted as shown in sample file

RudiC · August 9, 2012, 8:06am

Would this come near what you expect to see:

awk '    BEGIN {OFS="\t"}
    {
     a[7]=a[8]=a[9]="NA";
     if (tmp!=$6) {getline p < "yday";
             split(p,a)}
     tmp=$6
    }
    {print $0, a[7], a[8], a[9]}
    ' today

36000807    A    123     78    0    1    0.1     0.2     0.3    0.56    0.47    0.39
36000807    A    123     79    0    1    -0.1    0.2     -0.3    NA    NA    NA
36000807    A    123     78    0    5    0.1     0.2     0.3    -0.34    0.27    -0.38
36000807    A    123     79    0    5    -0.1    0.2     -0.3    NA    NA    NA
36000807    A    123     78    0    10   0.1     0.2     0.3    -0.14    0.25    -0.53
36000807    A    123     79    0    10   -0.1    0.2     -0.3    NA    NA    NA

---------- Post updated at 02:06 PM ---------- Previous update was at 01:54 PM ----------

This will meet your requirements on line 1:

awk 'BEGIN {OFS="\t"}
    {
     a[4]=a[7]=a[8]=a[9]="NA";
     if (tmp!=$6 && NR>1)
         {getline p < "yday";  split(p,a)}
     tmp=$6
    }
    {print $0, a[4], a[7], a[8], a[9]}
    ' today

Make the OFS a , if you need that as output delimiter.

raj_saini20 · August 9, 2012, 8:19am


 awk 'BEGIN{OFS="\t"}
 NR==FNR{
	a[$6]=$0
	}
 NR!=FNR{
	if(a[$6])
		{	
			split(a[$6],b);
			if(!c[$6])
				{
					c[$6]=1;
					print $0,b[4],b[7],b[8],b[9]
				}
			else
				{
					print $0,b[4],"NA","NA","NA"
				}
		}
	else
		{
			print $0,"NA","NA","NA","NA"
		}
	}' yest.csv today.csv

manas_ranjan · August 9, 2012, 8:41am

thank you all for all your suggestions .
I'm gonna try it. hopefully it's going to work, fingers are crossed...

can I make a combination columns like

$5 and $6

not

$6

from the file

raj_saini20 · August 9, 2012, 8:46am

yes it can be done with combination also

manas_ranjan · August 9, 2012, 9:09am

thanks
the one which you provided works fine or rather as charm if both files have different counts.

But it's doesn't work, if both y'day and today file have same count??

can you help me.
I made little change can you please check what I'm missing for same count.

awk -F"," 'BEGIN{OFS=","}
         NR==FNR{a[$5 FS $6]=$0 } NR!=FNR{
        if(a[$5 FS $6]){split(a[$5 FS $6],b);
        if(!c[$5 FS $6])
        {       c[$5 FS $6]=1;
        print $0,b[4],b[7],b[8],b[9]
        } else {
        print $0,b[4],"NA","NA","NA" }
        } else {
        print $0,"NA","NA","NA","NA"
        }}' yday.csv today.csv

today.csv

36000807	A	123 	78	0	1	0.1	 0.2 	0.3
36000807	A	123 	78	0	5	0.1	 0.2 	0.3
36000807	A	123 	79	0	10	-0.1 	 0.2 	-0.3

yday.csv

36000807	A	123 	76	0	1	0.56	 0.47 	0.39
36000807	A	123 	76	0	5	-0.34 	 0.27 	-0.38
36000807	A	123 	76	0	10	-0.14 	 0.25 	-0.53

final.csv

f1.1		f1.2	f1.3	f1.4	f1.5	f1.6	f1.7	f1.8	f1.9	f2.4	f2.7	f2.8	f2.9
36000807	A	123 	78	0	1	0.1	 0.2 	0.3	76	0.56	 0.47 	0.39
36000807	A	123 	78	0	5	0.1	 0.2 	0.3	76	-0.34 	 0.27 	-0.38
36000807	A	123 	79	0	10	-0.1 	 0.2 	-0.3	76	-0.14 	 0.25 	-0.53
--------------------------today.csv----------------------------------------****-----------yday.csv------------

RudiC · August 9, 2012, 9:12am

Any comments on / errors with my proposal? Works fine on my machine...

manas_ranjan · August 9, 2012, 9:21am

yes, Rudi I have tried your both option taking both file as same count and different count .

in different record count it works fine as expected
but when both of record counts are same then it kind of messed up.

I think some where in the same count, we need little tweak keeping the logic for different record count untouched.

RudiC · August 9, 2012, 10:17am

OK, then be little more specific. yday was half the lines as today in your example. This should not be assumed, I understand; so what be the number of lines in yday? Anything between today and today/2? Or today and zero? And what to do if yday has the same line count for e.g. 3 records and only half the line count starting at record 4? Is it possible yday has more lines for certain records?

manas_ranjan · August 10, 2012, 4:10am

Hi Rudi,

That's exactly I'm trying to say.

sometimes it could be possible that

yday.csv's record count is exactly equals to today.csv
the file I used as an example have same record count 3 and 3
it's purely coincidence that my example of yday and today has 3 records which is exactly half of not same record count file.
There could be possible that yday and today file counts could be 4 and 4.
This is the scenario for same record count.
and for not same record count it could be 6 and 2 or could be 6 and 3 .

please refer example case one and two are example of not same record count
case one
today.csv 6 count

36000807	A	123 	78	0	1	0.1	 0.2 	0.3
36000807	A	123 	79	0	1	-0.1 	 0.2 	-0.3
36000807	A	123 	78	0	5	0.1	 0.2 	0.3
36000807	A	123 	79	0	5	-0.1 	 0.2 	-0.3
36000807	A	123 	78	0	10	0.1	 0.2 	0.3
36000807	A	123 	79	0	10	-0.1 	 0.2 	-0.3

yday.csv - 3 record count

36000807	A	123 	76	0	1	0.56	 0.47 	0.39
36000807	A	123 	76	0	5	-0.34 	 0.27 	-0.38
36000807	A	123 	76	0	10	-0.14 	 0.25 	-0.53

merged file which matched column 5 and 6 of both file and print exactly 1st row with the value and 2nd row with NA NA

f1.1		f1.2	f1.3	f1.4	f1.5	f1.6	f1.7	f1.8	f1.9	f2.4	f2.7	f2.8	f2.9
36000807	A	123 	78	0	1	0.1	 0.2 	0.3	76	0.56	 0.47 	0.39
36000807	A	123 	79	0	1	-0.1 	 0.2 	-0.3	76	NA	NA	NA
36000807	A	123 	78	0	5	0.1	 0.2 	0.3	76	-0.34 	 0.27 	-0.38
36000807	A	123 	79	0	5	-0.1 	 0.2 	-0.3	76	NA	NA	NA
36000807	A	123 	78	0	10	0.1	 0.2 	0.3	76	-0.14 	 0.25 	-0.53
36000807	A	123 	79	0	10	-0.1 	 0.2 	-0.3	76	NA	NA	NA
--------------------------today.csv------------------------------------------****-----------yday.csv------------

option two
today.csv 6 reocrd count

36000807	A	123 	78	0	1	0.1	 0.2 	0.3
36000807	A	123 	79	0	1	-0.1 	 0.2 	-0.3
36000807	A	123 	78	0	5	0.1	 0.2 	0.3
36000807	A	123 	79	0	5	-0.1 	 0.2 	-0.3
36000807	A	123 	78	0	10	0.1	 0.2 	0.3
36000807	A	123 	79	0	10	-0.1 	 0.2 	-0.3

yday.csv - 2 record count missing column 5 and 6 (0 and 1) values

36000807	A	123 	76	0	5	-0.34 	 0.27 	-0.38
36000807	A	123 	76	0	10	-0.14 	 0.25 	-0.53

final merged file have 6 rows and since yday column 5 and 6 missed it's values
so f2.7 f2.8 f2.9 have values NA NA NA

f1.1		f1.2	f1.3	f1.4	f1.5	f1.6	f1.7	f1.8	f1.9	f2.4	f2.7	f2.8	f2.9
36000807	A	123 	78	0	1	0.1	 0.2 	0.3	76	NA	NA	NA
36000807	A	123 	79	0	1	-0.1 	 0.2 	-0.3	76	NA	NA	NA
36000807	A	123 	78	0	5	0.1	 0.2 	0.3	76	-0.34 	 0.27 	-0.38
36000807	A	123 	79	0	5	-0.1 	 0.2 	-0.3	76	NA	NA	NA
36000807	A	123 	78	0	10	0.1	 0.2 	0.3	76	-0.14 	 0.25 	-0.53
36000807	A	123 	79	0	10	-0.1 	 0.2 	-0.3	76	NA	NA	NA
--------------------------today.csv-----------------------------------------***-----------yday.csv------------

and incase of same record count , both files have same record count and each column 5 and 6 have corresponding value for coulmn 5 and 6 in other file
today.csv

36000807	A	123 	78	0	1	0.1	 0.2 	0.3
36000807	A	123 	78	0	5	0.1	 0.2 	0.3
36000807	A	123 	79	0	10	-0.1 	 0.2 	-0.3

yday.csv

36000807	A	123 	76	0	1	0.56	 0.47 	0.39
36000807	A	123 	76	0	5	-0.34 	 0.27 	-0.38
36000807	A	123 	76	0	10	-0.14 	 0.25 	-0.53

and final mergerd file

f1.1		f1.2	f1.3	f1.4	f1.5	f1.6	f1.7	f1.8	f1.9	f2.4	f2.7	f2.8	f2.9
36000807	A	123 	78	0	1	0.1	 0.2 	0.3	76	0.56	 0.47 	0.39
36000807	A	123 	78	0	5	0.1	 0.2 	0.3	76	-0.34 	 0.27 	-0.38
36000807	A	123 	79	0	10	-0.1 	 0.2 	-0.3	76	-0.14 	 0.25 	-0.53
--------------------------today.csv----------------------------------------****-----------yday.csv------------

f1=today.csv
f2=yday.csv
f1.X today.csv's X column number
f2.X yday.csv's X column number

RudiC · August 11, 2012, 5:34am

This proposal hopefully will fulfill all your requirements, file1 being larger, smaller than, or identical to file2. It will compare on field 6 only; you may extend it to comparing field 5 by adding ai[5] and bi[5] to all occurrences of ai[6] and bi[6]. But then, expect bizarre results! Here we go:

awk 'BEGIN {fmt="%10.10s"; aOK=bOK=needa=needb=1;}
        { while (aOK || bOK)
                {if (needa) aOK=getline a<filea;
                 if (needb) bOK=getline b<fileb;
                 if (!(aOK||bOK)) exit;

                 if (aOK) {split(a,ai)} else {ai[6]="NA"};
                 if (bOK) {split(b,bi)} else {bi[6]="NA"};

                 if(ai[6]==bi[6])       {needa=needb=1}
                 else   if(ai[6]>bi[6])
                                        {needa=0; needb=1; ai[4]=ai[5]=ai[6]=ai[7]=ai[8]=ai[9]="NA";}
                        else            {needa=1; needb=0; bi[4]=bi[5]=bi[6]=bi[7]=bi[8]=bi[9]="NA";}
                for (i=1;i<=9;i++) printf fmt, ai; for(i=4;i<=9;i++) printf fmt, bi; print "";
                }
         exit
        }
        ' FS=" " filea=today fileb=yday today

Please adapt the format string fmt in the BEGIN part and the field separator according to your requirements. Give it a thorough test and come back with the results!
It still may require some polishing, e.g. on the repeated assignment of "NA" to the single fields, but I've run out of ideas here...

raj_saini20 · August 13, 2012, 1:24am

 awk 'BEGIN{OFS="\t"}
 NR==FNR{
	a[$6]=$0
	}
 NR!=FNR{
	if(a[$6])
		{	
			split(a[$6],b);
			if(!c[$6])
				{
					c[$6]=1;
					print $0,b[4],b[7],b[8],b[9]
				}
			else
				{
					print $0,b[4],"NA","NA","NA"
				}
		}
	else
		{
			print $0,"NA","NA","NA","NA"
		}
	}' YEST.CSV TODAY.CSV

This is working for all scenario. For the same row count output is

36000807        A       123     78      0       1       0.1      0.2    0.3     76      0.56    0.47    0.39
36000807        A       123     78      0       5       0.1      0.2    0.3     76      -0.34   0.27    -0.38
36000807        A       123     79      0       10      -0.1     0.2    -0.3    76      -0.14   0.25    -0.53

which is as desired by you

RudiC · August 15, 2012, 7:14am

Here's the polished version of my above proposal. Pls. give it a try and comment:

awk -F, 'BEGIN  {OFS=FS;fmt="%s"OFS;
                 aOK=bOK=needa=needb=1;
                 na[4]=na[5]=na[6]=na[7]=na[8]=na[9]="NA";
                 filea=ARGV[1];
                 fileb=ARGV[2];
                }

         function prt (A, B, C) {for (i=1;i<=3;i++) printf fmt, A;
                                 for (i=4;i<=9;i++) printf fmt, B;
                                 for (i=4;i<=9;i++) printf fmt, C;
                                 print "";}

        BEGIN \
         {while (aOK||bOK)
                {if (needa) aOK=getline a<filea;
                 if (needb) bOK=getline b<fileb;

                 if (!(aOK||bOK)) exit;

                 if (aOK) {split(a,ai)} else {ai[6]="NA"};
                 if (bOK) {split(b,bi)} else {bi[6]="NA"};

                 if(ai[6]==bi[6])               {needa=1; needb=1; prt (ai, ai, bi);}
                 else   if(ai[6]>bi[6])         {needa=0; needb=1; prt (ai, na, bi);}
                        else                    {needa=1; needb=0; prt (ai, ai, na);}
                }
         }' today.csv yday.csv