Problem with Join Command

Varshha · August 12, 2016, 12:05pm

I have 2 files. File 1 is a daily file with only a bunch of IDs and a date column. File 2 has all the dump of IDs and their respective cost. I basically want an inner join. When I am picking a few rows from these files and joining, they work perfectly fine. But when I join the full files together, I get no response. Below are the sample files

File 1 :
01223610248 , 04/07/2009
01223612562 , 08/19/2003
01223617098 , 02/01/2005
01223618661 , 12/13/2005
01223619159 , 05/29/2007
01223620423 , 02/06/2007
01256957092 , 04/22/2003
01256959417 , 12/19/2006
01256959597 , 02/08/2005
01256970382 , 10/12/2004
01256970722 , 03/15/2005
01256972774 , 10/11/2005
01256975064 , 05/02/2006
01256976030 , 03/21/2006
933604300140 , 10/27/2008
933736900872 , 05/06/2016
933604300128 , 03/14/2008

File 2:
933604300140 ,20.64
933736900872 ,18.56
933604300128 ,20.64
67119603398 ,0.64
67261704102 ,0.65
75072313652 ,0.65
02454397033 ,0.70
02454397537 ,0.70
03139824387 ,0.70
03139824388 ,0.70
76218230730 ,0.70
77898802256 ,0.70
88843006240 ,0.70
63410597392 ,0.81
84315600053 ,0.82
63447926391 ,0.97
06004450461 ,0.98
60161702121 ,0.98
79862230382 ,0.98
79862230662 ,0.98

A simple join command works for this. But when I join the original File 2 which is around 19 mb on the server with the original file 1 (96 kb) , I get no output at all.

using

join file 2 file1

works for the above sample files.

I have tried the following commands for the full files:

join -t, file2 file1

 join -t, -1 1 -1 2 file2 file1

nothing seems to be working with the original files.

I also tried the following code lines:

awk 'NR=FNR{check[$0];next} $2 in check' File2 File1

and

cat File2 | while read line; do  grep $line File1; done

What am I missing here? Please help!

Thanks

Don_Cragun · August 12, 2016, 4:06pm

You say nothing seems to be working with the original files. What does that mean? Are the commands completing with no output? Are you killing the commands because no output is produced in an hour (or some other fixed time)?

Your sample data in File 1 (or maybe file1 as it is referenced in your sample code) uses <space><comma><space> as the field separator. Your sample data in File 2 (or maybe file2 or file 2 as it is referenced in your sample code) uses <space><comma> as the field separator. But your code just uses <comma> as the field separator. Are you sure that your real data ALWAYS has exactly one <space> character after the number in the first field in both input files before the <comma> on every line?

Varshha · August 15, 2016, 2:21am

I posted the reply yesterday but I am not sure why it is not reflecting. So here it is again :

The files were originally tab delimited but I made them comma delimited to help me with the join command. I am now trying again with tab delimited files. I also tried some additions in my join command and this is what I gave :

join -t"  "  -a 2 -a 1 -e 'NULL' -o '0,1.1,1.2,2.1,2.2' File1 File2 | head -100

I am getting a result out of this which unfortunately means that UNIX is not finding a common key for the files to join and it is surprising because there ARE common values between the files. This is how the sample of the result looks like :

01635158332	09/09/2016 01635158332	09/09/2016 NULL NULL NULL
01635163349	11/24/2009 01635163349	11/24/2009 NULL NULL NULL
16.11	01635163339 NULL NULL 16.11	01635163339 NULL
16.11	01635163349 NULL NULL 16.11	01635163349 NULL

As you can see above, 01635163349 is a common key between File 1 that has dates and file 2 that has the cost. So ideally the result should be

01635163349  11/24/2009  16.11

The command

join -1 1 -2 1 File 1 File 2

does not give me any result as in no output on the console at all.

This is how file 1 looks:

00033492482     04/11/2006
00033492682     07/14/2009
00033492702     02/09/2010
00076848302     08/10/2010
00881123792     11/07/2000
01130162424     06/12/2007
01130164254     01/29/2008
01130165543     05/16/2011
01130168864     07/14/2009
01635163349     11/24/2009

File 2:

0.00    03139822826
0.00    49246820001
0.00    7621830148
0.00    822004599003
0.11    73379268872
0.64    67119603398
0.65    67261704102
16.11   01635163349

Can there be any other way to achieve an inner join between these files?

RudiC · August 15, 2016, 3:43am

Shouldn't you use column 2 in file2? Try

join  -11 -22 file1 file2
01635163349 11/24/2009 16.11

And, moving targets rarely help. Why didn't you post representative samples in the first place?

Varshha · August 15, 2016, 5:12am

Sorry, I did

join -11 -22  File1 File2

Came out as a typo .... but this is not working either

RudiC · August 15, 2016, 5:16am

DOS <CR> (0x0D, \r, ^M) line terminators? Where and how did you produce the files?

Varshha · August 15, 2016, 5:24am

These files are being sent by the source. There are many other columns in these files. I have manipulated them to remove the unrequired columns and the header using AWK and SED.

---------- Post updated at 04:24 AM ---------- Previous update was at 04:19 AM ----------

These files are being sent by the source. There are many other columns in these files. I have manipulated them to remove the unrequired columns and the header using AWK and SED. When I am viewing the file in vi, I do not see ant ^M charachters.

RudiC · August 15, 2016, 5:33am

As you can see, it works on the copies of the files you posted. So - there must be an inherent difference in the original files you work upon. Reduce the files to just the lines in questions, try again, and if it doesn't work, post the output of

od -tx1c file[12]

.

Don_Cragun · August 15, 2016, 5:47am

varshha:

I posted the reply yesterday but I am not sure why it is not reflecting. So here it is again :

The files were originally tab delimited but I made them comma delimited to help me with the join command. I am now trying again with tab delimited files. I also tried some additions in my join command and this is what I gave :
join -t"  "  -a 2 -a 1 -e 'NULL' -o '0,1.1,1.2,2.1,2.2' File1 File2 | head -100
I am getting a result out of this which unfortunately means that UNIX is not finding a common key for the files to join and it is surprising because there ARE common values between the files. This is how the sample of the result looks like :
01635158332	09/09/2016 01635158332	09/09/2016 NULL NULL NULL
01635163349	11/24/2009 01635163349	11/24/2009 NULL NULL NULL
16.11	01635163339 NULL NULL 16.11	01635163339 NULL
16.11	01635163349 NULL NULL 16.11	01635163349 NULL
As you can see above, 01635163349 is a common key between File 1 that has dates and file 2 that has the cost. So ideally the result should be
01635163349  11/24/2009  16.11
The command
join -1 1 -2 1 File 1 File 2 
does not give me any result as in no output on the console at all.

This is how file 1 looks:
00033492482     04/11/2006
00033492682     07/14/2009
00033492702     02/09/2010
00076848302     08/10/2010
00881123792     11/07/2000
01130162424     06/12/2007
01130164254     01/29/2008
01130165543     05/16/2011
01130168864     07/14/2009
01635163349     11/24/2009
File 2:
0.00    03139822826
0.00    49246820001
0.00    7621830148
0.00    822004599003
0.11    73379268872
0.64    67119603398
0.65    67261704102
16.11   01635163349
Can there be any other way to achieve an inner join between these files?

You have now shown us 3 different input file formats (tab separated fields, <space><comma><space> separated fields, and <space><comma> separated fields). You have shown us commands using <space>, <comma>, and <tab> as the field separator. And it isn't clear which separators have been used in the files those command are processing.

More importantly, you have said that your file names are File 1 , File1 , file 1 , and file1 . Since none of your commands have quoted the filenames being passed as arguments, many of them are asking various utilities to work on files named File or file and 1 and 2 (which presumably result in non-existent file diagnostics that you haven't shown us). The name of a file is case sensitive and having a <space> in a filename requires special handling in LOTS of ways that are being ignored in all of your command lines.

Then, it is also important to understand that in an awk script, $0 is the contents of the current input line, $2 is the contents of the 2nd field in the current input line, and a command like:

awk 'NR=FNR{check[$0];next} $2 in check' File2 File1

is never going to work unless

File2

contains line that just contain whole lines that exactly match the 2nd field of a line in File1 (which is not true for any of your sample input file pairs.

And, the command line:

cat File2 | while read line; do  grep $line File1; done

will only work correctly if there are no <space> or <tab> characters on any line in File2 AND you are trying to find complete lines form File2 that match a subset of a line from File1 .

And, the command line:

join file 2 file1

should give you a diagnostic similar to:

usage: join [-a file no | -v file no ] [-e string] [-1 field] [-2 field]
            [-o list] [-t char] file1 file2

not the no output that you say you get.

If you keep giving us inconsistent data and don't show us what your command lines and/or the output you get from them really are, you make it impossible for us to help you.

Saying things like:

Doesn't help us. Show us the exact diagnostic that was produced!

Saying things like:

Doesn't give us any indication as to whether or not we are working on UNIX format text files after you have manipulated files sent by the source. If, after have manipulated them, the source files are still DOS format text files, there is a good chance that fields are matching because of DOS text file <carriage-return> line separators causing <carriage-return> characters to keep fields from matching or to cause output sent to your terminal being obscured by parts of output lines overwriting earlier text already sent to your screen.

Please give us clear answers to the questions we have asked. We are asking for information that will allow us to help you. We are not asking you to do extra work for the fun of it.

Please help us help you!

Varshha · August 15, 2016, 6:14am

Sorry for the confusion that I am creating. Let me start from the beginning. These files are tab delimited files. Because there is a confusion with the file names, I will henceforth use the original file names --dlya0908.tab (which I was referring as File1) and tgpr.tab (which I was referring as File2)

I do not have information on how does the source team create the files. It is an external server from which the files are FTPed.

So, dlya0908.tab looks like this :

00033492482     04/11/2006
00033492682     07/14/2009
00033492702     02/09/2010
00076848302     08/10/2010
00881123792     11/07/2000
01130162424     06/12/2007
01130164254     01/29/2008
01130165543     05/16/2011
01130168864     07/14/2009
01635163349     11/24/2009

and tgpr.tab looks like this :

0.00    03139822826
0.00    49246820001
0.00    7621830148
0.00    822004599003
0.11    73379268872
0.64    67119603398
0.65    67261704102
16.11   01635163349

I am trying to join these files like this :

join -t"  "  -11 -22 dlya0908.tab tgpr.tab

The above command does not give me any result.

When I give

join -a 1 -a 2 -e "NULL" -o'0,1.1,2.2' dlya0908.tab tgpr.tab

I get

0.00 NULL 03139822826
0.00 NULL 49246820001
0.00 NULL 7621830148
0.00 NULL 822004599003
0.11 NULL 73379268872
0.64 NULL 67119603398
0.65 NULL 67261704102
00033492482 00033492482 NULL
00033492682 00033492682 NULL
00033492702 00033492702 NULL
00076848302 00076848302 NULL
00881123792 00881123792 NULL
01130162424 01130162424 NULL
01130164254 01130164254 NULL
01130165543 01130165543 NULL
01130168864 01130168864 NULL
01635163349 01635163349 NULL
16.11 NULL 01635163349

This is wrong because there is a common key here -- 01635163349.

So the output I am looking for is :

01635163349  11/24/2009  16.11

I am looking for a way to inner join these files. tgpr.tab is a full dump file while dlya0809 is a daily file.

I hope the information I have given is helpful this time

Don_Cragun · August 15, 2016, 6:26am

Files being joined by the join utility must be ordered in the collating sequence of sort b on the fields on which they are being joined.

The 2nd field in tgpr.tab is NOT in sorted order. And, with the sample data you showed us in post #10, every line in dlya0908.tab sorts before the 1st line in tgpr.tab .

RudiC · August 15, 2016, 6:38am

Sorry, no. Nothing new. How about the octal dump?

,with a <TAB> (\t, 0x09) char following the -t option, will print the desired result unless field 1 in dlya0908.tab can't be joined with field 2 in tgpr.tab due to - obviously non-printing - differences.

will compare field 1 in dlya0908.tab to field 1 in tgpr.tab and won't find identical entries, of course.

Don_Cragun · August 15, 2016, 6:57am

If dlya0908.tab is in sorted order by field 1, and tgpr.tab is not changing while your script is running, you might want to try:

sort -b -k2,2 tgpr.tab|join -1 1 -2 2 -a 1 -a 2 -e "NULL" -o'0,1.1,2.2' dlya0908.tab -

which, with the data shown in post #10, produces the output:

00033492482 00033492482 NULL
00033492682 00033492682 NULL
00033492702 00033492702 NULL
00076848302 00076848302 NULL
00881123792 00881123792 NULL
01130162424 01130162424 NULL
01130164254 01130164254 NULL
01130165543 01130165543 NULL
01130168864 01130168864 NULL
01635163349 01635163349 01635163349
03139822826 NULL 03139822826
49246820001 NULL 49246820001
67119603398 NULL 67119603398
67261704102 NULL 67261704102
73379268872 NULL 73379268872
7621830148 NULL 7621830148
822004599003 NULL 822004599003

Or, with just:

sort -b -k2,2 tgpr.tab|join -1 1 -2 2 -e "NULL" -o'0,1.1,2.2' dlya0908.tab -

and those same input files, you get the output:

01635163349 01635163349 01635163349

PS: Note, however, that this only works if your input files actually have tab separated fields. The sample files you have provided in this thread use sequences of spaces as field separators (not tabs).

Varshha · August 15, 2016, 10:15am

Right! So it was the second column that was creating trouble. It is working fine now. Thank You for your time Don and RudiC!