Join of files is incomplete?!

Hi folks,

I am using the join command to join two files on a common field as follows:

File1.txt
Adsorption|H01.181.529.047
Adult|M01.060.116
Children|M01.055

File2.txt
5|Adsorption|C0001674
7|Adult|C000001
6|Children|C00002

join -i -t "|" -a 2 -1 1 -2 2 File1.txt File2.txt

This works fine for some lines but not all - Adult is missed whatever I try to do e.g. put to lower case etc?

Adsorption|H01.181.529.047|5|C0001674
7|Adult|C000001
Children|M01.055|6|C00002

What os are you using? What does -i do with your version of join? I don't have a "join" that supports -i. But, using your data files...

$ cat File1.txt
Adsorption|H01.181.529.047
Adult|M01.060.116
Children|M01.055
$ cat File2.txt
5|Adsorption|C0001674
7|Adult|C000001
6|Children|C00002
$
$
$ join -t "|" -a 2 -1 1 -2 2 File1.txt File2.txt
Adsorption|H01.181.529.047|5|C0001674
Adult|M01.060.116|7|C000001
Children|M01.055|6|C00002
$

Hmmm, thanks for that.

I am using FedoraCore 2 Linux with join (coreutils) 5.2.1, May 2004.

It must be a problem with my version of join then, what OS are you on?

The -i flag is just for case-insensitive matching.

Cheers

There's this from the 'join' manual at www.gnu.org

'Either file1 or file2 (but not both) can be `-', meaning standard input. file1 and file2 should be already sorted in increasing textual order on the join fields, using the collating sequence specified by the LC_COLLATE locale...'

Another site mentions that:-

'However, as a GNU extension, if the input has no unpairable lines the sort order can be any order that considers two fields to be equal if and only if the sort comparison described above considers them to be equal.'

Which suggests to me that experimenting with the LC_COLLATE environment variable may allow the command to work.

With no -i, it works with HP-UX, Solaris, and even Redhat 7.2. Redhat does support the -i option so I tried that as well. Still works.

Fedora - Linux localhost.localdomain 2.6.11-1.1369_FC4
Works just fine.

System - SunOS 5.9

I am using Unix join to join the following two files.

FileA
_______________
1,-1
3,-1
5,-1
49,-3
51,-1
52,-1
53,-1
54,-1
56,-2
57,-2
61,-1
62,-2
65,-1
66,-2
71,-1
72,-2
81,-3
82,-3
91,-4
99,-1
100,-5

FileB
________
1,2222
3,3222
5,2342
11,2418
15,1890
16,2445
20,2465
21,1889
30,1588
30,1888
31,2887
40,3423
45,4321
49,2345
51,5567
52,5210
53,4444
54,4567
56,1111
57,5678
61,6754
62,6742
65,1231
66,6765
71,1234
71,1991
72,7168
81,7777
82,8765
91,8766
99,9812
99,9998
100,8888
100,8981

First I sort them as -

sort -b -n -t ',' +0 FileA > A_sort
sort -b -n -t ',' +0 FileB > B_sort

Then I join them as,
join -t ',' -j1 1 -j2 1 -o 0 1.2 2.2 A_sort B_sort

and get -
1,2222,-1
3,3222,-1
5,2342,-1
51,5567,-1
52,5210,-1
53,4444,-1
54,4567,-1
56,1111,-2
57,5678,-2
61,6754,-1
62,6742,-2
65,1231,-1
66,6765,-2
71,1234,-1
71,1991,-1
72,7168,-2
81,7777,-3
82,8765,-3
91,8766,-4
99,9812,-1
99,9998,-1

I miss the following - :confused:
49,2345,-3
100,8888,-5
100,8981,-5

Why is this happening? Are they being internally treated as character though I specify -n in sort? What do i need to do? btw, both LC_COLLATE and LC_CTYPE are set to "". Should I set them as POSIX or C or something?

Many thanks in advance to all the Unix enthusiasts in this forum :slight_smile:

I believe I have cracked the Da HP Code this time, its http://mailgate.supereva.com/comp/comp.sys.hp.hpux/msg26730.html

So awk and fixed length zero fill seems to be the only solution!