join command

penfold · February 21, 2005, 5:08am

Hi,

I'm using the join command and it appears to discard certain fields. Here are the two files i'm comparing:

File1:
1 a
2 b
3 c
4 d
99 f
101 g
999 i
200 j

File 2:
1 e
2 f
3 g
4 h
99 h
101 i
999 j
200 k

> join 1 2
results in:

1 a e
2 b f
3 c g
4 d h
99 f h
999 i j

So 101 and 200 have been ignored....any ideas why?

Cheers

zazzybob · February 21, 2005, 6:15am

The files must be sorted first.

From the join manual page....

So....

$ sort 1 > 1.sorted
$ sort 2 > 2.sorted
$ join 1.sorted 2.sorted
1 a  e
101 g  i
2 b  f
200 j  k
3 c  g
4 d  h
99 f  h
999 i  j

Cheers
ZB

penfold · February 21, 2005, 9:56am

Thanks - you were right it was because it wasn't correctly sorted.

Does anyone know how to add a tab to the output of the join command i.e. the fields that are added should be seperated from the rest with a tab?

e.g.

1 g </tab> h
3 g </tab> i

Cheers

codpilot · February 21, 2005, 10:53am

Well here's my join issue

I have a file in the following format:
id/name/salary/position

and a file in the following format
number,amount,id

if I use sed to replace the different separators with a ":" and try to join them - the join fails to match on id.

However if I type the files in with the : separator already in place - the join works fine.

Sadly I am the instructor - and tested the script with my pre-made files (with the ":") but assigned the work with different separators to make the class use sed to convert them.

Both files are sorted (ascii) and other than the separators are identical. Anyone have a clue as to why this would happen?

vgersh99 · February 21, 2005, 10:59am

looking at 'man join':

codpilot · February 21, 2005, 11:04am

I did that:

join -t: -1 3 -2 1 file1 file2

and if I hand type the files in - no problem. If however I use the original files with the differing separators, use sed to convert them to a ":" then the join fails. Same command, same file (except the separators in one have been manually entered, and the second time they were converted) different results

sigh such a pita

vgersh99 · February 21, 2005, 11:13am

could you post a sample script AND sample input files?

codpilot · February 21, 2005, 11:24am

Sure thing:

sales is the initial sales file
associates is the associates file
SRep is the script

sales2 and assoc are the same files hand typed with the ":" separator. I don't/haven't tried that with the script - but instead run a manual join and they work fine.

Thanks for the help
umm - I selected upload, but the files did not "Attach" hmm well here is a cut and paste of the files:

sales (partial):
104,2,03:07:2004,24
104,3,03:07:2004,24
113,92,12:07:2004,24
104,2,03:07:2004,24
104,1,03:07:2004,26
104,2,03:07:2004,24
112,1,01:07:2004,23
104,9,03:08:2004,21
105,2,03:08:2004,24
111,4,12:07:2004,21
106,2,03:07:2004,24
107,9,03:08:2004,21
108,21,03:07:2004,22
109,12,03:08:2004,24
110,2,03:08:2004,23
104,2,03:07:2004,24
104,2,03:09:2004,24
104,1,03:09:2004,26
104,2,03:09:2004,24
112,1,01:09:2004,23
104,9,03:09:2004,21
104,2,05:09:2004,24
104,1,05:09:2004,26
104,2,05:09:2004,24
112,1,05:09:2004,23
104,9,05:09:2004,21

Associates:
21/John Doe/39000/Clerk
22/Kirk Russell/99000/President
23/Susan Smith/44000/Manager
24/Fast Buck/21000/Stock Boy
25/Scott Brown/99000/Engineer
26/Aimee Woods/88000/Secretary

script SRep:
#!/bin/bash
#=============================================================
# Script: Sales Report - Assignment 2
# Date: February 2005
# Purpose:

#trap " rm t[1] 2>/dev/null; exit" 0

awk '
BEGIN {
{ FS = ":"}
{ print "\t Marine Parts R US\n" }
{ print "\t Sales Associates Ranking\n" }
{ "date" | getline d }
{ printf "\t %s\n",d }
}'

# Select the September sales
awk '/:09:/ { print $0 }' sales > sept
echo "============== September Sales ============"
# cat sept

# Strip the dates since we don't need them anymore
# the sort the file on the ID field
cut -d, -f1,2,4 sept > sept_wo_dates
sort -t, -k 3 < sept_wo_dates -o sept_sort
# cat sept_sort

# make the sales file have a : separator
sed -n 's/,/:/gp' < sept_sort > sept_colon
cat sept_colon
# make the associates file have a consistent : seperator
sed -n 's/[/]/:/gp' < associates > assoc_colon
cat assoc_colon

# join the two files
join -t: -1 3 -2 1 -o 1.1 -o 1.2 -o 2.2 -o 2.4 sept_colon assoc_colon > sept_assoc
cat sept_assoc

the hand made files:
sales2 (partial)
109:1:03:09:2004:24
104:2:03:07:2004:24
104:3:03:07:2004:24
113:92:12:07:2004:24
104:2:03:07:2004:24
104:1:03:07:2004:26
104:2:03:07:2004:24
112:1:01:07:2004:23
104:9:03:08:2004:21
105:2:03:08:2004:24
111:4:12:07:2004:21
106:2:03:07:2004:24
107:9:03:08:2004:21
108:21:03:07:2004:22

assoc:
21:John Doe:39000:Clerk
22:Kirk Russell:99000:President
23:Susan Smith:44000:Manager
24:Fast Buck:21000:Stock Boy
25:Scott Brown:99000:Engineer
26:Aimee Woods:88000:Secretary

vgersh99 · February 21, 2005, 11:38am

well..... usually have sample inputs helps to figure out the problem, but... given the blow:

sales.txt

15:joe:5000:engineer

and

assoc.txt

20:75000:15

and running:

join -t: -1 1 -2 3 sales.txt assoc.txt

produces:

codpilot · February 21, 2005, 11:46am

Sorry I tried to upload the files and that didn't work for some reason, so I edited it with sample input.

You are correct, that works fine. When I create the files with a common separator its perfect, when I use sed to make the separators match - nothing, nada zip. Heck I simply retype the output of the sed command with pico and it works. Rather frustrating. grrr

vgersh99 · February 21, 2005, 11:58am

Here's a hint for you to analyze your data AFTER you sed your file.
here's your sale file BEFOREsed-ing:

what field is your key?

here's your file AFTER sed-ing:

what field is your key NOW?

Hint:
already have fields with embedded ':'. Use different separator when sed-ing so that your key fields from both files saty in their respective places - something like '#'.

codpilot · February 21, 2005, 12:06pm

Even with that, they still won't join. I just modded the script. As it stood I was stripping out the date field that had the same field separator (with the cut command) so the field position in the join was already determined.

vgersh99 · February 21, 2005, 12:13pm

I think you got your command line order incorrect/swapped.
I think it should be :

join -t: -1 1 -2 3 assoc sales

codpilot · February 21, 2005, 12:42pm

In the script it's correct - my typing may be discombobulated though. The whole objective is to take a file, change the separators then join them together. Apparently sed does something so that when you do a replace - it doesn't quite make the grade anymore.

As a matter of fact if you type in a line with the "/" separator then sed it to ";", create a second file with the same line already using a ";" they will not match (try it using comm). I guess the question is why? What is it adding? ah well.

This will work with one line but not with multiple (or at least not consistently)

vgersh99 · February 21, 2005, 12:58pm

change this:

sed -n 's/[/]/:/gp' < associates > assoc_colon

to this:

sed -e 's#/#:#g' < associates > assoc_colon

codpilot · February 21, 2005, 1:18pm

Tried it - no luck with that too.

vgersh99 · February 21, 2005, 1:34pm

join -t: -1 3 -2 1 sept_colon assoc_colon > sept_assoc

or given this and the sample files:

 join -t: -1 3 -2 1 -o 1.1 -o 1.2 -o 2.2 -o 2.4 sept_colon assoc_colon

here's what I get:

104:9:John Doe:Clerk
104:9:John Doe:Clerk
112:1:Susan Smith:Manager
112:1:Susan Smith:Manager
104:2:Fast Buck:Stock Boy
104:2:Fast Buck:Stock Boy
104:2:Fast Buck:Stock Boy
104:2:Fast Buck:Stock Boy
104:1:Aimee Woods:Secretary
104:1:Aimee Woods:Secretary

codpilot · February 21, 2005, 2:11pm

there must be something else at work here. I hand executed each line of my script to get to the _colon files, replacing my command with your command. Still the result of my join is empty

This is really annoying, only because I can find nothing wrong with the syntax. there must be some specification as input to join that I am missing.

On a seperate matter, why would join not output values correctly? Given a join on field one for both files, if I have a:
-o 1.1 -o 1.2 -o 1.3 -o 2.1 -o 2.2, why is it overwriting the first field with the 2.2 and dropping both index fields? I can output either file correctly (all fields) it just fails on putting out both files fields.

Now I know why I moved on to relational DB's LOL

vgersh99 · February 21, 2005, 2:17pm

seems like it's working ok for me .......
the 'sept_assoc' file contains the correct [it seems] join....

codpilot · February 21, 2005, 5:47pm

Well I think the problem is quite simply - the original files have messed up linefeeds/CR anddon't play well. I reentered the files (with a common separator) and they work fine - now I'll try them with different ones and see what happens.

Thanks for everyones help. Now if only I can figure out where the extra chars are sneaking in.

8)