As i suggested join
i will explain how to use it. Some work will be left over for the reader and effort on thread-owners part to solve the problem will be appreciated. The following is a loose translation from the german Wikipedia article which i also wrote:
join
is used to annex information of several (usually two) input data streams (files or pipelines) and output the result. The input should be in some sort of record-format: a table-like structure in which records - separated by newline characters - exist which themselves consist of fields separated by field separators.
Example:
field separators (here tabs)
|
fields +---------+------+
| | | |
| | | |
|-----|-----+---|-+----|----+
| | | | | | |
V V V V V V V
Peter Smith 38 50.000 <--- record
Paul Miller 40 55.000 <--- record
Mary Myers 32 60.000 <--- record
We see a table of persons with some characteristics: surname, family name, age, income. Each person is described in its record and each record consists of several fields, each denoting one such characteristic. Note that we could have put captions as table headers but these would NOT be part of the table.
join
now creates an relation between two (or more) such tables. If a record in one table relates to several records in the other table it will be copied as often as necessary. Here is an example:
A: B: result:
f1 a f1 X f1 a X
f1 b f1 b X
f1 c f1 c X
Let us put all together: Suppose we have a file ("tel") with people and their telephone numbers:
>Name Telephone
Anna 123456-123
Karl 123456-456
Sandra 123457-789
And we have another file ("fax") with people and their Fax-number:
>Name Fax
Anna 345678-997
Leo 345679-998
Sandra 345678-999
Notice that both files are tab-separated again, so that between fields there is always a single tabulator-character. The first try
$ join tel fax
>Name Telephone Fax
Anna 123456-123 345678-997
Sandra 123457-789 345678-999
would per default join over the first fields (the names) and only output the values available in both files. Database people would call that an inner join. Notice also that we have entered captions as a pseudo-record.
But this messy output is perhaps not what we want. join
per default uses any whitespace as field separator but it can specifically be told (the -t
to use a certain character. This character will in turn also be used in the output.
In addition we can specify a certain order of output fields ( -o
) if we don't want all of them to appear. The resulting output looks a lot better now:
$ join -t'<tab>' -o 0,1.2,2.2 tel fax
>Name Telephone Fax
Anna 123456-123 345678-997
Sandra 123457-789 345678-999
Furthermore we can change the default inner join to an outer join (include records not available in both files, -a
) and we can assign a standard filler text for the missing information ( -e
):
$ join -t'<tab>' -a 1 -a 2 -e '(none)' -o 0,1.2,2.2 tel fax
>Name Telephone Fax
Anna 123456-123 345678-997
Karl 123456-456 (none)
Leo (none) 345679-998
Sandra 123457-789 345678-999
At last we can also invert the joining so that only records appear in the output which are NOT present in every file - a list of people having either no phone or no Fax:
$ join -t'<tab>' -v 1 -v 2 -o 0 tel fax
Karl
Leo
A tip at last: all the input files to join
have to be sorted already. In this case "sorted" means: sorted for the fields which will used to join the information. Otherwise some or maybe all records will be mysteriously missing from the output. In my example this was silently done before (this is the reason why i used ">" to mark the captions - it sorts before any character so that the header comes out on top).
The implementation of this is now left to the interested reader who is, by now, surely eager to try his newfound powers on his data. Be sure to post your results so that others can learn from your achievements as well as your mistakes.
I hope this helps.
bakunin