Text file parsing and comparison

I have two files (first.txt and second.txt):

more first.txt 

        cat mammal

        lizard reptile

        Elephant mammal

        ant Insecta
more second.txt 

        ant     termite

        ant     army_ant

        human   man

        human   woman

I want to make a third file that takes the relevant entry in the second column of the first file and paste it as a third column in the second file when the first column in first and second file matches.

more third.txt 
        ant     termite insecta
        ant     army_ant        insecta
        human   man     mammal
        human   woman   mammal

Note: The third file that I include here to illuminate my point shouldn't have human man mammal or human woman mammal . I was wrong. Check the entire thread for explanation.

What have you tried so far and where exactly are you stuck?
And how did you arrive at human man mammal in your desired output from the sample inputs you gave?

1 Like

To clarify, the third.txt file is what I expect as output and I just put that out for illustration purposes. Reg my attempts: I tried capturing the the data in the first file data in an awk associative array, but then I am unable to think of the right conditional statement to print the corresponding entries from the first file as the third column in the second file.

it might help to see the attempt so we could help to straighten it out.
It's a very doable covered many time in these forums req.

First, my suggestion is to re-read the forum rules: Homework/coursework questions are to be posted in a special section of the forum where special rules apply.

Second, i suggest to type

man join

into the next available terminal and read what stands there. This thread is

bakunin

After checking back with the thread owner i learned that i was wrong and this is not homework. My apologies. This thread is

bakunin

As i suggested join i will explain how to use it. Some work will be left over for the reader and effort on thread-owners part to solve the problem will be appreciated. The following is a loose translation from the german Wikipedia article which i also wrote:

join is used to annex information of several (usually two) input data streams (files or pipelines) and output the result. The input should be in some sort of record-format: a table-like structure in which records - separated by newline characters - exist which themselves consist of fields separated by field separators.

Example:

      field separators (here tabs)
            |
    fields  +---------+------+
      |     |         |      |
      |     |         |      |
      |-----|-----+---|-+----|----+
      |     |     |   | |    |    |
      V     V     V   V V    V    V
     Peter      Smith   38      50.000      <--- record
     Paul       Miller  40      55.000      <--- record
     Mary       Myers   32      60.000      <--- record

We see a table of persons with some characteristics: surname, family name, age, income. Each person is described in its record and each record consists of several fields, each denoting one such characteristic. Note that we could have put captions as table headers but these would NOT be part of the table.

join now creates an relation between two (or more) such tables. If a record in one table relates to several records in the other table it will be copied as often as necessary. Here is an example:

A:               B:            result:
     f1 a           f1 X               f1 a X
     f1 b                              f1 b X
     f1 c                              f1 c X

Let us put all together: Suppose we have a file ("tel") with people and their telephone numbers:

>Name	Telephone
Anna	123456-123
Karl	123456-456
Sandra	123457-789

And we have another file ("fax") with people and their Fax-number:

>Name	Fax
Anna	345678-997
Leo	345679-998
Sandra	345678-999

Notice that both files are tab-separated again, so that between fields there is always a single tabulator-character. The first try

$ join tel fax

>Name Telephone Fax
Anna 123456-123 345678-997
Sandra 123457-789 345678-999

would per default join over the first fields (the names) and only output the values available in both files. Database people would call that an inner join. Notice also that we have entered captions as a pseudo-record.

But this messy output is perhaps not what we want. join per default uses any whitespace as field separator but it can specifically be told (the -t to use a certain character. This character will in turn also be used in the output.

In addition we can specify a certain order of output fields ( -o ) if we don't want all of them to appear. The resulting output looks a lot better now:

$ join -t'<tab>' -o 0,1.2,2.2 tel fax

>Name	Telephone	Fax
Anna	123456-123	345678-997
Sandra	123457-789	345678-999

Furthermore we can change the default inner join to an outer join (include records not available in both files, -a ) and we can assign a standard filler text for the missing information ( -e ):

$ join -t'<tab>' -a 1 -a 2 -e '(none)' -o 0,1.2,2.2 tel fax

>Name	Telephone	Fax
Anna	123456-123	345678-997
Karl	123456-456	(none)
Leo	(none)		345679-998
Sandra	123457-789	345678-999

At last we can also invert the joining so that only records appear in the output which are NOT present in every file - a list of people having either no phone or no Fax:

$ join -t'<tab>' -v 1 -v 2 -o 0 tel fax

Karl
Leo

A tip at last: all the input files to join have to be sorted already. In this case "sorted" means: sorted for the fields which will used to join the information. Otherwise some or maybe all records will be mysteriously missing from the output. In my example this was silently done before (this is the reason why i used ">" to mark the captions - it sorts before any character so that the header comes out on top).

The implementation of this is now left to the interested reader who is, by now, surely eager to try his newfound powers on his data. Be sure to post your results so that others can learn from your achievements as well as your mistakes.

I hope this helps.

bakunin

2 Likes

Thanks @bakunin for the detailed explanation regarding join. I also realized that my original question wasn't clear and I didn't really need to compare the files or capture the data in an array. I was able to do a simple "join" after sorting. As I was in a hurry I ended up posting a query that was poorly worded. Anyway just for clarification here is my first file (sorted by the first field):

ant	Insecta	
cat	mammal
Elephant	mammal	
lizard	reptile

Here is my second file sorted by the first field:

ant	termite
ant	army_ant
human	man
human	woman

I can join this by the (default) first field as follows:

join first.txt second.txt

ant Insecta termite
ant Insecta army_ant

[/CODE]

Clearly in my original post, I also erred in my expectation for the �third file� where I erroneously included human man mammal . This was first pointed out by @vgersh99. Thanks!

If you don't want to sort your input files (as required by join ), please show us what you tried with awk (so we can see how close you were able to get to your goal). Maybe we can help you fix your awk code to get the output you want.

Don, @vgersh99 gave me a hint in a PM and using that I think, I have got the output I desired using the following awk code:

awk 'FNR==NR{f1[$1]=$2;next} $1 in f1 {print $0,f1[$1]}' first.txt second.txt

ant	termite Insecta
ant	army_ant Insecta