UNIX joins : facing issue while joining three files

rahul2662 · July 27, 2015, 7:27am

Hello ,

I have three files :

sampleoutput1.txt has columns (in the following order) :

hostname ; available patches , available packages

sampleoutput2.txt has columns (in the following order) :

hostname ; patchwave ; BSID ; Application

sampleoutput3.txt has columns (in the following order) :

boottime ; hostname

I am trying to join three files on the column hostname from the three files.
I want the output records as follows :

hostname;available patches;available packages;boot time;patchwave;BSID;Application

Below is the script. Here I am first trying to join the files : sampleoutput1.txt and sampleoutput3.txt. Then I am trying to join the resultant with sampleoutput2.txt

#!/bin/bash
awk -F";" 'FNR==NR{A[$2]=$1;next} {print $0 OFS A[$1] OFS}' OFS=";" sampleoutput3.txt sampleoutput1.txt > sampleoutput1-3.txt
awk -F";" 'FNR==NR{B[$1]=$1;next} {print $0 OFS B[$2] OFS B[$3] OFS B[$4]}' OFS=";" sampleoutput2.txt sampleoutput1-3.txt > sampleoutput1-2-3.txt
echo "Output is placed in file : sampleoutput1-2-3.txt"

The file sampleoutput1-3.txt is giving correct output but file : sampleoutput1-2-3.txt is not giving correct output.

Please help.

Thanks
Rahul

RudiC · July 27, 2015, 7:45am

I'd be surprised if sampleoutput1-3.txt held the correct output as the field separator is NOT ";" and the files' sequence is wrong.

Given the first file does NOT have a line break before the third field, and that spaces are used consistently around the field contents, and that the DOS/windows <CR> line terminators are removed, this might work:

join -t, -22 file1 file3 | join  -t, - file2
hostname,available patches,available packages,boottime,patchwave,BSID,Application

rahul2662 · July 27, 2015, 7:55am

Hello RudiC , the files are not sorted moreover there are some records for hostname which are present in sampleoutput1.txt and not present in sampleoutput2.txt and sampleoutput3.txt. Also delimiter used is ";". I think the above wont work in this scenario. Please help.

RudiC · July 27, 2015, 8:19am

You should not modify posts after an answer was given referring to its contents. And, you should do the initial specification with a bit more care.

SriniShoo · July 27, 2015, 8:30am

awk -F ';' '
 FILENAME == ARGV[1] {A[$1] = ($2 ";" $3); next}
 FILENAME == ARGV[2] {A[$2] = A[$2] == "" ? (";;" $1) : (A[$2] ";" $1); next}
 FILENAME == ARGV[3] {A[$1] = A[$2] == "" ? (";;;" $2 ";" $3 ";" $4) : (A[$1] ";" $2 ";" $3 ";" $4)}
 END {for(x in A) print x ";" A[x]}' sampleoutput1.txt sampleoutput3.txt sampleoutput2.txt > sampleoutput1-2-3.txt

Scrutinizer · July 27, 2015, 9:15am

After your corrections, I noticed there is a comma where there should be a semicolon, also there are spaces around the semicolons. I do not know if they are there in reality, I adjusted FS to compensate. If they are not there you can just use ";" . Please put more care still into your file sample specifications.

Try:

awk '
  BEGIN {
    FS=" *; *" 
    OFS=";"
  } 
  FNR==1{
    c++
  }
  c==1 { 
    A[$1]=$0 
  } 
  c==2 { 
    A[$2]=A[$2] OFS $1
  }
  c==3 { 
    $1=A[$1]
    print
  }
' sampleoutput1.txt sampleoutput3.txt sampleoutput2.txt

rahul2662 · July 28, 2015, 12:35am

Very Sorry RudiC , I will take more care next time. Thanks for your help.

---------- Post updated at 08:28 AM ---------- Previous update was at 08:24 AM ----------

Thanks a lot Scrutinizer , the code which you provided worked fine. Also I will ensure more care while posting next time. Could you please help me in explaining the script. Also Please provide some sources/books which can help me in learning to master advanced awk and sed.

---------- Post updated at 10:05 AM ---------- Previous update was at 08:28 AM ----------

Hello SriniShoo , The script which you provided went into long running and eventually i stopped it after approx 15 minutes. Thanks a lot for your efforts

RavinderSingh13 · July 28, 2015, 12:54am

Hello Rahul,

Following may help you in same.

 awk '
  BEGIN {                                                    ####### Start begin statement here.
    FS=" *; *"                                               ####### mentioning field seprator here as "*; *"  means it will take look for all text from starting to till semi colon as mentioned in your input file
    OFS=";"                                                  ####### mentioning output field seprator as semi colon ;
  } 
  FNR==1{                                                    ####### Whenever FNR is equal to 1, this condition will be TRUE whenever a new file read. As FNR will be reset whenever a new file is being read and on other hand NR will be keep on increasing it's value till all files being read.
    c++                                                      ####### incrementing the value of a variable named c by one.
  }
  c==1 {                                                     ####### When variable C's value is 1 this will be TRUE after first file is being read.
    A[$1]=$0                                                 ####### Creating an array named A whose index is $1 and value is $0 (complete line).
  } 
  c==2 {                                                     ####### When variable named C's value is 2 which will be TRUE when second file is being read because when read second file's turn will come then FNR==1 conditon will increase variable named C's value to 1 more which will be 2 now.
    A[$2]=A[$2] OFS $1                                       ####### Taking same array named A this time taking index as $2 because $1 of sampleouput1.txt and $2 of sampleoutput2.txt are same. Assigning the value to A{$2] OFS $1 where OFS is Output fielfd seprator.
  }
  c==3 {                                                     ####### When variable named C 's valie is 3 this will be TRUE when 3rd file is being read.
    $1=A[$1]                                                 ####### now making $1's value as A[$1]; because $1 of file named sampleoutput2.txt and $1 of sampleoutput1.txt and $2 of sampleoutput3.txt are same as per input files provided.
    print                                                    ####### Performing print action now, which will print complete line then.
  }
' sampleoutput1.txt sampleoutput3.txt sampleoutput2.txt      ####### mentioning filenames now.

Thanks,
R. Singh

SriniShoo · July 28, 2015, 1:56am

Sorry! That was untested and there was a bug
try below

awk -F ';' '
 FILENAME == ARGV[1] {A[$1] = ($2 ";" $3); next}
 FILENAME == ARGV[2] {A[$2] = A[$2] == "" ? (";;" $1) : (A[$2] ";" $1); next}
 FILENAME == ARGV[3] {A[$1] = A[$1] == "" ? (";;;" $2 ";" $3 ";" $4) : (A[$1] ";" $2 ";" $3 ";" $4)}
 END {for(x in A) print x ";" A[x]}' sampleoutput1.txt sampleoutput3.txt sampleoutput2.txt > sampleoutput1-2-3.txt

Scrutinizer · July 28, 2015, 2:19am

ravindersingh13:

Hello Rahul,

Following may help you in same.

[..]
   FS=" *; *"                                               ####### mentioning field seprator here as "*; *"  means it will take look for all text from starting to till semi colon as mentioned in your input file
filenames now. [..]

Thanks,
R. Singh

Thanks Ravinder. Just one correction:

    FS=" *; *"                                               ####### using the field separator (FS) here as " *; *"  means it will consider spaces around the semicolons also part of the field separator (so that these spaces do not becomes part of the fields)

rahul2662 · July 28, 2015, 9:19am

Thanks a lot Ravinder Sir , Scrutinizer and SriniShoo.