simple join for multiple files and produce 3 outputs

stateperl · August 28, 2010, 3:59am

sh script file1 filea fileb filec ................filez. >>output1 & output2 &output3

file1

z10     1873    1920    z_number1_E59
z10     2042    2090    z_number2_E59
Z22     2476    2560    z_number3_E59
Z22     2838    2915    z_number4_E59
z1      1873    1920    z_number1_E60
z1      2042    2090    z_number2_E60
z1      2032    2041    z_number2_E20

filea

z10     1873    1920    z_number1_E59
z10     2042    2090    z_number2_E59
Z22     2476    2560    z_number3_E59
Z22     2838    2915    z_number4_E59
z1      1863    1872    z_number1_E60
z1      2032    2041    z_number2_E60
z1      2032    2041    z_number2_E10

fileb

z10     1863    1872    z_number1_E59
z10     2032    2041    z_number2_E59
Z22     2476    2560    z_number3_E59
Z22     2838    2915    z_number4_E59
z1      1873    1920    z_number1_E60
z1      2042    2090    z_number2_E60
z1      2032    2041    z_number2_E10

filec

z10     1873    1920    z_number1_E59
z10     2042    2090    z_number2_E59
Z22     2476    2560    z_number3_E59
Z22     2838    2915    z_number4_E59
z1      1863    1872    z_number1_E60
z1      2032    2041    z_number2_E60
z1      2032    2041    z_number2_E10

output1

Z22     2476    2560    z_number3_E59
Z22     2838    2915    z_number4_E59

output2

z1      2032    2041    z_number2_E10

output3

z1      2032    2041    z_number2_E20

Franklin52 · August 28, 2010, 7:21am

Try this, the variable nfiles contains the number of files:

awk -v nfiles="4" 'NR==FNR{a[$0]++;next}
$0 in a {a[$0]++; next}
{b[$0]++}
END{
  for(i in a){
    if(a==nfiles) {
      print i > "output1"
    }
    else if(a==1) {
        print i > "output3"
    }
  }
  for(i in b){
    if(b==nfiles-1) {
        print i > "output2"
    }
  }
}' file1 filea fileb filec

stateperl · August 29, 2010, 12:55am

Thank you. working good . But it is throwing error when I ran 17 files ?

Franklin52 · August 29, 2010, 7:56am

stateperl:

Thank you. working good . But it is throwing error when I ran 17 files ?
awk: cmd. line:1: (FILENAME=/file14.bed FNR=250923) fatal: cannot open file `/filej0' for reading (No such file or directory) 

Does the file /filej0 exist in your directory?

gvj · August 29, 2010, 11:25am

Could u please explain ur code? I confused, how did u finding common lines in all files??

Franklin52 · August 29, 2010, 3:34pm

Explanation:

$0 in a {a[$0]++; next}		# This counts the number of each line in an array

    if(a==nfiles) {		# If the value of an element == 4
      print i > "output1"	# ++ the line exists in 4 files
    }

agama · August 29, 2010, 3:38pm

I've taken the liberty to add some explaination to Franklin52's code:

awk -v nfiles="4" '
NR==FNR{a[$0]++;next}           # NR is the current record number counting from start of programme
                                # FNR is the current record number counting from start of current file
                                # when FNR == NR it implies the record comes from file 1
                                # thus this statement captures all records from file 1 in the hash named 'a'
                                # the index is the whole input record with the value being a count of files.
                                # next causes the next record to be read and the programme to loop to top

$0 in a {a[$0]++; next}         # when this statement is reached, the record ($0) is not in the first file
                                # if the current record was seen in the first file increment the counter
                                # maintained in a.  Next causes the next record to be read.

{b[$0]++}                       # this statement is executed when a record from file2...n is encountered
                                # and the record was not seen in file1. A second hash is used to
                                # track all records that weren't in file 1

END{                            # this section of code is driven after the last record is read from file n
   for(i in a){                 # for every record seen in file 1...
      if(a==nfiles) {        # if the record was seen in all files (count in a matches number of files)
         print i > "output1"    # print to the list for seen in all files
      }
      else if(a==1) {        # if the record was only seen in the first file print to output list
         print i > "output3"     # of records just in file 1
      }
  }

  for(i in b){                  # for every record that wasn't seen in the  first file, but was seen
     if(b==nfiles-1) {       # in all other files, print to that list
        print i > "output2"
     }
 }
}' file1 filea fileb filec

The only potential problem with this code is that it will yield a false positive in the case where filex has a duplicate line that is in file1 and is missing from exactly one other input file. Related combinations of duplicates and 'holes' will also fall into this. If this is a concern, an easy solution would be to 'sort -u' each of the input files to remove all duplicate records.

stateperl · August 29, 2010, 10:12pm

Yes I have all the files in directory.
After manual checking I found out the script throwing error if the file number is more than 9
It start giving error 10 or more files.

summer_cherry · August 29, 2010, 11:43pm

file a,b,c,d map to your file in thread-appear-sequence

output1:

comm -12 <(comm -12 <(sort a) <(sort b)) <(comm -12 <(sort c) <(sort d))

output2:

comm -13 <(sort a) <(comm -12 <(sort b) <(comm -12 <(sort c) <(sort d)))

output3:

comm -23 <(sort a) <(cat b c d | sort)

stateperl · August 30, 2010, 4:10pm

but this also doen't work for multiple files ? is there any way to improve franklin script to more than 9 files?