how to match fields from different files in PERL

Howdy!

I have multiple files with tab-separated data:

File1_filtered.txt

gnl|Amel_4.0|Group3.29	1	G	R	42	42	60	15	,.AAA.aa,aa.A..	hh00/f//hD/h/hh
gnl|Amel_4.0|Group3.29	2	C	Y	36	36	60	5	T.,T,	LggJh
gnl|Amel_4.0|Group3.29	3	A	R	27	27	60	9	Gg,,.gg.,	B6hcc22_c
File2_filtered.txt

gnl|Amel_4.0|Group3.29	1	C	K	12	56	60	3	TGT	L6L
gnl|Amel_4.0|Group3.29	2	C	Y	63	63	60	5	,$,$tt,	EEZZe
File3_filtered.txt

gnl|Amel_4.0|Group3.29	2	C	Y	36	36	60	5	T.,T,	LggJh
gnl|Amel_4.0|Group3.29	4	A	R	27	27	60	9	Gg,,.gg.,	B6hcc22_c

I created a master list containing all the different rows based on the first two columns (without duplicates)

masterList.txt

gnl|Amel_4.0|Group3.29	1
gnl|Amel_4.0|Group3.29	2
gnl|Amel_4.0|Group3.29	3
gnl|Amel_4.0|Group3.29	4	

I need to go through each file once, and extract the data on the column 4, and match it to its corresponding line in the master list based on columns 1 and 2 (they need to match exactly).
If there is no entry for a particular line in a data file that matches the masterlist, add and asterisk.

Like this:

pos1 pos2	pos3	File1	File2	File3
gnl|Amel_4.0|Group3.29	1	R	K	*
gnl|Amel_4.0|Group3.29	2	Y	Y	Y
gnl|Amel_4.0|Group3.29	3	Y	*	R
gnl|Amel_4.0|Group3.29	4	*	*	*

In the code I have so far, I loaded the master list into a hash. Then each data file is loaded in an array of arrays (split by columns).
Everything works except the matching of the hash and the arrays for each file.
As usual, many thanks in advance for any help you may provide.

Cheers!

#!/usr/bin/perl 

use strict;
use warnings;


##dump the results in this file
my $outfile =  ">> matrix.txt";
open (MATRIX,$outfile);

#open the master list
open(MASTER,"folder/MasterList.txt") || die "open MASTER failed";

#load MASTER list into hash of arrays
my %m_hash=();
while(<MASTER>){  
	chomp;
	my @fieldsM = split (/\s|\t/, $_);
	my $scaff = $fieldsM[0];
	my $pos = $fieldsM[1];
	my $key = $scaff.",".$pos;
	my $value= $fieldsM[2];
	$m_hash{$key} = $value;
	#print "$key\t$value\n";
}
close MASTER;

#Load files into an array
my @itemsToUse;
my $directory= "folder";
opendir (DIR, $directory) or die "cant OPEN directory with files!\n";
my @allitems = readdir(DIR);

foreach my $fs (@allitems) {
	if ($fs =~ /filtered.txt/) {
		my $files = $fs;
		push (@itemsToUse, $files);
	}
}

#open the data files
foreach my $fs (@itemsToUse){
	while(<>){ # sequentially read files and do the comparison on the fly
		chomp;
		my @fieldsSNP=split/\s|\t/;    #  split by space or tab
		#print "$fields[1]\n";
		foreach my $i ( 0 .. $#{ $m_hash{$fieldsSNP[0]} } ) { 
			if (($fieldsSNP[0] == $m_hash{$fieldsSNP[0]}) && ($fieldsSNP[1] == $m_hash{$fieldsSNP[1]})){
				print MATRIX "$m_hash{$fieldsSNP[0]}[$i][0] $m_hash{$fieldsSNP[0]}[$i][1]  $fieldsSNP[4]\n";
			}
		}#close if
	}#close foreach
}#close foreachs
close MASTER;
close MATRIX;
exit 0;

For starters:
You're splitting the master on whitespaces and assigning
$fieldsM[2], which is not defined (only 2 columns in your masterList.txt). Here:

my $value= $fieldsM[2];

---------- Post updated at 02:44 PM ---------- Previous update was at 01:39 PM ----------

Please try this out:

#!/usr/bin/awk -f

NR==FNR{
    out[$1 $2]=pat[$1 $2]=$1" "$2;  #remember the pattern to match against
    oldind=ARGIND+1; #init helper variables
    colInd=2;
    next;
} 
 {  #for each record in _filtered.txt files
  for(i in pat) { #loop through stored patterns
      if($1" "$2==pat) { 
        out[$1 $2]=out[$1 $2]" "$4;  #match; append 4th column
      }
  } 
  if(ARGIND!=oldind) #new file taken; fill in  '*'s
  {
      colInd++;
      for(i in out) {
        if(split(out,a," ") < colInd) { #missing value, append '*'
          out=out" *"
        }
      }
      oldind=ARGIND
  }
 }
 END{   #do the same thing one more time to fill asterisks for last input file
      colInd++;  
      for(i in out) {
        if(split(out,a," ") < colInd) {
          out=out" *"
        }
      }

     for(i in out) { #print it all 
       print out
     }
 }

and invoke it like:

./run.awk folder/masterList.txt *_filtered.txt

This assumes your awk is GNU awk (ARGIND variable); if not, then store the filename (variable FILENAME) and watch when that changes instead.

1 Like

Thanks for the AWK solution. Will test it.

Yes, I mistakenly deleted a third column from the master file, but the code still not working properly....

Cheers
Santiago

Thanks for the AWK solution. Will test it.

Yes, I mistakenly deleted a third column from the master file, but the code still not working properly....

Cheers
Santiago

---------- Post updated at 08:01 AM ---------- Previous update was at 07:58 AM ----------

I would like to find a solution for this problem using perl.... Any takers?
Thanks!!