compare 2 coloum of 2 diff files using perl script

vasuki · July 29, 2008, 1:56pm

Hi,

i am new to perl scripting.. i am still learing it.. i am asked to write a perl script which should compare 2 coloums of 2 different files. if those 2 coloumn are same the script should store the both the lines in 2 diff files.

these are files,

file 1:
21767016 226112 char[]
19136520 797355 java.lang.String
17769368 307049 java.lang.Object[]
13981656 582569 java.util.HashMap$Entry
10867240 16650 int[]
9065616 559799 java.lang.String[]
9060192 79626 java.util.HashMap$Entry[]
6969384 23146 byte[]
6857664 285736 java.util.Vector

file 2:
21702192 904258 java.lang.String
20985320 360561 java.lang.Object[]
20524112 209810 char[]
12623280 525970 java.util.HashMap$Entry
10945080 678896 java.lang.String[]
9781432 10871 int[]
8302464 345936 java.util.Vector
8107104 337796 netscape.ldap.util.RDN
7620024 68357 java.util.HashMap$Entry[]
6515152 52272 * ConstMethodKlass

so i have to compare 3rd coloumn of these to 2 files, eg i have java.lang.string present in both so i want the script store these complete line of both the files in 2 diff files. These 2 files are big files.

please suggest me how can this be done. some one had suggested you can use the hash table implementation.

thanks,
Vasuki

Annihilannic · July 29, 2008, 11:34pm

Try this perhaps (untested):

awk '
    # load the contents of file1 into a hash indexed by $3
    NR==FNR { file1[$3]=$0; next }
    # check whether $3 in file2 is in the hash, if so, print bothlines to files
    $3 in file1 { print file1[$3] >> "file1.both"; print >> "file2.both" }
' file1 file2

era · July 30, 2008, 1:37am

And here is (roughly) the same in Perl:

perl -ane 'BEGIN { open FILE1, ">file1.both"; open FILE2, ">file2.both"; }
  if ($. == ++$n) { $h{$F[2]} = $_; close ARGV if eof; next; }
  if ($h{$F[2]}) { print FILE1 $h{$F[2]}; print FILE2; }' file1 file2

This isn't very idiomatic Perl, but should hopefully be enough to get you started. The trickery with $n and ARGV is to simulate the awk NR==FNR idiom. The eof thing is to reset line numbers in $. when the file changes; see the eof documentation for a brief discussion.

By the way, $3 in 6515152 52272 * ConstMethodKlass is just "*" -- maybe you want to normalize that, rather than change the script.

summer_cherry · July 30, 2008, 1:53am

not sure whether understand your question correctly.

Anyway, hope below one can make some sense.

open (FH,"<a");
while(<FH>){
	@arr=split(" ",$_);
	$hash{$arr[2]}=$_;
}
close(FH);
open (FH,"<b");
while(<FH>){
	@arr=split(" ",$_);
	if(exists($hash{$arr[2]})){
		print $hash{$arr[2]};
		print $_;
	}
}
close(FH);