I have 2 input files (tab separated):
file1:
make_A 1990 foo bar
make_B 2010 this that
make_C 2004 these those
file2:
make_X 1970 1995 ref_1:43 ref_2:65
make_A 1970 1995 ref_1:4 ref_2:21 ref_3:18
make_A 1980 2002 ref_1:7 ref_2:7 ref_3:0 ref_4:9
make_B 2007 2009 ref_1:98
make_C 2000 2004 ref_1:34 ref_2:4 ref_3:0
I am trying to append records of file 2 to file 1 if:
1) $1 of file 1 and $1 of file 2 are the same
AND
2) $2 of file 2 $2 of file 1 $3 of file 2
AND
3) file 2 contains the value '0' for ref_3 (i.e. 'ref_3:0')
then to count the number of records in file 2 that matched these criteria.
in order to get:
make_A 1990 foo bar make_A 1980 2002 ref_1:7 ref_2:7 ref_3:0 ref_4:9
make_C 2004 these those make_C 2000 2004 ref_1:34 ref_2:4 ref_3:0
Count
make_X 0
make_A 1
make_B 0
make_C 1
I tried the following, but it returns a blank output and I don't really understand why:
gawk '
BEGIN{FS=OFS="\t"}
NR==FNR{
brand[$1]=$2;
line[$1]=$0;
next
}
{
match($0, /ref_3\:[0-9]+/)
ref_n=split((substr($0,RSTART,RLENGTH)),b,":")
if($1 in brand){
if($2<=brand[$1] && brand[$1]<=$3 && b[ref_n]==0){
ref++
print line[$1] FS $0
}
}
}
END{print "\nCount"; for(i in ref){print ref}}' file1.txt file2.txt
You aren't far off. Using your indentation style and making a few minor changes:
gawk '
BEGIN{FS=OFS=" "}
NR==FNR{
brand[$1]=$2;
line[$1]=$0;
next
}
{
match($0, /ref_3\:[0-9]+/)
ref_n=split((substr($0,RSTART,RLENGTH)),b,":")
ref[$1]
if($1 in brand){
if($2<=brand[$1] && brand[$1]<=$3 && b[ref_n]==0){
ref[$1]++
print line[$1] FS $0
}
}
}
END{print "\nCount"; for(i in ref){print i,ref+0}}' file1.txt file2.txt
seems to do what you want. The problems in your code were:
- The biggest problem is that (even though you said your input files had tab separated fields), there are no tabs in either of your input files. The fields in your input files and in the output you said you wanted are separated by three space characters,
- brands that did not have any matched lines were not added to the
ref[]
array,
- the reference counts array (
ref[]
) was treated as a scalar when you incremented its value, and
- when you printed the counts, you only printed the count, not the array index and the count.
Changes to fix those minor issues are marked in red in the code above.
Note that your specification wasn't clear as to whether there should only be one output line for each brand if there are multiple input lines meeting your constraints or one output line for each input line meeting your constraints. The code above produces one output line for each input line in file2.txt
that meets the constraints.
Note also that the order of the counts at the end of the output is in random order. Additional changes would be required if you need to have the output order of those line match the order in which each brand was first found in file2.txt
(as it was in your sample output specification).
You might also want to compare the above with the following:
gawk '
BEGIN { FS = OFS = " "
}
NR == FNR {
brand[$1] = $2
line[$1] = $0
next
}
{ ref[$1]
if($1 in brand && $2 <= brand[$1] && brand[$1] <= $3 &&
$0 ~ / ref_3:0( |$)/) {
ref[$1]++
print line[$1] FS $0
}
}
END { print "\nCount"
for(i in ref)
print i, ref+0
}' file1.txt file2.txt
which produces the same output using a single if
statement instead of a call to match()
, a call to substr()
, a call to split()
and two if
statements.
1 Like
Thanks for your clear explanation Don Cragun ! I understand now.