awk does not find ids with semi-colon in the name

cmccabe · April 23, 2016, 9:25am

I am using awk to search $5 of the "input" file using the "list" file as the search criteria. So if the id in line 1 of "list" is found in "search" then it is counted in the ids found . However, if the line in "list" is not found in "search", then it is outputted as is missing . The awk below runs and works for most but the ids with a ; in them are missing but can be manually found in the file. I am not sure where to add this though. Thank you :).

input

chrX    48933012    48933134    chrX:48933012-48933134    PRAF2;WDR45
chrX    48934078    48934193    chrX:48934078-48934193    PRAF2;WDR45
chrX    48934293    48934422    chrX:48934293-48934422    PRAF2;WDR45
chr17    42426522    42426680    chr17:42426522-42426680    GRN;L01117
chr17    42426783    42426929    chr17:42426783-42426929    GRN;L01117
chr17    30814628    30815572    chr17:30814628-30815572    AK307275;CDK5R1
chr2    234668923    234669807    chr2:234668923-234669807    UGT1A1;UGT1A10;UGT1A3;UGT1A4;UGT1A5;UGT1A6;UGT1A7;UGT1A8;UGT1A9
chr2    234675669    234675821    chr2:234675669-234675821    UGT1A1;UGT1A10;UGT1A3;UGT1A4;UGT1A5;UGT1A6;UGT1A7;UGT1A8;UGT1A9
chr12    9221325    9221448    chr12:9221325-9221448    A2M
chr12    9222330    9222419    chr12:9222330-9222419    A2M

list

PRAF
GRN
CDK5R1
UGT1A1
A2M

current output

1 ids found
CDK5R1 is missing
PRAF is missing
GRN is missing
UGT1A1 is missing

desired output

5 ids found

awk '
    NR==FNR { lookup[$0]++; next }
    ($5 in lookup) { seen[$5]++ } 
    END {
      print length(seen)" ids found"; 
      for (id in seen) delete lookup[id]; 
      for (id in lookup) print id " is missing"
}' list input > count

awk with error

awk '
>     NR==FNR { lookup[$0]+|;++; next }
>     ($5 in lookup) { seen[$5]++ } 
>     END {
>       print length(seen)" ids found"; 
>       for (id in seen) delete lookup[id]; 
>       for (id in lookup) print id " is missing"
> }' list2 input > count
awk: cmd. line:2:     NR==FNR { lookup[$0]+|;++; next }
awk: cmd. line:2:                          ^ syntax error
awk: cmd. line:2:     NR==FNR { lookup[$0]+|;++; next }
awk: cmd. line:2:                              ^ syntax error

Scrutinizer · April 23, 2016, 9:49am

Hi, try this modification to your code:

awk '
    NR==FNR { 
      lookup[$1]++
      next
    }
    { 
      split($5,F,/;/)
      for(i in F)
        if (F in lookup)
          seen[F]++
    } 
    END {
      print length(seen)" ids found"; 
      for (id in lookup) 
        if (!(id in seen)) 
          print id " is missing"
    }
' list input > count

4 ids found
PRAF is missing

--

Note: length(array) is a non-standard extension, so not every awk will support it

cmccabe · April 23, 2016, 10:05am

Thank you very much for your help, I really appreciate it :).

Don_Cragun · April 23, 2016, 5:32pm

There really isn't any need to count the number of times you have seen an ID in the lookup[] and seen[] arrays. Assuming that your sample input data isn't really representative of the sizes of your real input files, the following suggestion might be faster or slower than Scrutinizer's suggestion since it handles the list of IDs in lookup[] (from the 1st input file) and the list of IDs in seen[] (from the 2nd input file) differently:

Scrutinizer's code trims lookup[] and adds entries to seen[] as it processes each input line. So seen[] will only contain elements that had previously been in lookup[].
The following code doesn't look at lookup[] while its reading the 2nd input file. It adds elements to seen[] for each ID found in field 5 in lines in the 2nd input file. It then makes a single walk through lookup[] at the end removing entries for IDs that are also found in seen[]. (Note that this might be a little more portable to other versions of awk because it doesn't depend on being able to use length(array name) which is an extension not required by the standards.)

You might want to compare the time taken by our two approaches with some of your real data.

awk '
FNR == NR {
	lookup[$1]
	next
}
{	for(i = split($5, F, /;/); i; i--)
		seen[F]
}
END {	for(id in lookup)
		if(id in seen) {
			found++
			delete lookup[id]
		}
	print found, "of", NR - FNR, "ids found"
	for(id in lookup)
		print id, "is missing"
}' list input

which, with your sample input files produces the output:

4 of 5 ids found
PRAF is missing

If you don't want the additional information shown in red in the output above, remove the code shown in red in the above script.