Fgrep or grep or awk help - scanning for delimiters.

Tbox · February 18, 2010, 5:45am

Hi,

I'm struggling a little here, so I figured it's time to ask for help.

I have a file with a list of several hundred IDs (the hit file- "hitfile.txt"), which is newline delimited, and a much bigger (~500Mb) text file, "FASTA.txt" with several thousand entries, delimited by ">". It's the FASTA format, for those interested.

On the same line as the >, several different IDs are contained, delimited by "/". One of them is an internal ID ("internalID" which is not much use) and the other an external ID ("externalID" which is much more useful). The file therefore looks like this:

>internalID1 / externalID1

GATTACA

>internalID2 / externalID2

GATTACA

I have been able to extract the Identifier containing lines and also extract the more useful external ID.

I used:

fgrep -f hitfile.txt FASTA.txt > outfile.txt

With a hitfile of:

internalID1
internalID2

This outputs the lines as:

>internalID1 / externalID1
>internalID2 / externalID2

From which it is trivial to further extract the externalIDs.

Now, I would like to not only pull out single lines, but pull out all lines from the ID (which is always the first item after the >) until the next >, which is the next entry. This will mean I have a file not only of the IDs but also the sequences therein. So with a hitfile of:

internalID1

The output is:

>internalID1 / externalID1

GATTACA

This is where my complete n00bism and lack of bash-fu get me stuck. I have tried a couple of promising looking awk scripts, to no avail...

Any help in this matter will be much, much appreciated.

radoulov · February 18, 2010, 6:12am

awk -F'[>/ ]' 'END { 
  if (_2 in _) print r
 }
NR == FNR { _[$0]; next }
/^>/ { 
  if (_2 in _) print r 
  r = x; _2 = $2 
  }
{ r = r ? r RS $0 : $0 }  
' hitfile.txt FASTA.txt

Use gawk, nawk or /usr/xpg4/bin/awk on Solaris.

drl · February 18, 2010, 6:20am

Hi.

The AT&T cgrep command was designed to extract sections of text within a window:

#!/usr/bin/env bash

# @(#) s1	Demonstrate extraction of context window, cgrep.
# http://www.bell-labs.com/project/wwexptools/cgrep/

echo
set +o nounset
LC_ALL=C ; LANG=C ; export LC_ALL LANG
echo "Environment: LC_ALL = $LC_ALL, LANG = $LANG"
echo "(Versions displayed with local utility \"version\")"
version >/dev/null 2>&1 && version "=o" $(_eat $0 $1) cgrep
set -o nounset

FILE1=data1
FILE2=data2

echo
echo " Data file $FILE1:"
cat $FILE1

echo
echo " Data file $FILE2:"
cat $FILE2

echo
echo " Results:"
cgrep -D +I2 +w '>' -f $FILE2 $FILE1

exit 0

producing:

% ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0 
GNU bash 3.2.39
cgrep - (local: ~/executable/cgrep May 29 2009 )

 Data file data1:
>internalID1 / externalID1

GATTACA

>internalID2 / externalID2

GATTACA

 Data file data2:
internalID1

 Results:
>internalID1 / externalID1

GATTACA

See the URL noted in the script. You would need to download and compile the code, but I have done that in 32-bit and 64-bit environments without trouble.

If you are not comfortable with that, then someone may stop by shortly to offer an awk or perl code.

Best wishes ... cheers, drl

Tbox · February 18, 2010, 7:10am

Both wonderful replies. Both work very well. Thank you so much!

ahmad.diab · February 18, 2010, 7:49am

use below:- (use gawk or nawk or /usr/xpg4/bin/awk)

nawk 'NR==FNR {a[$1] ; next} $1 in a{print RS$0}' hitfile.txt RS="\>" FASTA.txt

;);)

space_agent · March 6, 2010, 7:06am

@ahmad.diab wow really great work, a true piece of art. I wonder if there is a similar short and concise way to do that in perl ?

radoulov · March 6, 2010, 9:36am

Note that some awk implementations have problems with long multiline records (high number of bytes in the < blocks, in this case).

With Perl, I'm not able to come up with more concise solution than this one:

perl -ne'
  @_{/([^\n]+)/} = 1 and next if @ARGV;
  if (/^>/ or eof) {
    @_{$x =~ /^>([^ \/]+)/} and print $x;
    $x = "";
    } 
  $x .= $_' hitfile.txt FASTA.txt

You can modify the input record separator $/ with Perl too, but, as far as I know, not the way awk allows you to.

durden_tyler · March 6, 2010, 11:28am

And yet another:

perl -lne 'if ($ARGV eq "hitfile.txt"){chomp; push @x,$_}
           else {local $/=undef; @c=split/>/;
                 foreach $i(@c){$y = $1 if />(.*?) \//; print $i=~/./?">":"",$i if "@x" =~ m/$y/}
           }' hitfile.txt FASTA.txt

Test run:

$ 
$ cat hitfile.txt
internalID1
internalID2
$ 
$ cat FASTA.txt
>internalID1 / externalID1

GATTACA1

>internalID2 / externalID2

GATTACA2

>internalID3 / externalID3

GATTACA3
$ 
$ ##
$ perl -lne 'if ($ARGV eq "hitfile.txt"){chomp; push @x,$_}
>            else {local $/=undef; @c=split/>/;
>                  foreach $i(@c){$y = $1 if />(.*?) \//; print $i=~/./?">":"",$i if "@x" =~ m/$y/}
>            }' hitfile.txt FASTA.txt

>internalID1 / externalID1
>GATTACA1

>internalID2 / externalID2
>GATTACA2
$

tyler_durden

ahmad.diab · March 7, 2010, 8:38am

another try:-

perl -ne '
if ($ARGV eq "hitfile.txt") {chomp; $seen{$_}++ ; next } ;if(/^>(.+)\s+\// or eof){ print $s ; $s="" } ;$seen{$1} and $s .= $_ ;
' hitfile.txt FASTA.txt

;);)