Hi,
I'm struggling a little here, so I figured it's time to ask for help.
I have a file with a list of several hundred IDs (the hit file- "hitfile.txt"), which is newline delimited, and a much bigger (~500Mb) text file, "FASTA.txt" with several thousand entries, delimited by ">". It's the FASTA format, for those interested.
On the same line as the >, several different IDs are contained, delimited by "/". One of them is an internal ID ("internalID" which is not much use) and the other an external ID ("externalID" which is much more useful). The file therefore looks like this:
>internalID1 / externalID1
GATTACA
>internalID2 / externalID2
GATTACA
I have been able to extract the Identifier containing lines and also extract the more useful external ID.
I used:
fgrep -f hitfile.txt FASTA.txt > outfile.txt
With a hitfile of:
internalID1
internalID2
This outputs the lines as:
>internalID1 / externalID1
>internalID2 / externalID2
From which it is trivial to further extract the externalIDs.
Now, I would like to not only pull out single lines, but pull out all lines from the ID (which is always the first item after the >) until the next >, which is the next entry. This will mean I have a file not only of the IDs but also the sequences therein. So with a hitfile of:
internalID1
The output is:
>internalID1 / externalID1
GATTACA
This is where my complete n00bism and lack of bash-fu get me stuck. I have tried a couple of promising looking awk scripts, to no avail...
Any help in this matter will be much, much appreciated.