Hello,
Trying to parse a file (in FASTA format) and reformat it.
1) Each record starts with ">" and followed by words separated by space, but they are in one same line for sure;
2) Sequences are following that may be in multiple rows with possible spaces inside until the next ">".
infile.fasta:
>seq01 some description protein
AGCTAC GTACAT
CAGTCGTGT GAT
CGAGC GGG
>seq02 another chloropyll_Rubisco subunit
AGCTAG AGTAG
CGCGCTAGCTAG
CGATGC AA
CGCGGTCGT
>seq03 some other description protein
AGCTAC GTACATG
CAGTCGTGT GATG
CGAGC GGGA
I want to:
1) Only keep the first field (or token) of the line where ">" is found, ignore the rest of the line; i.e. keep the first word after the ">" as sequence ID;
2) Concatenate the sequences from different rows into a single string to have the second field.
The final format is a two-columns table.
output:
>seq01 AGCTACGTACATCAGTCGTGTGATCGAGCGGG
>seq02 AGCTAGAGTAGCGCGCTAGCTAGCGATGCAACGCGGTCGT
>seq03 AGCTACGTACATGCAGTCGTGTGATGCGAGCGGGA
My code is:
#include <iostream>
#include <fstream>
#include <string>
using namespace std;
int main()
{
ifstream inFILE("infile.fasta");
int inGuard = 1; //using a guard variable
while (inFILE.good()) {
string line; //declare string for each line
getline(inFILE, line); //Read the whole line
char *sPtr; //Declare char pointer sPtr for tokens
//Initialize char pointer sArray for conversion of the string to char*
char *sArray = new char[line.length() + 1];
strcpy(sArray, line.c_str());
if (sArray[0] == '>') {
sPtr = strtok(sArray, " "); //Using space as delimiter get the first token.
cout << sPtr << " "; //Print the first token only
continue;
}
else
{
sPtr = strtok(sArray, " "); //Get all the tokens with " " as delimiter.
//For all tokens
while (sPtr != NULL) {
cout << sPtr;
sPtr = strtok(NULL, " ");
}
}
}
cout << endl;
inFILE.close();
return 0;
}
But my output is:
>seq01 AGCTACGTACATCAGTCGTGTGATCGAGCGGG>seq02anotherchloropyll_RubiscosubunitAGCTAGAGTAGCGCGCTAGCTAGCGATGCAACGCGGTCGT>seq03 AGCTACGTACATGCAGTCGTGTGATGCGAGCGGGA
I am stuck, and not quite clear where it went wrong. Thanks for any help!
---------- Post updated at 06:21 PM ---------- Previous update was at 05:32 PM ----------
Modified the if
block, but there is a bug for the first entry, i.e. an extra newline is printed at the beginning!
if (sArray[0] == '>') {
sPtr = strtok(sArray, " "); //Using space as delimiter get the first token.
if (inGuard == 1) {
cout << sPtr << " ";
inGuard++;
}
else
{
cout << endl << sPtr << " "; //Print the first token on a new line
}
continue;
output:
>seq01 AGCTACGTACATCAGTCGTGTGATCGAGCGGG
>seq02 AGCTAGAGTAGCGCGCTAGCTAGCGATGCAACGCGGTCGT
>seq03 AGCTACGTACATGCAGTCGTGTGATGCGAGCGGGA
What should I do to fix this? Thanks again!
---------- Post updated 09-12-14 at 03:41 AM ---------- Previous update was 09-11-14 at 06:21 PM ----------
One of the reasons for the old problem is the leading space in some of the entries, like >seq02
. But the first newline still bugs me.
---------- Post updated at 11:31 AM ---------- Previous update was at 03:41 AM ----------
Solved the problem with a guard variable. Modified code are highlighted in bold red.
Admin, should this post be deleted as answered by myself?