C++ getline, parse and take first tokens by condition

yifangt · September 12, 2014, 11:31am

Hello,
Trying to parse a file (in FASTA format) and reformat it.
1) Each record starts with ">" and followed by words separated by space, but they are in one same line for sure;
2) Sequences are following that may be in multiple rows with possible spaces inside until the next ">".

infile.fasta:
>seq01 some description protein
AGCTAC GTACAT
CAGTCGTGT GAT
CGAGC GGG
>seq02 another chloropyll_Rubisco subunit
AGCTAG AGTAG
CGCGCTAGCTAG
CGATGC AA
CGCGGTCGT
>seq03 some other description protein
AGCTAC GTACATG
CAGTCGTGT GATG
CGAGC GGGA

I want to:
1) Only keep the first field (or token) of the line where ">" is found, ignore the rest of the line; i.e. keep the first word after the ">" as sequence ID;
2) Concatenate the sequences from different rows into a single string to have the second field.
The final format is a two-columns table.

output:
>seq01 AGCTACGTACATCAGTCGTGTGATCGAGCGGG
>seq02 AGCTAGAGTAGCGCGCTAGCTAGCGATGCAACGCGGTCGT
>seq03 AGCTACGTACATGCAGTCGTGTGATGCGAGCGGGA

My code is:

#include <iostream>
#include <fstream>
#include <string>

using namespace std;

int main()
{
    ifstream inFILE("infile.fasta");
    int inGuard = 1;               //using a guard variable
    while (inFILE.good()) {
    string line;        //declare string for each line

    getline(inFILE, line);    //Read the whole line

    char *sPtr;        //Declare char pointer sPtr for tokens
    //Initialize char pointer sArray for conversion of the string to char*
    char *sArray = new char[line.length() + 1];
    strcpy(sArray, line.c_str());

    if (sArray[0] == '>') {
        sPtr = strtok(sArray, " ");    //Using space as delimiter get the first token.
        cout << sPtr << " ";       //Print the first token only
        continue;
    } 
    else 
    {
        sPtr = strtok(sArray, " ");    //Get all the tokens with " " as delimiter.
        //For all tokens
        while (sPtr != NULL) {
        cout << sPtr;
        sPtr = strtok(NULL, " ");
        }
    }
    }

    cout << endl;
    inFILE.close();

    return 0;
}

But my output is:

>seq01 AGCTACGTACATCAGTCGTGTGATCGAGCGGG>seq02anotherchloropyll_RubiscosubunitAGCTAGAGTAGCGCGCTAGCTAGCGATGCAACGCGGTCGT>seq03 AGCTACGTACATGCAGTCGTGTGATGCGAGCGGGA

I am stuck, and not quite clear where it went wrong. Thanks for any help!

---------- Post updated at 06:21 PM ---------- Previous update was at 05:32 PM ----------

Modified the if block, but there is a bug for the first entry, i.e. an extra newline is printed at the beginning!

    if (sArray[0] == '>') {
 sPtr = strtok(sArray, " ");    //Using space as delimiter get the first token.
if (inGuard == 1) {
cout << sPtr << " ";
inGuard++;
} 
else
  { 
     cout << endl << sPtr << " ";      //Print the first token on a new line
    }       
        continue;

output:

>seq01 AGCTACGTACATCAGTCGTGTGATCGAGCGGG
>seq02 AGCTAGAGTAGCGCGCTAGCTAGCGATGCAACGCGGTCGT
>seq03 AGCTACGTACATGCAGTCGTGTGATGCGAGCGGGA

What should I do to fix this? Thanks again!

---------- Post updated 09-12-14 at 03:41 AM ---------- Previous update was 09-11-14 at 06:21 PM ----------

One of the reasons for the old problem is the leading space in some of the entries, like >seq02 . But the first newline still bugs me.

---------- Post updated at 11:31 AM ---------- Previous update was at 03:41 AM ----------

Solved the problem with a guard variable. Modified code are highlighted in bold red.
Admin, should this post be deleted as answered by myself?

vbe · September 12, 2014, 11:43am

Thanks for keeping us informed
this thread will not be removed, since someone else may fall on the same issue, he here will find the solution

Corona688 · September 12, 2014, 11:52am

Presumably you're using C++ rather than awk for performance reasons?

I've found that C stdio often has higher performance than C++ iostreams, sometimes surprisingly so. Especially when you're doing string-to-array conversion every single loop.

yifangt · September 12, 2014, 12:04pm

I am going back-forth with C and C++ these days, whenever I met some practical stuff I read(Too many to ask here!). I did not forget you helped me with the strtok() function in one of the posts, which is very profound and comprehensive to me, but performance is not too much of my concern at this moment.
Do you mean awk to do the job? Could you post it if you have the script handy? Thanks a lot!

Corona688 · September 12, 2014, 12:41pm

$ awk '/^>/ { NF=1;$1="\n"$1" " } { $1=$1 } 1 ; END { print "\n" }' ORS="" OFS="" *.fasta

>seq01 AGCTACGTACATCAGTCGTGTGATCGAGCGGG
>seq02 AGCTAGAGTAGCGCGCTAGCTAGCGATGCAACGCGGTCGT
>seq03 AGCTACGTACATGCAGTCGTGTGATGCGAGCGGGA

$

Setting blank output field and record separators causes it to print spaces and newlines only when we ask (the $1=$1 trick strips them from $0).

awk '/^>/ { NF=1;$1="\n"$1" " } # Turf all but field 1, add space and newline if line begins with >
{ $1=$1 } # Get rid of all other spaces between fields
1 # Print all lines
END { print "\n" }' ORS="" OFS="" *.fasta

yifangt · September 12, 2014, 1:01pm

Cool! I did not think of awk to do the job until you mentioned. Thanks a lot!

yifangt · September 18, 2014, 1:01pm

Thought of combining the map<string, string> container with the program.
Store all the combined sequence entries in map< string, string> ; which will be:
1) easier to print and avoid the problem like extra blank line for the first entry;
2) convenient to retrieve part of the sequences by sequence ID (i.e. the key of the map).
Here is my modified code that was compiled well with Segmentation fault when run.

#include <iostream> 
#include <fstream> 
#include <string>
#include <map>

using namespace std;  
int main() 
{     
ifstream inFILE("infile.fasta");     
int inGuard = 1;               //using a guard variable
    
    map <string, string>FastaSeq;   //Declare a map to hold each sequence entry

    while (inFILE.good()) {     
    string line;        //declare string for each line      
    string entryID, sequence;    //declare two strings for key and value for map
    getline(inFILE, line);    //Read the whole line      
    char *sPtr;        //Declare char pointer sPtr for tokens     

     //Initialize char pointer sArray for conversion of the string to char*     
     char *sArray = new char[line.length() + 1];     
     strcpy(sArray, line.c_str());

     if (sArray[0] == '>') {         
     sPtr = strtok(sArray, " ");    //Using space as delimiter get the first token.         
     cout << sPtr << " ";       //Print the first token only         
     entryID = sPtr;             //assign the first token as key for the map
     continue;     
}     
 else  {         
     sPtr = strtok(sArray, " ");    //Get all the tokens with " " as delimiter.         
     FastaSeq[entryID] += sPtr;   // assign first part of sequence to map
           
while (sPtr != NULL) {          //For all tokens     
     cout << sPtr;
     FastaSeq[entryID] += sPtr;   // assign more token to sequence
     sPtr = strtok(NULL, " ");         
     }     
   }     
}      
cout << endl;    
 inFILE.close(); 

//print the map    
map <string, string>::const_iterator seq_itr;
if (seq_itr != FastaSeq.end()){
      cout << seq_itr->first << " ";
      cout << seq_itr->second << endl;
}

    return 0; 
}

The parts I was not sure are the "appending" of the parsed third and after tokens to the second token as sequence (value of map) highlighted in red FastaSeq[entryID] += sPtr; , which may be the problem for the program. Thanks a lot!

Corona688 · September 18, 2014, 2:13pm

Your code certainly does not compile for me. I'm still fixing compiler errors.

If it is crashing you forgot to include error checking. I'll add that too.

Why bother storing all of them? If your FASTA file happens to be 3 gigabytes that will be quite a lot of memory in your map!

Also, map is not necessarily array, you're not guaranteed to get the same order out as you put in.

In the end, I think your first solution was better -- just print what you need to print, when you need to print it, and don't keep anything else.

C version:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

#define CHUNK 4096
#define TOKEN " \r\n\t" // Includes newlines and \r since fgets includes them

int main(int argc, char *argv[])
{
        int bufsize=CHUNK;
        // Flag variable.  First loop it holds nothing, all other loops,
        // holds "\n" to print a newline before >...
        char *prefix="";
        // Start with a CHUNK-sized buffer that can be enlarged at need
        char *buf=malloc(bufsize);
        FILE *fp;

        if(argc != 2) {
                fprintf(stderr, "No file given\n");
                exit(1);
        }

        fp=fopen(argv[1], "r");
        if(fp == NULL)
        {
                fprintf(stderr, "Can't open %s\n", argv[1]);
                return(1);
        }

        while(fgets(buf, bufsize, fp) != NULL)
        {
                char *tok;
                size_t len=strlen(buf);
                // Increase buffer size if fgets didn't get a complete line
                // complete, as in, ends in '\n'
                while(  buf[len-1] != '\n')
                {
                        buf=realloc(buf, bufsize+=CHUNK); // Make buffer bigger
                        // Read the rest of the line
                        if(fgets(buf+len, CHUNK, fp) == NULL) break;
                        // Count the new length
                        len+=strlen(buf+len);
                }

                tok=strtok(buf, TOKEN);
                if(tok == NULL) // Something strange happened -- no line?
                        continue;

                while(tok != NULL)
                {
                        if(tok[0] == '>')
                        {
                                printf("%s%s ", prefix, strtok(buf, TOKEN));
                                prefix="\n";
                                break; // Leave the loop to ignore all other tokens on this line
                        }
                        fputs(tok,stdout);
                        tok=strtok(NULL, TOKEN);
                }
        }

        fputc('\n', stdout);
        fclose(fp);
}

Corona688 · September 18, 2014, 3:05pm

Here is your corrected code

#include <iostream>
#include <fstream>
#include <string>
#include <map>

/**
 * You need string.h for strtok and strcpy.  MANDATORY!
 * Not having the right headers can cause a CRASH!
 */
#include <string.h>

using namespace std;

int main()
{
   ifstream inFILE("infile.fasta");
   /* You're not using this */
   //int inGuard = 1;               //using a guard variable

   /**
    * If you put it inside the loop, it goes out of scope every loop.
    * That's good when you want that, and bad when you don't.
    * Since you want the value to stay the same every loop, you don't.
    */
   string entryID;

   map <string, string>FastaSeq;   //Declare a map to hold each sequence entry

   while (inFILE.good()) {
      string line;        //declare string for each line
      /* moved above */
      //string entryID, sequence;    //declare two strings for key and value for map
      char *sPtr;        //Declare char pointer sPtr for tokens
      getline(inFILE, line);    //Read the whole line

      //Initialize char pointer sArray for conversion of the string to char*    
      char *sArray = new char[line.length() + 1];
      strcpy(sArray, line.c_str());

      if (sArray[0] == '>') {
         sPtr = strtok(sArray, " ");    //Using space as delimiter get the first token.
         /**
          * If your program crashes, odds are you won't see anything printed to cout.
          * use cerr for debugging instead, it prints instantly instead of being held for later.
          *
          * Use cerr for errors/debugging, cout for data output.
          */
         //cout << sPtr << " ";       //Print the first token only
         cerr << endl << sPtr << " ";
         entryID = sPtr;             //assign the first token as key for the map
         continue;
     } else  {
         sPtr = strtok(sArray, " ");    //Get all the tokens with " " as delimiter.

         /**
          * Always, always, always check your pointers!
          * Never assume strtok must have worked.
          * This is what broke your last 3 programs.
          */
         //FastaSeq[entryID] += sPtr;   // assign first part of sequence to map

         /**
          * The loop checks for NULL, so inside, sPtr is safe to use.
          */
         while (sPtr != NULL) {          //For all tokens
//            cout << sPtr;
            cerr << sPtr;
            FastaSeq[entryID] += sPtr;   // assign more token to sequence
            sPtr = strtok(NULL, " ");
         }
      }

      delete [] sArray;      /* NOT OPTIONAL! */
   }

   cerr << endl << endl;
   inFILE.close();

   //print the map
   map <string, string>::const_iterator seq_itr;

   /**
    * You made an iterator but didn't point it to anything.
    * This is bad for the same reason an unchecked pointer to
    * nothing is bad.
    *
    * Imagine a loop like for(x=0; x != 10; x++) but it's not an int,
    * instead you use z.begin() and z.end().  ++ still works.
    */
   seq_itr=FastaSeq.begin();

//   if (seq_itr != FastaSeq.end()){
   while(seq_itr != FastaSeq.end()) {
      cout << seq_itr->first << " ";

      /**
       * ???  Not sure what you're trying to do here.
       * You can't print an iterator, just its contents (first, second)
       */
      // cout << seq_itr << seq_itr->second << endl;
      cout << seq_itr->second << endl;

      /**
       * You can call ++ on an iterator, it's effectively i=i.next();
       */
      seq_itr++;
   }

    return 0;
}

yifangt · September 18, 2014, 3:30pm

The reason to save them into map is for later retrieval.
Say, there are millions of entries (that's exactly any small lab would have!), but only ten or hundreds need be retrieved from it.
After some reading, it seems current programs take two steps:
1) index the dataset;
2) retrieve subsample from the indexed dataset.
It seems to me a hash_map was used. This morning I was reviewing the codes we discussed and thought a program could do the job this way:

./prog dataset.file sample.list

where sample.list only have the sequence names, i.e. the keys of the map.
sample.list:

seq01
seq03
seq99 (not in the dataset)

Does this make any sense to you? Or what I missed?
I tried this for the tab-delimited format file, which worked fine but that is not general. If it is a tab-delimited file, the job can be done with the awk script, even grep can do the job easily. However, it seems not easy with grep for the generic format. Thanks.
---------------
You are so fast! While I was writing your second one popped out. Thanks a lot!

---------- Post updated at 03:30 PM ---------- Previous update was at 03:22 PM ----------

while(seq_itr != FastaSeq.end())
 { cout << seq_itr->first << " ";        
   cout << seq_itr->second << endl;        /* You can call ++ on an iterator, it's effectively i=i.next();        */       
seq_itr++;    
}

I was trying to print each key and value of the map, i.e. the pair of seqID vs. sequence.

Corona688 · September 18, 2014, 3:33pm

I see, I see. Hmmm.

How about, instead of storing the entire file, store the locations you've found things. That's your "index". Then, when asked for that information, seek to that spot in the file and read it.

A map is probably not the best data structure for this. A map is probably array or list-based, so if you have 2-million sequences, map["mysequence"] takes a 2-million item loop to tell whether it has it. A tree or a hash would be good. I never got the hang of trees in C++, though, and C++ doesn't have a generic hash table type (unless they added one while I wasn't looking).

On the other hand -- if you know what items you want, why not just print them?

yifangt · September 18, 2014, 3:37pm

Understand that:

How about, instead of storing the entire file, store the locations  you've found things. That's your "index". Then, when asked for that  information, seek to that spot in the file and read it.

Isn't that the same to loop/hash the map? And is it do-able?

On the other hand -- if you know what items you want, why not just print them?

Two things there:
1) I do not know if the entry is in the dataset or not,
2) If it is there, I want to get full information (sequences may be stored in unknown number of rows!) of that entry, so that need use a program.

I am aware bioperl/biopython is better to do this type of job, but I am catching C++. And C++ is way faster than perl for sure for millions of queries.

Corona688 · September 18, 2014, 4:08pm

Knowing where in 10 gigs of data your information is, and keeping all that 10 gigs of data in memory whether you need it or not, are somewhat different.

OK, now I see the situation.

But I still think you have it backwards. Whenever an idea begins with "store the universe in memory, then use a tiny part of it" my hackles go up. Keep a list of the things you want to find. Scan the file and print only those without storing the universe.

I think I mentioned, long ago, a thread on this forum where the OP was using C++ for text processing. But he kept wanting to do more and more with it -- to the point it had rudimentary expressions. In the end it was still a little faster than awk, but it wasn't that fast.

awk, perl, and python are all written in C or C++. If they're slower than your programs, it's because your program does a whole lot less.

awk honestly sounds great for the job here. If your awk program is short, awk will run fast. It already has a very fast array that's based on a hash or tree.

yifangt · September 18, 2014, 4:40pm

You seems to know every tiny corner of my mind! I am not fluent using any of those programming languages, so that my comments on speed does not count at all.
My colleague simply said to me:"You are overthinking it!" or "You are resistant to this approach!" whenever I ask technique details for things like this.
For this practice, I am struggling to catch the flow of the

 sPtr = strtok(NULL, " ")

Regular books seldom address this part in great detail. When I took the CS200 course, the professor always emphasized "C and only C, no OOP allowed!"
Now I realized what he meant, seriously!
The part I am still not sure is:
1) In the line with ">", the first field is stored as one string, except the '>' char which is a separator for each record (like RS in awk).
2) All the rest of the field next to the ">" line are concatenated to have a single string. It is easy for printing, but to track them in memory with

 sPtr = strtok(NULL, " ")

I am not sure at all. Neither am I with this line:

FastaSeq[entryID] += sPtr;   // assign more token to sequence

For example, the entry:

>seq01 some description protein 
AGCTAC GTACAT C
AGTCGTGT GAT 
CGAGC GGG

Only seq01 is picked up for key on the first line, the other part are discarded; from the second row of the entry all is concatenated: AGCTACGTACATCAGTCGTGTGATCGAGCGGG for value of the map (if I insist map be used!)

I seem to understand the syntax, as I can print out the individual field parsed, but do not know how to combine certain fields together if needed. Maybe I should not say I understand the syntax.
How the pointer/reference is manipulated behind is the bottleneck for me to catch the whole point. Can you elaborate that? Thanks!

Corona688 · September 18, 2014, 6:44pm

strtok() is pretty simple once you know what it does, which is why it's so fast.

Which line of what now?

2) All the rest of the field next to the ">" line are concatenated to have a single string. It is easy for printing, but to track them in memory with
 sPtr = strtok(NULL, " ")
I am not sure at all.
FastaSeq[entryID] += sPtr;   // assign more token to sequence
For example, the entry:
>seq01 some description protein 
AGCTAC GTACAT C
AGTCGTGT GAT 
CGAGC GGG
Only seq01 is picked up for key on the first line, the other part are discarded; from the second row of the entry all is concatenated: AGCTACGTACATCAGTCGTGTGATCGAGCGGG for value of the map (if I insist map be used!)

strtok() discards the spaces replacing them with NULL terminators. Instead of printing a space, the string ends early. This even lets it break it into a bunch of separate mini-strings without copying it anywhere or using any more memory.

Let me illustrate it. What does this code print?

char str[]="abc def ghi jkl";

str[3]='\0';
cout << str << endl;

str[7]='\0';
cout << str+4 << endl;

str[11]='\0';
cout << str+8 << endl;

This is all strtok does, change your spaces into NULs and tell you where it started.

There's much less technically wrong with your programs now, most of your problems are innocent mistakes. But an innocent mistake with a pointer makes your program explode without even telling you where or why.

This leaves you trying to fix your program by wild guessing, which is incredibly frustrating. Let me help you out.

#include <stdio.h>
#include <string.h>
#include <assert.h>

int main() {
         char str[]="abc def ghi";
         char *ptr=strtok(str, " ");

        // printf would crash if we fed it NULL.
        // so we tell assert, "we are assuming ptr != NULL"
        // and if your assumption is incorrect, it dies.
        assert(ptr != NULL);
        printf("%s\n", ptr);

        ptr=strtok(NULL, " ");
        assert(ptr != NULL);
        printf("%s\n", ptr);

        ptr=strtok(NULL, " ");
        assert(ptr != NULL);
        printf("%s\n", ptr);

        ptr=strtok(NULL, " ");
        assert(ptr != NULL);
        printf("%s\n", ptr);

        ptr=strtok(NULL, " ");
        assert(ptr != NULL);
        printf("%s\n", ptr);
}

$ gcc myassert.c
$ ./a.out
abc
def
ghi
a.out: assert.c:24: main: Assertion `ptr != ((void *)0)' failed.
Aborted

$

You can consider an assert to be a "controlled crash". This should take a lot of the mystery out of your programs because, unlike a segfault, it tells you exactly where and why it broke down. You can dump them wherever you want without changing your program logic.

It might surprise you just how short a function strtok() is. Here's a simplified one for clarity:

#include <stdio.h>

char *last; /* Store the last value my_strtok used */

/**
 * A simplified strtok that only uses one char as a token.
 * The real strtok takes a string, and stops at ANY char in it.
 *
 * 'first' points to the first character in the string.
 * If we give it NULL, it assumes 'last'.
 *
 * It points to wherever it left off in the global variable 'last'.
 */
char *my_tok(char *first, char c)
{
        int pos=0;

        /* If given a string, start over here */
        if(first != NULL)       last=first;

        first=last;     /* Pick up wherever we left off */

        /* Our very first char is NULL?  Give up. */
        if(first[0] == '\0') return(NULL);

        /* Increment 'pos' until we find c or NULL */
        while(first[pos] && (first[pos] != c)) pos++;

        /**
         * If we found a separator, replace it with a NULL terminator
         * The string beginning in 'first' will now stop early, here.
         *
         * A 'while' loop is used to catch several in a row.
         */
        while(first[pos]==c)
        {
                first[pos]='\0';
                pos++;
        }

        // Remember exactly where we left off.
        last += pos;

        // Return a pointer to where we started.
        return(first);
}

int main(void) {
        char buf[128]="abcd  efgh  jklm  nop";
        char *tok=my_tok(buf, ' ');

        while(tok != NULL)
        {
                fprintf(stderr, "tok=%s\n", tok);
                tok=my_tok(NULL, ' ');
        }
}

$ gcc mystr.c
$ ./a.out

tok=abcd
tok=efgh
tok=jklm
tok=nop

$

yifangt · September 19, 2014, 1:47pm

Two questions related the movement of the pointer char *tok and for my planned string map of sequences.
With multiple strings as:

buf1[128]= "This is a test";
buf2[128]= " Second string with a leading space"
buf3[128]= "";
buf4[128] ="\n\t\nForth string with leading unprintable chars"

Using your my_strtok() function, it is easy to parse each string(char array) and print out on screen as the pointer moves forward.
Question 1: How to save (NOT print) concatenated strings in memory?

string1="Thisisatest"; 
string2="Secondstringwithaleadingspace"
string4="Forthstringwithleadingunprintablechars"

Of course string3 will be an empty one, and a master string

string ="ThisisatestSecondstringwithaleadingspaceForthstringwithleadingunprintablechars"

The reason I ask for "save" is for the manipulation of the variable char *tok .
I seem to be quite vague about this pointer in the stack or/and heap(if I am not too wrong with the two terms!?)
Question 2: How is the pointer char *tok (and probably some new pointers to save the concatenated strings) moving back and forth to have those individual concatenated strings and the master string?

Corona688 · September 19, 2014, 3:15pm

Question 1: You worked that out pages ago, string += token; For std::string anyway. For C-strings, it means adding more to the end of an array, so you have to worry about whether there's room, etc.

Question 2: Follow the logic in the function. I've labelled the value of 'last' as green and the value of 'first' as red so you can see which one strtok is using when.

First case: You give it a new string:

char *last;
char *my_tok(char *first, char c)
{
        int pos=0;

        last=first; /* 'last' now points to "abc def ghi", so becomes red */
        first=last; /* because of the statement above, 'last' is already equal to 'first' */

        /* Increment 'pos' until we find c or NULL */
        while(first[pos] && (first[pos] != c)) pos++;

        // pos will now be '3', because first[3] == ' '

        first[pos]='\0';
        pos++;

        // 'last' currently points to "abc\0def ghi"
        last += pos;
        // Now 4 ahead, pointing to "def ghi".
        // Since it's now different, I've made it green again.

        // Return a pointer to where we started, which still points to
        // "abc def ghi", but changed to "abc\0def ghi".
        // The variable 'last' knows where we left off, pointing to "def ghi".
        return(first);
}

int main() {
        char buf[]="abc def ghi";
        char *tok=my_tok(buf, ' ');
}

You get the exact same pointer you put in. This makes sense -- strtok modifies the original and gives it back.

Second case: Getting an additional token from the previous string:

char *my_tok(char *first, char c)
{
        int pos=0;

        //last=first;        /* Since 'first' is NULL, this DOES NOT happen: */

        /* Right now, 'first' points to NULL. */
       first=last;
        /* 'first' now points to "def ghi" instead. */

        /* Increment 'pos' until we find c or NULL */
        while(first[pos] && (first[pos] != c)) pos++;

        // pos is now 3 again, since first[3] == ' '

        first[pos]='\0'; // replace that ' ' with '\0'
        pos++; // Increment once, to include that '\0' in the length
        // pos is now 4.  "abc\0" is exactly 4 chars.

        // currently, 'last' points to "def\0ghi"
        last += pos;
        // last now points to "ghi".

        // Return a pointer to where we started, "def\0ghi".
        // 'last' remembers where we left off, four further ahead at "ghi".
        return(first);
}

int main(void) {
        char buf[128]="abc def ghi";
        char *tok=my_tok(buf, ' ');

        while(tok != NULL)
        {
                fprintf(stderr, "tok=%s\n", tok);
                tok=my_tok(NULL, ' ');
        }
}

This time, we get the string from last. It was "def ghi" before, altered to "def\0gh" to split the token, then returned to us unchanged. last, on the other hand, is changed, now pointing to "def" (marked in purple.)

yifangt · September 19, 2014, 6:23pm

Is there any special reason you used char as delimiter for your function?

tok=my_tok(buf, ' ');

Whereas normal one is

tok=strtok(buf, " ");

I guess they are quite different in the background (to rewrite the source code) between the two, as ' ' is for char where " " for string. Yours uses single char as delimiter and the strtok() uses multiple char delimiters, right?

Corona688 · September 19, 2014, 7:51pm

I made it as short as I could without calling any string.h functions.

Making it use a string would be a very simple change from (first[pos] != c) to (strchr(first[pos], t) == NULL) -- or a small loop, if written without strchr:

char *my_tok(char *first, char *t)
{
        int pos=0;

        /* If given a string, start over here */
        if(first != NULL)       last=first;

        first=last;     /* Pick up wherever we left off */

        /* Our very first char is NULL?  Give up. */
        if(first[0] == '\0') return(NULL);

        /* Increment 'pos' until we find c or NULL */
        while(first[pos])
        {
                int n;
                // Check for, and stop at, any token character.
                for(n=0; t[n]; n++) if(first[pos]==t[n]) break;

                //if we found a token, t[n] won't be NULL.
                if(t[n]) break; // Stop at token char
        }

        /**
         * If we found a separator, replace it with a NULL terminator
         * The string beginning in 'first' will now stop early, here.
         *
         * A 'while' loop is used to catch several in a row.
         */
        while(1)
        {
                int n;
                first[pos]='\0';
                pos++;
                for(n=0; t[n]; n++) if(first[pos] == t[n]) break;
                // Rerun while loop if we found another token char
                if(t[n]) continue; 
                break; // Leave the loop if we did not.
        }

        // Remember exactly where we left off.
        last += pos;

        // Return a pointer to where we started.
        return(first);
}

Yes, it uses any of them. strtok(buf, "abe") is telling it "end the token when you find one or more of ANY of these characters". When breaking tokens on space, I also check for tabs, carriage returns, and newlines out of habit. That'll make it work even on the messiest text. (It also eats the newlines fgets includes in the lines it reads, a habit getline does not share.)

The real strtok will also strip off leading characters -- scanning " a b c d e " would find "a", "b", "c", "d", "e", while my "fake" strtok would find "", "a", "b", "c", "d", "e". Another little loop at the beginning would fix that.