Format specifier for sscanf() in C

Hello, I have formatted lines delimited by colon ":", and I need to parse the line into two parts with sscanf() with format specifiers.
infile.txt:

Sample Name: sample1
SNPs                         : 91
MNPs                         : 1
Insertions                   : 5
Deletions                    : 2
Indels                       : 0
Same as reference            : 1
Missing Genotype             : 44
SNP Transitions/Transversions: 1.74 (73/42)
Total Het/Hom ratio          : 2.96 (74/25)

And here is my code fragment:

char line[256];     //for row read from file
char name[32];      //1st part (key part) parsed from each line[]
char str1[8];       //2nd part (value part) parsed from each line[]

sscanf(line, "%[^:] : %s", name, str1);      //scan in two halves delimited by ":", in the 2nd half, only take the first part delimited by space/tab
                                             //e.g. line1: Sample Name-> name; sample1 -> str1 
                                             //line10: Total Het/Hom ratio -> name; 2.96-> str1,  discard (74/25)

There may not be space/tab before or after the colon. Ignoring the float/integer data type for numbers at this moment.
For each line I could not get the first half to name, and the first part after ":" into str1. The problem is sscanf() does not stop at the end of each line. Spent some time reading the manpage of sscanf() and my old post, could not figure it out myself.
What is wrong with my sscanf() line? Thanks a lot.

Maybe add a newline at the end of your regex?

For me it works: the red phrases land in the blue variables.
But there can be trailing spaces in name because %[^:] matches up to the :

2 Likes

Thanks!
Then there must be something else wrong with my program. I decided to post the whole program here seeking more help to debug. A test file is also attached for trial.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

// pratice with fgets() + sscanf() to read in multiple lines into struct
typedef struct {
    char ID[16];
    char SNPs[8];
    char MNPs[8];
    char Insertion[8];
    char Deletion[8];
    char Indels[8];                     
    char SameRef[8];            
    char MissingGT[8];             
    char SNPTransTranv[8];
    char TotalHetHomRatio[8];
    char SNPHetHomRatio[8];
    char MNPHetHomRatio[8];
    char InsertionHetHomRatio[8];
    char DeletionHetHomRatio[8];
    char IndelHetHomRatio[8];
    char InsertDeletionRatio[8];
    char Indel_SNPMNPRatio[8];
} RECORD;

int main (int argc, char *argv[])
{
    char line[256];     //for row read from file
    char name[32];      //1st part (key part) parsed from each line[]
    char str1[8];       //2nd part (value part) parsed from each line[]

    FILE* fPtr = fopen(argv[1], "r");
    RECORD record[12];           //test file only has ~ 70 rows
    static int i = 0;            //initialize counter

    while (fgets(line, 256, fPtr) != NULL) {
        //if (!sscanf(line, "%[^\r\n]", name)) continue;      //skip blank line. This line may have problem???
        
    sscanf(line, "%[^:] : %s", name, str1);    //scan in two parts delimited by ":", 
                                             //in the 2nd part, only take the first part delimited by space
    if (strstr(name, "Sample Name") != NULL)
            strcpy(record.ID, name);
    else if (strstr(name, "SNPs") != NULL) 
            strcpy(record.SNPs, str1);
    else if (strstr(name, "MNPs") != NULL)
            strcpy(record.MNPs, str1);
    else if (strstr(name, "Insertions") != NULL)
            strcpy(record.Insertion, str1);
    else if (strstr(line, "Deletions") != NULL)
            strcpy(record.Deletion, str1);
    else if (strstr(line, "Indels") != NULL) 
            strcpy(record.Indels, str1);
    else if (strstr(line, "Same as reference") != NULL)
            strcpy(record.SameRef, str1);
    else if (strstr(line, "Missing Genotype") != NULL)
            strcpy(record.MissingGT, str1);
    else if (strstr(line, "SNP Transitions") != NULL)
            strcpy(record.SNPTransTranv, str1);
    else if (strstr(line, "Total Het/Hom") != NULL)
            strcpy(record.TotalHetHomRatio, str1);
    else if (strstr(line, "SNP Het/Hom ratio") != NULL)
            strcpy(record.SNPHetHomRatio, str1);
    else if (strstr(line, "MNP Het/Hom ratio") != NULL)
            strcpy(record.MNPHetHomRatio, str1);
    else if (strstr(line, "Insertion Het/Hom ratio") != NULL)
            strcpy(record.InsertionHetHomRatio, str1);
    else if (strstr(line, "Deletion Het/Hom ratio") != NULL)
            strcpy(record.DeletionHetHomRatio, str1);
    else if (strstr(line, "Indel Het/Hom ratio") != NULL)
            strcpy(record.IndelHetHomRatio, str1);
    else if (strstr(line, "Insertion/Deletion ratio") != NULL)
            strcpy(record.InsertDeletionRatio, str1);
    else if (strstr(line, "Inde/SNP+MNP ratio") != NULL)
            strcpy(record.Indel_SNPMNPRatio, str1);
   
    printf("%s: %s\n", name, str1);         //puts() always adds newline at the end of the string
        //printf("%d\n", i); 
    i++;                                    //increment of record count
    if (i > 8) exit (EXIT_FAILURE);       //truncate the input file, need improved
	}
    fclose(fPtr);
    return 0;
}

My code was compiled without problem, but only gave the first RECORD correctly and then segment fault.

./myprog vcfstats.txt

Sample Name: sample1
SNPs                         : 91
MNPs                         : 1
Insertions                   : 5
Deletions                    : 2
Indels                       : 0
Same as reference            : 1
Missing Genotype             : 44
SNP Transitions/Transversions: 1.74
Total Het/Hom ratio          : 2.96
SNP Het/Hom ratio            : 2.79
MNP Het/Hom ratio            : -
Insertion Het/Hom ratio      : 4.00
Deletion Het/Hom ratio       : -
Indel Het/Hom ratio          : -
Insertion/Deletion ratio     : 2.50
Indel/SNP+MNP ratio          : 0.08
Sample Name: sample2
SNPs                         : 73
MNPs                         : 2
Insertions                   : 2
Deletions                    : 3
Indels                       : 0
Same as reference            : 1
Missing Genotype             : 63
SNP Transitions/Transversions: 1.87
Segmentation fault: 11

Thanks a lot again.

Again it worked for me, with the infile.txt from your post #1.
Perhaps it helps to make your strings more robust:

    char line[256];     //for row read from file
    char name[256];     //1st part (key part) parsed from each line[],
                        // entire line if there is no : separator
    char str1[8];       //2nd part (value part) parsed from each line[]
...
    while (fgets(line, sizeof(line), fPtr) != NULL) {
    str1[0]=0;    // clear str1 in case it won't be set

Thanks, did you try the attached file with 4 RECORD?
I always got segment fault at the same spot of the input, i.e. the 10th variable TotalHetHomRatio[8] within the 2nd RECORD, sample2.
Also I suspect my stupid if-else-if loop is wrong.

The code that you provided stops at line 8 with exit 1:

...
    printf("%s: %s\n", name, str1);         //puts() always adds newline at the end of the string
        //printf("%d\n", i); 
    i++;                                    //increment of record count
    if (i > 8) exit (EXIT_FAILURE);       //truncate the input file, need improved

If I remove that then I get a sgmentation fault.
In the code that you provided there are two mistakes, here is one correction

typedef struct {
    char ID[32];
...

Further, you mix up line numbers with record numbers.
If variable i is supposed to increment with every record starting with "Sample Name" then you can have

    if (strstr(name, "Sample Name") != NULL) {
        i++;                                    //increment of record count
        strcpy(record.ID, name);
    }

and no increment for every line!
And in order to start with index 0 the initialization should be

    static int i = -1;           //initialize counter
2 Likes

Change the RECORD increment i++; according to your correction fixed the problem!
Thank you so much!
---------------------------------------------------------------------------
Wait! There is still something I must have missed.
1) The last member of each struct RECORD is not correctly parsed;
2) Sample 10 and after will not get correctly parsed when there are 10 or more samples.
This brings me back to the original learning on sscanf() with fgets() with the regex I used.
Here is the reformatted code to print a re-arranged table of the input file.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

// pratice with fgets() + sscanf() to read in multiple lines into struct
typedef struct {
    char ID[32];
    char SNPs[8];
    char MNPs[8];
    char Insertion[8];
    char Deletion[8];
    char Indels[8];                     
    char SameRef[8];            
    char MissingGT[8];             
    char SNPTransTranv[8];
    char TotalHetHomRatio[8];
    char SNPHetHomRatio[8];
    char MNPHetHomRatio[8];
    char InsertionHetHomRatio[8];
    char DeletionHetHomRatio[8];
    char IndelHetHomRatio[8];
    char InsertDeletionRatio[8];
    char Indel_SNPMNPRatio[8];
} RECORD;

int    main (int argc, char *argv[])
{
    char line[256];     //for row read from file
    char name[32];      //1st part (key part) parsed from each line[]
    char str1[8];       //2nd part (value part) parsed from each line[]
//    char tail[128];     //rest behind 2nd part

    FILE* fPtr = fopen(argv[1], "r");
    RECORD record[16];            //test file may have 288 ~ 306 rows including blank lines for 16 RECORD
    static int i = -1;            //initialize counter

    while (fgets(line, sizeof(line), fPtr) != NULL) {
        str1[0]='0';
        if ( line[0] == '\n' ) continue;        //skip "blank" lines with e.g. empty or invisible spaces(to be improved!).

        //scan in two parts delimited by ":", in the 2nd part, only take the first part delimited by space
        sscanf(line, "%[^:] : %s", name, str1);   
        if (strstr(name, "Sample Name") != NULL) {
            i++;
            strcpy(record.ID, str1); 
            printf("%s ", str1);
        }
        else if (strstr(name, "SNPs") != NULL) { 
            strcpy(record.SNPs, str1);
            printf("%s ", str1);}
        else if (strstr(name, "MNPs") != NULL) {
            strcpy(record.MNPs, str1);
            printf("%s ", str1);}
        else if (strstr(name, "Insertions") != NULL) {
            strcpy(record.Insertion, str1);
            printf("%s ", str1);}
        else if (strstr(name, "Deletions") != NULL) {        //changed the variable line -> name in the original post, and all the rest after this line
            strcpy(record.Deletion, str1);
            printf("%s ", str1);}
        else if (strstr(name, "Indels") != NULL) {  
            strcpy(record.Indels, str1);
            printf("%s ", str1);}
        else if (strstr(name, "Same as reference") != NULL) {
            strcpy(record.SameRef, str1);
            printf("%s ", str1);}
        else if (strstr(name, "Missing Genotype") != NULL) {
            strcpy(record.MissingGT, str1);
            printf("%s ", str1);}
        else if (strstr(name, "SNP Transitions") != NULL) {
            strcpy(record.SNPTransTranv, str1);
            printf("%s ", str1);}
        else if (strstr(name, "Total Het/Hom") != NULL) {
            strcpy(record.TotalHetHomRatio, str1);
            printf("%s ", str1);}
        else if (strstr(name, "SNP Het/Hom ratio") != NULL) {
            strcpy(record.SNPHetHomRatio, str1);
            printf("%s ", str1);}
        else if (strstr(name, "MNP Het/Hom ratio") != NULL) {
            strcpy(record.MNPHetHomRatio, str1);
            printf("%s ", str1);}
        else if (strstr(name, "Insertion Het/Hom ratio") != NULL) {
            strcpy(record.InsertionHetHomRatio, str1);
            printf("%s ", str1);}
        else if (strstr(name, "Deletion Het/Hom ratio") != NULL) {
            strcpy(record.DeletionHetHomRatio, str1);
            printf("%s ", str1);}
        else if (strstr(name, "Indel Het/Hom ratio") != NULL) {
            strcpy(record.IndelHetHomRatio, str1);
            printf("%s ", str1);}
        else if (strstr(name, "Insertion/Deletion ratio") != NULL) {
            strcpy(record.InsertDeletionRatio, str1);
            printf("%s ", str1);}
        else if (strstr(name, "Inde/SNP+MNP ratio") != NULL) {
            strcpy(record.Indel_SNPMNPRatio, str1);
            printf("%s ", str1); }
        else printf("Sthwrong!\n"); 
        // printf("%s: %s\n", name, str1);        
    }
    puts("END");        //For debug. puts() always adds newline at the end of the string
    fclose(fPtr);
    return 0;
}
./prog infile

The attached infile is simply repeating the first 4 RECORD with unique sample IDs for trial.
And the wrong output is:

sample1 91 1 5 2 0 1 44 1.74 2.96 2.79 - 4.00 - - 2.50 Sthwrong!
Sthwrong!
sample2 73 2 2 3 0 1 63 1.87 2.59 2.50 1.00 - - - 0.67 Sthwrong!
Sthwrong!
sample3 87 1 4 2 0 1 42 1.74 2.96 2.79 - 2.00 - - 1.25 Sthwrong!
Sthwrong!
sample4 83 1 2 3 0 4 65 1.87 2.59 2.50 1.00 - - - 0.67 Sthwrong!
Sthwrong!
sample5 91 1 5 2 0 1 44 1.74 2.96 2.79 - 4.00 - - 2.50 Sthwrong!
Sthwrong!
sample6 73 2 2 3 0 1 63 1.87 2.59 2.50 1.00 - - - 0.67 Sthwrong!
Sthwrong!
sample7 87 1 4 2 0 1 42 1.74 2.96 2.79 - 2.00 - - 1.25 Sthwrong!
Sthwrong!
sample8 83 1 2 3 0 4 65 1.87 2.59 2.50 1.00 - - - 0.67 Sthwrong!
Sthwrong!
sample9 91 1 5 2 0 1 44 1.74 2.96 2.79 - 4.00 - - 2.50 Sthwrong!
Sthwrong!
Sthwrong!
73 2 2 3 0 1 63 1.87 2.59 2.50 1.00 - - - 0.67 Sthwrong!
Sthwrong!
Sthwrong!
87 1 4 2 0 1 42 1.74 2.96 2.79 - 2.00 - - 1.25 Sthwrong!
Sthwrong!
Sthwrong!
83 1 2 3 0 4 65 1.87 2.59 2.50 1.00 - - - 0.67 Sthwrong!
Sthwrong!
Sthwrong!
91 1 5 2 0 1 44 1.74 2.96 2.79 - 4.00 - - 2.50 Sthwrong!
Sthwrong!
Sthwrong!
73 2 2 3 0 1 63 1.87 2.59 2.50 1.00 - - - 0.67 Sthwrong!
Sthwrong!
Sthwrong!
87 1 4 2 0 1 42 1.74 2.96 2.79 - 2.00 - - 1.25 Sthwrong!
Sthwrong!
Sthwrong!
83 1 2 3 0 4 65 1.87 2.59 2.50 1.00 - - - 0.67 Sthwrong!
Sthwrong!
END

But, what is expected is:

sample1 91 1 5 2 0 1 44 1.74 2.96 2.79 - 4.00 - - 2.50 0.08
sample2 73 2 2 3 0 1 63 1.87 2.59 2.50 1.00 - - - 0.67 0.07
sample3 87 1 4 2 0 1 42 1.74 2.96 2.79 - 2.00 - - 1.25 0.08
sample4 83 1 2 3 0 4 65 1.87 2.59 2.50 1.00 - - - 0.67 0.07
sample5 91 1 5 2 0 1 44 1.74 2.96 2.79 - 4.00 - - 2.50 0.08
sample6 73 2 2 3 0 1 63 1.87 2.59 2.50 1.00 - - - 0.67 0.07
sample7 87 1 4 2 0 1 42 1.74 2.96 2.79 - 2.00 - - 1.25 0.08
sample8 83 1 2 3 0 4 65 1.87 2.59 2.50 1.00 - - - 0.67 0.07
sample9 91 1 5 2 0 1 44 1.74 2.96 2.79 - 4.00 - - 2.50 0.08
sample10 73 2 2 3 0 1 63 1.87 2.59 2.50 1.00 - - - 0.67 0.07
sample11 87 1 4 2 0 1 42 1.74 2.96 2.79 - 2.00 - - 1.25 0.08
sample12 83 1 2 3 0 4 65 1.87 2.59 2.50 1.00 - - - 0.67 0.07
sample13 91 1 5 2 0 1 44 1.74 2.96 2.79 - 4.00 - - 2.50 0.08
sample14 73 2 2 3 0 1 63 1.87 2.59 2.50 1.00 - - - 0.67 0.07
sample15 87 1 4 2 0 1 42 1.74 2.96 2.79 - 2.00 - - 1.25 0.08
sample16 83 1 2 3 0 4 65 1.87 2.59 2.50 1.00 - - - 0.67 0.07
 END

I believe the string size is correct, as the last member record[i].Indel_SNPMNPRatio is never more than 3 digits. Any help is greatly appreciated.

1 Like

The str1 variable is too small. "sample10" is 8 characters, so str1 must have 8+1=9 bytes wide at least.

    char name[256];      //1st part (key part) parsed from each line[]
    char str1[12];       //2nd part (value part) parsed from each line[]

Also I was wrong in my previous post, should have been

typedef struct {
    char ID[12];
    char SNPs[8];
...

Because ID only needs to store str1.

1 Like

Thanks!
Change the char array size ID[12], str1[12] resolved the ID bug!

Last bug(hopefully!): the last variable is printed twice, that I really could not understand.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

// pratice with fgets() + sscanf() to read in multiple lines into struct
typedef struct {
    char ID[12];
   /*omit other members for this post */
    char Indel_SNPMNPRatio[8];
} RECORD;

int main (int argc, char *argv[])
{
    char line[512];      //for row read from file
    char name[32];      //1st part (key part) parsed from each line[]
    char str1[12];       //2nd part (value part) parsed from each line[]

    FILE* fPtr = fopen(argv[1], "r");
    RECORD record[16];            //test file may have 288 ~ 306 rows including blank lines
    static int i = -1;               //initialize counter

    while (fgets(line, sizeof(line), fPtr) != NULL) {
        str1[0]='0';
        if ( line[0] == '\n' ) continue;        //skip "blank" lines with e.g. empty or invisible spaces(to be improved!).

        sscanf(line, "%[^:] : %s", name, str1);   
        if (strstr(name, "Sample Name") != NULL) {
            i++;
            strcpy(record.ID, str1);
            printf("\n");
        }
        printf("%s ", str1);
    }

    puts("\nEND");      //For debug.
    fclose(fPtr);
    return 0;
}
 

./prog vcfstats.txt

sample1 91 1 5 2 0 1 44 1.74 2.96 2.79 - 4.00 - - 2.50 0.08 0.08 
sample2 73 2 2 3 0 1 63 1.87 2.59 2.50 1.00 - - - 0.67 0.07 0.07 
...... 
sample15 87 1 4 2 0 1 42 1.74 2.96 2.79 - 2.00 - - 1.25 0.08 0.08 
sample16 83 1 2 3 0 4 65 1.87 2.59 2.50 1.00 - - - 0.67 0.07 0.07 
END

I want to ensure all the details of the bugs.
How come the last variable gets printed twice?
Thanks again.

The empty line prints str1.
Because it was not matched in scanf() it has the value from the previous line - therefore the str1= assignment,
should set it to "" with one of
str1[0]=0 or str1[0]='\0' or with a bit more overhead strcpy(str1, "") .
But you have str1[0]='0' that overwrites the first character of the previous value with a 0 character.

1 Like

I was so stupid when you first pointed out that I should add a line to empty str1 with str1[0]=0; but I thought you had typos so that I simply added the single quotes: str1[0]='0'; Last night I actually tried str1[0]='\0'; without realizing that is the right way, or understanding until your last reply. It finally works now and I understand it better.
Thank you so much for all your time and patience with me!

BTW in many cases a trim function comes in handy.
For example, the fuzzy strstr() that can match a portion of the string can be replaced with an exact strcmp()

...
// http://www.martinbroadhurst.com/trim-a-string-in-c.html
char *rtrim(char *str, const char *seps)
{
    int i;
    if (seps == NULL) {
        seps = "\t\n\v\f\r ";
    }
    i = strlen(str) - 1;
    while (i >= 0 && strchr(seps, str) != NULL) {
        str = '\0';
        i--;
    }
    return str;
}

int main (int argc, char *argv[])
{
    char line[512];      //for row read from file
    char name[32];      //1st part (key part) parsed from each line[]
    char str1[12];       //2nd part (value part) parsed from each line[]

    FILE* fPtr = fopen(argv[1], "r");
    RECORD record[16];            //test file may have 288 ~ 306 rows including blank lines
    static int i = -1;               //initialize counter

    while (fgets(line, sizeof(line), fPtr) != NULL) {
        if ( line[0] == '\n' ) continue;        //skip "blank" lines with e.g. empty or invisible spaces(to be improved!).
        str1[0]='\0';
        sscanf(line, "%[^:] : %s", name, str1);   
        rtrim(name, NULL);             //strip the trailing spaces
        if (strcmp(name, "Sample Name") == 0) {
            i++;
            strcpy(record.ID, str1);
            printf("\n");
        }
...

Note that the variable i in the rtrim() is local to the function, does not conflict with the variable i in main()

2 Likes

Thanks!
Corona688 in this forum had similar code, but I skipped that part not to making my question too branchy.