Correct way to read data of different formats into same struct

yifangt · April 6, 2015, 12:25am

I was wondering what is the correct way to read in data "one-part-per-line" as compared with "one-record-per-line" formats into the same structure in C?

format1.dat:

Zacker  244.00  244.00  542.00
Lee     265.00  265.00  456.00
Walter  235.00  235.00  212.00
Zena    323.00  215.45  332.50

format2.dat:

Zacker  
244.00 
244.00 
542.00
Lee    
265.00    
265.00    
456.00
Walter    
235.00    
235.00    
212.00
Zena    
323.00    
215.45    
332.50
Mira    
285.00    
285.00    
415.00

Using the same structure as:

typedef struct info
{
  char name[20];
  double test;
  double quiz;
  double English;
} Info;

To process data in format1.dat, I have:

int main ()
{
  int n = 0;
  Info record[N];

  FILE *INFILE = fopen ("format1.dat", "r");
  while (fscanf (INFILE, "%s %lf %lf %lf", 
               record[n].name, &record[n].test,
               &record[n].quiz, &record[n].English) == 4)
    {
      printf ("%s\t%.2lf\t%.2lf\t%.2lf\n", 
              record[n].name, record[n].test,
              record[n].quiz, record[n].English);
      ++n;
    }
fclose (INFILE);

return 0;
}

How to read the data in format2.dat into the same struct, especially the while () block:

while (    ...   )  {
      printf ("%s\t%.2lf\t%.2lf\t%.2lf\n", 
               record[n].name, record[n].test,
               record[n].quiz, record[n].English);
      ++n;
    }

I was comparing these situation with awk which by default processes the file by row, or set the RS/FS separator if fields are in different rows.
Thanks a lot!

Don_Cragun · April 6, 2015, 1:46am

I guess I don't see your problem.

Changing your program slightly so it will compile, read from standard input instead of from a hardcoded filename, and not overwrite memory following your array if you overflow the array:

#include <stdio.h>

#define	NMAX	7

typedef struct info
{
  char name[20];
  double test;
  double quiz;
  double English;
} Info;

int main()
{
  int	n = 0;
  int	ret;
  Info	record[NMAX];

  while (n < NMAX && (ret = fscanf (stdin, "%s %lf %lf %lf", 
               record[n].name, &record[n].test,
               &record[n].quiz, &record[n].English)) == 4)
    {
      printf ("%s\t%.2lf\t%.2lf\t%.2lf\n", 
              record[n].name, record[n].test,
              record[n].quiz, record[n].English);
      ++n;
    }
  printf("%d records processed.\n", n);
  printf("return code from last fscanf() call: %d\n", ret);

  return 0;
}

and running it in a directory containing format1.dat and format2.dat from post #1 in this thread as follows:

$ ./a.out < *1.dat
Zacker	244.00	244.00	542.00
Lee	265.00	265.00	456.00
Walter	235.00	235.00	212.00
Zena	323.00	215.45	332.50
4 records processed.
return code from last fscanf() call: -1
$ ./a.out < *2.dat
Zacker	244.00	244.00	542.00
Lee	265.00	265.00	456.00
Walter	235.00	235.00	212.00
Zena	323.00	215.45	332.50
Mira	285.00	285.00	415.00
5 records processed.
return code from last fscanf() call: -1
$ cat *.dat | ./a.out
Zacker	244.00	244.00	542.00
Lee	265.00	265.00	456.00
Walter	235.00	235.00	212.00
Zena	323.00	215.45	332.50
Zacker	244.00	244.00	542.00
Lee	265.00	265.00	456.00
Walter	235.00	235.00	212.00
7 records processed.
return code from last fscanf() call: 4

migurus · April 6, 2015, 6:16pm

I think questioning how to read a set of data is not exactly correct here. Rather, how to write those files. If you have an option to select in which way data will be given to you I'd much prefer one-record-per-line, as in this case end of line serves naturally as a record separator. In one-per-part-line approach there is extra burden on reader program to always keep track which field is being processed, which record is current, etc... Not to discount plain readability of line-per-record is much better than the other one, which is always helpful.

yifangt · April 6, 2015, 11:01pm

Thank you!

The major reason I post this question is to understand the "flow of the data" to process. I was comparing these situation with awk which by default processes the file by row, or set the RS/FS separator if fields are in different rows. In practice I do come across this situation (one-part-per-line) more often than (one-record-per-line), especially the output from other equipment, and it is not unusual to have > 50x millions records (=200x millions of lines).
Same situation for this data format I can think of is, spaces-containing-string for each variable. It is better to have them in different lines. For example:

struct bookInfo {
char book_name[100];
char author[100]; 
int pulish_year;                    //Corrected publish_year
char press_name[60];
}

Space is inserted to separate different records for easier view only

data.txt

C Programming Language  
B. W. Kernighan & D. M. Ritchie
1988
Prentice Hall

C Programming: A Modern Approach 
K. N. King
2008
W. W. Norton & Company

Absolute Beginner�s Guide To C 
Greg Perry
1994
Sams Publishing

C Primer Plus 
Stephen Prata
2004
Sams Publishing

Expert C Programming: Deep C Secrets 
Peter V. Linden 
1994
Prentice Hall

I can't imagine if they are in single line with mixture of spaces tab, quotes etc. Clear delimiter is needed, but it will be a new thread for this problem, as I do have difficulty to read in this type of data into structure in C.

Come back to the code part of my first post of this thread, I did not realize it is related to fscanf() that I was not sure of, but I AM SURE that my problem is related to data "stream" from STDIN or FILE, that's why I use awk RS/FS as a reference.
So, my question is: How does fscanf() processing the second scenario (format2.dat), i.e. record member is broken into different lines instead of being in a single line?

Don_Cragun · April 7, 2015, 2:13am

Using fscanf() is great when you have any type of whitespace as a delimiter and machine produced data. But, if you have data that might be missing a field, it is sometimes hard to detect where the error happened and resync to a known good record boundary.

If you're going to be dealing with data that is delimited by <newline>s, it is usually safer reading into a line buffer (e.g., fgets() ) and parsing data from the appropriate lines into the appropriate fields in your structures ( strncpy() and sscanf() ). It is then easy to detect empty lines as record boundaries, guarantee that you don't overflow the ends of fields in your structures, and verify the data is the correct type for each field as you gather it.

P.S.: I strongly encourage you to change the name of the third field in your bookInfo structure. Many programmers would misspell references to that field as publish_year .

yifangt · April 7, 2015, 4:47pm

Thanks Don:
I am aware what you meant by fgets() + sscanf() + strncpy() to parse input. At this moment I am working with fscanf() and assuming no missing field of the data.
According to this reference:

int fscanf ( FILE * stream, const char * format, ... );
Reads data from the stream and stores them according to the parameter format into the locations pointed by the additional arguments.
......
On success, the function returns the number of items of the argument list successfully filled.

My unclear part is: can the number of items of the argument list be separated by newline, i.e. in different rows? In other words, fscanf() keeps looking for the defined items until all of them are found, no matter those items are separated by space, tab or newline?
To give more of my confusion by this code fragment:

  rewind (pFile);                  //Line 11
  fscanf (pFile, "%f", &f);        //Line 12
  fscanf (pFile, "%s", str);       //Line 13

Line 11: set position of pFile stream to the beginning
Line 12: Look for a float number in one line (e.g. line 1)
Line 13: Look for a string in next line(line 2), or the same line(line 1)?

  rewind (pFile);                          //Line 11
  fscanf (pFile, "%f %s", &f, str);        //Line 12

This time Line 12 would be Looking for a float number AND a string in the same line (line 1)?
Sorry for this naive question, which bugs me. Thanks a lot!

Don_Cragun · April 7, 2015, 8:33pm

Isn't this obvious from the code I presented in post #2 in this thread where the fscanf() format string "%s %lf %lf %lf" was able to read four values from one or two lines from format1.dat and able to read four values from four or five lines from format2.dat ? And the format string "%s%lf%lf%lf" would have produced the same results but isn't as easy for some people to read.

Have you read the man page for fscanf() recently. Look at it closely. (Characters that are classified as space characters by isspace() are ignored between strings being matched against conversion specifications other than for conversions with a conversion specifier [ , c , C , or n .)

Did you try your fscanf() calls in a program? Or, are you just looking at the statements and wondering what they would do? (You could have easily answered this question yourself by putting your code in a program and trying it. And, you could probably have had the results in less time than it took you to type in your post.)

yifangt · April 8, 2015, 12:35am

Thanks Don!
Yes, I tried your code of post #2, which worked of course. Before my post I tried this version for format2.dat as I thought C takes input by line:

fscanf (INFILE, "%s", record[n].name);
fscanf (INFILE, "%lf", &record[n].test);
fscanf (INFILE, "%lf", &record[n].quiz);
fscanf (INFILE, "%lf", &record[n].English);
printf ("%s\t%.2lf\t%.2lf\t%.2lf\n", record[n].name, record[n].test, record[n].quiz, record[n].English);
++n;

and this code:

fscanf (INFILE, "%s %lf", record[n].name, &record[n].test);
fscanf (INFILE, "%lf", &record[n].quiz);
fscanf (INFILE, "%lf", &record[n].English);
printf ("%s\t%.2lf\t%.2lf\t%.2lf\n", record[n].name, record[n].test, record[n].quiz, record[n].English);
++n;

both of which worked, but look very weird even to myself! Along with your version, I have three ways to do the same thing! These give me confusion that brought my post.
Book and tutorials I read use format1.dat as example to fill struct. From the manpage, it is not very clear to me without an exact example:

A sequence of white-space characters (space, tab, newline, etc.; see isspace(3)).  This directive matches any amount of white  space, including none, in the input.

Handling stream (including parsing) in C is quite challenging to me, it is better now with fscanf(). Thanks a lot again!

Corona688 · April 8, 2015, 11:50am

Space in fscanf crosses any whitespace including newlines, so you could do:

fscanf("%[^\r\n] %lf %lf %lf", record[n].name, &record[n].test, &record[n].quiz, &record[n].English);

The %s is changed to something that won't stop at the first whitespace, %[^\r\n] will accept all text up to but not including newlines.

...except this is dangerous. There's no way to prevent buffer overruns or do error checking and it will choke at the first mistake in your input. Don't use fscanf. Use fgets and sscanf like I keep telling you.

Corona688 · April 8, 2015, 12:09pm

How scanf/fscanf/sscanf work is it scans a string and stops whenever the input doesn't match what you've specified or it finds whitespace. It's kind of like a regex -- you're creating a string which matches the data you want, with various kinds of wildcards. But it's very different from a regex in that it doesn't stop at the end of the line, it stops wherever it pleases. You can force it with %[charstoaccept] / %[^charstonotaccept], a %s-equivalent I first saw in a config file loader which suddenly had to cope with carriage returns when ported to Windows. (And people wonder why I'm paranoid about those.)

Suppose you're scanning

abcde 3.14159


c d e

with %s %lf %s . It starts with %s, and accepts 'a', 'b', 'c', 'd', 'e' into the string. Then it sees whitespace and decides the string is over.

The space betweeh %s and %lf tells it "whitespace is acceptable", so it skips past the whitespace.

After that, it gets to %lf, scans '3', '.', '1', '4', '1', '5', '9', and processes that into a float, then hits \n which tells it the number's now over.

The space between %lf and %s tells it whitespace is acceptable. So it skips past all three newlines and starts reading 'c' into a string, and stops immediately because it's hit some whitespace. Since it's completely finished the command string "%s %lf %s", it feels no need to do anything with that whitespace and leaves it there for next time.

...Which means, the next time you scan "%s %lf %s", it will quit immediately because %s sees a white space and stops. The point is, fscanf leaves the file things in an unpredictable place unless things go exactly as planned.

This is why I always read entire lines. sscanf() can't choke and leave stdin in a weird place, because the entire line has already been read. If you want sscanf to read several entire lines, it can do that too.

#include <stdio.h>
#include <string.h>

typedef struct abc {
  char a[64], b[64], c[64];
} abc;

int main(void)
{
        char buf[4096];
        abc a;

        while(!feof(stdin)) {
                int bpos=0;

                // Read three lines into the same buffer.
                if(!fgets(buf+bpos, 4096-bpos, stdin)) break;
                bpos = strlen(buf); // could be more efficient but you get the idea
                if(!fgets(buf+bpos, 4096-bpos, stdin)) break;
                bpos = strlen(buf);
                if(!fgets(buf+bpos, 4096-bpos, stdin)) break;

                sscanf(buf, "%[^\r\n] %[^\r\n] %[^\r\n]", a.a, a.b, a.c);

                printf("a: %s\nb: %s\nc: %s\n", a.a, a.b, a.c);
        }
}

$ printf "a b c\nd e f\ng h i\n" | ./a.out

a: a b c
b: d e f
c: g h i

$

yifangt · April 8, 2015, 12:19pm

I panic when I saw those phrases:

A solid understanding of the functions fscanf(), sscanf(), fgets() is needed to use them properly.
Use fgets and sscanf like I keep telling you I was trying to understand why, and before that, I need catch how "flow of the data" (sometime I think it as stream) was read into memory, or handled by the program. RS/FS was borrowed to show my understanding as awk handle formatted data with these two VAR, which is not appropriate here apparently. So many options, I need understand the right choice. This part is hard at my first sight to Read three lines into the same buffer

                // Read three lines into the same buffer.
                if(!fgets(buf, 4096-bpos, stdin)) break;          //To help my understanding, can I do this, as bpos is set 0 before the loop?
                bpos = strlen(buf);                               // could be more efficient but you get the idea
                if(!fgets(buf+bpos, 4096-bpos, stdin)) break;
                bpos = strlen(buf);
                if(!fgets(buf+bpos, 4096-bpos, stdin)) break;

Thanks a lot!

Corona688 · April 8, 2015, 12:37pm

I'm sorry, I've been editing my posts a lot. I've now included an explanation of exactly what fgets does and why it's so annoying -- it might read halfway through a line and stop there.

Corona688 · April 8, 2015, 12:46pm

   // Read three lines into the same buffer.
   if(!fgets(buf, 4096-bpos, stdin)) break;          //To help my understanding, can I do this, as bpos is set 0 before the loop?
   bpos = strlen(buf);                               // could be more efficient but you get the idea
   if(!fgets(buf+bpos, 4096-bpos, stdin)) break;
   bpos = strlen(buf);
   if(!fgets(buf+bpos, 4096-bpos, stdin)) break;

Yes, you can. I have 3 lines exactly the same because I was going to put them into a loop, but didn't, to make it clearer.