Print specific pattern line in c++

Input file:

@HWI-BRUNOP1_header_1
GACCAATAAGTGATGATTGAATCGCGAGTGCTCGGCAGATTGCGATAAAC
+HWI-BRUNOP1_header_1
TNTTJTTTETceJSP__VRJea`_NfcefbWe[eagggggfgdggBBBBB
@HWI-BRUNOP1_header_2
CAGCAGACGCTTTGATTGCTCGATCTCTTGGTAAATACGGCATCATCTGC
+HWI-BRUNOP1_header_2
TJTTTJFFFFa`TWNMPJGTbZSTPZJHHGT^I^H^SKZeeeeeeb``RT

Desired output file:

>HWI-BRUNOP1_header_1
GACCAATAAGTGATGATTGAATCGCGAGTGCTCGGCAGATTGCGATAAAC
>HWI-BRUNOP1_header_2
CAGCAGACGCTTTGATTGCTCGATCTCTTGGTAAATACGGCATCATCTGC

Rules to follow when writing c++ program:

  1. Print out the line start with "@HWI" and the content below "@HWI";
  2. Change the "@HWI" into "#HWI" and save the output result into another file;
    Command that I try to deal with 14Gb input file data:
[home@cpp]time grep -A1 '@HWI' input_file.txt | sed -e 's/--//g' -e 's/@HWI/#HWI/g' | sed '/^$/d' > output_file.txt
real    7m13.413s
user    5m25.382s
sys     1m37.668s
[1]+  Done       time grep -A1 '@HWI' input_file.txt | sed -e 's/--//g' -e 's/@HWI/#HWI/g' | sed '/^$/d' > output_file.txt           
[home@cpp]cat output_file.txt
>HWI-BRUNOP1_header_1
GACCAATAAGTGATGATTGAATCGCGAGTGCTCGGCAGATTGCGATAAAC
>HWI-BRUNOP1_header_2
CAGCAGACGCTTTGATTGCTCGATCTCTTGGTAAATACGGCATCATCTGC

Desired format to run the c++ program:

cplusplus_program_name input_file_name output_file_name

Thanks for any advice.

You're already getting 30 megabytes per second translation rate, which comes to 60 megs/second transfer rate considering you're reading AND writing. Just how fast is your disk? Will a hardcoded solution actually be faster?

But if you really want a hardwired solution:

#include <stdio.h>
#include <string.h>

int main(void)
{
        char buf[4096];

       while(fgets(buf, 4096, stdin))
       {
               if(strncmp(buf, "@HWI", 4) == 0)
               {
                       buf[0]='#';
                       fputs(buf, stdout);
                       if(fgets(buf, 4096, stdin) == NULL) continue;
                       fputs(buf, stdout);
               }
       }
}

works pretty close to how you want:

./cname < infile > outfile

You can also do it in awk with

awk '/^@HWI/ { sub("@", "#", $1); print; getline ; print }' < infile > outfile

The C version seems faster than the awk one admittedly!

1 Like

Hi Corona688,

Many thanks for your c++ program.
I'm very appreciate it.
Based on your experience, is it I should edit your program by using "argv", "f_read", "f_write" in order to get the desired program running format:

cplusplus_program_name input_file_name output_file_name

It really doesn't matter whether the shell or the program does the file-opening for a simple program that always has one input file and one output file, but since you insist:

#include <stdio.h>
#include <string.h>

int main(int argc, char *argv[])
{
       char buf[4096];
       FILE *fin, *fout;

       if(argc != 3)
       {
               fprintf(stderr, "Usage:  %s filein fileout\n", argv[0]);
               return(1);
       }

       fin=fopen(argv[1], "r");
       if(fin == NULL)
       {
              fprintf(stderr, "Couldn't open %s\n", argv[1]);
              return(1);
       }


       fout=fopen(argv[2], "w");
       if(fout == NULL)
       {
              fprintf(stderr, "Couldn't open %s\n", argv[2]);
              return(1);
       }

       while(fgets(buf, 4096, fin))
       {
               if(strncmp(buf, "@HWI", 4) == 0)
               {
                       buf[0]='#';
                       fputs(buf, fout);
                       if(fgets(buf, 4096, fin) == NULL) continue;
                       fputs(buf, fout);
               }
       }

       fclose(fin);
       fclose(fout);
       return(0);
}
1 Like

Many thanks, Corona688.
I just curious what does it really mean in "strncmp(buf, "@HWI", 4) == 0" and "fputs(buf, fout);" that you use in your source code?
I can't really get how to use string compare function in this case.
Many thanks for advice

compare the first 4 characters of two strings for strncmp, and print a string for fputs. 'man strcmp' and 'man fputs' may help.

Many thanks, Corona688.
After I reading the 'man strcmp' and 'man fputs'. I understanding the reason why you using 'strcmp' and 'fputs' while coding :slight_smile:
I just wondering why that "if(fgets(buf, 4096, fin)== NULL ) continue;" will print out only the line next to "@HWI instead of other line?
"fgets" will "Get string from stream" and read through it from the input file.

eg.
[home@cpp]cat input_file.txt
@HWI-BRUNOP1_header_1
GACCAATAAGTGATGATTGAATCGCGAGTGCTCGGCAGATTGCGATAAAC
+HWI-BRUNOP1_header_1
TNTTJTTTETceJSP__VRJea`_NfcefbWe[eagggggfgdggBBBBB

Many thanks to explain the reason that "if(fgets(buf, 4096, fin)== NULL ) continue;" will only print the second line shown in above example instead of other line.

It reads a line every loop, but never prints anything unless it finds @HWI. When it does, it prints both that line (after subbing # for @) and the line after it (by reading and immediately printing another line).

So any lines that aren't @HWI or don't immediately follow @HWI get read but not printed.

Many thanks for explanation in detail.
I'm still need time to 'digest' it :slight_smile:
Will ask your experience once got problem.
Thanks first.

From 'man fgets' :

   while(fgets(buf, 4096, fin)) //keep reading line-by-line into buffer
       {
               if(strncmp(buf, "@HWI", 4) == 0) //if buffer starts with '@HWI' 
               {
                       buf[0]='#';  //set the first char to '#'
                       fputs(buf, fout);   //and print the whole line
                       if(fgets(buf, 4096, fin) == NULL) continue;   //read next line; check for failure or eof
                       fputs(buf, fout);  //print the line 
               }
       }

This piece of code calls fputs() twice -- once it prints the modified '@HWI' line and the second time it prints the next line after. Then, the loop gets in next iteration and keeps reading in lines, until while condition is satisfied, that is, until '@HWI' is encountered. The third and fourth line in your input get lost in the while iterations, not in the if test.

@Corona: Any particular reason why you chose to use buffer of size 4096?

1 Like

Not really. 4096 is just a "nice round number" in computing terms, a power of two, and plenty long enough for most text lines. 'most' may not mean yours, adjust to taste. Anything under a couple of megs in size will work. If you need really large buffers you have to make them with malloc:

unsigned long size=(1024L*1024L*64L); // 64 megabytes
char *buf=malloc(size);

You use buf the same way as you would a stack var like char buf[64]; since an array and a pointer amount to the same thing in the end, as far as the processor's concerned -- a starting point in memory.

2 Likes