Wildcard Pattern Matching In C

I've been having problems lately trying to do pattern matching in C while implementing wildcards. Take for instance the following code:

#include <sys/types.h> 
#include <sys/stat.h> 
#include <stdio.h> 
#include <stdlib.h> 
#include <unistd.h> 
#include <dirent.h> 
#include <string.h> 
#include <time.h>  

void grepwc(char *b) {      

        FILE *fp;     
        fp = fopen("/var/log/apache2/other_vhosts_access.log", "r");     
        char line[100];     
        unsigned int i = 0;      
 
        while(fgets(line, sizeof(line), fp)) {       
             if (strstr(line, b) != NULL) {         
                  i++;       
              }     
        }    

        printf("%s %d\n", b, i);   
        fclose(fp);  
}  

int main() {          

time_t current = time(NULL);         
char date_time[10];         
char newhold[34];  

/*         strftime(date_time, sizeof(date_time), "%d", localtime(&current));         
           strncat( newhold, date_time, 10 );         
           strncat( newhold, "?", 2 );         
           strftime(date_time, sizeof(date_time), "%b", localtime(&current));         
           strncat( newhold, date_time, 10 );         
           strncat( newhold, "*", 2 ); */         
           strncat( newhold, "pattern", 10 );       
       
           grepwc(newhold);          
              
           return 0; 
}

With what I have commented out it looks for the word "pattern" in a log file and counts each match:

$ ./test
pattern 2

I should note that output may look a little broken depending on your architechure of processor. This is the file I am gathering this from right now:

$ cat /var/log/apache2/other_vhosts_access.log
01 Jul pattern
pattern

However, if we uncomment these lines:

                    strftime(date_time, sizeof(date_time), "%d", localtime(&current));         
                    strncat( newhold, date_time, 10 );         
                    strncat( newhold, "?", 2 );         
                    strftime(date_time, sizeof(date_time), "%b", localtime(&current));         
                    strncat( newhold, date_time, 10 );         
                    strncat( newhold, "*", 2 ); 

The date is not matched:

$ ./test 
01?Jul*pattern 0

I've tried searching some lots pattern matching tutorials in C and even tried applying some regex. The best I got was to get this to work with only one wildcard, but I'm needing to get this to work with 2 or more. Any suggestions greatly appreciated.

Looking at strstr 's man page, I can't see it would accept any wildcard char nor regex. So you might need to build you own grep routine?

1 Like

I looked at this man page just now. I did not see anything saying it would or would not accept wildcards or regex on my OS. I also noticed these:

SEE ALSO
       index(3), memchr(3), rindex(3), strcasecmp(3),  strchr(3),  string(3),
       strpbrk(3), strsep(3), strspn(3), strtok(3), wcsstr(3)

I looked at the man pages for most of these too, but found no mention of wildcards or regex. Am I not seeing something in these man pages? Or does anyone else know a function that would work with this purpose?

You won't find anything in those libraries. Check into regex.h ...

There are two basic sets of pattern matching: files and strings

fnmatch() is used to match wildcards like ? and * in file name patterns.
regcomp(), regexec(), regfree() are called in that order to build, then execute, then release resources for grep and egrep like pattern matching.

Generally you are better off to use these library calls than to roll your own. If you already can use ls pattern matching it is easy to use the fnmatch call.

The code structure for emulating what the grep command does is a little more complex.
If you remember, grep and egrep have a lot of options. Since they are implemented by the regex family of calls, the calls are more complex. Options for constructing the resources regcomp (regular expression compile) supports several. The regex command supports the others.

There is also the PCRE library that perl regex uses. If you are a perl user, consider that library.

Don't try to roll your own if you've never gotten fully acquainted with a regex library. If you must, read Russ Cox to get an idea how to proceed.

Implementing Regular Expressions

Site has howtos

2 Likes

Don't get me wrong, regex is a great and wonderful tool. However, from what I understand regex in C is re-compiled every time it runs. That would not be so bad if I was not planning to call this code later many times with threads on a system where resources are very tight. For that reason I'm trying to stay away from them if possible.

I tried looking at fnmatch earlier. I changed my while loop to this, but the pattern was not matched:

    while(fgets(line, sizeof(line), fp)) {
        if (fnmatch(newhold, line, 100) == 0){
        i++;
      }
    }

I looked at the man page and some examples online and appears the second parameter to fnmatch() needs to be a constant or a struct value. Either that or I'm doing something wrong I don't know?

If you don't need too many different regexes but could repeatedly (re)use a handful of them on many different strings, compile each of them once into a new pattern buffer, and run all the comparisons with these few pattern buffers.

1 Like

You can "statically" pre-compile to a text file: use the regcomp command - not a C call. It creates the compiled buffer you need in a file. I typically use maybe a dozen of ptr-compiles in a simple app for validating input. You put them in an include file.

And Rudic is correct - you can compile once and then run the compiled buffer multiple times internally -- as well as pre-compile.

1 Like

Well, after no luck with fnmatch() and other pattern matching functions I found online I decided to give regex a try:

#include <stdio.h>
#include <unistd.h>
#include <string.h>
#include <time.h>
#include <stdlib.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <regex.h>

int main() {

   char newhold[40];

        regex_t re;
        time_t current = time(NULL);
        char day[10];
        char mon[10];
        int retval = 0;        

        strftime(day, sizeof(day), "%d", localtime(&current));
        strncat( newhold, day, 10 );
        strncat( newhold, "?", 2 );
        strftime(mon, sizeof(mon), "%b", localtime(&current));
        strncat( newhold, mon, 10 );
        strncat( newhold, "*", 2 );
        strncat( newhold, "pattern", 10 );

     if(regcomp(&re , newhold, REG_EXTENDED) != 0 ){
         return;
     }

    FILE *fp;
    fp = fopen("/var/log/apache2/other_vhosts_access.log", "r");
    char line[100];
    unsigned int i = 0;

    while(fgets(line, sizeof(line), fp)) {
        if ((retval = regexec(&re, line, 0, NULL, 0)) == 0){
        i++;
      }
    }
  printf("%s %d\n", newhold, i );
  fclose(fp);

  return 0;
}

Obviously this doesn't work. I know in the following part that "newhold" would normal have a constant defined instead of a char array. I could do that with the "pattern" section of this regex and with the "?" and "*" wildcards. However, the variables "day" and "mon" are going to be checked by the system every time the code runs. So a constant wouldn't work in this case.

Perhaps I'm going wrong in other aspects as well, but that's the biggest problem I see at the moment. Anyone know any tricks to pass variables into regex for this? I tried searching that online as well with no success. Maybe my Google-fu is just lacking?

Also I'm not finding much creating a file from the regcomp command either. I'd love to see that if anyone can provide an example.

What is in /var/log/apache2/other_vhosts_access.log ?

Please explain in English what you are hoping the extended regular expression you have created in newhold[] will match.

Which of the lines in /var/log/apache2/other_vhosts_access.log do you hope will be matched by your ERE?

Take the contents of the string variable (newhold) you constructed, print it, and try it as a pattern for grep as a console command. You also need to check the return code from regexec in case something else biffed. It should be: REG_NOMATCH or zero. Anything else is a fatal error. regerror() is your friend.

1 Like

Sorry, I thought I had provided the contents of the file earlier. Here it is again with just the following for testing:

$ cat /var/log/apache2/other_vhosts_access.log
03/Jul blah pattern

I did not know regex in C and grep/egrep were so closely related. I did as Jim suggested and saw nothing was matched:

$ egrep "03?Jul*pattern" /var/log/apache2/other_vhosts_access.log | wc -l
0

I changed this to the following and it was matched:

egrep "03.Jul.*pattern" /var/log/apache2/other_vhosts_access.log | wc -l
1

Still not getting a match when I try applying this to C though:

#include <stdio.h>
#include <unistd.h>
#include <string.h>
#include <time.h>
#include <stdlib.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <regex.h>

int main() {

   char newhold[40];

        regex_t re;
        time_t current = time(NULL);
        char day[10];
        char mon[10];
        int retval = 0;

        strftime(day, sizeof(day), "%d", localtime(&current));
        strncat( newhold, day, 10 );
        strncat( newhold, ".", 2 );
        strftime(mon, sizeof(mon), "%b", localtime(&current));
        strncat( newhold, mon, 10 );
        strncat( newhold, ".*", 4 );
        strncat( newhold, "pattern", 10 );

     if(regcomp(&re , newhold, REG_NOMATCH) != 0 ){
         return 1;
     }

    FILE *fp;
    fp = fopen("/var/log/apache2/other_vhosts_access.log", "r");
    char line[100];
    unsigned int i = 0;

    while(fgets(line, sizeof(line), fp)) {
        if ((retval = regexec(&re, line, 0, NULL, 0)) == 0){
        i++;
      }
    }
  printf("%s %d\n", newhold, i );
  fclose(fp);
  return 0;
}

I also tried working with regerror(), but had trouble applying the few examples I was able to find.

You're very close... In the line:

     if(regcomp(&re , newhold, REG_NOMATCH) != 0 ){

REG_NOMATCH is a defined to be a return code for regexec() indicating that it did not find a match; it is not defined to be a flag to be passed to regcomp(). If you change the above line in your code to:

     if(regcomp(&re , newhold, 0) != 0 ){

and rebuild your code, running it produces the output:

03.Jul.*pattern 1

I would, however, suggest changing:

        strftime(day, sizeof(day), "%d", localtime(&current));
        strncat( newhold, day, 10 );
        strncat( newhold, ".", 2 );
        strftime(mon, sizeof(mon), "%b", localtime(&current));
        strncat( newhold, mon, 10 );
        strncat( newhold, ".*", 4 );
        strncat( newhold, "pattern", 10 );

to:

	strftime(newhold, sizeof(newhold), "%d/%b.*pattern",
	    localtime(&current));

which gets rid of several chances to overflow the size of newhold[] and chances to unnecessarily truncate text being added in intermediate strncat() calls. If you do this and rebuild your code again, it will produce the output:

03/Jul.*pattern 1

And, then, of course, you can also get rid of the day[] and mon[] arrays.

1 Like

Thank you Don Cragun! That does work correctly and is a lot cleaner!

I'm still interested in statically pre-compiling the regex to a text file that jim mcnamara mentioned. If anyone has any examples of that let me know.

The man page for the regcomp utility (try man 1 regcomp ) on your system should tell you everything you need to know. If you get back something similar to:

No entry for regcomp in section 1 of the manual

(which is what I get on OS X) then your system does not provide any way to do this. You could try building your own regcomp utility, but there is no guarantee that a compiled RE is included in its entirety in the regex_t that is returned by the regcomp() function (and, even if it is, there is no guarantee that that structure can be loaded at a different location in your address space, or copied into the address space of a different process, and still work).

Maybe Jim can give us more details about the system he is using that provides the regcomp utility he is using.

1 Like