Building an argc/argv style structure from a string (char*)

Hello All,

First post. I've been struggling with the following:
Given a char* string, I need to construct an "int argc, char *argv[]" style structure. What I'm struggling with most is handling escaped-whitespace and quotes.

e.g. the string:
char *s = "hello world 'my name is simon' foo\ bar"
should produce an argc/argv of:
argc = 4
argv[0] = "hello"
argv[1] = "world"
argv[2] = "my name is simon"
argv[3] = "foo bar"

Make sense?

Anyway, after struggling for a while, I thought I would look at the source for bash/csh to see how they do it, but their setup is way more complicated (since they handle their own shell languages in those strings).

Anyone have any clear/simple example of how to do this, or know of a place/application where I could find one?

Many Thanks

You need to parse your line of text into fields.

strtok() allows the use of multiple field delimiters. It has drawbacks, but you can call it
on a temporary string several times to get the field breakout you need.

You can also use lex/yacc to parse fields. An intermediate approach is to use the regex engine. See man regcmp and/or man regex.

Now - the obvious question - why can you not simply use argc, argv? ie., write a child process that is "called" with your string:

/* mychild.c */
#include <stdio.h>
int main( int argc, char **argv)
{
      int i=0;
      while(i<argc)
      {
           printf("argv[%d]=%s\n", i, argv[i++]);
       }
       printf("argc=%d\n", argc);
}

call your child code with popen -

void foo(char *stringtoparse)
{
char tmp[256]={0x0};
char cmdstr[256]={0x0};
FILE *cmd=NULL;

sprintf(cmdstr, "./mychild %s", stringtoparse);
cmd=popen(cmdstr, "r");
while(fgets(tmp, sizeof(tmp), cmd)!=NULL)
   printf("%s", tmp);
pclose(cmd);
}

Hi Jim,

Thanks for your reply.
I had considered using a separate program popen'd from the parent, but decided against it (at least for now) since it would require installing two programs instead of one ... although it is an easy solution.

I've been trying to figure out a way to popen the binary that's running (or accomplish something similar) so that I could handle all of this in one program. Do you, or anyone else, happen know if there is a reliable way to get the absolute pathname for the currently running binary? If I could get that, this might work.

Many Thanks again,
-Ryan

Are you looking at using /proc?

.... you "own" the binary in question. This whole thing is very confusing, as currently explained. As a rule of thumb, this usually happens when somebody decides how to solve a problem and is missing something important. Not that what you seem to ask is impossible. Just messy.

Please explain -
What EXACTLY are you trying to do -- not how you think it should be done.
An example answer might be - I'm trying to get the command line arguments for a process I do not own, from binary code I cannot change.

Hi Jim,

Ok, I'll start from scratch with what I'm doing and perhaps there's an easier way you might see.

I'm writing an application that uses vi-like keybindings and has a command-mode similar to vi. Specifically, in command mode one can do a "write" just like in vi
:w
-or- to save the current playlist (this is a music application), one can do
:w filename

If they specificy a filename such as
:w foo\ bar
or,
:w "foo bar"

In these cases, I'd obviously like to be able to parse the parameters correctly (i.e. recognize that "foo\ bar" is one string, not two).

There are other a few other commands I have (and I'm currently working on a few more), that take multiple parameters, and I'd like to be able to handle spaces and quoting correctly for them.

About my current setup:
All of the command functions take two params, "int argc" and "char *argv[]", just like a regular "main" function. I then have an array of strings and function pointers (to these functions) that behaves essentially like a path.
When a user is in command mode and enters a string, I parse that string into a (bad) argc/argv structure, and then search the path for a matching named record, and if found, execute the function with the argc/argv that I built.

Does this help make clear at least my setup and what I'm asking about? I can point you to code if you would like.

Thanks again,
-Ryan

I got it. You really do need a parser. But the shell has one - modern shells that is.

#include <stdio.h>
#include <string.h>
/* using the default shell parser - no extra code required */
void make_args(char **argv, int *argc, char *string)
{
		char tmp[256]={0x0};
		FILE *cmd=NULL;
		int i=0;
		char *p=NULL;
		
		sprintf(tmp, "set - %s && for i in %c$@%c;\n do\n echo $i\ndone", 
		        string, '"', '"');
		cmd=popen(tmp, "r");
		while (fgets(tmp, sizeof(tmp), cmd)!=NULL)
		{
		    p=strchr(tmp, '\n');
		    if (p!=NULL) *p=0x0;
				strcpy(argv[i++], tmp);
		}
		*argc=i;
}

Jim (or others interested),

I've been working on a parser, and believe I may have it. Though if you (or others) have a suggestion for an easier or more obvious solution, I'd love a good smack of the clue-stick!

If you're interested, and wouldn't mind, I'd love comments.

The below code is a bit rough, but it includes a driver program that prompts user to enter a string and then runs it through my parser. Afterwords, it outputs my argc/argv structure.

(right now it just builds a global argc/argv structure...once i'm convinced this proof-of-concept works, it will obviously be updated)

Many Thanks again Jim for your comments,
-Ryan

#include <stdlib.h>
#include <stdio.h>
#include <stdbool.h>

/* for debugging */
#define STATUS(format, args...) \
   printf("here: %d. ", __LINE__); printf(format, ## args); printf("\n"); fflush(stdout);


/* currently building the argc/argv stuff in a global context */
#define ARGV_MAX  255
#define ARGV_TOKEN_MAX  255
int    _argc;
char  *_argv[ARGV_MAX];
char  *_argv_token;

/* initialize empty argc/argv struct */
void
argv_init()
{
   _argc = 0;
   if ((_argv_token = calloc(ARGV_TOKEN_MAX, sizeof(char))) == NULL)
      err(1, "argv_init: failed to calloc");
   bzero(_argv_token, ARGV_TOKEN_MAX * sizeof(char));
}

/* add a character to the current token */
void
argv_token_addch(int c)
{
   int n;

   n = strlen(_argv_token);
   if (n == ARGV_TOKEN_MAX - 1)
      errx(1, "argv_token_addch: reached max token length (%d)", ARGV_TOKEN_MAX);

   _argv_token[n] = c;
}

/* finish the current token: copy it into _argv and setup next token */
void
argv_token_finish()
{
   if (_argc == ARGV_MAX)
      errx(1, "argv_token_finish: reached max argv length (%d)", ARGV_MAX);

/*STATUS("finishing token: '%s'\n", _argv_token);*/
   _argv[_argc++] = _argv_token;
   if ((_argv_token = calloc(ARGV_TOKEN_MAX, sizeof(char))) == NULL)
      err(1, "argv_token_finish: failed to calloc");
   bzero(_argv_token, ARGV_TOKEN_MAX * sizeof(char));
}

/* main parser */
void
str2argv(char *s)
{
   bool in_token;
   bool in_container;
   bool escaped;
   char container_start;
   char c;
   int  len;
   int  i;

   container_start = 0;
   in_token = false;
   in_container = false;
   escaped = false;

   len = strlen(s);

   argv_init();
   for (i = 0; i < len; i++) {
      c = s;

      switch (c) {
         /* handle whitespace */
         case ' ':
         case '\t':
         case '\n':
            if (!in_token)
               continue;

            if (in_container) {
               argv_token_addch(c);
               continue;
            }

            if (escaped) {
               escaped = false;
               argv_token_addch(c);
               continue;
            }

            /* if reached here, we're at end of token */
            in_token = false;
            argv_token_finish();
            break;

         /* handle quotes */
         case '\'':
         case '\"':

            if (escaped) {
               argv_token_addch(c);
               escaped = false;
               continue;
            }

            if (!in_token) {
               in_token = true;
               in_container = true;
               container_start = c;
               continue;
            }

            if (in_container) {
               if (c == container_start) {
                  in_container = false;
                  in_token = false;
                  argv_token_finish();
                  continue;
               } else {
                  argv_token_addch(c);
                  continue;
               }
            }

            /* XXX in this case, we:
             *    1. have a quote
             *    2. are in a token
             *    3. and not in a container
             * e.g.
             *    hell"o
             *
             * what's done here appears shell-dependent,
             * but overall, it's an error.... i *think*
             */
            printf("Parse Error! Bad quotes\n");
            break;

         case '\\':

            if (in_container && s[i+1] != container_start) {
               argv_token_addch(c);
               continue;
            }

            if (escaped) {
               argv_token_addch(c);
               continue;
            }

            escaped = true;
            break;

         default:
            if (!in_token) {
               in_token = true;
            }

            argv_token_addch(c);
      }
   }

   if (in_container)
      printf("Parse Error! Still in container\n");

   if (escaped)
      printf("Parse Error! Unused escape (\\)\n");
}

/* simple driver */
int
main(int argc, char *argv[])
{
   char  s[255];
   int   i;

   while (fgets(s, sizeof(s), stdin) != NULL) {

      printf("parsing...\n");
      fflush(stdout);

      str2argv(s);

      for (i = 0; i < _argc; i++)
         printf("\t_argv[%d] = '%s'\n", i, _argv); fflush(stdout);
   }

   return 0;
}

Hi Jim,
I actually just posted a response while you were posting that.
It's awaiting moderation because, I suppose, there is a link in it (to the source i was currently working on).

Anyway, thanks! I was looking for something like that in bash and csh source, but failed to find it! Just out of curiosity, where did you find that?

Comparing it to my parser, it appears i have re-invented a flat wheel.

-Ryan

---------- Post updated at 05:41 PM ---------- Previous update was at 05:40 PM ----------

Woops, I responded before looking through the code entirely.

Nice, you're using popen+set! This is very, very nice.

Thanks again,
-Ryan

If you want to, you can use a specific shell rather than the system default shell. For example, here is how to use ksh93 via the libshell's sh_init()/sh_trap() mechanism.

#include <shell.h>
#include <stdio.h>
#include <nval.h>

int 
call_shell(int argc, char* argv[], char *str)
{
    Namval_t *np, *np_sub;
    char tmp[512];

    sprintf(tmp, "set - %s && for i in \"$@\";\n do\n aname+=(\"$i\")\ndone", str);

    Shell_t *shp = sh_init(argc, argv, 0);
    sh_trap(tmp, 0);

    np = nv_open("aname", shp->var_tree, 0);
    nv_putsub(np, NULL, ARRAY_SCAN);
    np_sub = np;

    do {
        // copy out the arguments to wherever here. 
        fprintf(stderr, "%d: subscript='%s' value='%s'\n", np_sub, nv_getsub(np_sub), nv_getval(np_sub));
    } while (np_sub && nv_nextsub(np_sub));

    nv_close(np);

    return(0);
}

int 
main(int argc, char *argv[])
{
    char string[] = "hello world 'my name is simon' foo\\ bar";

    call_shell(argc, argv, string);
}

When compiled and run, the output is

... value='hello'
... value='world'
... value='my name is simon'
... value='foo bar'

I like fpMurphy's approach better, but shells support

set -  parm1 parm2 parm3

as a way to redefine the parameters for the current invocation. This just uses the shell's built in parser. You can change behavior of the parser by defining the IFS variable.

Thanks Jim. However it is not for the fainthearted. You have to know the internal structures of whichever shell you choice to use.