Reg Ex question

garric · December 2, 2008, 4:08pm

Hi All,

If I had a string that was a combination of plain text and quoted text - For ex

String: This "sentence is" a combination of "multiple words"

I wanted to know how I can write a reg-ex that splits the above string into the following

result[0] = This
result[1] = sentence is
result[2] = a
result[3] = combination
result[4] = of
result[5] = multiple words

Any help is welcome, thanks.

Regards,
garric

Ikon · December 2, 2008, 4:21pm

Doesnt make sense as to why you would do it for that sentence.

Is this a homework question?

garric · December 2, 2008, 4:24pm

No. That was just an example. Basically, I want to split a string on /\s+/ but do not want to split strings within quotes.

I guess it's too tough for homework, or atleast I feel so. Anyways, I'm too old for homework.

vgersh99 · December 2, 2008, 4:42pm

echo 'This "sentence is" a combination of "multiple words"' | nawk -f garric.awk

garric.awk:

# setcsv(str, sep) - parse CSV (MS specification) input
# str, the string to be parsed. (Most likely $0.)
# sep, the separator between the values.
#
# After a call to setcsv the parsed fields are found in $1 to $NF.
# setcsv returns 1 on sucess and 0 on failure.
#
# By Peter Str\366mberg aka PEZ.
# Based on setcsv by Adrian Davis. Modified to handle a separator
# of choice and embedded newlines. The basic approach is to take the
# burden off of the regular expression matching by replacing ambigious
# characters with characters unlikely to be found in the input. For
# this the characters "\035".
#
# Note 1. Prior to calling setcsv you must set FS to a character which
#         can never be found in the input. (Consider SUBSEP.)
# Note 2. If setcsv can't find the closing double quote for the string
#         in str it will consume the next line of input by calling
#         getline and call itself until it finds the closing double
#         qoute or no more input is available (considered a failiure).
# Note 3. Only the "" representation of a literal quote is supported.
# Note 4. setcsv will probably missbehave if sep used as a regular
#         expression can match anything else than a call to index()
#         would match.
BEGIN { FS=SUBSEP; OFS="|" }

{
  result = setcsv($0, " ")
  for(i=1;i<=NF;i++)
    printf("result[%d] = %s\n", i-1, $i)
  #print
}

function setcsv(str, sep, i) {
  gsub(/""/, "\035", str)
  gsub(sep, FS, str)

  while (match(str, /"[^"]*"/)) {
    middle = substr(str, RSTART+1, RLENGTH-2)
    gsub(FS, sep, middle)
    str = sprintf("%.*s%s%s", RSTART-1, str, middle,
      substr(str, RSTART+RLENGTH))
  }

  if (index(str, "\"")) {
    return ((getline) > 0) ? setcsv(str (RT != "" ? RT : RS) $0, sep) : !setcsv(str "\"", sep)
  } else {
    gsub(/\035/, "\"", str)
    $0 = str

    for (i = 1; i <= NF; i++)
      if (match($i, /^"+$/))
        $i = substr($i, 2)

    $1 = $1 ""
    return 1
  }
}

garric · December 2, 2008, 4:59pm

Thanks. But I was looking for a simpler reg-ex. In perl or Java.

Ikon · December 2, 2008, 5:11pm

This is gonna be the simpliest you can get, i believe:

my $string = 'This "sentence is" a combination of "multiple words"';

my %items;

push @{$items{
 $1 =~ /"/ ? 'quoted' : 'unquoted'
}}, $1 while $string =~ /(".*?"|\S+)/g;

print 'Quoted: ',
 join(', ', @{$items{'quoted'}}),
  "\n";

print 'Unquoted: ',
 join(', ', @{$items{'unquoted'}}),
  "\n";

Ikon · December 2, 2008, 5:18pm

New one:

#!/usr/bin/perl

$string = 'This "sentence is" a combination of "multiple words"';

@list = split/\s+/,$string;
foreach my $w (@list) {
   if ($w =~ /^"([^"]+)"$/) { # starts and ends with double-quotes
      print "$1\n";
   }
   elsif ($w =~ /^"([^"]+)$/) {  # only starts with a double quote
      print "$1 ";
   }
   elsif ($w =~ /^([^"]+)"$/) { # only ends with a double-quote
      print "$1\n";
   }
   else { # no quotes at all (fall-through condition)
      print "$w\n";
   }
}

Ikon · December 2, 2008, 5:24pm

previous will only work if there is only 2 words inside quotes.

This one should work with:

$str = 'This "sentence is" a combination of "multiple words"';

while($str=~/"([^"]+)"/g) {push @quoted,$1;} # fetch quoted words
foreach(@quoted){
$word=$new=$_;
$new=~s/\s/+/g;  ##inserting '+'
$str=~s/$word/$new/;
}
@str1= split (/\s+/,$str);
$_ =~ s/\+/ /g foreach(@str1) ;

print "$_\n" foreach(@str1);