How to match fields surrounded by double quotes with commas?

Hello to all,

I'm trying to match only fields surrounded by double quotes that have one or more commas inside.

The text is like this

"one, t2o",334,"tst,982-0",881,"kmk 9-l","kkd, 115-001, jj-3",5

The matches should be

"one, t2o"
"tst,982-0"
"kkd, 115-001, jj-3"

I'm trying with the regex below, but is matching all fields surrounded by double quotes, even those that
don't have any comma inside.

".+?"

Thanks in advance for any help

If it allows PCRE

"[\w]+,[-, \w]*"

Otherwise

"[a-z0-9]+,[-, a-z0-9]*"

Maybe even this if the content varies from what's shown:

"[-, a-z0-9]+,[-, a-z0-9]+"

Essentially, I am trying to avoid picking up ",334," and ",881," which are commas inside quotes as well. :wink:

Assuming your text is in a file named file , try:

awk -F '"' '
{	for(i = 2; i <= NF; i += 2)
		if(index($i, ","))
			printf("\"%s\"\n", $i)
}' file

If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk .

Hello Aia and Don,
I actually only need the regex and I'm trying with one of yours.

"[-, a-z0-9]+,[- .a-z0-9]+"

It works for the code I posted, but if I change the "-" in red to "%" the match fails.
I mean change from this:

"one, t2o",334,"tst,982-0",881,"kmk 9-l","kkd,115-001,jj-3",5

to this

"one, t2o",334,"tst,982-0",881,"kmk 9","kkd,115-001,jj%3",5,"kkd,"

The matches should be those in red.

I was trying to change your regex to match any field that have anything (letters, numbers, spaces and any other symbols) and at least one comma, doing something like this:

"[. \w]+,[. \w]+"

But is not working even when the dot is supposed to match almost any symbol.

Thanks again for any help.

"[\w]+,[-, \w%]*"
"[-, a-z0-9]+,[-, a-z0-9%]*"

Any of those?

I'm very confused.

There are lots of different kinds of regular expressions. What utility is going to process the RE that you want? Show us the code you're using where this RE will be plugged in!

What operating system and shell are you using?

Does the input you want to process always start with a quoted string?

Can we safely assume that double-quotes always appear in pairs (i.e., that your input will never contain an escaped double-quote in a double-quoted string)? If not, what is the escape mechanism?

Hello Don/Aia,

Actually I want to use the Regex in a VBA application under Windows, but I'm testing the Regex in an online regex tester (Regex Tester).

The thing is that is a CSV file with comma as field delimiter and since could be commas inside fields, it's used the double quotes as text qualifier in order to avoid confusion between commas that separate fields and commas that are inside the fields. So, each field could contain any text (alphanumeric, spaces, symbols etc) and I want to match only those that have at least one comma. If it is too much complicated then I'd like to match only fields that are surrounded by double quotes.

each line of CSV could be like:
any text,"any text","any, text","any,text",,,2,45,any text

Where any text could a combination of alphanumeric, spaces, symbols etc. Because of that I was trying to match any character doing [.]+,[.]+ but t doesn't work.

Double quotes always appear in pairs since surround the fields (some of them) and there is no escaping character.

Hope make sense.

Regards

So why ask for help in a UNIX/Linux forum?

Anyways, as Don already pointed out, Parsing based on double quotes is what I'd do. But, if you want just a strait regex solution, the following needle may meet your needs:

(^|,)\K"[^",]*,.*?"

Hello jethrow, thanks for answer.

It almost working your regex, only is taking one extra comma before the match.

Try in regexpal.com your regex (without the K) with this text and you'll see.

"one, t2o",334,"tst,982-0",881,"kmk 9","kkd,115-001,jj%3",5,"kkd,"

Since I know here are many people with a lot of knowledge about this matter :).

Thanks again

From what I could find in Microsoft Beefs Up VBScript with Regular Expressions, it looks like VBA REs are UNIX EREs with a few extensions. Try (totally untested):

"[^",]+,[^"]*"

Hello Don,

That regex works just perfect!

Many thanks for the help.

It's worth noting that Don's example won't match if inside the quotes a comma is the first character. Since \K doesn't seem to work with the VB Regexp object, you could utilize a look-ahead assertion to verify the end quote is followed by a comma, or the end of the line:

"[^",]*,[^"]*"(?=,|$)

I did test this with the VB Regexp object.