regular expression for split function in perl

jghoshal · February 3, 2011, 12:49am

Hi,

Below is an example of a record I have, which I wish to split using the perl's split function and load it into an array. I am having tough time figuring out the exact reg-ex to perform the split.

Given record:

"a","xyz",0,2,48,"abcd","lmno,pqrR, stv",300,"abc",20,

The delimiter to uniquely identify each field is "," (comma). The quotation marks represents a string and the non-quotation marks represents integers.

The problem with this record is such that, a string which is represented within the quotation marks has "," (commas) (example this "lmno,pqrR, stv") in it, which should not be identified as a field, because the comma resides within the quotation marks.

Therefore, I would like to build a reg-ex within the split function which will basically follow the either of the below algorithms:

Algo A:
split (/\,|dont split if there is [:alpha:]\,[:alpha:]|dont split if there is [:alpha:]\,|dont split if there is ,[:alpha:]/, $givenrec)

--------------OR---------------

Algo B:
split a field if the starting and ending character of that field is " (for strings) OR none (for integers)

I really appreciate the help in advance.

Please let me know if you require any further explanations.

Thus the result should be: That is retaining the quotation marks as well, since its string.

"a"
"xyz"
0
2
48
"abcd"
"lmno,pqrR, stv"
300
"abc"
20

birei · February 3, 2011, 2:20am

Hi,

Test next script:

$ perl -MText::ParseWords -e '@fields = quotewords( ",", 1, q/"a","xyz",0,2,48,"abcd","lmno,pqrR, stv",300,"abc",20,/); print "$_\n" foreach( @fields );'

Regards,
Birei

jghoshal · February 3, 2011, 11:36pm

This is great. Thank you very much.

One problem with this is, the computation time is longer. So say for example, if I am parsing through a 1GB of file with 1.4M records, it takes 10 minutes, versus the "broken" split functon, which takes 2 minutes.

Any room for improvement on the computation would be really appreciated.

Thanks once again.

---------- Post updated 02-04-11 at 12:36 AM ---------- Previous update was 02-03-11 at 12:33 PM ----------

split (/,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))/, $str)

where $str is the given record. Computation time on the above profile seems to be 4 minutes.

Thanks.

durden_tyler · February 4, 2011, 12:33pm

$
$
$ cat f5
"a","xyz",0,2,48,"abcd","lmno,pqrR, stv",300,"abc",20,
$
$
$ perl -lne 'while(/,*(".*?")|(\d+)/g) {print $1||$2}' f5
"a"
"xyz"
0
2
48
"abcd"
"lmno,pqrR, stv"
300
"abc"
20
$
$

tyler_durden

rdcwayx · February 8, 2011, 6:45am

So shell can handle it easily, why not call system function, and use tr command directly?

tr "," "\n" < infile