Perl regex to remove a segment in a line

gary_w · August 20, 2012, 3:26pm

Hello, ksh on Sun5.8 here. I have a pipe-delimited, variable length record file with sub-segments identified with a tilda that we receive from a source outside of our control. The records are huge, and Perl seems to be the only shell that can handle the huge lines. I am new to Perl, and am trying to come up with a regex to find segments > 15 and remove them. Some of these segments have sub-segments that should be ignored. i.e. ~DRG segments can have multiple ~DCT segments, and are followed by other segments, some of which are optional..

Here's a sample BEFORE:

|~DRG|15|qwe|qwe|qwe|~DCT|efs|efs|243545|~DRG|16|qwe|qwe|qwe|~DCT|efs|efs|243545|~DRG|17|fgh|fgg|dfg|~DCT|fgg|fhh|`123|~MSP|etc|

And the desired AFTER:

|~DRG|15|qwe|qwe|qwe|~DCT|efs|efs|243545|~MSP|etc|

What I need to do is match ~DRG segments where the next field is > 15, up to the next non- ~DRG or non-~DCT segment. I believe I am getting caught up using negative search vs a read-ahead method, etc.

I have tried many ways, with this one being the closest:

 $str =~ s/\|~DRG\|(1[6-9]|2[0-9]).*?\((?!~DRG)|(?!~DCT)\)/\1/g;

But this is not going all the way to the next non- ~DRG or non-~DCT segment. In the output below, ~DRG 15 only has one ~DCT but the match is not going all the way to the ~MSP segment:

Output (bad as it shows a ~DCT from one of the ~DRG's > 15)
(Lines wrapped for readability)

|~DRG|15|03|599942600|DECYL METHYL SULFOXIDE|0.060|I|0|O|
DECYL METHYL SULFOXIDE POWDER|1|99
|MISCELLANEOUS|U6W|BULK CHEMICALS|960000
|PHARMACEUTICAL AIDS|O||MISCELL.|POWDER|89.8 %|0|N||||~DCT|STD|0.00|01|AWPA|AWPA|38.5000016G||0|N
||||~DCT|STD|0.00|09||AWPA|2.50000
|~MSP|1|93392900|~MSP|2|72900
|~MSP|3|7512900
|~MSP|4|964850|~MSP|5|96500
|~MSP|6|96802900
|~MSP|7|6610000|~MSP|8|967900|~MSP|9|9932900
|~MSP|10|9680002900|~MSP|11|9662900
|~MSP|12
|79403800|~MSP|13|964900|~MSP|14|96700
|~MSP|15|9640|~MSP|16|96200|~MSP|17|96200037

If you have a suggestion on the regex or if there is a better approach, I will be grateful!

Gary

alister · August 20, 2012, 3:48pm

Your first sample shows a single line record, but the second, larger sample appears to span multiple lines. Can a single record span multiple lines? If so, how is the end of record determined?

In the second sample you highlight a segment that according to your explanation should not be modified. The field after ~DRG is not greater than 15. Also, there do appear to be two ~DCT segments highlighted in blue, but your text mentions only one.

Edit: Hmm. Perhaps ther was a ~DRG|16 in that second sample that was deleted, and the second ~DCT belongs to it. Not sure. This is why it's good to show both the before and after with sample data.

In addition to answers for those questions, it would help if for each data sample you showed us the before and the (desired) after.

Regards,
Alister

gary_w · August 20, 2012, 4:02pm

My apologies for not being clear enough.

The first record is just a sample showing the DRG and DCT layout.

The second example output is split to multiple lines for readability. The actual records are separated by carriage returns and have TONS of columns so this is just a sample of the relevant part of the record. DRG 15 has only one DCT but it is showing another DCT from one of the records >15. My regex is not going all the way to the MSP record.

I will update my first example to show a before and desired after. The record is too huge to show the whole thing.

alister · August 20, 2012, 4:25pm

To confirm that your AWK can't handle these records, does the following fail to print the number of fields in each record?

awk -F\| '{print NF}' file

On Solaris, make sure to test with /usr/xpg4/bin/awk.

Also, do the records always begin and/or end with a pipe symbol? If yes, be specific, begin or end or both.

Regards,
Alister

gary_w · August 20, 2012, 4:42pm

This works:

/usr/xpg4/bin/awk -F\| '{print NF}' file

Some records have > 800 columns!

Records do not start with a pipe nor end with one; however the segments inside the record do start/end with a pipe.

alister · August 20, 2012, 5:08pm

This works with several variations I created from your sample data:

perl -lpe 's/\|~DRG\|(\d+).*?(?=\|~(?!DCT))/$1>15?"":$&/ge' file

Regards,
Alister

gary_w · August 20, 2012, 5:31pm

Sweet! It works on my test file. I would be grateful if you could give an explanation on the regex? I need to do some similar operations on other parts of the file and want to understand it.

alister · August 20, 2012, 5:51pm

It starts off very much like your original attempt.

\|~DRG\|(\d+) : Begins by looking for |~DRG|<some number> . The number is captured.

.*? : ... followed by a non-greedy wildcard, so it doesn't skip anything we don't want to skip ...

(?=\|~(?!DCT)) ... until a |~ is found that is not followed by DCT . This uses a negative look-ahead assertion nested within a positive look-ahead assertion, so that we don't consume this portion of the record, leaving it for the substitution's next iteration.

/$1>15 ? "" : $&/ge : The e flag indicates that the replacement text should be treated as a perl expression. We test if the number captured by the first parenthetical, $1, is greater than 15. If it is, everything that was matched is replaced with an empty string. If $1 is not greater than 15, $& replaces the text with itself.

Regards,
Alister

gary_w · August 20, 2012, 6:03pm

Actually I think I have it figured out. Please correct me if not.

-l = chomp each record (strips newlines)
-p = Automatically loop through each line and print it
-e = define code to be executed

s/\|~DRG\| = Search for the patterm "|~DRG|"
(\d+) = followed by a positive number (group it so we can refer to it as $1)
.*? = followed by any character, any number of characters stopping at the first match on the line of the second group:
(?=\|~(?!DCT)) = match where the pattern equals "|~" followed by a string that is NOT "DCT"
/$1>15?"":$& = replace that with the results of a test. If the first group (the positive nbr) is > 15, replace with an empty string. Else replace with the results of the last match (the segment itself).
/g = perform the replacements on every occurrence on the line
e = evaluate the replacement expression as if it is a Perl expression, rather than just a regular expression

Whew

---------- Post updated at 06:03 PM ---------- Previous update was at 06:02 PM ----------

You beat me to it! At least I was right in figuring it out.

Thanks a million!