Removing characters from end of line (length unknown)

dirtyd0ggy · January 5, 2012, 6:27am

Hi

I have a file which contains wrong XML, There are some garbage characters at the end of line that I want to get rid of. Example:

<request type="product" ><attributes><pair><name>q</name><value><![CDATA[LOL]]></value></pair><pair><name>start</name><value>1</value></pair></attributes></request>J IiYY'z3uJ5}#Q/k;!9Q){_F
<request type="product"><attributes><pair><name>q</name><value><![CDATA[LOL2]]></value></pair><pair><name>start</name><value>1</value></pair></attributes></request>4/lITl'cO{;_?(>YmP

How can I remove the garbage characters after </request> ? Or in other words, How to remove string between </request> and <request> ?

Please note from <request> to </request> is just one line so

awk '/<request t/ , /<\/request>/' test.txt

does not work.

My purpose is to extract value when name is "q" (LOL and LOL2) in this case. So if that can be done , easily, I am not bothered about removing the junk characters.

Thank you for your time.

Skrynesaver · January 5, 2012, 7:00am

perl -e ' while(<>){print "$1\n" if (/name>q<\/name><value><(?:!\[CDATA\[)?([^\]]+)\]\]><\/value/);}' test.txt

dirtyd0ggy · January 5, 2012, 7:12am

You Sir, are awesome. 1000 internets to you.

balajesuri · January 5, 2012, 7:23am

Sorry, my bad.. Didn't read the question completely. Deleted my erroneous solution.

dirtyd0ggy · January 6, 2012, 3:14am

Hi

Just trying to understand your solution, some questions:

1)why did you use "?:" before !\[CDATA

2) What is the reason for putting "(?:!\[CDATA\[)" in parentheses i.e. "(" and ")"

3) What does "?" in the middle do?

4) What does ([^\]]+) do?

Sorry, I am still learning regular expressions. Someday I want to be as good as you. Please help.

I have made the characters in bold for your convenience.

Thank you.

if (/name>q<\/name><value><(?:!\[CDATA\[)?([^\]]+)\]\]><\/value/);

Skrynesaver · January 6, 2012, 4:22am

/name>q<\/name><value>< # literal string
(?: #non capturing parenthesis
!\[CDATA\[)? This block is optional (allows for cases where the data isn't CDATA escaped)
( #begin capture
[^\]]+ # more than one character which isn't a ] (match is greedy so it will capture as many as possible
)#end of capture
(:?\])+ #What I should have said ;) to make the CDATA wrapper genuinely optional
><\/value# string literal
/x # allow comments in regexes so the maintainer doesn't hunt you down and kill you

The contents matched by the capturing parenthesis available then as $1.

dirtyd0ggy · January 6, 2012, 4:34am

Thanks for taking time out and explaining things. $1 is not (?:!\[CDATA\[) even though it is in parenthesis because it is followed by "?"

Or in other words, what is the reason $1 is not set to (?:!\[CDATA\[) even though that is the first expression inside parenthesis?

Skrynesaver · January 6, 2012, 5:02am

As I mentioned above that's the syntax for a non capturing block, (?:$pattern)$modifier allows you to group a pattern so that you apply a modifier to the pattern as a whole rather than the last character, without the memory overhead of capture.
I intended that the regex allow for the possibility of a value block which was not CDATA escaped...
Non capturing blocks are especially useful where you are doing something like @results=$variable=/(?:$repeated_pattern_im_not_interested_in)*($target)/g;