Hi
I have a file which contains wrong XML, There are some garbage characters at the end of line that I want to get rid of. Example:
<request type="product" ><attributes><pair><name>q</name><value><![CDATA[LOL]]></value></pair><pair><name>start</name><value>1</value></pair></attributes></request>J IiYY'z3uJ5}#Q/k;!9Q){_F
<request type="product"><attributes><pair><name>q</name><value><![CDATA[LOL2]]></value></pair><pair><name>start</name><value>1</value></pair></attributes></request>4/lITl'cO{;_?(>YmP
How can I remove the garbage characters after </request> ? Or in other words, How to remove string between </request> and <request> ?
Please note from <request> to </request> is just one line so
awk '/<request t/ , /<\/request>/' test.txt
does not work.
My purpose is to extract value when name is "q" (LOL and LOL2) in this case. So if that can be done , easily, I am not bothered about removing the junk characters.
Thank you for your time.
perl -e ' while(<>){print "$1\n" if (/name>q<\/name><value><(?:!\[CDATA\[)?([^\]]+)\]\]><\/value/);}' test.txt
1 Like
You Sir, are awesome. 1000 internets to you.
Sorry, my bad.. Didn't read the question completely. Deleted my erroneous solution.
Hi
Just trying to understand your solution, some questions:
1)why did you use "?:" before !\[CDATA
2) What is the reason for putting "(?:!\[CDATA\[)" in parentheses i.e. "(" and ")"
3) What does "?" in the middle do?
4) What does ([^\]]+) do?
Sorry, I am still learning regular expressions. Someday I want to be as good as you. Please help.
I have made the characters in bold for your convenience.
Thank you.
if (/name>q<\/name><value><(?:!\[CDATA\[)?([^\]]+)\]\]><\/value/);
/name>q<\/name><value>< # literal string
(?: #non capturing parenthesis
!\[CDATA\[)? This block is optional (allows for cases where the data isn't CDATA escaped)
( #begin capture
[^\]]+ # more than one character which isn't a ] (match is greedy so it will capture as many as possible
)#end of capture
(:?\])+ #What I should have said ;) to make the CDATA wrapper genuinely optional
><\/value# string literal
/x # allow comments in regexes so the maintainer doesn't hunt you down and kill you
The contents matched by the capturing parenthesis available then as $1.
1 Like
Thanks for taking time out and explaining things. $1 is not (?:!\[CDATA\[) even though it is in parenthesis because it is followed by "?"
Or in other words, what is the reason $1 is not set to (?:!\[CDATA\[) even though that is the first expression inside parenthesis?
As I mentioned above that's the syntax for a non capturing block, (?:$pattern)$modifier
allows you to group a pattern so that you apply a modifier to the pattern as a whole rather than the last character, without the memory overhead of capture.
I intended that the regex allow for the possibility of a value block which was not CDATA escaped...
Non capturing blocks are especially useful where you are doing something like @results=$variable=/(?:$repeated_pattern_im_not_interested_in)*($target)/g;
1 Like