get text between two tags in bash (awk)

rkoziol7 · February 23, 2011, 5:51am

Hi,

I have a sample text file:

<category name="Temp1">something1</category><!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
</TD></TR></TABLE></BODY></HTML>
<category name="Temp2">something2
</category>

New lines in the file may or may not occur.

I would like to get only those parts of the file which are between the closest 'category' tags, so in this example:

<category name="Temp1">something1</category><category name="Temp2">something2</category>

I am trying to force awk to do that like that:

awk -F "</?category.*>" '{ print $1 }' file.txt

But this command gives me only:

</TD></TR></TABLE></BODY></HTML>

Could anyone point me how to write the command properly?

Regards,
Robert

ctsgnb · February 23, 2011, 6:22am

What do you have if you replace your $1 with $2 ?

What do you get if you do this

awk -F "</?category.*>" '{ print $2 }' file.txt

?

pravin27 · February 23, 2011, 6:25am

Try this,

awk -F">" '/category/{printf $1FS;printf $2 ~ /<\/category/?$2FS:$2}'  infile

rkoziol7 · March 4, 2011, 3:43am

Thanks for your answers. The last answer works of course with my file.

My mistake that I have cut my sample too much

Imagine the other text file, slightly more complex:

<category name="Temp1">something1<blah>some<test>aa</test></blah></category>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
</TD></TR></TABLE></BODY></HTML>
<category name="Temp2">something2<cat><test1>aa</test1>ww</cat></category>
<category name="Temp1">something1<blah>some<test>aa</test></blah></category> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> </TD></TR></TABLE></BODY></HTML> <category name="Temp2">something2<cat><test1>aa</test1>ww</cat></category>

I would like to get:
(Toggle Plain Text)

<category name="Temp1">something1<blah>some<test>aa</test></blah></category><category name="Temp2">something2<cat><test1>aa</test1>ww</cat></category><category name="Temp1">something1<blah>some<test>aa</test></blah></category><category name="Temp2">something2<cat><test1>aa</test1>ww</cat></category>

from it. Could you tell me how to rewrite the command?

ctsgnb · March 4, 2011, 4:18am

Assuming your file does not contain | character so i can use it to replace the "category" string (i use this tip to make sure that

/category>.*<category matches only a .* that does NOT contain any other "category" string

echo "$(tr -d '\n' <infile)" | sed 's/category/|/g;s:/|>[^|]*<|:/|><|:g;s/|/category/g'