Pattern replace from a text file using sed

I have a sample text format as given below

<Text Text_ID="10155645315851111_10155645333076543" From="460350337461111" Created="2011-03-16T17:05:37+0000" use_count="123">This is the first text</Text>
<Text Text_ID="10155645315851111_10155645317023456" From="1626711840902323" Created="2011-03-16T17:01:02+0000" use_count="234">This is the second text</Text>
<Text Text_ID="10155645315851111_10155645320006543" From="1481727095384343" Created="2011-03-16T17:02:04+0000" use_count="3456">This is the third text 
If counted  
GOT IT... </Text>
<Text Text_ID="10155645315851111_10155645326222345" From="411021195696789" Created="2011-04-16T17:03:44+0000" use_count="5433">This is the fourth text........</Text>

There are many lines in a file as given above. My concern is which script will be suitable to extract only the text MESSAGE between the markers, i.e., <Text ...> MESSAGE </Text>. Please note the MESSAGE can be of multiple line and including some special character as given in the third text message. Can someone help me out with a sample script? Thanks in advance.:slight_smile:

If you've got other tags than just <Text>, and want to eliminate those, so just print the contents of <Text> tags, try

awk '/<Text/ {P=1E9; sub(/<Text[^>]*>/,_)} /<\/Text>/ {P=NR; sub(/<\/Text>/,_)} P>=NR' file
1 Like

Considering the above text format, I would like to filter the messages matching the numbers, for example, all the messages that matches From="460350337461111" and From="411021195696789" at a time using sed script. Can someone help me out with a sample sed script? Thanks in advance. :slight_smile:

Regex is not really suited to parsing HTML, you should consider using a Perl module such as XML::Simple .

that said, you want a sed script that will strip everything within the angle brackets <...>

 
sed 's/<[^>]\+>//g' ~/tmp/tmp.txt
This is the first text
This is the second text
This is the third text
If counted
GOT IT... 
This is the fourth text........
2 Likes