How can i delete the content between all the occurences of two strings using sed or awk command

Hi. I have to delete the content between all the occurrences of the xml tags in a single file.

For example:

  • The tags <script>.....................</script> occurs more than once in the same file.

  • It follows tagging rules meaning a start tag will be followed by an end tag. Will not have two continuous similar opening tags.

  • But the tags are not necessarily in separate lines.

I used the below script which has deleted just the first occurrence in the file.

sed -e "s/ <script>*?<\/script>//g" $INF > $OUTF

Please help me in doing this. Since i have to process huge amount of data, more efficient method would be better.

If there is any other way apart from sed and awk that would also be better.

Something like this?

awk '/<script>/{p=1} /<\/script>/{p=0; next}!p' file

looks like it's deleting from the begining of the line where the <script> and </script> are located.

test.file

this is <script> the first line in the file1
and the second </script> line
and the third line in the file

when I execute the script
awk '/<script>/{p=1} /<\/script>/{p=0; next}!p' test.file
I get

and the third line in the file

I am assuming Mr satheeshkumar want is

this is  
line
and the third line in the file

Thanks.

Thanks Frank. It removes contents in all the occurrence of <script> and </script> tag. It removes the content in the complete line where the above tags are present.

Could you please help me to remove the content which are between those tags instead of removing everything in a line.

Example:

*Below is the output that i get when i execute your command

input file content:

client side<script>java script</script>java scripting is.......
server side<scipt>classic asp</script>ASP is a microsoft technology.......

satheesh here

output:

satheesh here

But i want the output as:

client side java scripting is.......
server side ASP is a microsoft technology.......

---------- Post updated at 09:28 AM ---------- Previous update was at 09:24 AM ----------

You are correct jville. Thats what i need exactly.:slight_smile:

Try:

awk '
/<script>/ && /<\/script>/{
  sub("<script>.*</script>",x)
  if($0){print}
  next
}
/<script>/{p=1} /<\/script>/{
  p=0; next
}
!p' file

Frank it works fine for the first line. But when it comes to the next line of input it does the same error. It removes the complete line instead of just removing the content between the <script> </script> tag.

This time i get the output as

client side java scripting is.......

satheesh here

instead of

client side java scripting is.......
server side ASP is a microsoft technology.......
satheesh here

Second line of input has been deleted as the whole. It seems that the conditional deletion works only for first occurrence.

Thanks
Satheesh

awk '
/<script>/ && /<\/script>/{
  sub("<script>.*</script>",x)
  if($0){print}
  next
}
/<script>/{
  sub("<script>.*",x)
  if($0){print}
  p=1
} 
/<\/script>/{
  sub(".*</script>",x)
  if($0){print} 
  p=0
  next
}
$0 && !p' file
1 Like

:slight_smile: That is Working great Frank.Thank you very much.

I am new to shell scripting..Just trying to understand how you have written this.

Thank you very much again.

---------- Post updated at 01:14 PM ---------- Previous update was at 01:11 PM ----------

:slight_smile: That is Working great Frank.Thank you very much.

As I am new to shell scripting..just trying to understand the code you have written.

Thank you very much again.

testdata:

12345
<script>------------------------</script>
23456<script>-----------</script><script>
-----------------------------------------
--------------------------------</script>
<script>----------------</script><script>
-----------------------------------------
---------------------------</script>34567
4<script>--</script>567<script></script>8
56<script>-------------------</script>789

Expected output:

12345
23456
34567
45678
56789

descriptinator.sed:

#n

:top
/<script>/ {
        # Change the first <script> in the line to \n.
        # We can be certain that there will not be a newline in the
        # initial pattern space, so this is unambiguous.
        s//\
/
        # If the closing </script> is on the same line, change it also
        # to \n and delete everything in between, newlines inclusive.
        # If the result in an empty line, print nothing.
        /<\/script>/ {
                s//\
/
                s/\n.*\n//
                b empty?
        }

        # The </script> element is not on the same line as its <script>.
        # Before moving on in search of it, delete from the newline to the
        # end of line.  Print only if the line is not empty.
        s/\n.*//
        /./p

        # Discard lines until closing </script> is found.
        :next
        n
        /<\/script>/! b next

        # Change </script> to \n and delete preceding text.
        s//\
/
        s/.*\n//

        :empty?
        # If the line has been left empty, do not print a blank line.
        /./!d

        # In case there's another <script> element later in the line.
        b top
}
p

Descriptinator Test run:

$ sed -f descriptinator.sed testdata 
12345
23456
34567
45678
56789

Regards,
Alister

Thank you Alister. It is working great. Most importantly it works for most of the cases especially when the starting and the ending tag are in same line.

Thanks again.

Regards
Satheesh