How can i delete the content between all the occurences of two strings using sed or awk command

satheeshkumar · August 17, 2011, 8:51am

Hi. I have to delete the content between all the occurrences of the xml tags in a single file.

For example:

The tags <script>.....................</script> occurs more than once in the same file.
It follows tagging rules meaning a start tag will be followed by an end tag. Will not have two continuous similar opening tags.
But the tags are not necessarily in separate lines.

I used the below script which has deleted just the first occurrence in the file.

sed -e "s/ <script>*?<\/script>//g" $INF > $OUTF

Please help me in doing this. Since i have to process huge amount of data, more efficient method would be better.

If there is any other way apart from sed and awk that would also be better.

Franklin52 · August 17, 2011, 9:45am

Something like this?

awk '/<script>/{p=1} /<\/script>/{p=0; next}!p' file

jville · August 17, 2011, 10:18am

looks like it's deleting from the begining of the line where the <script> and </script> are located.

test.file

this is <script> the first line in the file1
and the second </script> line
and the third line in the file

when I execute the script
awk '/<script>/{p=1} /<\/script>/{p=0; next}!p' test.file
I get

and the third line in the file

I am assuming Mr satheeshkumar want is

this is  
line
and the third line in the file

Thanks.

satheeshkumar · August 17, 2011, 10:28am

Thanks Frank. It removes contents in all the occurrence of <script> and </script> tag. It removes the content in the complete line where the above tags are present.

Could you please help me to remove the content which are between those tags instead of removing everything in a line.

Example:

*Below is the output that i get when i execute your command

input file content:

client side<script>java script</script>java scripting is.......
server side<scipt>classic asp</script>ASP is a microsoft technology.......

satheesh here

output:

satheesh here

But i want the output as:

client side java scripting is.......
server side ASP is a microsoft technology.......

---------- Post updated at 09:28 AM ---------- Previous update was at 09:24 AM ----------

jville:

looks like it's deleting from the begining of the line where the <script> and </script> are located.

test.file
this is <script> the first line in the file1
and the second </script> line
and the third line in the file
when I execute the script
awk '/<script>/{p=1} /<\/script>/{p=0; next}!p' test.file
I get
and the third line in the file
I am assuming Mr satheeshkumar want is
this is  
line
and the third line in the file
Thanks.

You are correct jville. Thats what i need exactly.

Franklin52 · August 17, 2011, 11:14am

Try:

awk '
/<script>/ && /<\/script>/{
  sub("<script>.*</script>",x)
  if($0){print}
  next
}
/<script>/{p=1} /<\/script>/{
  p=0; next
}
!p' file

satheeshkumar · August 17, 2011, 11:33am

Frank it works fine for the first line. But when it comes to the next line of input it does the same error. It removes the complete line instead of just removing the content between the <script> </script> tag.

This time i get the output as

client side java scripting is.......

satheesh here

instead of

client side java scripting is.......
server side ASP is a microsoft technology.......
satheesh here

Second line of input has been deleted as the whole. It seems that the conditional deletion works only for first occurrence.

Thanks
Satheesh

Franklin52 · August 17, 2011, 12:39pm

awk '
/<script>/ && /<\/script>/{
  sub("<script>.*</script>",x)
  if($0){print}
  next
}
/<script>/{
  sub("<script>.*",x)
  if($0){print}
  p=1
} 
/<\/script>/{
  sub(".*</script>",x)
  if($0){print} 
  p=0
  next
}
$0 && !p' file

satheeshkumar · August 17, 2011, 2:14pm

That is Working great Frank.Thank you very much.

I am new to shell scripting..Just trying to understand how you have written this.

Thank you very much again.

---------- Post updated at 01:14 PM ---------- Previous update was at 01:11 PM ----------

That is Working great Frank.Thank you very much.

As I am new to shell scripting..just trying to understand the code you have written.

Thank you very much again.

alister · August 17, 2011, 6:01pm

testdata:

12345
<script>------------------------</script>
23456<script>-----------</script><script>
-----------------------------------------
--------------------------------</script>
<script>----------------</script><script>
-----------------------------------------
---------------------------</script>34567
4<script>--</script>567<script></script>8
56<script>-------------------</script>789

Expected output:

descriptinator.sed:

#n

:top
/<script>/ {
        # Change the first <script> in the line to \n.
        # We can be certain that there will not be a newline in the
        # initial pattern space, so this is unambiguous.
        s//\
/
        # If the closing </script> is on the same line, change it also
        # to \n and delete everything in between, newlines inclusive.
        # If the result in an empty line, print nothing.
        /<\/script>/ {
                s//\
/
                s/\n.*\n//
                b empty?
        }

        # The </script> element is not on the same line as its <script>.
        # Before moving on in search of it, delete from the newline to the
        # end of line.  Print only if the line is not empty.
        s/\n.*//
        /./p

        # Discard lines until closing </script> is found.
        :next
        n
        /<\/script>/! b next

        # Change </script> to \n and delete preceding text.
        s//\
/
        s/.*\n//

        :empty?
        # If the line has been left empty, do not print a blank line.
        /./!d

        # In case there's another <script> element later in the line.
        b top
}
p

Descriptinator Test run:

$ sed -f descriptinator.sed testdata 
12345
23456
34567
45678
56789

Regards,
Alister

satheeshkumar · August 18, 2011, 10:46am

Thank you Alister. It is working great. Most importantly it works for most of the cases especially when the starting and the ending tag are in same line.

Thanks again.

Regards
Satheesh