Break a file into smaller pieces with same search string

nadeemrafikhan · September 24, 2018, 9:30pm

please help to break a file into smaller pieces with same search string and delete the matched lines.
like i have a below file

<expression>
some text
........
..........
</expression>
<expression>
some text
........
..........
</expression>
<expression>
some text
........
..........
</expression>
<expression>
some text
........
..........
</expression>
......................
.......................
.......................

Now every expression "/expression" will break in separate files.

Don_Cragun · September 25, 2018, 12:54am

What operating system are you using?

What shell are you using?

What have you tried to solve this problem on your own?

What is the name of your input file?

What names should be used for the output files you hope to create?

nadeemrafikhan · September 25, 2018, 1:33am

Here is relevant information you asked.

What operating system are you using?
Linux 7

What shell are you using?
Bash

What have you tried to solve this problem on your own?

I tried some command but not getting expected output

What is the name of your input file?
input.xml

What names should be used for the output files you hope to create?
output.ranx.xml

Don_Cragun · September 25, 2018, 4:23am

I am extremely disappointed that after observing us helping other users for more than 2.5 years and helping you with two other problems, you are unwilling to clearly show us what output(s) you're trying to produce, to show us what you have tried, and to show us the output(s) that your attempt(s) have produced.

If I am correctly interpreting your request to be to copy lines from a file named input.xml into a file named output.ranx.xml with all sequences of lines in your input file starting with a line containing the string <expression> up to and including a line containing the string </expression> deleted, you could try something like:

sed '/<expression>/,/<\/expression>/d' input.xml > output.ranx.xml

which with your given sample input would create a file named output.ranx.xml containing the text:

......................
.......................
.......................

When I first read your post #1, I had guessed that you wanted to create multiple output files with each of those output files containing one matched set of lines starting with a line containing the string <expression> and ending with the string </expression> but since you have specified that there is to only be one output file, that obviously can't be what you're trying to do. And re-reading your first post, I also see that this would not delete the matched lines as requested, so that can't be what you want.

nadeemrafikhan · September 25, 2018, 4:42am

Apologies, I think i you missed or i didnt convey my message correctly .. I have one input file "input.xml" that contains the above text and it search the expression first and import to the 1_input_ranx.xml file and expression second and import to the 2_input_ranx.xml so on and remove the expression receptively. So we need to create multiple output files from one input file with the same expression. Hope i can elaborate it now.Thanks!

Don_Cragun · September 25, 2018, 5:16am

We obviously have a little bit of a language barrier here. I have no idea what "remove the expression receptively" means. So, I do not know what output you want to place into any of your output files.

Since you can't do arithmetic in sed , you'll probably want to process your input file directly in the shell or use something like awk or perl .

Can we assume that every input line containing <expression> will have a line following it containing </expression> ?

Please try to come up with something that does what you want and show us what you have tried to get there. And, please show us exactly what output you hope to produce in each of the output files you want to produce from your sample input!

nadeemrafikhan · September 25, 2018, 5:43am

below is the Input file

<BEAM-PATTERN-LIST>
<bean pattern>
<M>0.0,0.00</M>
<M>1.0,0.00</M>
<M>2.0,0.00</M>
<M>3.0,0.00</M>
<M>4.0,0.00</M>
<M>5.0,0.10</M>
<M>6.0,0.10</M>
<M>7.0,0.10</M>
<M>8.0,0.20</M>
<M>9.0,0.20</M>
</bean pattern>
<bean pattern>
<M>170.0,26.50</M>
<M>171.0,26.40</M>
<M>172.0,26.30</M>
<M>173.0,26.30</M>
<M>174.0,26.20</M>
<M>175.0,26.10</M>
<M>176.0,26.10</M>
<M>177.0,26.10</M>
<M>178.0,26.00</M>
<M>179.0,26.00</M>
</bean pattern>
<bean pattern>
<M>190.0,26.00</M>
<M>191.0,26.10</M>
<M>192.0,26.10</M>
<M>193.0,26.10</M>
<M>194.0,26.20</M>
<M>195.0,26.30</M>
<M>196.0,26.30</M>
<M>197.0,26.40</M>
<M>198.0,26.50</M>
<M>199.0,26.50</M>
</bean pattern>
---------------
---------------
------so on-----------
-----------------
</BEAM-PATTERN-LIST>

out put files
first file:

<M>0.0,0.00</M>
<M>1.0,0.00</M>
<M>2.0,0.00</M>
<M>3.0,0.00</M>
<M>4.0,0.00</M>
<M>5.0,0.10</M>
<M>6.0,0.10</M>
<M>7.0,0.10</M>
<M>8.0,0.20</M>
<M>9.0,0.20</M>

second output file

<M>170.0,26.50</M>
<M>171.0,26.40</M>
<M>172.0,26.30</M>
<M>173.0,26.30</M>
<M>174.0,26.20</M>
<M>175.0,26.10</M>
<M>176.0,26.10</M>
<M>177.0,26.10</M>
<M>178.0,26.00</M>
<M>179.0,26.00</M>

third out file

<M>190.0,26.00</M>
<M>191.0,26.10</M>
<M>192.0,26.10</M>
<M>193.0,26.10</M>
<M>194.0,26.20</M>
<M>195.0,26.30</M>
<M>196.0,26.30</M>
<M>197.0,26.40</M>
<M>198.0,26.50</M>
<M>199.0,26.50</M>

---------------
-------nth output file-----------
------------------

hope i make clear this one

Don_Cragun · September 25, 2018, 9:53am

The following seems to do what you want:

awk '
$0 == "<bean pattern>" {
	copy = 1
	outFN = ++count "_input_ranx.xml"
	next
}
$0 == "</bean pattern>" {
	copy = 0
	close(outFN)
	next
}
copy {	print > outFN
}' input.xml

MadeInGermany · September 25, 2018, 3:41pm

The same with bash builtins:

#!/bin/bash
x=0 copy=0
while IFS= read line
do
  case $line in
  "<bean pattern>")
    filename=$((++x))_output_ranx.xml
    copy=1
    echo "writing $filename ..."
    exec 3> $filename
    continue
  ;;
  "</bean pattern>")
    copy=0
  esac
  if [ $copy -eq 1 ]
  then
    echo "$line" >&3
  fi
done < input.xml

nadeemrafikhan · September 26, 2018, 3:13am

Thanks everyone .. appreciate the help specially Don.. finally i made it but this way..

dlt=$(awk '/PATTERN-LIST/,/\/PATTERN-LIST/' xmll | grep -v "PATTERN-LIST"|wc -l)
awk '/PATTERN-LIST/,/\/PATTERN-LIST/' xmll | grep -v "PATTERN-LIST" >tmp_file.xml
#echo $dlt
vdlt=$(expr $dlt/736)
vdlt="$(( (dlt) / 736))"
for ((i=0;i<$vdlt;i++));
do
echo $i"_tmp_file.xml";
sed -n '/PATTERN/{p; :loop n; p; /\/PATTERN/q; b loop}' tmp_file.xml >$i"_tmp_file.xml";
vrm=$(sed -n '/PATTERN/{p; :loop n; p; /\/PATTERN/q; b loop}' tmp_file.xml| wc -l);
sed -i -e "1,${vrm}d" tmp_file.xml;
done

RudiC · September 26, 2018, 3:57am

If above does what it is supposed to do, fine. If you are happy with it, OK.
But, be aware that it is very inefficient compared to the solutions proposed before in this thread. It creates (6 + 4 * n) processes to run this number of commands (n being the number of lines in the file divided by 736) as opposed to one in Don Cragun's proposal and zero in MadeInGermany's as it's only using bash builtins.

nadeemrafikhan · September 26, 2018, 4:38am

Yes Don proposal also works !!