Perl or Awk script to copy a part of text file.

asandy1234 · October 30, 2009, 11:42am

Hi Gurus,
I'm a total newbie to Perl and Awk scripting. Let me explain the scenario, there is a DB2 table with 5 columns and one of the column is a CLOB datatype containing XML. We need all the 4 columns but only a portion of string from the XML column.
We decided to export DB2 table to a .del file and process it using Perl or Awk script. I need a script to process the .del file so that I have column1, column2, column3 and in column 4 which is XML, we just need the string which is in between <text> and </text> (there may be multiple occurrence of this so they can be seperated by number) plus the column 5.
I know it will be piece of cake for the experts.

Thanks,

frans · October 30, 2009, 5:14pm

try this

I=0
while read LINE
do
    TEXT[$I]=$(echo $LINE | grep -o '<text>.*</text>' | sed -e 's/<text>//' -e 's/<\/text>//'
    (( I ++ ))
done < $FILENAME

That creates an array where each item is the text contained between <text> and </text>.
Hope this helps

asandy1234 · October 30, 2009, 5:36pm

frans:

try this
I=0
while read LINE
do
   TEXT[$I]=$(echo $LINE | grep -o '<text>.*</text>' | sed -e 's/<text>//' -e 's/<\/text>//'
   (( I ++ ))
done < $FILENAME
That creates an array where each item is the text contained between <text> and </text>.
Hope this helps

Thanks Frans for your prompt reply but I've a question which may sound stupid to you but still.
1) What is the extension of the file I should save? Is it .awk or .ksh?
2) Where do I replace the input and output file name in the code?

Thanks,

frans · October 30, 2009, 5:49pm

It's bash scripting so on the first line of the script you write #!/bin/bash If the path of your shell is /bin/bash, of course.
The full code :

#!/bin/bash
while read LINE
do
    echo $LINE | grep -o '<text>.*</text>' | sed -e 's/<text>//' -e 's/<\/text>//'
    (( I ++ ))
done < INPUTFILE > OUTPUTFILE

You could use variables for the input and output files if they have to be re-used later, else just replace 'INPUTFILE' and 'OUTPUTFILE' by your own file names.

P.S. no matter the extension ! make it executable with chmod +x and go.

asandy1234 · November 2, 2009, 11:46am

Thank you very much Frans, I'll test and let you know.

---------- Post updated at 12:46 PM ---------- Previous update was at 12:05 PM ----------

Hi Frans,
There is no directory #!/bin/bash in the unix box, but there is directory for #/usr/bin/bsh. Are these 2 same? I tried to run the script but I get an error as below

grep: Not a recognized flag: o
Usage: grep [-r] [-R] [-H] [-L] [-E|-F] [-c|-l|-q] [-insvxbhwy] [-p[parasep]] -e pattern_list...
[-f pattern_file...] [file...]
Please clarify.

Thanks.

frans · November 2, 2009, 1:28pm

bsh is Bourne SHell, bash is Bourne Again SHell, so the options of the grep command doesn't seem to be the same.
the -o option tells grep to output only the matching part of the line, it's helpful.
Tell me what's the output when you use grep without option.
I believe that there's only one occurence of the <text>....</text> in each line.
To go faster in trying, don't redirect to the output file ( > OUTPUTFILE ) so you directly see what happens.
a possibility is to use the extract like

LINE="lbla bla bla jcjfd<text>what i want</text>flh%(j blablabla" # for testing
LINE=$(echo $LINE | grep <text>*</text>) # Returns every line containing the match.
echo $LINE
LINE=${LINE##*<text>} # Deletes the matching from the beginning
echo $LINE
LINE=${LINE%%</text>*} # Deletes the matching from the end
echo $LINE

and see what happens

asandy1234 · November 2, 2009, 1:37pm

frans:

bsh is Bourne SHell, bash is Bourne Again SHell, so the options of the grep command doesn't seem to be the same.
the -o option tells grep to output only the matching part of the line, it's helpful.
Tell me what's the output when you use grep without option.
I believe that there's only one occurence of the <text>....</text> in each line.
To go faster in trying, don't redirect to the output file ( > OUTPUTFILE ) so you directly see what happens.
a possibility is to use the extract like
LINE="lbla bla bla jcjfd<text>what i want</text>flh%(j blablabla" # for testing
LINE=$(echo $LINE | grep <text>*</text>) # Returns every line containing the match.
echo $LINE
LINE=${LINE##*<text>} # Deletes the matching from the beginning
echo $LINE
LINE=${LINE%%</text>*} # Deletes the matching from the end
echo $LINE
and see what happens

No there are multiple occurence of this pattern in the record

frans · November 2, 2009, 3:00pm

What a challenge to script that in shell !
Here, i've coded something wich works

LINE="lbla bla bla jcjfd<text>what i want</text>flh%(j blablabla<text>second occurence</text>juhgfiuhf<text>third occ</text>jkhgq"
I=1
LINE=${LINE#*<text>}	# Removes all from the begining up to the first "<text>"
LINE=${LINE%</text>*}	# Removes all from the end down to the last "</text>"
while echo $LINE | grep -q "<text>"	# more than one field
do	TEXT[$I]=${LINE%%</text>*}
	LINE=${LINE#*<text>}
	(( I ++ ))
done
TEXT[$I]=${LINE%</text>*}	# for the last one
{	# To see what we've done
	N=$I
	for I in $(seq $N)
	do	echo "TEXT[$I] = ${TEXT[$I]}"
	done
}

Look if it works by you. If so it's possible to embed it in the appropriate code to parse your files.

asandy1234 · November 2, 2009, 4:04pm

I get a syntax error
"0403-057 Syntax error at line 10 : `<' is not expected."

frans · November 2, 2009, 4:10pm

try to put quotes around <text> and </text>
like "<text>" "</text>" or '<text>' '</text>'

asandy1234 · November 3, 2009, 11:29am

I get different error now.
Document.bash[15]: I ++ : 0403-053 Expression is not complete; more tokens expe
cted

frans · November 3, 2009, 11:35am

It seems not to be implemented in your shell version
(( I ++ )) means "I=I+1" so try instead

let I=I+1

asandy1234 · November 3, 2009, 12:44pm

I changed it but I',m getting something like this..
seq: Not found

frans · November 3, 2009, 1:04pm

try

# in place of
for I in $(seq $N)
# write
for (i = 1; i < N+1; i++)

or the appropriate code to make a counting loop with your shell

asandy1234 · November 4, 2009, 11:09am

I changed it but I'm getting a syntax error

Document.bash[18]: 0403-057 Syntax error at line 20 : `(' is not expected.

frans · November 4, 2009, 11:25am

What would be nice is if i could get a sample of the file you want to parse.
I've modified the code

LINE="lbla bla bla jcjfd<text>what i want</text>flh%(j blablabla<text>second occurence</text>juhgfiuhf<text>third occ</text>jkhgq"
I=1
LINE=${LINE#*<text>}	# Removes all from the begining up to the first "<text>"
LINE=${LINE%</text>*}	# Removes all from the end down to the last "</text>"
while echo $LINE | grep -q "<text>"	# more than one field
do	TEXT[$I]=${LINE%%</text>*}
	LINE=${LINE#*<text>}
	let I=I+1
done
TEXT[$I]=${LINE%</text>*}	# for the last one
{	# To see what we've done
	N=$I
	I=1
	while [ $I -le $N ]
	do
		echo "TEXT[$I] = ${TEXT[$I]}"
		let I=I+1
	done
}

asandy1234 · November 4, 2009, 11:58am

Thanks Frans this code is working, just changed <text> to "<text>" .
Can you let me know how to modify this code to accomodate input and output file.

Thanks.

frans · November 4, 2009, 12:22pm

Revisited and uncommented script for in from file and out to file

I=1
while read LINE
do
	LINE=${LINE#*<text>}
	LINE=${LINE%</text>*}
	while echo $LINE | grep -q "<text>"
	do
		echo "$I ${LINE%%</text>*}"
		LINE=${LINE#*<text>}
		let I=I+1
	done
	echo "$I ${LINE%</text>*}"
        let I=I+1
done < inputfile > ouputfile

The output gives numbered lines with the text. If you don't want the numbers, remove the $I in the echo
If this doesn't work, send me a big sample of xml to process

asandy1234 · November 4, 2009, 1:14pm

Wow great Frans, the code works. Thank you very much for ur help.

---------- Post updated at 02:14 PM ---------- Previous update was at 01:33 PM ----------

Hi Frans, one more issue.
The code works fine if the file is one line but if the xml is multi line the code doesn't work.
Please find below the xml

frans · November 4, 2009, 2:24pm

A version with numbering

LINE=`cat inputfile`
I=1
{
    LINE=${LINE#*<text>}    # Removes all from the begining up to the first "<text>"
    LINE=${LINE%</text>*}    # Removes all from the end down to the last "</text>"
    while echo $LINE | grep -q "<text>"    # more than one field
    do
    echo "$I ${LINE%%</text>*}"
        LINE=${LINE#*<text>}
        let I=I+1
    done
    echo "$I ${LINE%</text>*}"    # for the last one
} > ouputfile

Or without (simplier)

LINE=`cat inputfile`
{
    LINE=${LINE#*<text>}    # Removes all from the begining up to the first "<text>"
    LINE=${LINE%</text>*}    # Removes all from the end down to the last "</text>"
    while echo $LINE | grep -q "<text>"    # more than one field
    do
    echo "${LINE%%</text>*}"
        LINE=${LINE#*<text>}
    done
    echo "${LINE%</text>*}"    # for the last one
} > ouputfile

Take in the first line around input file that are backticks [AltGr-7] and not single quotes !

Perl or Awk script to copy a part of text file.

Hi Frans, one more issue. The code works fine if the file is one line but if the xml is multi line the code doesn't work. Please find below the xml

Hi Frans, one more issue.
The code works fine if the file is one line but if the xml is multi line the code doesn't work.
Please find below the xml