Perl or Awk script to copy a part of text file.

Hi Gurus,
I'm a total newbie to Perl and Awk scripting. Let me explain the scenario, there is a DB2 table with 5 columns and one of the column is a CLOB datatype containing XML. We need all the 4 columns but only a portion of string from the XML column.
We decided to export DB2 table to a .del file and process it using Perl or Awk script. I need a script to process the .del file so that I have column1, column2, column3 and in column 4 which is XML, we just need the string which is in between <text> and </text> (there may be multiple occurrence of this so they can be seperated by number) plus the column 5.
I know it will be piece of cake for the experts.

Thanks,

try this

I=0
while read LINE
do
    TEXT[$I]=$(echo $LINE | grep -o '<text>.*</text>' | sed -e 's/<text>//' -e 's/<\/text>//'
    (( I ++ ))
done < $FILENAME

That creates an array where each item is the text contained between <text> and </text>.
Hope this helps

1 Like

Thanks Frans for your prompt reply but I've a question which may sound stupid to you but still.
1) What is the extension of the file I should save? Is it .awk or .ksh?
2) Where do I replace the input and output file name in the code?

Thanks,

It's bash scripting so on the first line of the script you write #!/bin/bash If the path of your shell is /bin/bash, of course.
The full code :

#!/bin/bash
while read LINE
do
    echo $LINE | grep -o '<text>.*</text>' | sed -e 's/<text>//' -e 's/<\/text>//'
    (( I ++ ))
done < INPUTFILE > OUTPUTFILE

You could use variables for the input and output files if they have to be re-used later, else just replace 'INPUTFILE' and 'OUTPUTFILE' by your own file names.

P.S. no matter the extension ! make it executable with chmod +x and go.

1 Like

Thank you very much Frans, I'll test and let you know.

---------- Post updated at 12:46 PM ---------- Previous update was at 12:05 PM ----------

Hi Frans,
There is no directory #!/bin/bash in the unix box, but there is directory for #/usr/bin/bsh. Are these 2 same? I tried to run the script but I get an error as below

grep: Not a recognized flag: o
Usage: grep [-r] [-R] [-H] [-L] [-E|-F] [-c|-l|-q] [-insvxbhwy] [-p[parasep]] -e pattern_list...
[-f pattern_file...] [file...]
Please clarify.

Thanks.

bsh is Bourne SHell, bash is Bourne Again SHell, so the options of the grep command doesn't seem to be the same.
the -o option tells grep to output only the matching part of the line, it's helpful.
Tell me what's the output when you use grep without option.
I believe that there's only one occurence of the <text>....</text> in each line.
To go faster in trying, don't redirect to the output file ( > OUTPUTFILE ) so you directly see what happens.
a possibility is to use the extract like

LINE="lbla bla bla jcjfd<text>what i want</text>flh%(j blablabla" # for testing
LINE=$(echo $LINE | grep <text>*</text>) # Returns every line containing the match.
echo $LINE
LINE=${LINE##*<text>} # Deletes the matching from the beginning
echo $LINE
LINE=${LINE%%</text>*} # Deletes the matching from the end
echo $LINE

and see what happens

No there are multiple occurence of this pattern in the record

What a challenge to script that in shell !
Here, i've coded something wich works

LINE="lbla bla bla jcjfd<text>what i want</text>flh%(j blablabla<text>second occurence</text>juhgfiuhf<text>third occ</text>jkhgq"
I=1
LINE=${LINE#*<text>}	# Removes all from the begining up to the first "<text>"
LINE=${LINE%</text>*}	# Removes all from the end down to the last "</text>"
while echo $LINE | grep -q "<text>"	# more than one field
do	TEXT[$I]=${LINE%%</text>*}
	LINE=${LINE#*<text>}
	(( I ++ ))
done
TEXT[$I]=${LINE%</text>*}	# for the last one
{	# To see what we've done
	N=$I
	for I in $(seq $N)
	do	echo "TEXT[$I] = ${TEXT[$I]}"
	done
}

Look if it works by you. If so it's possible to embed it in the appropriate code to parse your files.

I get a syntax error
"0403-057 Syntax error at line 10 : `<' is not expected."

try to put quotes around <text> and </text>
like "<text>" "</text>" or '<text>' '</text>'

I get different error now.
Document.bash[15]: I ++ : 0403-053 Expression is not complete; more tokens expe
cted

It seems not to be implemented in your shell version
(( I ++ )) means "I=I+1" so try instead

let I=I+1

I changed it but I',m getting something like this..
seq: Not found

try

# in place of
for I in $(seq $N)
# write
for (i = 1; i < N+1; i++)

or the appropriate code to make a counting loop with your shell

I changed it but I'm getting a syntax error

Document.bash[18]: 0403-057 Syntax error at line 20 : `(' is not expected.

What would be nice is if i could get a sample of the file you want to parse.
I've modified the code

LINE="lbla bla bla jcjfd<text>what i want</text>flh%(j blablabla<text>second occurence</text>juhgfiuhf<text>third occ</text>jkhgq"
I=1
LINE=${LINE#*<text>}	# Removes all from the begining up to the first "<text>"
LINE=${LINE%</text>*}	# Removes all from the end down to the last "</text>"
while echo $LINE | grep -q "<text>"	# more than one field
do	TEXT[$I]=${LINE%%</text>*}
	LINE=${LINE#*<text>}
	let I=I+1
done
TEXT[$I]=${LINE%</text>*}	# for the last one
{	# To see what we've done
	N=$I
	I=1
	while [ $I -le $N ]
	do
		echo "TEXT[$I] = ${TEXT[$I]}"
		let I=I+1
	done
}

Thanks Frans this code is working, just changed <text> to "<text>" .
Can you let me know how to modify this code to accomodate input and output file.

Thanks.

Revisited and uncommented script for in from file and out to file

I=1
while read LINE
do
	LINE=${LINE#*<text>}
	LINE=${LINE%</text>*}
	while echo $LINE | grep -q "<text>"
	do
		echo "$I ${LINE%%</text>*}"
		LINE=${LINE#*<text>}
		let I=I+1
	done
	echo "$I ${LINE%</text>*}"
        let I=I+1
done < inputfile > ouputfile

The output gives numbered lines with the text. If you don't want the numbers, remove the $I in the echo
If this doesn't work, send me a big sample of xml to process

Wow great Frans, the code works. Thank you very much for ur help.

---------- Post updated at 02:14 PM ---------- Previous update was at 01:33 PM ----------

Hi Frans, one more issue.
The code works fine if the file is one line but if the xml is multi line the code doesn't work.
Please find below the xml

A version with numbering

LINE=`cat inputfile`
I=1
{
    LINE=${LINE#*<text>}    # Removes all from the begining up to the first "<text>"
    LINE=${LINE%</text>*}    # Removes all from the end down to the last "</text>"
    while echo $LINE | grep -q "<text>"    # more than one field
    do
    echo "$I ${LINE%%</text>*}"
        LINE=${LINE#*<text>}
        let I=I+1
    done
    echo "$I ${LINE%</text>*}"    # for the last one
} > ouputfile

Or without (simplier)

LINE=`cat inputfile`
{
    LINE=${LINE#*<text>}    # Removes all from the begining up to the first "<text>"
    LINE=${LINE%</text>*}    # Removes all from the end down to the last "</text>"
    while echo $LINE | grep -q "<text>"    # more than one field
    do
    echo "${LINE%%</text>*}"
        LINE=${LINE#*<text>}
    done
    echo "${LINE%</text>*}"    # for the last one
} > ouputfile

Take in the first line around input file that are backticks [AltGr-7] and not single quotes !