Hi Gurus,
I'm a total newbie to Perl and Awk scripting. Let me explain the scenario, there is a DB2 table with 5 columns and one of the column is a CLOB datatype containing XML. We need all the 4 columns but only a portion of string from the XML column.
We decided to export DB2 table to a .del file and process it using Perl or Awk script. I need a script to process the .del file so that I have column1, column2, column3 and in column 4 which is XML, we just need the string which is in between <text> and </text> (there may be multiple occurrence of this so they can be seperated by number) plus the column 5.
I know it will be piece of cake for the experts.
Thanks Frans for your prompt reply but I've a question which may sound stupid to you but still.
1) What is the extension of the file I should save? Is it .awk or .ksh?
2) Where do I replace the input and output file name in the code?
It's bash scripting so on the first line of the script you write #!/bin/bash If the path of your shell is /bin/bash, of course.
The full code :
#!/bin/bash
while read LINE
do
echo $LINE | grep -o '<text>.*</text>' | sed -e 's/<text>//' -e 's/<\/text>//'
(( I ++ ))
done < INPUTFILE > OUTPUTFILE
You could use variables for the input and output files if they have to be re-used later, else just replace 'INPUTFILE' and 'OUTPUTFILE' by your own file names.
P.S. no matter the extension ! make it executable with chmod +x and go.
Thank you very much Frans, I'll test and let you know.
---------- Post updated at 12:46 PM ---------- Previous update was at 12:05 PM ----------
Hi Frans,
There is no directory #!/bin/bash in the unix box, but there is directory for #/usr/bin/bsh. Are these 2 same? I tried to run the script but I get an error as below
grep: Not a recognized flag: o
Usage: grep [-r] [-R] [-H] [-L] [-E|-F] [-c|-l|-q] [-insvxbhwy] [-p[parasep]] -e pattern_list...
[-f pattern_file...] [file...]
Please clarify.
bsh is Bourne SHell, bash is Bourne Again SHell, so the options of the grep command doesn't seem to be the same.
the -o option tells grep to output only the matching part of the line, it's helpful.
Tell me what's the output when you use grep without option.
I believe that there's only one occurence of the <text>....</text> in each line.
To go faster in trying, don't redirect to the output file ( > OUTPUTFILE ) so you directly see what happens.
a possibility is to use the extract like
LINE="lbla bla bla jcjfd<text>what i want</text>flh%(j blablabla" # for testing
LINE=$(echo $LINE | grep <text>*</text>) # Returns every line containing the match.
echo $LINE
LINE=${LINE##*<text>} # Deletes the matching from the beginning
echo $LINE
LINE=${LINE%%</text>*} # Deletes the matching from the end
echo $LINE
What a challenge to script that in shell !
Here, i've coded something wich works
LINE="lbla bla bla jcjfd<text>what i want</text>flh%(j blablabla<text>second occurence</text>juhgfiuhf<text>third occ</text>jkhgq"
I=1
LINE=${LINE#*<text>} # Removes all from the begining up to the first "<text>"
LINE=${LINE%</text>*} # Removes all from the end down to the last "</text>"
while echo $LINE | grep -q "<text>" # more than one field
do TEXT[$I]=${LINE%%</text>*}
LINE=${LINE#*<text>}
(( I ++ ))
done
TEXT[$I]=${LINE%</text>*} # for the last one
{ # To see what we've done
N=$I
for I in $(seq $N)
do echo "TEXT[$I] = ${TEXT[$I]}"
done
}
Look if it works by you. If so it's possible to embed it in the appropriate code to parse your files.
What would be nice is if i could get a sample of the file you want to parse.
I've modified the code
LINE="lbla bla bla jcjfd<text>what i want</text>flh%(j blablabla<text>second occurence</text>juhgfiuhf<text>third occ</text>jkhgq"
I=1
LINE=${LINE#*<text>} # Removes all from the begining up to the first "<text>"
LINE=${LINE%</text>*} # Removes all from the end down to the last "</text>"
while echo $LINE | grep -q "<text>" # more than one field
do TEXT[$I]=${LINE%%</text>*}
LINE=${LINE#*<text>}
let I=I+1
done
TEXT[$I]=${LINE%</text>*} # for the last one
{ # To see what we've done
N=$I
I=1
while [ $I -le $N ]
do
echo "TEXT[$I] = ${TEXT[$I]}"
let I=I+1
done
}
Revisited and uncommented script for in from file and out to file
I=1
while read LINE
do
LINE=${LINE#*<text>}
LINE=${LINE%</text>*}
while echo $LINE | grep -q "<text>"
do
echo "$I ${LINE%%</text>*}"
LINE=${LINE#*<text>}
let I=I+1
done
echo "$I ${LINE%</text>*}"
let I=I+1
done < inputfile > ouputfile
The output gives numbered lines with the text. If you don't want the numbers, remove the $I in the echo
If this doesn't work, send me a big sample of xml to process
LINE=`cat inputfile`
I=1
{
LINE=${LINE#*<text>} # Removes all from the begining up to the first "<text>"
LINE=${LINE%</text>*} # Removes all from the end down to the last "</text>"
while echo $LINE | grep -q "<text>" # more than one field
do
echo "$I ${LINE%%</text>*}"
LINE=${LINE#*<text>}
let I=I+1
done
echo "$I ${LINE%</text>*}" # for the last one
} > ouputfile
Or without (simplier)
LINE=`cat inputfile`
{
LINE=${LINE#*<text>} # Removes all from the begining up to the first "<text>"
LINE=${LINE%</text>*} # Removes all from the end down to the last "</text>"
while echo $LINE | grep -q "<text>" # more than one field
do
echo "${LINE%%</text>*}"
LINE=${LINE#*<text>}
done
echo "${LINE%</text>*}" # for the last one
} > ouputfile
Take in the first line around input file that are backticks [AltGr-7] and not single quotes !