Using SED/AWK to extract xml at end of file

Hello everyone,

Firstly i do not require alot of help.. i am right at the end of finishing my scipt but cannot find a solution to the last part.

What i need to do is, prompt the user for a file to work with, which i have done.
promt the user for an output file - which is done.

#!/bin/bash
echo "Get my XML"
echo -n "Enter the source file name : "
read infile
echo -n "Enter output file name : "
read outfile
sed -n 1433,1615p $infile >> $outfile
echo "Data should be in $outfile if this compiled correctly"

The file, is .txt and is massive, i only need the last 200 lines or so which is XML... I know i can use SED to specify what line numbers to extract to the output file, but not all documents that use this script will require the last 200, it could be the midlle 50.

Which leads me on to my problem, using SED or AWK i would like to extract all the xml after 'Sending XML' which is consistant accross all documents, up until the words ' Message sending ended.'

I have been reading various articles/forums which have helped and has lead me to providing my current example. Although, using line numbers is not feasible, they will differ accross the documents, whereas the words above are always present.

I really hope someone can help as i have spent far too much time on this!

Thanks!

H

sed -n '/Sending XML/,/Message sending ended/p' ${infile} > ${outfile}
1 Like

I cannot thank you enough.. i have posted on so many forums and no one ever gets back to me! I will be using this more often!

I now have another issue, The XML i have been left with has line breaks, see below:

 Sending XML          :

 <document> <docRequestID>2010-10-22-11.57.22.903813</docRequestID><docStylesh

 eet>Thunderhead</docStylesheet><requestType>claim</requestType><level0Object>

  <objectType>transaction</objectType><objectID>900</objectID><objectSeq>1</ob
 Line break has effected tag
 jectSeq><level1Object> <objectType>lifelite</objectType><objectID>901</object

As you can see, the line break has effected this tag half is on the line below and this XML cannot be used like that. I would like for it to remove the linespaces at the start of the line, looking like : </object>

Thank you so much for your help!!

nawk '/Sending XML/,/Message sending ended/' ${infile} | nawk 'NF && /^ *</{printf("%s%c", $0, (/ *<.*>$/)?ORS:"");next}NF' > ${outfile}
1 Like

Thank you so much, i will try it shortly! I dont have to install nawk do i? is it standard tech?

---------- Post updated at 07:43 PM ---------- Previous update was at 05:11 PM ----------

I tried running that code you gave me and i get an error saying that it cannot revert the file... Unexpected error: Invalid UTF-8 sequence in input.

Im having a look online to see what that means. If you have any ideas let me kno.

cheers

What OS are you on?
If you have nawk/gawk - use either one.

im on ubuntu... i tried using nawk and gawk.. didnt work!

---------- Post updated at 09:01 PM ---------- Previous update was at 08:48 PM ----------

i have tried using nawk and gawk and no luck. i am running ubuntu and the awk is 3.1.6

Can you check if there are any non-printable characters in the XML portion ? Especially around the line break that has affected the xml tag.

tyler_durden

there isnt i dont think... just a whitespace that seperates the tags in some instances. Not all.

Cheers

What's the output of this command ?

sed -n '/Sending XML/,/Message sending ended/p' your_file | od -bc

tyler_durden

Hello, i added your code into my script, im not sure what file you were referring to so i have attached what i used.

#!/bin/bash
echo "getXML"

echo -n "Enter the source file name WITH extension : "
read infile 
echo "Processing... : " 
sleep 1 
echo -n "Enter output file name (extenstion not applicable) : "
read outfile
sed -n '/Sending XML/,/Message sending ended/p' $outfile | od -bc
echo "Processing XML... : "
sleep 1
echo "Success..Data should be in '$outfile' if compiled correctly"

The outcome...
Unexpected error: Incomplete multibyte sequence in input when i open the outfile created.

On the terminal i got loads of different numbers fly accross the screen. Im not sure if they are even related to the infile i have.. attached below...

e   l   d   I   D   >   <   f   i   e   l   d   N   a   m
0031640 145 076 144 141 164 145 117 015 012 040 146 102 151 162 164 150
          e   >   d   a   t   e   O  \r  \n       f   B   i   r   t   h
0031660 074 057 146 151 145 154 144 116 141 155 145 076 074 146 151 145
          <   /   f   i   e   l   d   N   a   m   e   >   <   f   i   e
0031700 154 144 126 141 154 165 145 057 076 074 057 157 142 152 145 143
          l   d   V   a   l   u   e   /   >   <   /   o   b   j   e   c
0031720 164 106 151 145 154 144 076 074 157 142 152 145 143 164 106 151
          t   F   i   e   l   d   >   <   o   b   j   e   c   t   F   i
0031740 145 154 144 076 040 074 146 151 145 154 144 111 104 076 061 065
          e   l   d   >       <   f   i   e   l   d   I   D   >   1   5
0031760 061 067 074 057 146 151 145 015 012 040 154 144 111 104 076 074
          1   7   <   /   f   i   e  \r  \n       l   d   I   D   >   <
0032000 146 151 145 154 144 116 141 155 145 076 154 151 146 145 164 151
          f   i   e   l   d   N   a   m   e   >   l   i   f   e   t   i
0032020 155 145 123 154 141 101 155 157 165 156 164 074 057 146 151 145
          m   e   S   l   a   A   m   o   u   n   t   <   /   f   i   e
0032040 154 144 116 141 155 145 076 074 146 151 145 154 144 126 141 154
          l   d   N   a   m   e   >   <   f   i   e   l   d   V   a   l
0032060 165 145 076 061 070 060 060 060 060 060 074 057 146 151 145 154

Thanks,

H

What I wanted was you executing my command on your command prompt (the Linux dollar-prompt).

The file I was refering to was the source file. That is, the one that is being read in your Bash script.

Since you are going to test your Bash script, I am sure you know the name of the source file that you'll enter at the prompt above. That file name will be assigned to the variable "infile" in your script.

Now, let's say the source file name you have in mind is "abc.txt".

This file has some XML stuff embedded in it. My hunch is that there are Unicode characters in that XML stuff.

Try this on your Linux dollar-prompt -

perl -lne 'binmode(STDOUT, ":utf8"); while(/(.)/g){print $.,"\t",$1,"\t",ord($1) if ord($1) > 255}' abc.txt

Replace the string "abc.txt" by the actual name of your source file name.

tyler_durden

i tried that and replaced the file with my source file, in my case it was trace.txt i am not sure where the output file is though? I checked trace.txt and it was the same doc, do i not need to specify where the output is?

sorry if im being slow, i only started learning three weeks ago

No, you do not need to specify the output file name. The output will be displayed right after your command.

(A) If you have Ubuntu Gnome, then open up "Gnome Terminal" or "Terminal".

(B) If you have Ubuntu KDE (Kubuntu?), then open up "Konsole".

You'll see a dollar prompt in the terminal window.

Type in the following command at the prompt, in a single line.

$ perl -lne 'binmode(STDOUT, ":utf8"); while(/(.)/g){print $.,"\t",$1,"\t",ord($1) if ord($1) > 255}' trace.txt

Don't type that $ symbol. That's just for you to know that the stuff from "perl -lne .... " has to be typed at the $ prompt.

You could, alternatively, copy+paste the perl command from this webpage.

When you press the Enter or Return key after "trace.txt" the output will be displayed right there on the terminal window - right below your command.

Copy your command and the output from the terminal window and post them over here.

(Put that Bash script aside for the time being. You'd want to investigate the contents of the trace.txt source file first.)

tyler_durden

ok done that, it just goes to the next line, returns no outpput whatsoever:

27@ubuntu:~/xml_test$ perl -lne 'binmode(STDOUT, ":utf8"); while(/(.)/g){print $.,"\t",$1,"\t",ord($1) if ord($1) > 255}' trace.txt