Insert text after match in XML file

whegra · November 8, 2017, 7:26pm

Having a little trouble getting this to work just right.

I have xml files that i want to split some data.

I have 2 <name> tags within the file

I would like to take only the first tag and split the data.

tag example.
From this.

TAB<Name>smith, john</Name>

to

TAB<Name>smith, john</Name>
TAB<LastName>smith</LastName>
TAB<FirstName>john</Firstname>

I can get the replace and tab to work but it add firstname and lastname to both matches of " /Name> " instead of just the first match.

Any help would be greatly appreciated.

Don_Cragun · November 8, 2017, 8:14pm

whegra:

Having a little trouble getting this to work just right.

I have xml files that i want to split some data.

I have 2 <name> tags within the file

I would like to take only the first tag and split the data.

tag example.
From this.
TAB<Name>smith, john</Name>
to
TAB<Name>smith, john</Name>
TAB<LastName>smith</LastName>
TAB<FirstName>john</Firstname>
I can get the replace and tab to work but it add firstname and lastname to both matches of " /Name> " instead of just the first match.

Any help would be greatly appreciated.

Please tell us what operating system and shell you're using AND show us the code you have that replacing all occurrences of the <Name> tag data.

whegra · November 8, 2017, 8:47pm

Running cygwin on windows.

Bash shell.

Here is my not so elegant but gets the job done code.

ls *.xml > /cygdrive/x/$$tmp
while read filename ; do
	#Get Name Tag, first occurance
        a=`less $filename|grep Name|head -n1`
	#Get only lastname,firstname
        b=`echo $a|cut -d"<" -f2|cut -d">" -f2`
        #Lastname
	d=`echo $b|cut -d"," -f1`
	#Firstname
        e=`echo $b|cut -d"," -f2`
	#remove space before firstname
        e=$(sed -e 's/^[[:space:]]*//' <<<"$e")
	f="<LastName>$d</LastName>"
	g="<FirstName>$e</FirstName>"
	sed -bi "/tagaftername/i\    $f" $filename
	sed -bi "/tagaftername/i\    $g" $filename
		
done < /cygdrive/x/$$tmp
rm -f /cygdrive/x/$$tmp

I couldn't get the first occurance code to work so I decided to reverse things and go up instead of down. I looked at the tag right after name which happens to be unique.

Don_Cragun · November 8, 2017, 9:55pm

I would be tempted to try a different approach. The following invokes awk once no matter how many XML files you have to process. This should be a lot faster than invoking ls once and invoking less , grep , and head once per file processed and cut four times per file processed and sed three times per file processed. Try:

#!/bin/bash
awk -F'[<>]' '
function copyback(filename) {
	if(filename == "")
		return
	for(i = 1; i <= lc; i++)
		print line > filename
	close(filename)
	lc = 0
}
FNR == 1 {
	copyback(lastfile)
	print "Processing " FILENAME
	found = 0
	lastfile = FILENAME
}
found {	line[++lc] = $0
	next
}
$2 == "Name" && $4 == "/Name" {
	line[++lc] = $0
	n = split($3, names, /, */)
	line[++lc] = sprintf("\t<LastName>%s</Lastname>", names[1])
	if(n >= 2)
		line[++lc] = sprintf("\t<FirstName>%s</FirstName>", names[2])
	found = 1
}
END {	copyback(lastfile)
}' *.xml

If someone else wants to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk .

whegra · November 9, 2017, 11:48am

One issue, kind of a big one. Your code cuts everything out of the xml file before the first instance of the name tag. So if the name tag was at line 50, everything before that line gets deleted in the new file.

Also is there a way to keep line feed as CRLF similar to sed -b ?

I'll give the code is very fast. 5K files took 26 seconds vs 15 min for my method.

Don_Cragun · November 9, 2017, 12:50pm

Sorry about that. If you would have given us a sample input file and the output that should be produced from that input, I would have caught that early lines in input files were being dropped. There is nothing in my code that would remove <carriage-return> characters from existing lines in the file, but it didn't put <carriage-return>s in the lines it adds (and if you wanted DOS format text files, that is something we would have expected you to explicitly state in your requirements). Does the following replacement come closer to meeting your requirements?

#!/bin/bash
awk -F'[<>]' '
function copyback(filename) {
	if(filename == "")
		return
	for(i = 1; i <= lc; i++)
		print line > filename
	close(filename)
	lc = 0
}
FNR == 1 {
	copyback(lastfile)
	print "Processing " FILENAME
	found = 0
	lastfile = FILENAME
}
{	line[++lc] = $0
}
found {	next
}
$2 == "Name" && $4 == "/Name" {
	# line deleted here.
	n = split($3, names, /, */)
	line[++lc] = sprintf("\t<LastName>%s</Lastname>\r", names[1])
	if(n >= 2)
		line[++lc] = sprintf("\t<FirstName>%s</FirstName>\r", names[2])
	found = 1
}
END {	copyback(lastfile)
}' *.xml

Changes from the previous version are shown in red.

whegra · November 9, 2017, 1:55pm

Works great. I do see your additions add CRLF linefeed, but all other lines get changed to LF.

I've just added a unix2dos argument at the very end once processing is done, only added 1min for 5k files.

To recap - Converting 5,000 xml files.

My code: 15min
Your code: 1min
unix2dos: 1min

I really appreciate your assistance.
Thank you.

Don_Cragun · November 9, 2017, 2:22pm

There is nothing in the code that I gave you that would remove a <carriage-return> from an input file when copying it back to the output file. (Maybe that is being done by the cygwin version of awk .) If your input files are UNIX format text files or if cygwin awk is stripping off <carriage-return> when reading lines and you want to create DOS text files, change the line:

 {	line[++lc] = $0

in my last suggestion to:

 {	line[++lc] = $0 "\r"

and you can then get rid of the time spent running unix2dos .

whegra · November 9, 2017, 2:32pm

That did the trick.

Thank you.