Converting Text File into XML using Unix Shell Scripts

Laud12345 · February 15, 2005, 11:05am

Hi everyone,

If someone out there could help me out with this problem. I would really appreciate it.

I am trying to convert a file into xml format using Unix shell scripts.

The file has fields with each field having a certain number of bytes, but the fields are not delimited by anything (e.g. whitespace). I need to get those fields into some sort of data structure so that I can use them to generate the XML using a simple for loop. The following is a an example of what I am looking at:

Jack Johnson 90980288Harv 9090998
Joe Joie 8989 Sed 99488

I can't use whitespace as a delimiter.

Thanks you guys very much,
I really appreciate it.

gozer13 · February 15, 2005, 12:16pm

What kind of data structure do you require?
You can make it comma delimited by executing:

sed 's/ /,/g' inputfile.list > outputfile.lst
this will convert all whitespaces of inputfile.lst to a comma in outputfile.lst.

Laud12345 · February 15, 2005, 12:33pm

The reason I cannot use a comma as a delimiter or anything like that is because the fields are not seperated by a delimiter, but allocated a certain number of bytes.

For example:

Firstname (10) Last Name (10)

MickelsonJackonson ( You will have fields that take up all space in the field and you won't have any commas in there)

I am thinking of using the substr function from awk which will be able to read in the exact bytes from each line and then put it into an array or something to process it.

gozer13 · February 15, 2005, 1:28pm

hmm similar like something I am working on myself. I need to judge the varying length of a string and create independent results. I don't think substr will work for me though, but sounds like it may be one solution for your quandary.

vgersh99 · February 15, 2005, 1:43pm

if your "fields" are of fixed width you can take advantage of gawk's "FIELDWIDTH" internal variable for processing.

If you don't have gawk installed on your system, take a look at this thread at comp.lang.awk how to "simulate" FIELDWIDTH in other awks:FIELDWIDTH

Laud12345 · February 16, 2005, 11:03am

Thanks guys very much. I was able to use the substr function in order to read in the fixed columns. I am also considering doing the gawks' fieldwidth thing, but would that affect the performance? I am to be working on around 800,000 records so it also needs to be pretty efficient.

Also, is there is a quick function in awk that lets you strip away leading and trailing whitespace from a string? I was thinking of using the 'split' function to store the string into an array and then printing out the array, but that seems to be overkill. Any help would be appreciated.

Please let me know
Laud

vgersh99 · February 16, 2005, 11:09am

function trim(str)
{
    sub("^[ ]*", "", str);
    sub("[ ]*$", "", str);
    return str;
}

photon · February 16, 2005, 11:48am

That is dependent on number of characters in field, ASCII
is an encoding where each character is represented as
exactly one byte.

It is clear from the sample data you have given there is a
set pattern from alpha characters to digits.

C has isalpha() and isdigit() and most other languages have
something similar to this.

If you use regular expressions [A-Z] or [a-z] and [0-9] will work.

Not given enough information on your sample data, you may
try splitting on whitespace or from alpha char to digit.

Laud12345 · February 16, 2005, 12:12pm

Anything wrong with this code you see? I cannot get it to run: I am basically trying to run some awk code within a korn shell. I want to run the trim funciton, but I get complaint when I run the script.

#!/bin/ksh

awk '

function trim(str)
{
sub("^[ ]", "", str);
sub("[ ]$", "", str);
return str;
}

gref = substr($0,2,32);
gref = trim(gref);

}'

vgersh99 · February 16, 2005, 12:20pm

[as always] if on Solaris, use nawk instead of awk.
where is you input to awk?
what is your ACTION? - you have mis-matched '{}' for your action. I think you want something like this:

#!/bin/ksh

nawk '
function trim(str)
{
   sub("^[ ]*", "", str);
   sub("[ ]*$", "", str);
   return str;
}

{
  gref = trim(substr($0,2,32));
  print gref
}' path2YourInputFileGoesHere

↩︎

Laud12345 · February 16, 2005, 12:35pm

Thanks vgersh99.

I am using Solaris.

My input to awk is some sort of file. I think I do have the braces messed up. I did get it to work. I am very greatful.

Regards,
Laud