Converting Text File into XML using Unix Shell Scripts

Hi everyone,

If someone out there could help me out with this problem. I would really appreciate it.

I am trying to convert a file into xml format using Unix shell scripts.

The file has fields with each field having a certain number of bytes, but the fields are not delimited by anything (e.g. whitespace). I need to get those fields into some sort of data structure so that I can use them to generate the XML using a simple for loop. The following is a an example of what I am looking at:

Jack Johnson 90980288Harv 9090998
Joe Joie 8989 Sed 99488

I can't use whitespace as a delimiter.

Thanks you guys very much,
I really appreciate it.

What kind of data structure do you require?
You can make it comma delimited by executing:

sed 's/ /,/g' inputfile.list > outputfile.lst
this will convert all whitespaces of inputfile.lst to a comma in outputfile.lst.

The reason I cannot use a comma as a delimiter or anything like that is because the fields are not seperated by a delimiter, but allocated a certain number of bytes.

For example:

Firstname (10) Last Name (10)

MickelsonJackonson ( You will have fields that take up all space in the field and you won't have any commas in there)

I am thinking of using the substr function from awk which will be able to read in the exact bytes from each line and then put it into an array or something to process it.

hmm similar like something I am working on myself. I need to judge the varying length of a string and create independent results. I don't think substr will work for me though, but sounds like it may be one solution for your quandary.

if your "fields" are of fixed width you can take advantage of gawk's "FIELDWIDTH" internal variable for processing.

If you don't have gawk installed on your system, take a look at this thread at comp.lang.awk how to "simulate" FIELDWIDTH in other awks:FIELDWIDTH

Thanks guys very much. I was able to use the substr function in order to read in the fixed columns. I am also considering doing the gawks' fieldwidth thing, but would that affect the performance? I am to be working on around 800,000 records so it also needs to be pretty efficient.

Also, is there is a quick function in awk that lets you strip away leading and trailing whitespace from a string? I was thinking of using the 'split' function to store the string into an array and then printing out the array, but that seems to be overkill. Any help would be appreciated.

Please let me know
Laud

function trim(str)
{
    sub("^[ ]*", "", str);
    sub("[ ]*$", "", str);
    return str;
}

That is dependent on number of characters in field, ASCII
is an encoding where each character is represented as
exactly one byte.

It is clear from the sample data you have given there is a
set pattern from alpha characters to digits.

C has isalpha() and isdigit() and most other languages have
something similar to this.

If you use regular expressions [A-Z] or [a-z] and [0-9] will work.

Not given enough information on your sample data, you may
try splitting on whitespace or from alpha char to digit.

Anything wrong with this code you see? I cannot get it to run: I am basically trying to run some awk code within a korn shell. I want to run the trim funciton, but I get complaint when I run the script.

#!/bin/ksh

awk '

function trim(str)
{
sub("^[ ]", "", str);
sub("[ ]
$", "", str);
return str;
}

gref = substr($0,2,32);
gref = trim(gref);

}'

  1. [as always] if on Solaris, use nawk instead of awk.
  2. where is you input to awk?
  3. what is your ACTION? - you have mis-matched '{}' for your action. I think you want something like this:
#!/bin/ksh

nawk '
function trim(str)
{
   sub("^[ ]*", "", str);
   sub("[ ]*$", "", str);
   return str;
}

{
  gref = trim(substr($0,2,32));
  print gref
}' path2YourInputFileGoesHere

  1. ↩︎

Thanks vgersh99.

I am using Solaris.

My input to awk is some sort of file. I think I do have the braces messed up. I did get it to work. I am very greatful.

Regards,
Laud