UNIX/PERL script to convert XML file to pipe delimited format

Hello, I need to get few values from a XML file and output needs to be written in another file with pipe delimited format. The Header & Footer of the Pipe Delimited file will be constant.

The below is my sample XML file. I need to pull the values in between the XML tags <Operator_info to </Operator_info>. The values are NAME, PROFILE, ENABLE STATUS (IF status is Enabled leave Blank and DISABLED write first character D) and LASTSIGNON Date.

XML INPUT FILE:

<?xml version="1.0" encoding="UTF-8"?>
<OpList xmlns="urn:swift:saa:xsd:extractor">
<Operator_info xmlns="urn:swift:saa:xsd:extractor">
<!-- *** Extracted Data for Operators *** -->
 <ns5:OperatorDefn xmlns="urn:swift:saa:xsd:operator" xmlns:ns2="urn:swift:saa:xsd:authenticationservergroup" xmlns:ns3="urn:swift:saa:xsd:operatorprofile" xmlns:ns4="urn:swift:saa:xsd:unit" xmlns:ns5="urn:swift:saa:xsd:extractor" xmlns:ns6="urn:swift:saa:xsd:licenseddestination">
    <ns5:Operator>
        <Identifier>
            <Name>HELLO123</Name>
        </Identifier>
        <Description>Virat, Kholi</Description>
        <OperatorType>HUMAN</OperatorType>
        <AuthenticationType>LOCAL</AuthenticationType>
        <Profile>
            <ns3:Name>PROFILE1</ns3:Name>
        </Profile>
        <Unit>
            <ns4:Name>None</ns4:Name>
        </Unit>
        </ns5:Operator>
    <ns5:EnableStatus>ENABLED</ns5:EnableStatus>
    <ns5:ReEnableDate>n/a</ns5:ReEnableDate>
    <ns5:ApprovalStatus>APPROVED</ns5:ApprovalStatus>
    <ns5:LastChanged>25/07/14 20:15:35</ns5:LastChanged>
    <ns5:LastSignOn>18/10/15</ns5:LastSignOn>
    <ns5:LastEnabled>18/01/12 15:27:13</ns5:LastEnabled>
</ns5:OperatorDefn>
 <ns5:OperatorDefn xmlns="urn:swift:saa:xsd:operator" xmlns:ns2="urn:swift:saa:xsd:authenticationservergroup" xmlns:ns3="urn:swift:saa:xsd:operatorprofile" xmlns:ns4="urn:swift:saa:xsd:unit" xmlns:ns5="urn:swift:saa:xsd:extractor" xmlns:ns6="urn:swift:saa:xsd:licenseddestination">
    <ns5:Operator>
        <Identifier>
            <Name>HELLO12</Name>
        </Identifier>
        <Description>SACHIN,TEN</Description>
        <OperatorType>HUMAN</OperatorType>
        <AuthenticationType>LOCAL</AuthenticationType>
        <Profile>
            <ns3:Name>PROFILE2</ns3:Name>
        </Profile>
          <Profile>
            <ns3:Name>PROFILE3</ns3:Name>
        </Profile>
               <Unit>
            <ns4:Name>None</ns4:Name>
        </Unit>
       </ns5:Operator>
    <ns5:EnableStatus>DISABLED</ns5:EnableStatus>
    <ns5:ReEnableDate>n/a</ns5:ReEnableDate>
    <ns5:ApprovalStatus>APPROVED</ns5:ApprovalStatus>
    <ns5:LastChanged>14/02/12 17:34:35</ns5:LastChanged>
    <ns5:LastSignOn>n/a</ns5:LastSignOn>
    <ns5:LastEnabled>18/01/12 15:26:55</ns5:LastEnabled>
</ns5:OperatorDefn>
</Operator_info>

Expected Output:

20151027 GLOBAL USER GROUP  --> Header Record Constant
ACR|HELLO123|PROFILE1| |20151018|HELLO123
ACR|HELLO12|PROFILE2| D||HELLO12
ACR|HELLO12|PROFILE3|D||HELLO12
NUMBER OF DETAIL RECORDS:3  --> Footer Constant and should give thetotal record number

ACR is a constant value and should follow in each record first line.

Please use code tags as required by forum rules!

Any attempts from your side?

I'm new to development, so i have just started the code.

Try

awk '
BEGIN                   {print "20151027 GLOBAL USER GROUP"
                        }
/<.?Operator_info/      {ON = (substr ($1, 2, 1) == "O")
                        }

!ON                     {next
                        }

/<ns5:.*(EnableStat|SignOn)/ ||
/<(Identif|Profile)/    {IX = toupper (substr ($1, 6, 1))
                         if (IX ~ /[IT]/) getline
                         gsub (/^<[^>]*>|<[^<]*$/, "")
                         T[IX] = $0
                         if (IX == "L")         {printf "ACR|%s|%s|%s|%s|%s\n", T["T"], T["I"], (T["E"]~/^D/)?"D":"", T["L"], T["T"]
                                                 CNT++
                                        }
                        }
END                     {print "NUMBER OF DETAIL RECORDS: ", CNT
                        }
' file
20151027 GLOBAL USER GROUP
ACR|HELLO123|PROFILE1||18/10/15|HELLO123
ACR|HELLO12|PROFILE3|D|n/a|HELLO12
NUMBER OF DETAIL RECORDS:  2

It uses the last profile found for the same identifier; handling of several profiles per identifier is not implemented.

Thanks for your assistance, but i'm very much struggling to under stand the code & flow. If possible could you please explain me?

Here is RudiC's script with some slight modifications:

  • add comments,
  • added tracing to show which input lines are being processed, and what data is being captured from those lines (to make it easier for you to follow what the code is doing),
  • capture data from multiple <Profile> tags,
  • look for tags that do not appear at the start of a line (needed since you didn't originally use CODE tags when you posted your sample input), and
  • slightly reformat the trailer to match your expected output.

Note that neither of our scripts reformat the data found with the <ns5:LastSignOn> tags to YYYYMMDD instead of DD/MM/YY format nor to change n/a to an empty string. If that is important to you, try changing the code to do that on your own. If you can't get it to work, show us what you tried and the output it produced (in CODE tags) and we'll try to help fix it.

# Use awk to run the following script with the variable trace set to 0.
awk -v trace=0 '	
# Before reading lines from the input file, print the header.
BEGIN {	print "20151027 GLOBAL USER GROUP"
}
# Look for lines containing "<Operator_info" or "</Operator_info".
/<.?Operator_info/ {
	# If the 2nd character of the 1st field is "O", set ON to 1; otherwise
	# (i.e., if the 2nd character is "/") set ON to 0.
	ON = (substr ($1, 2, 1) == "O")
}

# If on is 0 (or has not yet been set), skip to next input line and ignore the
# following sections of this script for the current line.
!ON {	next
}

# Look for lines containing:
#	"<ns5:" followed by "EnableStat" or by "SignOn"
#	"<Identif"
# or	"<Profile"
/<ns5:.*(EnableStat|SignOn)/ || /<(Identif|Profile)/ {
	# Set IX to the uppercase version of the 6th character in the 1st field:
	# i.e.,	E for <ns5:"E"nableStatus
	#	I for <Prof"i"le
	#	L for <ns5:"L"astSignOn
	# or	T for <Iden"t"ifier>
	IX = toupper (substr ($1, 6, 1))
	# If trace is set to a non-0, non-empty-string value, print the current
	# line number and contents.
	if(trace) printf("line %d:%s\n", NR, $0)
	# If IX is "I" or "T" replace the current input line with the next
	# input line and continue processing.
	if (IX ~ /[IT]/) {
		getline
		# And, if trace is set to a non-0, non-empty-string value, print
		# the current line number and contents.
		if(trace) printf("Line %d:%s\n", NR, $0)
	}
	# Throw away everying from the start of the current input line from the
	# start of the current line up to and including the 1st ">" and
	# everything from the next "<" to the end of the line.
	gsub (/^[^>]*>|<[^<]*$/, "")
	# If we are processing a <Profile> tag increment the number of <Profile>
	# tags we have seen and save the remaining data from the current line
	# in the array T with the subscript being the number of <Profile> tags
	# we have seen, otherwise, save the remaining data from the current line
	# in the array T with the subscript being the current value saved in IX.
	if(IX == "I")
		T[++pcnt] = $0
	else	T[IX] = $0
	# If trace is set to a non-0, non-empty-string value, print the array
	# elemnt we just initialized.
	if(trace) printf("T[%s]=%s\n", (IX=="I") ? pcnt : IX, $0)
	# If we are processing an <ns5:LastSignOn> tag, print the results we
	# have accumulated for this <Identifier tag.
	if(IX == "L") {
		# Print one line for each <Profile> tag we have seen.
		for(i = 1; i <= pcnt; i++) {
			# Note that if the data saved for the <ns5:EnableStatus>
			# flag was "DISABLED" (or, actually, started with "D"),
			# print "D" for that field; otherwise, print a <space>
			# for that field.
			printf("ACR|%s|%s|%s|%s|%s\n",
			    T["T"], T, (T["E"]~/^D/)?"D":" ", T["L"], T["T"])
			# Increment the number of detail records we have
			# printed.
			CNT++
		}
		# Clear the number of <Profile> tags we have seen.
		pcnt = 0
	}
}

# When we hit end-of-file on the last input file, print the trailer line.
END {	print "NUMBER OF DETAIL RECORDS:" CNT
}
# End the script to be run by awk and list the input files to be processed.
' file

If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk .

When run as shown above (with tracing turned off) and with the sample data you supplied being contained in a file named file , it produces the output:

20151027 GLOBAL USER GROUP
ACR|HELLO123|PROFILE1| |18/10/15|HELLO123
ACR|HELLO12|PROFILE2|D|n/a|HELLO12
ACR|HELLO12|PROFILE3|D|n/a|HELLO12
NUMBER OF DETAIL RECORDS:3

If you change the line:

awk -v trace=0 '

to:

awk -v trace=1 '

to enable tracing, it produces the output:

20151027 GLOBAL USER GROUP
line 7:        <Identifier>
Line 8:            <Name>HELLO123</Name>
T[T]=HELLO123
line 13:        <Profile>
Line 14:            <ns3:Name>PROFILE1</ns3:Name>
T[1]=PROFILE1
line 20:    <ns5:EnableStatus>ENABLED</ns5:EnableStatus>
T[E]=ENABLED
line 24:    <ns5:LastSignOn>18/10/15</ns5:LastSignOn>
T[L]=18/10/15
ACR|HELLO123|PROFILE1| |18/10/15|HELLO123
line 29:        <Identifier>
Line 30:            <Name>HELLO12</Name>
T[T]=HELLO12
line 35:        <Profile>
Line 36:            <ns3:Name>PROFILE2</ns3:Name>
T[1]=PROFILE2
line 38:          <Profile>
Line 39:            <ns3:Name>PROFILE3</ns3:Name>
T[2]=PROFILE3
line 45:    <ns5:EnableStatus>DISABLED</ns5:EnableStatus>
T[E]=DISABLED
line 49:    <ns5:LastSignOn>n/a</ns5:LastSignOn>
T[L]=n/a
ACR|HELLO12|PROFILE2|D|n/a|HELLO12
ACR|HELLO12|PROFILE3|D|n/a|HELLO12
NUMBER OF DETAIL RECORDS:3
#!/usr/bin/perl

use strict;
use warnings;

# get the xml to work on from command line
my $fname = $ARGV[0] or die $!;

# open xml for reading
open my $in, '<', $fname or die $!;

# to keep detail record count
my $count = 0;

# translator for disabled and enabled status
my %status = ("DISABLED" => "D", "ENABLED" => " ",);

# custom block separator
$/ = "<\/ns5:OperatorDefn>";

# display header record constant
print "20151027 GLOBAL USER GROUP\n";

# start processing chunks from the xml file
while(<$in>) {
     my %record = (); # to catalog record details

     # collect only the wanted details
     while(/<((?:ns[35]:)?(?:Name|EnableStatus|LastSignOn))>(.*)<\/\1>/g){

         # add to catalog in the current loop
         push @{$record{$1}}, $2;
     }
     # display a piped-formatted record for each profile found
     for my $profile (@{$record{"ns3:Name"}}){
         # status n/a gets translated to empty display
         $record{"ns5:LastSignOn"}->[0] = "" if $record{"ns5:LastSignOn"}->[0] eq "n/a";

         # produce the formatted record
         printf "ACR|%s|%s|%s|%s|%s\n", $record{"Name"}->[0],
                                        $profile,
                                        $status{$record{"ns5:EnableStatus"}->[0]},
                                        $record{"ns5:LastSignOn"}->[0],
                                        $record{"Name"}->[0];
        $count++;  # another record processed
     }
}
close $in; # dismiss the xml file handle

# display footer constant tally
print "NUMBER OF DETAIL RECORDS: $count\n";

Save as karthi.pl
Run as perl karthi.pl karthi.xml

Expertise DON, AIA, RudiC thanks for your inputs and Ideas. I will work on and will let you know the output.. I will post again if i face any issues.. Thanks once again dear's......

---------- Post updated 11-03-15 at 06:25 PM ---------- Previous update was 11-02-15 at 09:11 PM ----------

Hello Aia,

I'm getting output written into a file when i execute the below from command line
"perl my.pl file.xml > output.txt". But this command is not getting executed if i pass this command via a ksh script. I need the output to be written in a file.

Moreover for the "LastSignOn" i need only the date in the format YYYYMMDD. Could you please advise on the same.

How are you invoking it inside the ksh script? Could you show it?

Make the following modifications to accommodate the new date format requirement.

#!/usr/bin/perl

use strict;
use warnings;

my $fname = $ARGV[0] or die $!;
open my $in, '<', $fname or die $!;

my $count = 0;
my %status = ("DISABLED" => "D", "ENABLED" => " ",);
$/ = "<\/ns5:OperatorDefn>";

print "20151027 GLOBAL USER GROUP\n";

while(<$in>) {
     my %record = ();
     while(/<((?:ns[35]:)?(?:Name|EnableStatus|LastSignOn))>(.*)<\/\1>/g){
         push @{$record{$1}}, $2;
     }

     for my $profile (@{$record{"ns3:Name"}}){
        #$record{"ns5:LastSignOn"}->[0] = "" if $record{"ns5:LastSignOn"}->[0] eq "n/a";

         printf "ACR|%s|%s|%s|%s|%s\n", $record{"Name"}->[0],
                                        $profile,
                                        $status{$record{"ns5:EnableStatus"}->[0]},
                                        format_day($record{"ns5:LastSignOn"}->[0]),
                                        $record{"Name"}->[0];
        $count++;
     }
}
close $in;

print "NUMBER OF DETAIL RECORDS: $count\n";

# subroutine to re-arrange date (naive)
sub format_day {
    my $date = shift;
    my @date_bits = split '/', $date;
    if (scalar @date_bits != 3){
        return "";
    }
    $date = join '/', ($date_bits[2]+2000, @date_bits[1,0]);
    return $date;
}
#!/usr/bin/ksh
BIN_DIR=$HOME/scripts
DATA_DIR=$HOME/scripts/config/data
CONFIG_DIR=$HOME/scripts/config
FILE=$DATA_DIR/SAMPLE.xml
DATE=`date -u +%Y%m%d`
MAILUSERS="abcd@gmail.com"
cd $DATA_DIR
touch SAMPLE_Query_$DATE.log
sample_query -operator -outputfile $FILE >> SAMPLE_Query_$DATE.log
if [ $? -eq 0 ]; then
   echo "SAMPLE Operator Report Created Successfully at `date`" >> SAMPLE_Query_$DATE.log
else
   echo "Error in creating SAMPLE Operator FILE" | mail -s "report failed @ $(hostname) at `date`" $MAILUSERS >> SAMPLE_Query_$DATE.log
exit
fi
#find  $DATA_DIR/SAMPLE_Query*.log -mtime +2 -exec rm {} \; -print >> SAMPLE_Query_$DATE.log
perl $BIN_DIR/karthi.pl $FILE > $DATA_DIR/ZA1P.NDM.SAMPLE.INTERFACE.GRP(+1)

In the above script i'm passing the perl script command and using the symbol ">" to redirect the output.

perl $BIN_DIR/karthi.pl $FILE > $DATA_DIR/ZA1P.NDM.SAMPLE.INTERFACE.GRP(+1)

What do you think the highlighted red does? Does your script not bark at you a syntax error?

Please, try:

perl $BIN_DIR/karthi.pl $FILE > $DATA_DIR/ZA1P.NDM.SAMPLE.INTERFACE.GRP.sample

Yep, thanks that is working. A last correction i need is i don't need the date separate "/" in Lastsignon. YYYYDDMM without "/" is fine..

---------- Post updated at 11:52 PM ---------- Previous update was at 11:35 PM ----------

I'm getting the below error but output is created.

Argument "12 22:14:36" isn't numeric in addition (+) at /home/all_adm/scripts/karthi.pl line 73, <$in> chunk 7.
Argument "12 21:52:45" isn't numeric in addition (+) at /home/all_adm/scripts/karthi.pl line 73, <$in> chunk 18.
Argument "12 04:31:52" isn't numeric in addition (+) at /home/all_adm/scripts/karthi.pl line 73, <$in> chunk 28.

In your original input the only example for `LastSignOn' was <ns5:LastSignOn>18/10/15</ns5:LastSignOn> , the others were `n/a' and they get converted to empty, as your output requirement.
Therefore the format expected in my code is DD/MM/YY which it gets rearranged as 2015/10/18 ; 15 getting added to the millennia we live in, as a bias.

It appears that your current input `LastSignOn' does not follow the original posted example, and therefore the code fails to do a proper date format. In order to fix it, the subroutine format_day() requires to know every possible format that the xml file is going to yield as a date.

Apologies, My LastSignOn tag has Date & Time. My requirement is only date in the format YYYYDDMM without any separator. Only 8 characters are allowed.

<ns5:LastSignOn>27/05/14 16:50:52</ns5:LastSignOn>

Please, drop in place the following substitution:

sub format_day {
    my $date = shift;
    my @date_bits = split '/|\s+', $date;
    if (scalar @date_bits < 3){
        return "";
    }
    return join "", ($date_bits[2]+2000, @date_bits[0,1]);
}

Hi Aia, thanks a lot for your help.. I got the output wat I required. Thanks once again. This thread is completed now.