Moving XML tag/contents after specific XML tag within same file

pchang · August 14, 2018, 4:18pm

Hi Forum.

I have an XML file with the following requirement to move the <AdditionalAccountHolders> tag and its content right after the <accountHolderName> tag within the same file but I'm not sure how to accomplish this through a Unix script.

Any feedback will be greatly appreciated.

Thanks.

Before Data:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<holders xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

<holder>
<property>
  <accountHolderName>aaa</accountHolderName>
  <payeeName></payeeName>
  <jointAccountRelationshipCode>xx</jointAccountRelationshipCode>
  <accountInstrumentNumber>1234567890</accountInstrumentNumber>
  <balanceType>6</balanceType>
  <dateOfLastTransaction>20071217</dateOfLastTransaction>
  <balanceAmount>4076</balanceAmount>
  <streetAddress1>55 West Road</streetAddress1>
  <streetAddress2></streetAddress2>
  <city>Chilliwack</city>
  <province>BC</province>
  <postalCode>V2L 3S8</postalCode>
  <countryCode>CA</countryCode>   
  <AdditionalAccountHolders>
    <additionalAccountHolderName>bbb</additionalAccountHolderName>
  </AdditionalAccountHolders>
</property>
</holder>

After Data:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<holders xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

<holder>
<property>
  <accountHolderName>aaa</accountHolderName>
  <AdditionalAccountHolders>
    <additionalAccountHolderName>bbb</additionalAccountHolderName>
  </AdditionalAccountHolders>
  <payeeName></payeeName>
  <jointAccountRelationshipCode>xx</jointAccountRelationshipCode>
  <accountInstrumentNumber>1234567890</accountInstrumentNumber>
  <balanceType>6</balanceType>
  <dateOfLastTransaction>20071217</dateOfLastTransaction>
  <balanceAmount>4076</balanceAmount>
  <streetAddress1>55 West Road</streetAddress1>
  <streetAddress2></streetAddress2>
  <city>Chilliwack</city>
  <province>BC</province>
  <postalCode>V2L 3S8</postalCode>
  <countryCode>CA</countryCode>   
</property>
</holder>

vgersh99 · August 14, 2018, 4:54pm

a bit verbose, but...

awk '
FNR==NR {
   if (/<AdditionalAccountHolders/) {s=$0;f++;next}
   if (f) s= s ORS $0
   if (/<[/]AdditionalAccountHolders/) f=0
   next
}
/<accountHolderName/ {print $0 ORS s;next}
/<AdditionalAccountHolders/,/<[/]AdditionalAccountHolders/ {next}
1' myXMLfile myXMLfile

OR a bit less verbose:

awk '
  FNR==NR && /<AdditionalAccountHolders/,/<[/]AdditionalAccountHolders/ {s=(s)?s ORS $0:$0;next}
  /<AdditionalAccountHolders/,/<[/]AdditionalAccountHolders/ {next}
  /<accountHolderName/ {print $0 ORS s;s="";next}
  1' myXMLfile myXMLfile

pchang · August 15, 2018, 9:33am

Hi vgersh99.

Thank you for the awk code suggestions. I tried both of your codes with 2 records in my XML file:

1) code#1 doesn't return any records
2) code#2 does return records but it's putting the <AdditionalAccountHolders> information from record#1 into record#2 and record#1 does not have the <AdditionalAccountHolders> tag in the new file. Please see sample data below:

Before Data:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<holders xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

<holder>
<property>
  <accountHolderName>aaa</accountHolderName>
  <payeeName></payeeName>
  <jointAccountRelationshipCode>aa</jointAccountRelationshipCode>
  <accountInstrumentNumber>1234567890</accountInstrumentNumber>
  <balanceType>6</balanceType>
  <dateOfLastTransaction>20071217</dateOfLastTransaction>
  <balanceAmount>4076</balanceAmount>
  <streetAddress1>55 West Road</streetAddress1>
  <streetAddress2></streetAddress2>
  <city>Chilliwack</city>
  <province>BC</province>
  <postalCode>V2L 3S8</postalCode>
  <countryCode>CA</countryCode>   
  <AdditionalAccountHolders>
    <additionalAccountHolderName>bbb</additionalAccountHolderName>
  </AdditionalAccountHolders>
</property>
</holder>


<holder>
<property>
  <accountHolderName>ddd</accountHolderName>
  <payeeName></payeeName>
  <jointAccountRelationshipCode>xx</jointAccountRelationshipCode>
  <accountInstrumentNumber>999999990</accountInstrumentNumber>
  <balanceType>6</balanceType>
  <dateOfLastTransaction>20081217</dateOfLastTransaction>
  <balanceAmount>556</balanceAmount>
  <streetAddress1>50 Roadside</streetAddress1>
  <streetAddress2></streetAddress2>
  <city>Ontario</city>
  <province>ON</province>
  <postalCode>P3L 3S8</postalCode>
  <countryCode>CA</countryCode>   
  <AdditionalAccountHolders>
    <additionalAccountHolderName>eee</additionalAccountHolderName>
  </AdditionalAccountHolders>
</property>
</holder>

After Data:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<holders xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

<holder>
<property>
  <accountHolderName>aaa</accountHolderName>
  <payeeName></payeeName>
  <jointAccountRelationshipCode>aa</jointAccountRelationshipCode>
  <accountInstrumentNumber>1234567890</accountInstrumentNumber>
  <balanceType>6</balanceType>
  <dateOfLastTransaction>20071217</dateOfLastTransaction>
  <balanceAmount>4076</balanceAmount>
  <streetAddress1>55 West Road</streetAddress1>
  <streetAddress2></streetAddress2>
  <city>Chilliwack</city>
  <province>BC</province>
  <postalCode>V2L 3S8</postalCode>
  <countryCode>CA</countryCode>   
</property>
</holder>


<holder>
<property>
  <accountHolderName>ddd</accountHolderName>
  <AdditionalAccountHolders>
    <additionalAccountHolderName>bbb</additionalAccountHolderName>
  </AdditionalAccountHolders>
  <payeeName></payeeName>
  <jointAccountRelationshipCode>xx</jointAccountRelationshipCode>
  <accountInstrumentNumber>999999990</accountInstrumentNumber>
  <balanceType>6</balanceType>
  <dateOfLastTransaction>20081217</dateOfLastTransaction>
  <balanceAmount>556</balanceAmount>
  <streetAddress1>50 Roadside</streetAddress1>
  <streetAddress2></streetAddress2>
  <city>Ontario</city>
  <province>ON</province>
  <postalCode>P3L 3S8</postalCode>
  <countryCode>CA</countryCode>   
</property>
</holder>

vgersh99 · August 15, 2018, 9:42am

well... your original sample data contained only ONE holder record - not 2 as in the new sample.
You should be more descriptive in the future...
Let me rework the suggestion with the NEW sample.

pchang · August 15, 2018, 9:53am

My XML file will contain many records - some records will have no holder records and some records may have 1 to 3 holder records.

vgersh99 · August 15, 2018, 9:54am

how about:

awk '
  /<accountHolderName/ {accH=FNR}
   FNR==NR {
     if (/<AdditionalAccountHolders/) {s[accH]=$0;f++;next}
     if (f) s[accH]= s[accH] ORS $0
     if (/<[/]AdditionalAccountHolders/) f=0
     next
   }
   /<accountHolderName/ {print $0 ORS s[accH];next}
   /<AdditionalAccountHolders/,/<[/]AdditionalAccountHolders/ {next}
   1' myXMLfile myXMLfile

joker · August 15, 2018, 10:22am

Hello,

XML Block warden speaking here

The question states that you use XML in an unefficient way. The ordering of an XML file is irrelevant, in terms of standardization undefined, can change spontaneously and with the right tools ordering isn't needed at all.

So scripts, that try to set up an order of elements are likely to break at slightest differences of the XML Layout.

Regards,
Stomp

Update: Some Examples how to read data from that xml file:

Read all accountHolderName Attributes

xmllint --xpath "//accountHolderName/text()" data.xml

Read all additionalAccountHolderName Attributes

xmllint --xpath "//additionalAccountHolderName/text()" data.xml

Note

xmllint complains that this data is not valid XML. I added a </holders> tag at the end of your xml-data, to fix it.

pchang · August 15, 2018, 10:28am

Hi Stomp.

I realize that the order of tags does not matter but unfortunately we are building the XML file for an external client and they are requesting that the <AdditionalAccountHolders> info follows right after <AccountHolderName>.

Thanks.

vgersh99 · August 15, 2018, 10:43am

Was the last sample file representative of all the cases or not?
If not, provide small representative sample.

Have you tried the latest suggestion in post 6?

joker · August 15, 2018, 10:58am

Well that's bad. If possible, tell them how to do it more efficiently.

Using XML in such a way is a constant source of trouble.

Of course that's a matter of company policy wether the own company will do, what the client wants, even if it's total bullshit.

To put it clear

If you do it correctly according to the standards, you have lots of tools, which will help you with the task. If you do not obey the standards, your client is on his own, probably writing very bad and difficult to maintain code.

There are lots of good XML tools out there and libraries are widespread in a lot of programming languages.

pchang · August 15, 2018, 11:19am

vgersh99:

how about:

awk '
  /<accountHolderName/ {accH=FNR}
   FNR==NR {
   if (/<AdditionalAccountHolders/) {s[accH]=$0;f++;next}
   if (f) s[accH]= s[accH] ORS $0
   if (/<[/]AdditionalAccountHolders/) f=0
   next
   }
   /<accountHolderName/ {print $0 ORS s[accH];next}
   /<AdditionalAccountHolders/,/<[/]AdditionalAccountHolders/ {next}
   1' myXMLfile

Sorry vgersh99 - the above awk code did not return any records - Please have a look.

myXMLfile contains the following good representative data:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<holders xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

<holder>
<property>
  <accountHolderName>aaa</accountHolderName>
  <payeeName></payeeName>
  <jointAccountRelationshipCode>aa</jointAccountRelationshipCode>
  <accountInstrumentNumber>1234567890</accountInstrumentNumber>
  <balanceType>6</balanceType>
  <dateOfLastTransaction>20071217</dateOfLastTransaction>
  <balanceAmount>4076</balanceAmount>
  <streetAddress1>55 West Road</streetAddress1>
  <streetAddress2></streetAddress2>
  <city>Chilliwack</city>
  <province>BC</province>
  <postalCode>V2L 3S8</postalCode>
  <countryCode>CA</countryCode>   
  <AdditionalAccountHolders>
    <additionalAccountHolderName>bbb</additionalAccountHolderName>
  </AdditionalAccountHolders>
</property>
</holder>


<holder>
<property>
  <accountHolderName>ddd</accountHolderName>
  <payeeName></payeeName>
  <jointAccountRelationshipCode>xx</jointAccountRelationshipCode>
  <accountInstrumentNumber>999999990</accountInstrumentNumber>
  <balanceType>6</balanceType>
  <dateOfLastTransaction>20081217</dateOfLastTransaction>
  <balanceAmount>556</balanceAmount>
  <streetAddress1>50 Roadside</streetAddress1>
  <streetAddress2></streetAddress2>
  <city>Ontario</city>
  <province>ON</province>
  <postalCode>P3L 3S8</postalCode>
  <countryCode>CA</countryCode>   
  <AdditionalAccountHolders>
    <additionalAccountHolderName>eee</additionalAccountHolderName>
    <additionalAccountHolderName>fff</additionalAccountHolderName>
  </AdditionalAccountHolders>
</property>
</holder>

<holder>
<property>
  <accountHolderName>ggg</accountHolderName>
  <payeeName></payeeName>
  <jointAccountRelationshipCode>gg</jointAccountRelationshipCode>
  <accountInstrumentNumber>999999990</accountInstrumentNumber>
  <balanceType>6</balanceType>
  <dateOfLastTransaction>20081217</dateOfLastTransaction>
  <balanceAmount>556</balanceAmount>
  <streetAddress1>50 Albert</streetAddress1>
  <streetAddress2></streetAddress2>
  <city>Ontario</city>
  <province>ON</province>
  <postalCode>P3L 3S8</postalCode>
  <countryCode>CA</countryCode>     
</property>
</holder>

------ Post updated at 11:19 AM ------

I'm dealing with a branch of government.

vgersh99 · August 15, 2018, 12:31pm

a slight mod:

awk '
  /<accountHolderName/ {accH=FNR}
  FNR==NR {
     if (/<AdditionalAccountHolders/) {s[accH]=$0;f++;next}
     if (f) s[accH]= s[accH] ORS $0
     if (/<[/]AdditionalAccountHolders/) f=0
     next
  }
  /<accountHolderName/ && s[accH] {print $0 ORS s[accH];next}
  /<AdditionalAccountHolders/,/<[/]AdditionalAccountHolders/ {next}
  1' myXMLfile myXMLfile

Given your latest sample, I get:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<holders xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

<holder>
<property>
  <accountHolderName>aaa</accountHolderName>
  <AdditionalAccountHolders>
    <additionalAccountHolderName>bbb</additionalAccountHolderName>
  </AdditionalAccountHolders>
  <payeeName></payeeName>
  <jointAccountRelationshipCode>aa</jointAccountRelationshipCode>
  <accountInstrumentNumber>1234567890</accountInstrumentNumber>
  <balanceType>6</balanceType>
  <dateOfLastTransaction>20071217</dateOfLastTransaction>
  <balanceAmount>4076</balanceAmount>
  <streetAddress1>55 West Road</streetAddress1>
  <streetAddress2></streetAddress2>
  <city>Chilliwack</city>
  <province>BC</province>
  <postalCode>V2L 3S8</postalCode>
  <countryCode>CA</countryCode>
</property>
</holder>


<holder>
<property>
  <accountHolderName>ddd</accountHolderName>
  <AdditionalAccountHolders>
    <additionalAccountHolderName>eee</additionalAccountHolderName>
    <additionalAccountHolderName>fff</additionalAccountHolderName>
  </AdditionalAccountHolders>
  <payeeName></payeeName>
  <jointAccountRelationshipCode>xx</jointAccountRelationshipCode>
  <accountInstrumentNumber>999999990</accountInstrumentNumber>
  <balanceType>6</balanceType>
  <dateOfLastTransaction>20081217</dateOfLastTransaction>
  <balanceAmount>556</balanceAmount>
  <streetAddress1>50 Roadside</streetAddress1>
  <streetAddress2></streetAddress2>
  <city>Ontario</city>
  <province>ON</province>
  <postalCode>P3L 3S8</postalCode>
  <countryCode>CA</countryCode>
</property>
</holder>

<holder>
<property>
  <accountHolderName>ggg</accountHolderName>
  <payeeName></payeeName>
  <jointAccountRelationshipCode>gg</jointAccountRelationshipCode>
  <accountInstrumentNumber>999999990</accountInstrumentNumber>
  <balanceType>6</balanceType>
  <dateOfLastTransaction>20081217</dateOfLastTransaction>
  <balanceAmount>556</balanceAmount>
  <streetAddress1>50 Albert</streetAddress1>
  <streetAddress2></streetAddress2>
  <city>Ontario</city>
  <province>ON</province>
  <postalCode>P3L 3S8</postalCode>
  <countryCode>CA</countryCode>
</property>
</holder>

pchang · August 15, 2018, 5:19pm

Thank you very much vgersh99.

Your code works great for all scenarios.

------ Post updated at 05:19 PM ------

Hi vgersh99.

Sorry to bother you again - My data requirement has changed slightly (it's one <AdditionalAccountHolders> tag per <additionalAccountHolderName> tag) and the current awk code does not work as expected anymore.

I have highlighted the data changes below for record#2. Would you kindly check the awk code? Thank you.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<holders xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

<holder>
<property>
  <accountHolderName>aaa</accountHolderName>
  <payeeName></payeeName>
  <jointAccountRelationshipCode>aa</jointAccountRelationshipCode>
  <accountInstrumentNumber>1234567890</accountInstrumentNumber>
  <balanceType>6</balanceType>
  <dateOfLastTransaction>20071217</dateOfLastTransaction>
  <balanceAmount>4076</balanceAmount>
  <streetAddress1>55 West Road</streetAddress1>
  <streetAddress2></streetAddress2>
  <city>Chilliwack</city>
  <province>BC</province>
  <postalCode>V2L 3S8</postalCode>
  <countryCode>CA</countryCode>   
  <AdditionalAccountHolders>
    <additionalAccountHolderName>bbb</additionalAccountHolderName>
  </AdditionalAccountHolders>
</property>
</holder>


<holder>
<property>
  <accountHolderName>ddd</accountHolderName>
  <payeeName></payeeName>
  <jointAccountRelationshipCode>xx</jointAccountRelationshipCode>
  <accountInstrumentNumber>999999990</accountInstrumentNumber>
  <balanceType>6</balanceType>
  <dateOfLastTransaction>20081217</dateOfLastTransaction>
  <balanceAmount>556</balanceAmount>
  <streetAddress1>50 Roadside</streetAddress1>
  <streetAddress2></streetAddress2>
  <city>Ontario</city>
  <province>ON</province>
  <postalCode>P3L 3S8</postalCode>
  <countryCode>CA</countryCode>

  <AdditionalAccountHolders>
    <additionalAccountHolderName>eee</additionalAccountHolderName>
  </AdditionalAccountHolders>
  <AdditionalAccountHolders>
    <additionalAccountHolderName>fff</additionalAccountHolderName>
  </AdditionalAccountHolders>
</property>
</holder>

<holder>
<property>
  <accountHolderName>ggg</accountHolderName>
  <payeeName></payeeName>
  <jointAccountRelationshipCode>gg</jointAccountRelationshipCode>
  <accountInstrumentNumber>999999990</accountInstrumentNumber>
  <balanceType>6</balanceType>
  <dateOfLastTransaction>20081217</dateOfLastTransaction>
  <balanceAmount>556</balanceAmount>
  <streetAddress1>50 Albert</streetAddress1>
  <streetAddress2></streetAddress2>
  <city>Ontario</city>
  <province>ON</province>
  <postalCode>P3L 3S8</postalCode>
  <countryCode>CA</countryCode>     
</property>
</holder>

awk code:

awk '
  /<accountHolderName/ {accH=FNR}
  FNR==NR {
     if (/<AdditionalAccountHolders/) {s[accH]=$0;f++;next}
     if (f) s[accH]= s[accH] ORS $0
     if (/<[/]AdditionalAccountHolders/) f=0
     next
  }
  /<accountHolderName/ && s[accH] {print $0 ORS s[accH];next}
  /<AdditionalAccountHolders/,/<[/]AdditionalAccountHolders/ {next}
  1' myXMLfile myXMLfile

vgersh99 · August 15, 2018, 6:23pm

what's expected based on the latest sample?

------ Post updated at 06:23 PM ------

how about:

awk '
  /<accountHolderName/ {accH=FNR}
  FNR==NR {
     if (/<AdditionalAccountHolders/) {s[accH]=(s[accH])?s[accH] ORS $0:$0;f++;next}
     if (f) s[accH]= s[accH] ORS $0
     if (/<[/]AdditionalAccountHolders/) f=0
     next
  }
  /<accountHolderName/ && s[accH] {print $0 ORS s[accH];next}
  /<AdditionalAccountHolders/,/<[/]AdditionalAccountHolders/ {next}
  1' myXMLfile myXMLfile

joker · August 16, 2018, 9:26am

Rant

Well. I'm always interested in competing to awk with other languages. I obviously can not compete in brevity(which is very impressive, when I see the solutions presented in this forum - but they may twist my brains sometimes which seems a horror to me, when coming back to a solution: WTF did I think, when I wrote that pile of crazy code?) so far, but I try to do in maintainability, efficiency(IO-request and memory economy) and runtime speed:

I don't know if you even are able to use ruby, but here's my suggestion in ruby(Just for the fun of learning).

/Rant

#!/usr/bin/env ruby

$handle = File.open(ARGV[0],"r")
$current_line = ""

$ah     ="^\s*<accountHolderName>.*<\/accountHolderName>\s*\n"
$aah    ="^\s*<additionalaccountHolders>.*<\/additionalaccountHolders>\s*\n"

def chunks()     
        Enumerator.new do |chunk|
             loop do
                     new = get_chunk()
                     if !new then break end
                     chunk << new
              end
        end
end

def get_chunk()
        while !$handle.eof do
                current_chunk = ( current_chunk ? current_chunk : "") + $current_line
                $current_line = $handle.readline
                if $current_line.match(/holder>/) then
                        return current_chunk
                end
        end
        return current_chunk
end

def reorder(chunk)
        ah_current=chunk.match(/(#{$ah})/im)
        aah_current=chunk.match(/#{$aah}/im)
        chunk.gsub!(/#{$aah}/im,"")
        chunk.gsub!(/#{$ah}/,"#{ah_current}#{aah_current}")
        return chunk
end

chunks.each {|c| puts reorder(c)}

Use it like this:

reorder.rb data.xml

------ Post updated at 01:56 PM ------

Or here with OOP:

#!/usr/bin/env ruby

class ChunkCollection
        def initialize(file)
                @handle = File.open(file,"r")
                @current_line=@data=""
        end
        def chunks()
                Enumerator.new do |c|
                     loop do
                             if data = get_chunk() then
                                     c << Chunk.new(data) 
                             else 
                                     break
                             end
                      end
                end
        end
        def get_chunk()
                while !@handle.eof do
                        current_chunk = ( current_chunk ? current_chunk : "") + @current_line
                        @current_line = @handle.readline
                        return current_chunk if @current_line.match(/holder>/) 
                end
        end
        def reorder() 
                chunks.each {|c| @data+=c.reorder()}
                return self
        end
        def show() puts @data end
end     

class Chunk
        def initialize(data) 
                @ah     ="^\s*<accountHolderName>.*<\/accountHolderName>\s*\n"
                @aah    ="^\s*<additionalaccountHolders>.*<\/additionalaccountHolders>\s*\n"
                @chunk = data
        end
        def reorder()
                ah_current=@chunk.match(/(#{@ah})/im)
                aah_current=@chunk.match(/#{@aah}/im)
                return @chunk.gsub(/#{@aah}/im,"").
                        gsub(/#{@ah}/,"#{ah_current}#{aah_current}")
        end
end

ChunkCollection.new(ARGV[0]).reorder.show

------ Post updated at 03:26 PM ------

I would suggest this little change to vgersh solution:

vgersh99:

------ Post updated at 06:23 PM ------

awk '
  /<accountHolderName/ {accH=FNR}
  FNR==NR {
   if (/<AdditionalAccountHolders/) {s[accH]=(s[accH])?s[accH] ORS $0:$0;f++;next}
   if (f) s[accH]= s[accH] ORS $0
   if (/<[/]AdditionalAccountHolders/) f--
   next
  }
  /<accountHolderName/ && s[accH] {print $0 ORS s[accH];next}
  /<AdditionalAccountHolders/,/<[/]AdditionalAccountHolders/ {next}
  1' myXMLfile myXMLfile

Edit

My change is not needed. Even the umodified version prior to the input data specification change (without f++ but f=1) works.

And well: That awk solution is not really that complicated.... :rolleyes:

pchang · August 16, 2018, 9:59am

Thanks to both Stomp and vgersh99 for spending your time and helping me out. This latest awk code works magically great and the results obtained are as expected:

awk '
  /<accountHolderName/ {accH=FNR}
  FNR==NR {
     if (/<AdditionalAccountHolders/) {s[accH]=(s[accH])?s[accH] ORS $0:$0;f++;next}
     if (f) s[accH]= s[accH] ORS $0
     if (/<[/]AdditionalAccountHolders/) f--
     next
  }
  /<accountHolderName/ && s[accH] {print $0 ORS s[accH];next}
  /<AdditionalAccountHolders/,/<[/]AdditionalAccountHolders/ {next}
  1' myXMLfile myXMLfile

Peasant · August 16, 2018, 11:05am

As a systems guy, i would always choose awk, if possible.
Not perhaps for parsing xml...

Examine the strace -c <code> reports :
For awk :

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 27.26    0.000558          13        42           write
 18.61    0.000381          21        18           read
 11.87    0.000243          16        15           openat
  5.72    0.000117          39         3           munmap
  5.28    0.000108           5        20           fstat
  4.49    0.000092          46         2           lseek
  4.35    0.000089           3        26           mmap
  4.25    0.000087          15         6           brk
  3.71    0.000076           5        15           close
  3.13    0.000064          64         1           stat
  2.54    0.000052          10         5           rt_sigaction
  2.00    0.000041           2        18           mprotect
  1.91    0.000039           7         6           fcntl
  1.27    0.000026           9         3         2 ioctl
  0.73    0.000015           8         2           getgroups
  0.44    0.000009           9         1           sigaltstack
  0.39    0.000008           8         1           getpgrp
  0.34    0.000007           7         1           getpid
  0.34    0.000007           7         1           getuid
  0.34    0.000007           7         1           getgid
  0.34    0.000007           7         1           geteuid
  0.34    0.000007           7         1           getegid
  0.34    0.000007           7         1           getppid
  0.00    0.000000           0        10        10 access
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           arch_prctl
------ ----------- ----------- --------- --------- ----------------
100.00    0.002047                   202        12 total

And for ruby :

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 37.68    0.003260           7       459       320 openat
 18.30    0.001583           4       361         7 lstat
 12.39    0.001072           6       186           read
  6.60    0.000571           4       141           close
  4.91    0.000425           3       141           fstat
  2.84    0.000246           3        76         1 stat
  2.53    0.000219           3        74        71 ioctl
  2.35    0.000203           3        61           fcntl
  2.24    0.000194           4        47           getegid
  2.09    0.000181           5        38           brk
  1.83    0.000158           3        46           getgid
  1.68    0.000145           3        46           getuid
  1.41    0.000122           3        47           geteuid
  1.33    0.000115          19         6           getdents
  0.89    0.000077           2        38           mmap
  0.47    0.000041           2        27           mprotect
  0.43    0.000037           2        24           lseek
  0.02    0.000002           2         1           sysinfo
  0.00    0.000000           0         5           write
  0.00    0.000000           0         4           munmap
  0.00    0.000000           0        18           rt_sigaction
  0.00    0.000000           0         3           rt_sigprocmask
  0.00    0.000000           0        12        12 access
  0.00    0.000000           0         3           getpid
  0.00    0.000000           0         1           clone
  0.00    0.000000           0         3         1 execve
  0.00    0.000000           0         1           getcwd
  0.00    0.000000           0         1           sigaltstack
  0.00    0.000000           0         2           arch_prctl
  0.00    0.000000           0         1           futex
  0.00    0.000000           0         1           sched_getaffinity
  0.00    0.000000           0         1           set_tid_address
  0.00    0.000000           0         1           clock_gettime
  0.00    0.000000           0         1           set_robust_list
  0.00    0.000000           0         2           pipe2
  0.00    0.000000           0         3           prlimit64
  0.00    0.000000           0         2           getrandom
------ ----------- ----------- --------- --------- ----------------
100.00    0.008651                  1884       412 total

The difference will not be noticed on a system when parsing one file.
But in a situation where you need to parse tens of thousands ...

Possibly the ruby code can be written to do it better, but doubtful it will ever surpass awk code in performance.
This is to say if two ideal coders write a program in ruby and awk to do one thing best it can and start forking it

So, to conclude, in my opinion higher level languages are to be used in situations where your program needs many libraries to ease up the job - connect to multiple API endpoints, databases, versioning systems, complex math and such.

You could do all that in awk, but tremendous effort will be required and will beat the propose of short programs which do one thing quick and efficient.

This is just my rant
Regards
Peasant.

joker · August 16, 2018, 11:51am

noted that the awk solution is in this case far ahead even it is used in that inefficient way that it is reading the input file twice...

Thanks for "strace -c". Never used that before. Good bloat indicator.

------ Post updated at 05:51 PM ------

Well. I thought I did my ruby script fairly good, but it's an absolute desaster. I generated an xml data file of just 10 MB. This is the result.

AWK Resources

0.50user 0.00system 0:00.56elapsed 89%CPU (0avgtext+0avgdata 7344maxresident)k
 2592inputs+0outputs (10major+1130minor)pagefaults 0swaps

7860 system calls

Ruby procedural

5.29user 0.06system 0:05.38elapsed 99%CPU (0avgtext+0avgdata 13604maxresident)k
0inputs+0outputs (0major+2523minor)pagefaults 0swaps

4902 system calls

Ruby OOP

41.35user 36.62system 1:18.19elapsed 99%CPU (0avgtext+0avgdata 254192maxresident)k
352inputs+0outputs (3major+22449045minor)pagefaults 0swaps

 37602 system calls

I assume the string-concatenation is really bad here.

Ruby OOP(without storing the result in memory)

4.93user 0.31system 0:05.57elapsed 94%CPU (0avgtext+0avgdata 13264maxresident)k
0inputs+0outputs (0major+2611minor)pagefaults 0swaps

  4901 system calls

So if one wants speed and low memory footprint, one can tune a lot with [high-level-programming-language] or just take awk

In terms of system calls ruby won here(probably because of the double reading of the file with awk) but it's 10 times slower. I think has some fat base whereas awk is very lean, so as more complex the task is, the less relevant is the basic bloat.

Peasant · August 16, 2018, 1:44pm

When measuring, couple of things to consider.

Todays OS and filesystems are smart, they cache, prefetch and similar math magic being done falling into probability and combinatorics division.
So far and deep in HW that they give you other users data when asked nicely
Filesystems will cache the first 10 MB read, so second read will be amazingly fast(er).

Be sure to take above into consideration during testing.

This was not done to compare ruby or awk per se, just to point out not to limit yourself to certain path, but use the right tool for the task.

As for the strace options, i've read the manual a bit before, to find an option, since i was sure GNU stuff has that nicely formatted without effort

Regards
Peasant.

joker · August 16, 2018, 2:34pm

Of course. I assume cache is voiding any significant normal read times here. I can create additional processing overhead by reading in too small portions or improve performance by reading larger chunks. This is good, because so now the times here are processing times only.

My curiosity here is NOT "the right tool for the right job" at the moment. My point is: Is [some high-level-programming-language] too bloated and not able to compete in this single task with awk in terms of speed? If not, how much it is behind?

I already tested the same algorithm which is used for awk here in ruby. It's roughly 3 times faster(still 2-3 times slower than awk), but far less elegant than the awk code. That's a first interesting insight. Along with the other realization that line based processing seems to be a lot faster than my chunk-based processing. I've got an idea too, what of my codeparts are a worse and it is good to see actually how much the difference for those "little" things is.