remove duplicated xml record in a file under unix

happyv · September 20, 2006, 5:49am

Hi,

If i have a file with xml format, i would like to remove duplicated records and save to a new file. Is it possible...to write script to do it?

tayyabq8 · September 20, 2006, 6:15am

Try

uniq inputfile

Yogesh_Sawant · September 20, 2006, 6:17am

I don't know if it's possible in shell or not, but it's possible in Perl. Do consider that option if you can.

happyv · September 20, 2006, 6:34am

Is the Perl can run under ksh Unix?

Also, the record is a bit difference...it look like

  record1:
       this is testing
       my id is 2001
  end:
  record2:
       this is testing2
       my id is 2002
  end:
  record3:
        this is testing
        my id is 2002
  end:
  record4:
        this is testing2
        my id is 2002
  end:

For the above, record 2 and 4 is duplicated. Because of the "id" and "testing2" is the same. if only one line is the same which is not called duplicated..

So Perl or any friend can help for the script?

ranj1 · September 20, 2006, 7:53am

I havent tested this, but please check it

paste -s -d"\t\t\t\n" filename|sort -u |tr "\t" "\n"

aigles · September 20, 2006, 8:15am

You can try to use awk.
Create the following awk script uniq.awk :

/^end:/ {
   if (! (Record in Records)) {
      Records[Record];
      print RecordLabel ":";
      print Record;
      print $0;  
      Record = "";
   }
   next;
}
$1 ~ /^.*:/ {
   sub(/:.*/, "", $1);
   RecordLabel = $1;
   next;
}
{
   Record = (Record ? Record "\n" : "") $0;
}

and execute it :

$ awk -f uniq.awk filename
record1:
this is testing
my id is 2001
end:
record2:
this is testing2
my id is 2002
end:
record3:
this is testing
my id is 2002
end:
$

jean-Pierre.

nervous · September 20, 2006, 9:03am

Dear Sir,

It would be great help if you can describe the code below in detail, I have just started to learn about awk and I can say that understanding of following code in a clear way would help me a lot in future.

Thanks in advance.

aigles · September 20, 2006, 1:17pm

/^end:/ { ... ; next }
Select end of record line (line starting with 'end:'), execute '...' code and read next line.

if (! (Record in Records)) { ... }
If the record definition have not been memorized in the Records array, execute '...' code.
The code print the full record (label, definition, end:) and memorize the record definition.

Records[Record];
Create an element in the array Records. The index of this element is the record definition.

print RecordLabel ":"; print Record; print $0;
Print the full record : Label, definition and end.

Record = "";
Reset the Record definition.

$1 ~ /^.* { ... ; next}
Select start of record (line with field 1 ending with ':'), execute '...' code and read next line.

sub(/:.*/, "", $1);
RecordLabel = $1;
The record label is memorized in the RecordLabel variable.
It is equal to all characters before ':' in field 1.

{ ... }
Select record definition line, execute '...' code.

Record = (Record ? Record "\n" : "") $0;
Append line read $0 to the variable Record where previous lines are memorized.
A line separator is added before if a line have already been memorized.

Jean-Pierre.

anbu23 · September 20, 2006, 1:36pm

modification in ranj@chn code to work for you

paste -s -d"\t\t\t\n" f | sort -u -k2 | sort -k1 |tr "\t" "\n"