Hi,
If i have a file with xml format, i would like to remove duplicated records and save to a new file. Is it possible...to write script to do it?
Hi,
If i have a file with xml format, i would like to remove duplicated records and save to a new file. Is it possible...to write script to do it?
Try
uniq inputfile
I don't know if it's possible in shell or not, but it's possible in Perl. Do consider that option if you can.
Is the Perl can run under ksh Unix?
Also, the record is a bit difference...it look like
record1:
this is testing
my id is 2001
end:
record2:
this is testing2
my id is 2002
end:
record3:
this is testing
my id is 2002
end:
record4:
this is testing2
my id is 2002
end:
For the above, record 2 and 4 is duplicated. Because of the "id" and "testing2" is the same. if only one line is the same which is not called duplicated..
So Perl or any friend can help for the script?
I havent tested this, but please check it
paste -s -d"\t\t\t\n" filename|sort -u |tr "\t" "\n"
You can try to use awk.
Create the following awk script uniq.awk :
/^end:/ {
if (! (Record in Records)) {
Records[Record];
print RecordLabel ":";
print Record;
print $0;
Record = "";
}
next;
}
$1 ~ /^.*:/ {
sub(/:.*/, "", $1);
RecordLabel = $1;
next;
}
{
Record = (Record ? Record "\n" : "") $0;
}
and execute it :
$ awk -f uniq.awk filename
record1:
this is testing
my id is 2001
end:
record2:
this is testing2
my id is 2002
end:
record3:
this is testing
my id is 2002
end:
$
jean-Pierre.
Dear Sir,
It would be great help if you can describe the code below in detail, I have just started to learn about awk and I can say that understanding of following code in a clear way would help me a lot in future.
Thanks in advance.
/^end:/ { ... ; next }
Select end of record line (line starting with 'end:'), execute '...' code and read next line.
if (! (Record in Records)) { ... }
If the record definition have not been memorized in the Records array, execute '...' code.
The code print the full record (label, definition, end:) and memorize the record definition.
Records[Record];
Create an element in the array Records. The index of this element is the record definition.
print RecordLabel ":"; print Record; print $0;
Print the full record : Label, definition and end.
Record = "";
Reset the Record definition.
$1 ~ /^.* { ... ; next}
Select start of record (line with field 1 ending with ':'), execute '...' code and read next line.
sub(/:.*/, "", $1);
RecordLabel = $1;
The record label is memorized in the RecordLabel variable.
It is equal to all characters before ':' in field 1.
{ ... }
Select record definition line, execute '...' code.
Record = (Record ? Record "\n" : "") $0;
Append line read $0 to the variable Record where previous lines are memorized.
A line separator is added before if a line have already been memorized.
Jean-Pierre.
modification in ranj@chn code to work for you
paste -s -d"\t\t\t\n" f | sort -u -k2 | sort -k1 |tr "\t" "\n"