Delete lines in file containing duplicate strings, keeping longer strings

raidzero · September 16, 2011, 4:07pm

The question is not as simple as the title... I have a file, it looks like this

<string name="string1">RZ-LED</string>
<string name="string2">2.0</string>
<string name="string2">Version 2.0</string>
<string name="string3">BP</string>

I would like to check for duplicate entries of string2, keeping the longer of two lines...

output would ideally be

<string name="string1">RZ-LED</string>
<string name="string2">Version 2.0</string>
<string name="string3">BP</string>

Is this possible using GNU tools?

radoulov · September 16, 2011, 4:13pm

Are the duplicate lines always consecutive?
What should happen if more than one line have the same length?

Corona688 · September 16, 2011, 4:15pm

Assuming the XML is as you've shown it and not some slightly different arrangement:

$ awk -v FS="\"" '{
        # Remember the order tokens come in
        if(!L[$2]) { C[N++]=$2; L[$2]=1; }
        # Save the longest
        if(length($3) > length(A[$2])) { A[$2]=$3; B[$2]=$0 }
}

END { for(M=0; M<N; M++) print B[C[M]] }' < data
<string name="string1">RZ-LED</string>
<string name="string2">Version 2.0</string>
<string name="string3">BP</string>
$

raidzero · September 16, 2011, 4:39pm

duplicate lines are not always consecutive, and the item names can vary :wall:

string1 may be defined at line 18, and then string1 might be defined again at like 818...

---------- Post updated at 04:39 PM ---------- Previous update was at 04:36 PM ----------

but you know what? corona, your solution seems to work

my awk-fu is weak

Corona688 · September 16, 2011, 4:51pm

I kind of cheated. I split on " to get string1/string2/string3 directly(as $2). As long as there's no " anywhere else, $3 is the entire rest of the line, which I use to compare the lengths. I also store the entire line for printing later, and use the C array to remember the order.

raidzero · September 16, 2011, 5:29pm

The string lengths can vary a lot. Actually it causes issues with long strings, it creates new lines in the file, which doesn't fly.

---------- Post updated at 05:02 PM ---------- Previous update was at 05:01 PM ----------

new lines in the strings is what I meant*

---------- Post updated at 05:04 PM ---------- Previous update was at 05:02 PM ----------

here is an example string that gets mangled:

%1$s\n\nFrom: %2$s\n\nTo: %3$s

---------- Post updated at 05:11 PM ---------- Previous update was at 05:04 PM ----------

and the reason it is mangled is because of those newline characters in the string... the awk script interprets the newlines when in fact the newline is not supposed to show up until application runtime

---------- Post updated at 05:13 PM ---------- Previous update was at 05:11 PM ----------

ignoring the "\n"'s would be ideal, can that be done? I don't really understand any of your function...

---------- Post updated at 05:29 PM ---------- Previous update was at 05:13 PM ----------

I got around the newline thing with sed: sed -i -e 's/\\/\\\\/g'

Corona688 · September 16, 2011, 6:19pm

I did ask if the text was always as shown; apparently not. This is why xml is so hard to awk...

Something like that would've been my suggestion to fix it anyway, though

I don't understand how that string would cause awk to mess up, though! Can you show the actual XML surrounding it?

raidzero · September 19, 2011, 6:04pm

the sed command took care of awk interpreting newline instructions, I have been trying to modify your awk script but just failing. Could you show how to add some logic to ignore items with # in their content (color codes - which are always the same length)?

Thanks!

Chubler_XL · September 19, 2011, 8:43pm

This should do it for you:

awk -v FS="\"" '$3 !~ "#" {
        # Remember the order tokens come in
        if(!L[$2]) { C[N++]=$2; L[$2]=1; }
        # Save the longest
        if(length($3) > length(A[$2])) { A[$2]=$3; B[$2]=$0 }
}
END { for(M=0; M<N; M++) print B[C[M]] }

or this if you still want # lines in the output:

awk -v FS="\"" '$3 !~ "#" {
        # Remember the order tokens come in
        if(!L[$2]) { C[N++]=$2; L[$2]=1; }
        # Save the longest
        if(length($3) > length(A[$2])) { A[$2]=$3; B[$2]=$0 }
        next
}
{ C[N++]=$0; B[$0]=$0 }
END { for(M=0; M<N; M++) print B[C[M]] }' infile

raidzero · September 19, 2011, 11:54pm

I apologize for not being clear... this is what is desired:

<color name="color1">#ff00aabb</color>
<color name="color2">#ff000000</color>
<color name="color1">#ffbbaa00</color>

I'd like to get rid of the second occurrence of color1 and keep the first.

Chubler_XL · September 20, 2011, 12:17am

Isn't that what Corona688's original solution does?

$ echo '<color name="color1">#ff00aabb</color>
<color name="color2">#ff000000</color>
<color name="color1">#ffbbaa00</color>' | awk -v FS="\"" '{
        # Remember the order tokens come in
        if(!L[$2]) { C[N++]=$2; L[$2]=1; }
        # Save the longest
        if(length($3) > length(A[$2])) { A[$2]=$3; B[$2]=$0 }
}
END { for(M=0; M<N; M++) print B[C[M]] }'
<color name="color1">#ff00aabb</color>
<color name="color2">#ff000000</color>

raidzero · September 20, 2011, 11:00am

huh. You're right. It just didnt work on a file with strings, colors, and a bunch of other xml stuff in it. I made a separate xml just for colors and ran that through it and it works