[uniq + awk?] How to remove duplicate blocks of lines in files?

raidzero · September 20, 2011, 11:53am

Hello again, I am wanting to remove all duplicate blocks of XML code in a file. This is an example:

input:

<string-array name="threeItems">
<item>item1</item>
<item>item2</item>
<item>item3</item>
</string-array>
<string-array name="twoItems">
<item>item1</item>
<item>item2</item>
</string-array>
<string-array name="threeItems">
<item>item1</item>
<item>item2</item>
<item>item3</item>
</string-array>

I cannot have two arrays by the name of "threeItems". This is the desired output:

<string-array name="threeItems">
<item>item1</item>
<item>item2</item>
<item>item3</item>
</string-array>
<string-array name="twoItems">
<item>item1</item>
<item>item2</item>
</string-array>

The arrays can be taken out of order, but the array contents need to stay intact. I have been using awk to pull arrays and all their elements from several files into one like this

awk '/string-array name/,/string-array>/' $file

but it seems that the same array appears in more than one file ><

Thanks for any tips!

Corona688 · September 20, 2011, 12:05pm

Is the data actually as shown? One tag per line? Or do the contents sometimes splay across multiple lines?

---------- Post updated at 10:05 AM ---------- Previous update was at 09:57 AM ----------

You could use awk's record-separator feature, make each <string-array the beginning of a record and split fields on newlines:

$ cat 3items.awk
BEGIN { RS="<string-array";     FS="\n";        OFS="\n";       }

{
        if($1 ~ /name=/)
        {
                gsub(/ *name=\"|\">/, "", $1);
                if(!ARR[$1])
                {
                        ARR[$1]=1;
                        $1="<string-array name=\"" $1 "\">";
                        print;
                }
        }
}
$ cat data
<string-array name="threeItems">
<item>item1</item>
<item>item2</item>
<item>item3</item>
</string-array>
<string-array name="twoItems">
<item>item1</item>
<item>item2</item>
</string-array>
<string-array name="threeItems">
<item>item1</item>
<item>item2</item>
<item>item3</item>
</string-array>

$ awk -f 3items.awk < data # Use nawk/gawk outside Linux
<string-array name="threeItems">
<item>item1</item>
<item>item2</item>
<item>item3</item>
</string-array>

<string-array name="twoItems">
<item>item1</item>
<item>item2</item>
</string-array>
$

vgersh99 · September 20, 2011, 12:36pm

nawk -F'"' '/string-array/ && NF>1 {f=($(NF-1) in a)?0:1;if(f)a[$(NF-1)]} f' myFile

ahamed101 · September 20, 2011, 12:42pm

wicked solution vgersh99!

--ahamed

sk1418 · September 20, 2011, 12:43pm

 awk '{printf $0}' yourFile | sed 's#</string-array>#&\n#g'|awk '!a[$0]++'

raidzero · September 20, 2011, 12:49pm

Thanks again, corona! However, it is doing this

<string-array name="emptyarray</string-array>">

to this

<string-array name="emptyarray"></string-array>

and this

<string-array name="arrayWithPipe</string-array>">

to this

        <string-array name="arrayWithPipe">
                <item>item1|item2</item>
        </string-array>

seems like the pipe is interfering and when there is an array with no elements it does the same thing?

I need a book on awk I think

---------- Post updated at 12:49 PM ---------- Previous update was at 12:47 PM ----------

wow guys I just saw all your replies.. I will try them out

this forum is great!

Corona688 · September 20, 2011, 12:50pm

It works with the data you posted. Please post more comprehensive input data.

ahamed101 · September 20, 2011, 12:52pm

its amazing how people think! nice one sk1418
cheers!

--ahamed

Corona688 · September 20, 2011, 12:52pm

Perhaps this:

BEGIN { RS="<string-array";     FS="\n";        OFS="\n";       }

{
        if($1 ~ /name=\"[^\"]+\">$/)
        {
                gsub(/ *name=\"|\">/, "", $1);
                if(!ARR[$1])
                {
                        ARR[$1]=1;
                        $1="<string-array name=\"" $1 "\">";
                        print;
                }
        }
        else    print RS $0;
}

I haven't found any problem with records containing pipes.

raidzero · September 21, 2011, 10:52am

Thanks everyone.

The XML file contains arrays, some of them are empty. they look like this

<array name="array1"></array>

some also look like this

<style name="style1" parent="@android:style/mainStyle"></style>

sometimes the arrays have elements and sometimes they do not. The contents of the arrays can be anything, they can have special characters, newlines (I have been using sed to escape the newlines) and regular characters.

Is it possible to include these items in the awk search-delete routine?

vgersh99 · September 21, 2011, 11:01am

please provide a sample file 'rich' enough with different flavors of 'arrays'.

meanwhile, try this:

nawk -F'"' '$1 ~ "=$" && NF>1 {f=($(NF-1) in a)?0:1;if(f)a[$(NF-1)]} f' myFile

raidzero · September 21, 2011, 1:18pm

<string-array name="stringArray1">
<item>Element1|Element2<item>
<item>@android:color/black<item>
</string-array>
<style name="style1" parent="@android:style/mainStyle"></style>
<style name="style1" parent="@android:style/mainStyle">
<item name="android:textColor">@drawable/black</item>
<item name="android:typeface">sans</item>
<item name="android:textStyle">bold</item>
</style>
<string-array name="stringArray1">
<item>Element1|Element2<item>
<item>@android:color/black<item>
</string-array>
<string-array name="stringArray2"></string-array>

I think that about covers it.

vgersh99 · September 21, 2011, 1:27pm

looks good to me!

raidzero · September 21, 2011, 2:05pm

This command is nice, it works very well to put all items between <array> and </array> on its own line - making it easier for processing. however, it does not remove duplicate definitions.

vgersh99 · September 21, 2011, 2:07pm

have you tried what I'd posted with 'nawk'?

raidzero · September 21, 2011, 2:29pm

Yes I did, it produced the same output as the first nawk suggestion.

With each array and its elements on one line it might make this easier. Is there a way to delete lines based on the comparison result of the contents between the first < and > characters?

binlib · September 21, 2011, 2:31pm

Or even cuter (though it only works with the original posted data):

gawk 'BEGIN{ORS=RS="\n</string-array>\n"}!a[$0]++' infile

vgersh99 · September 21, 2011, 2:35pm

given your sample file myFile:

<string-array name="stringArray1">
<item>Element1|Element2<item>
<item>@android:color/black<item>
</string-array>
<style name="style1" parent="@android:style/mainStyle"></style>
<style name="style1" parent="@android:style/mainStyle">
<item name="android:textColor">@drawable/black</item>
<item name="android:typeface">sans</item>
<item name="android:textStyle">bold</item>
</style>
<string-array name="stringArray1">
<item>Element1|Element2<item>
<item>@android:color/black<item>
</string-array>
<string-array name="stringArray2"></string-array>

and the code:

nawk -F'"' '$1 ~ "=$" && NF>1 {f=($(NF-1) in a)?0:1;if(f)a[$(NF-1)]} f' myFile

I get:

<string-array name="stringArray1">
<item>Element1|Element2<item>
<item>@android:color/black<item>
</string-array>
<style name="style1" parent="@android:style/mainStyle"></style>
<item name="android:textColor">@drawable/black</item>
<item name="android:typeface">sans</item>
<item name="android:textStyle">bold</item>
</style>
<string-array name="stringArray2"></string-array>

looks fine to me. Anything wrong you can identify?

vgersh99 · September 21, 2011, 2:37pm

Based on the OP's previous explanation, one cannot hard-wire the array "names" as they differ.

raidzero · September 21, 2011, 2:59pm

I have it wrapped in a function to take the array name as an argument, and the function is run as many times as the number of names. it takes several input files, one for string-arrays, one for styles, plurals, dimens, strings, colors, drawables, etc (all the android resources) and produces two final xml files: one for strings and colors with each item being one line, and one called arrays.xml which is what I am working with now. I hope that clears it up.

---------- Post updated at 02:59 PM ---------- Previous update was at 02:47 PM ----------

I figured it out...

here is my final function

dupArrayDelete() {
	echo "removing duplicate"
	arrayName=$1
	echo $arrayName
	#get arrays and their contents on their own line
	#first awk prints the file ignoring new lines, putting the whole file in one line
	#sed inserts newlines after each closing </style> or </string-array>, etc
	#second awk removes all lines that have the same column 2
	awk '{printf$0}' $2 | sed 's#</'$arrayName'>#&\n#g' | awk '!A[$2]++' >> $3
}

$1 is the array name, $2 is the source file and $3 is the destination

thanks everyone!