[uniq + awk?] How to remove duplicate blocks of lines in files?

Hello again, I am wanting to remove all duplicate blocks of XML code in a file. This is an example:

input:

<string-array name="threeItems">
<item>item1</item>
<item>item2</item>
<item>item3</item>
</string-array>
<string-array name="twoItems">
<item>item1</item>
<item>item2</item>
</string-array>
<string-array name="threeItems">
<item>item1</item>
<item>item2</item>
<item>item3</item>
</string-array>

I cannot have two arrays by the name of "threeItems". This is the desired output:

<string-array name="threeItems">
<item>item1</item>
<item>item2</item>
<item>item3</item>
</string-array>
<string-array name="twoItems">
<item>item1</item>
<item>item2</item>
</string-array>

The arrays can be taken out of order, but the array contents need to stay intact. I have been using awk to pull arrays and all their elements from several files into one like this

awk '/string-array name/,/string-array>/' $file

but it seems that the same array appears in more than one file ><

Thanks for any tips!

Is the data actually as shown? One tag per line? Or do the contents sometimes splay across multiple lines?

---------- Post updated at 10:05 AM ---------- Previous update was at 09:57 AM ----------

You could use awk's record-separator feature, make each <string-array the beginning of a record and split fields on newlines:

$ cat 3items.awk
BEGIN { RS="<string-array";     FS="\n";        OFS="\n";       }

{
        if($1 ~ /name=/)
        {
                gsub(/ *name=\"|\">/, "", $1);
                if(!ARR[$1])
                {
                        ARR[$1]=1;
                        $1="<string-array name=\"" $1 "\">";
                        print;
                }
        }
}
$ cat data
<string-array name="threeItems">
<item>item1</item>
<item>item2</item>
<item>item3</item>
</string-array>
<string-array name="twoItems">
<item>item1</item>
<item>item2</item>
</string-array>
<string-array name="threeItems">
<item>item1</item>
<item>item2</item>
<item>item3</item>
</string-array>

$ awk -f 3items.awk < data # Use nawk/gawk outside Linux
<string-array name="threeItems">
<item>item1</item>
<item>item2</item>
<item>item3</item>
</string-array>

<string-array name="twoItems">
<item>item1</item>
<item>item2</item>
</string-array>
$
1 Like
nawk -F'"' '/string-array/ && NF>1 {f=($(NF-1) in a)?0:1;if(f)a[$(NF-1)]} f' myFile

wicked solution vgersh99!
:slight_smile:

--ahamed

1 Like
 awk '{printf $0}' yourFile | sed 's#</string-array>#&\n#g'|awk '!a[$0]++'

Thanks again, corona! However, it is doing this

<string-array name="emptyarray</string-array>">

to this

<string-array name="emptyarray"></string-array>

and this

<string-array name="arrayWithPipe</string-array>">

to this

        <string-array name="arrayWithPipe">
                <item>item1|item2</item>
        </string-array>

seems like the pipe is interfering and when there is an array with no elements it does the same thing?

I need a book on awk I think :slight_smile:

---------- Post updated at 12:49 PM ---------- Previous update was at 12:47 PM ----------

wow guys I just saw all your replies.. I will try them out

this forum is great!

It works with the data you posted. Please post more comprehensive input data.

its amazing how people think! nice one sk1418
cheers!

--ahamed

Perhaps this:

BEGIN { RS="<string-array";     FS="\n";        OFS="\n";       }

{
        if($1 ~ /name=\"[^\"]+\">$/)
        {
                gsub(/ *name=\"|\">/, "", $1);
                if(!ARR[$1])
                {
                        ARR[$1]=1;
                        $1="<string-array name=\"" $1 "\">";
                        print;
                }
        }
        else    print RS $0;
}

I haven't found any problem with records containing pipes.

Thanks everyone.

The XML file contains arrays, some of them are empty. they look like this

<array name="array1"></array>

some also look like this

<style name="style1" parent="@android:style/mainStyle"></style>

sometimes the arrays have elements and sometimes they do not. The contents of the arrays can be anything, they can have special characters, newlines (I have been using sed to escape the newlines) and regular characters.

Is it possible to include these items in the awk search-delete routine?

please provide a sample file 'rich' enough with different flavors of 'arrays'.

meanwhile, try this:

nawk -F'"' '$1 ~ "=$" && NF>1 {f=($(NF-1) in a)?0:1;if(f)a[$(NF-1)]} f' myFile
<string-array name="stringArray1">
<item>Element1|Element2<item>
<item>@android:color/black<item>
</string-array>
<style name="style1" parent="@android:style/mainStyle"></style>
<style name="style1" parent="@android:style/mainStyle">
<item name="android:textColor">@drawable/black</item>
<item name="android:typeface">sans</item>
<item name="android:textStyle">bold</item>
</style>
<string-array name="stringArray1">
<item>Element1|Element2<item>
<item>@android:color/black<item>
</string-array>
<string-array name="stringArray2"></string-array>

I think that about covers it.

looks good to me!

This command is nice, it works very well to put all items between <array> and </array> on its own line - making it easier for processing. however, it does not remove duplicate definitions.

have you tried what I'd posted with 'nawk'?

Yes I did, it produced the same output as the first nawk suggestion.

With each array and its elements on one line it might make this easier. Is there a way to delete lines based on the comparison result of the contents between the first < and > characters? :slight_smile:

Or even cuter (though it only works with the original posted data):

gawk 'BEGIN{ORS=RS="\n</string-array>\n"}!a[$0]++' infile

given your sample file myFile:

<string-array name="stringArray1">
<item>Element1|Element2<item>
<item>@android:color/black<item>
</string-array>
<style name="style1" parent="@android:style/mainStyle"></style>
<style name="style1" parent="@android:style/mainStyle">
<item name="android:textColor">@drawable/black</item>
<item name="android:typeface">sans</item>
<item name="android:textStyle">bold</item>
</style>
<string-array name="stringArray1">
<item>Element1|Element2<item>
<item>@android:color/black<item>
</string-array>
<string-array name="stringArray2"></string-array>

and the code:

nawk -F'"' '$1 ~ "=$" && NF>1 {f=($(NF-1) in a)?0:1;if(f)a[$(NF-1)]} f' myFile

I get:

<string-array name="stringArray1">
<item>Element1|Element2<item>
<item>@android:color/black<item>
</string-array>
<style name="style1" parent="@android:style/mainStyle"></style>
<item name="android:textColor">@drawable/black</item>
<item name="android:typeface">sans</item>
<item name="android:textStyle">bold</item>
</style>
<string-array name="stringArray2"></string-array>

looks fine to me. Anything wrong you can identify?

Based on the OP's previous explanation, one cannot hard-wire the array "names" as they differ.

I have it wrapped in a function to take the array name as an argument, and the function is run as many times as the number of names. it takes several input files, one for string-arrays, one for styles, plurals, dimens, strings, colors, drawables, etc (all the android resources) and produces two final xml files: one for strings and colors with each item being one line, and one called arrays.xml which is what I am working with now. I hope that clears it up.

---------- Post updated at 02:59 PM ---------- Previous update was at 02:47 PM ----------

I figured it out...

here is my final function

dupArrayDelete() {
	echo "removing duplicate"
	arrayName=$1
	echo $arrayName
	#get arrays and their contents on their own line
	#first awk prints the file ignoring new lines, putting the whole file in one line
	#sed inserts newlines after each closing </style> or </string-array>, etc
	#second awk removes all lines that have the same column 2
	awk '{printf$0}' $2 | sed 's#</'$arrayName'>#&\n#g' | awk '!A[$2]++' >> $3
}

$1 is the array name, $2 is the source file and $3 is the destination :slight_smile:

thanks everyone!