Removing duplicate items from an array

Hello Experts,

I am trying to write a shell script to find duplicate items in an array, this is what i have tried :

#!/bin/bash
echo "This is another sample Progg to remove duplicate items from an array"
echo "How many number do you want to insert"
read n
for (( i=0; i<$n; i++ ))
do
   echo "Enter Number"
   read num
   arr=$num
done
 echo "Insertion is done, Below are the element of the array"
 echo ${arr[@]}

for (( i=0; i<$n; i++ ))
do
    echo "------------i here is ${arr}-------" 
    for (( j=0; j<$n; j++ ))
  do
       echo "j here is ${arr[j]}"
      if [[ $i -eq $j ]]
        then
           echo "Skip this iteration"
            continue
        else
           if [[ arr -eq arr[j] ]]
             then
                 echo "There is a duplicate!!"
                 arr=(${arr[@]:0:$j} ${arr[@]:(( $j+1))})
                 break
           fi
      fi
  done
done
echo ${arr[@]}

The script is working properly and removing the duplicate element as expected, the problem is since i am using a counter (variable n) equivalent to number of items in array to loop, the number of iteration is always constant even though I am popping out the duplicate element as soon as it is encountered.

Any help on how to optimise the script to remove the iteration that are not required?

I don't think your code works as well as you think it does. Try running your script again telling it that you want it to process 5 numbers and then enter the same number 5 times. Do you get the results you wanted? Then try running your script again telling it that you want it to process 5 numbers and then enter 1, 2, 1, 3, and 1. Would you expect the final result to be 2 3 1 or 1 2 3 ?

See if this works better for you:

#!/bin/bash
echo "This is a sample Prog to remove duplicate integer items from an array"
echo "How many numbers do you want to insert"
read n
for (( i=0; i<$n; i++ ))
do
	echo "Enter Number"
	read num
	arr=$num
done
echo "Insertion is done, Below are the elements of the array"
echo ${arr[@]}

for (( i=0; i<$((n-1)); i++ ))
do
	echo "------------arr here is ${arr}-------" 
	for (( j=$((i+1)); j<$n; j++ ))
	do
		echo "arr[j] here is ${arr[j]}"
		if [[ arr -eq arr[j] ]]
		then
			echo "There is a duplicate!!"
			arr=(${arr[@]:0:$j} ${arr[@]:$((j+1))})
			((j--))
			((n--))
		fi
	done
done
echo ${arr[@]}

This seems to work OK with both bash version 3.2.57(1)-release (x86_64-apple-darwin16) and with ksh version (AT&T Research) 93u+ 2012-08-01 . I assume it will also work with more recent releases of both of these shells.

3 Likes

Hi Don,

Thanks for your reply, the script is now working perfectly.

Can you please help to explain below part, the script seems to be working fine without decreasing the counter as well.

((j--)) 

The script is decreasing the counters.

After removing an element from the array of n items, the arithmetic command ((n--)) decreases the value of n. Note that n-1 is also the upper limit on the outer loop (indexed by i) and n is the upper limit on the inner loop (indexed by j).

When we remove an element from the array (i.e. remove ${array[$j]} by moving higher elements of the array down one position), the ((j--)) forces the next iteration of the inner loop to look at the same element (after moving down higher elements). Without the ((j--)) the first element that was moved down won't be examined to see if it is also a duplicate of the element indexed by the outer loop.

P.S.: Note also that your script compared every element of the array to every element of the array (skipping the cases where it was comparing an item to itself). My script compares element 1 in the array to elements 2 through n, then compares element 2 to elements 3 through n, and so on until it compares element n-1 to element n on the last iteration through the nested loops. Doing it this way eliminates the need to check for comparing an element to itself and gets rid of the duplicated comparisons with pairs of elements that have already been compared.

Don, your solution was 100% correct, as always!
As a note, within $(( )) and (( )) evaluation you do not need further $ evaluations. So the following works as well

for (( i=0; i<n; i++ ))
for (( i=0; i<(n-1); i++ ))

Last but not least, in bash-4 you can use an associative array to automatically sort out duplicates:

#!/bin/bash
echo "This is a sample Prog to remove duplicate integer items from an array"
echo "How many numbers do you want to insert"
read n
declare -A hash
for (( i=0; i<n; i++ ))
do
        echo "Enter Number"
        read num
        # we address the array by $num, and do not need a value
        hash[$num]=""
done
echo "Insertion is done, Below are the elements of the array"
# dump the keys
printf "%s\n" "${!hash[@]}"
echo "now loop through the keys"
for i in "${!hash[@]}"
do
   echo "do something with $i"
done
1 Like

I avoided associative arrays in my suggestion, because I only have access to bash version 3.2.57(1)-release (x86_64-apple-darwin16) and it doesn't have associative arrays. I normally use ksh instead of bash (and it has had associative arrays since 1993). Unfortunately, bash and ksh declare associative arrays incompatibly. As you have shown above, bash declares an associative array with:

declare -A arrayname...

while ksh uses:

typeset -A arrayname...

with that one line change to your code, it will work with a 1993 or later ksh instead of a version 4 or later bash .

I agree that using an associative array is the easy way to avoid duplicated values when creating an array. But note that you when you say that it "automatically sort out duplicates", the sort order produced by ${!arrayname[@]} is an alphanumeric sort of the indices; not a numeric sort. So if the values provided as the array indices are 5, 20, and 300; the output from:

echo ${!arrayname[@]}

will be:

20 300 5

not the:

5 20 300

that would be produced by a numeric sort.

The OP for this thread never said whether or not maintaining the order of the input values was desirable or important; nor whether sorted output would be a nicety or a nuisance. If numerically sorted output is important, the user entering values would either need to supply leading zeros so all numeric value are the same width, or the list of values will have to be sorted after duplicates are removed (either with an associative array or with the indexed array used in the original post in this thread).