Removing duplicate files from list with different path

vino · May 12, 2005, 4:40am

I have a list which contains all the jar files shipped with the product I am involved with. Now, in this list I have some jar files which appear again and again. But these jar files are present in different folders.
My input file looks like this

/path/1/to a.jar
/path/2/to a.jar
/path/1/to b.jar
/path/1/to c.jar
/path/1/to d.jar
/path/2/to c.jar

Now I need to remove the duplicate entries i.e. remove the extra a.jar, c.jar and others like wise.

Final list would be like this:

/path/1/to a.jar
/path/1/to b.jar
/path/2/to c.jar
/path/1/to d.jar

This is the script I have so far..

#! /bin/sh
cp jar.txt jarnew.txt
for file in $(cat jar.txt)
do
        FILE1=`basename $file`
        for dup in $(cat jar.txt)
        do
        FILE2=`basename $dup`
        if [ "$file" != "$dup" -a "$FILE1" == "$FILE2" ] ; then
        sed -e '/($dup)/ d' <jarnew.txt >jarnew.txt.tmp
        echo "$FILE1 $FILE2"
        mv jarnew.txt.tmp jarnew.txt
        fi;
        done
done

But with this, the main functionality of the script is not working. i.e. the if condition is not working as required. Could be the sed problem or the logic.

I am getting the jarnew.txt as good as the jar.txt

Any pointers on how to proceed ?

Vino

muthukumar · May 12, 2005, 6:51am

Try with this script as,

#!/bin/sh
> final.jar
while read line; do

FILE=`basename $line`;
DIR=`echo $line | awk '{ print $(NF-1) }'`;
if [[ $FILE == "c.jar" && $DIR == "2" ]]
then
echo $line >> final.jar
elif [[ $DIR == "1" && $FILE != "c.jar" ]]
echo $line >> final.jar
fi

done < jar.txt

You will get result. Check it.

Just_Ice · May 12, 2005, 7:36am

on the assumption that you only have filenames in the list ...

sort -t"/" -u +3 jar.txt > jarnew.txt

vino · May 12, 2005, 7:43am

Muthu,

I dont intend to specify any particular jar file like in the way you have mentioned in
f [[ $FILE == "c.jar" && $DIR == "2" ]]

Rather, I would have it generalized.

How do you go about that ?

Vino

muthukumar · May 12, 2005, 8:10am

Generally scripts are written based on pattern change. In your requirement on input, only c.jar is taken from path/2/. I have simulated your input to required output.

And, your input and output is not being generallized so that script is given with using speicific filenames.

vino · May 12, 2005, 8:14am

There are so many jar files. If I can collect these jar files manually, then I might as well do away with the script.

I need to get the jar files from the list, dynamically.

Vino

vino · May 12, 2005, 8:17am

JustIce,

I dont think sort is a possible soution. The path length varies i.e. the directory structure is different for files. Some of them have a depth of 3.. others a depth of more than 3.

Vino

Just_Ice · May 12, 2005, 8:25am

#! /bin/ksh

jarlist=`awk -F"/" '{print $NF}' jar.txt | sort -u`
for jarfile in $jarlist
do
    grep $jarfile jar.txt | uniq
done > jarnew.txt

exit 0

vino · May 12, 2005, 8:34am

JustIce,

Just tried out your solution. It sorts the entries. And I can still see the duplicate entries.

Vino

Just_Ice · May 12, 2005, 8:36am

try this one ...

#! /bin/ksh

jarlist=`awk -F"/" '{print $NF}' jar.txt | sort -u`
for jarfile in $jarlist
do
    grep $jarfile jar.txt | head -1
done > jarnew.txt

exit 0

vino · May 12, 2005, 8:44am

JustIce..

That works perfect !

Thanks !