List Duplicate

vakharia_Mahesh · June 27, 2007, 1:02pm

Hi All
This is not class assignment . I would like to know awk script how to
list all the duplicate name from a file ,have a look below
Sl No Name Dt of birth Location
1 aaa 1/01/1975 delhi
2 bbb 2/03/1977 mumbai
3 aaa 1/01/1976 mumbai
4 bbb 2/03/1975 chennai
5 aaa 1/01/1975 kolkatta
6 bbb 2/03/1977 bangalore

Here what I would like is if the DOB is same and name is same then print all the details . I have tried with the command "uniq -D " in the awk script
but could not succeed.
With thanks in advance for guidence !!!

aigles · June 27, 2007, 1:24pm

You can do something like that :

sort -k2,3 inputfile | \
awk '
   BEGIN { first_duplicate = 1 }
   {
     name = $2;
     dob  = $3;
     if (name == prv_name && dob == prv_dob) {
         if (first_duplicate)
            print "\n" prv_rec;
         print $0;
         first_duplicate = 0;
     } else {
        prv_name = name;
        prv_dob  = dob;
        prv_rec  = $0;
        first_duplicate = 1;
     }
   }
'

Output for your sample datas:

1 aaa 1/01/1975 delhi
5 aaa 1/01/1975 kolkatta

2 bbb 2/03/1977 mumbai
6 bbb 2/03/1977 bangalore

vgersh99 · June 27, 2007, 1:34pm

nawk '{
  idx= $2 SUBSEP $3
  arr[idx] = (idx in arr) ? arr[idx] ORS $0 : $0
  arrCnt[idx]++
}
END {
  for (i in arr)
     if (arrCnt > 1) print arr
}' myInputFile

Shell_Life · June 27, 2007, 1:59pm

The user asked for:

In other words, he wants to output the lines when both the date
and the name are the same.

I have the following test data:

1 aaa 1/01/1975 delhi
2 bbb 2/03/1977 mumbai
3 aaa 1/01/1976 mumbai
4 bbb 2/03/1975 chennai
5 aaa 1/01/1975 kolkatta
6 xxx 1/01/1976 mumbai
7 bbb 2/03/1977 bangalore
8 aaa 1/01/1976 mumbai

Based on the requirement, the correct output should be:

3 aaa 1/01/1976 mumbai
6 xxx 1/01/1976 mumbai
8 aaa 1/01/1976 mumbai

Running Aigles code:

1 aaa 1/01/1975 delhi
5 aaa 1/01/1975 kolkatta

3 aaa 1/01/1976 mumbai
8 aaa 1/01/1976 mumbai

2 bbb 2/03/1977 mumbai
7 bbb 2/03/1977 bangalore

Running Vgersh code:

2 bbb 2/03/1977 mumbai
7 bbb 2/03/1977 bangalore
1 aaa 1/01/1975 delhi
5 aaa 1/01/1975 kolkatta
3 aaa 1/01/1976 mumbai
8 aaa 1/01/1976 mumbai

vgersh99 · June 27, 2007, 2:10pm

Shell_Life,
'DOB and name' - not 'DOB and location'. 'Second and Third' - not 'Third and Fourth' fields.

Shell_Life · June 27, 2007, 2:14pm

Vgersh,
Thanks for clarifying.
I was under the impression that 'name' was Mumbai, Kolkatta, etc.
Great catch!
Cheers.

aigles · June 28, 2007, 3:32am

When I read the question, I had in mind a solution using arrays like that of vgersh99.
Finally I tried to see whether it were easy to make without arrays, and it's that solution that i have posted.
The vgersh99' solution is simpler and more readable.

I wanted to see the differences in performance between the two solution for a large volume of data.
For that I adapted the two solutions to determine the number of files duplicated files on my system.

I have build a file containing the list of all the files (field 1: directory path, field 2: name of the file)
The result file contains 64000 duplicate files approximately.

# find / | sed 's!/\([^/]*\)$!/ \1!' > files.txt
# wc files.txt
  534733 1069473 34359804 files.txt
# head -10 files.txt
/ 
/ lost+found
/ home
/home/ lost+found
/home/ guest
/home/guest/ .sh_history
/home/ gseyjr
/home/gseyjr/ .profile
/home/ usertest
/home/usertest/ .profile
#

The solution with arrays :

$ cat dup1.sh
awk '
   { 
      Files[$2] = ($2 in Files) ? Files[$2] ORS $0 : $0; 
      FilesCnt[$2]++ 
   }
   END { 
      for (f in Files) {
         if (FilesCnt[f] > 1) {
            print Files[f];
            duplicates++;
         }
      }
      print "\nDuplicates : " duplicates;
   }
' files.txt
$ time dup1.sh > /dev/null
real    0m27.22s
user    0m26.74s
sys     0m0.40s
$

The solution without arrays :

The -T option of the sort command was required because there wasn't sufficient space available for work files on the current filesystem.

$ cat dup2.sh
sort -T /refiea/tmp -k2,2 files.txt |
awk '
   BEGIN { first_duplicate = 1 }
   {
     file = $2;
     if (file == prv_file) {
         if (first_duplicate) {
            print prv_rec;
            duplicates++
         }
         print $0;
         first_duplicate = 0;
     } else {
        prv_file = file;
        prv_rec  = $0;
        first_duplicate = 1;
     }
   }
   END {
      print "Duplicates : " duplicates;
   }
'
$time dup2.sh > /dev/null
real    0m39.85s
user    0m2.92s
sys     0m0.10s
$

In fact, the sort itself takes more time to run that the complete solution with arrays.

$ time sort -T /refiea/tmp -k2,2 files.txt > /dev/null
real   33.06
user   32.28
sys    0.73
$

Conclusion:

The arrays win the contest.

Awk' arrays are yours friends.
They are easy to use and powerful.

vgersh99 · June 28, 2007, 8:31am

nicely done anlysis, aigles!

Shell_Life · June 28, 2007, 10:51am

I created a test file with 114,688 records:

    1 aaa 1/01/1975 delhi
    2 bbb 2/03/1977 mumbai
    3 aaa 1/01/1976 mumbai
    4 bbb 2/03/1975 chennai
    5 aaa 1/01/1975 kolkatta
    6 xxx 1/01/1976 mumbai
    7 aaa 1/01/1975 delhi
...
114686 xxx 1/01/1976 mumbai
114687 bbb 2/03/1977 bangalore
114688 aaa 1/01/1976 mumbai

Running the Vgersh's arrays solution:

>time nawk_array.sh > /dev/null
real    9m2.96s
user    9m2.67s
sys     0m0.08s

Running Aigles' sort solution:

>time nawk_dups.sh > /dev/null
real    0m10.22s
user    0m2.55s
sys     0m0.03s

Guys, the 'sort' command is very optmized.
Arrays work great for a short number of occurrences or when constants must be used.
Otherwise, arrays should be used with caution, especially when several
thousands of occurrences are involved.

aigles · June 28, 2007, 11:11am

Strange ..Your results are at the opposite of mine.

I ran again my test scripts under AIX with the same result, my input file contains 534733 records for a total size of 34Mb.

The solution with arrays is faster than the solution with sort (and again the sort itself is longer than the entire solution with arrays).

An idea to explain this mystery?

Shell_Life · June 28, 2007, 11:27am

Aigles,
Try to create a similar data set that I tested against as follows:

1) Create a file 'A' with:
aaa 1/01/1975 delhi
bbb 2/03/1977 mumbai
aaa 1/01/1976 mumbai
bbb 2/03/1975 chennai
aaa 1/01/1975 kolkatta
xxx 1/01/1976 mumbai
aaa 1/01/1975 delhi
xxx 1/01/1976 mumbai
bbb 2/03/1977 bangalore
aaa 1/01/1976 mumbai

2) Keep repeating the following process until you get to over 110,000 records:
cp A B
cat B > A

3) After you have the number of records you want:
cat -n A > B
sed 's/^I/ /' B > A    ### This will remove the ctl-I (tabs) after the numbers.

vakharia_Mahesh · June 28, 2007, 12:51pm

Dear Guru
What a great froum I am in !!! I really feel proud to be a member of it !

I thank to aigles, Shell Life ,vgersh99 from bottom of my heart 'cause
i have been trying with the code also by moving the previous but to no
avail and totally forgot about the array . Once again my sincere thanks to
all of you for sparing your time and providing me with the solution. One more
clarification I would like though I have yet to test ,
Will the array works with large volume ??
Or the simple script will do the job ? will you please throw some detail?
Something for my detail knowledge .

aigles · June 29, 2007, 3:37am

shell_life:

Aigles,
Try to create a similar data set that I tested against as follows:

1) Create a file 'A' with:
aaa 1/01/1975 delhi
bbb 2/03/1977 mumbai
aaa 1/01/1976 mumbai
bbb 2/03/1975 chennai
aaa 1/01/1975 kolkatta
xxx 1/01/1976 mumbai
aaa 1/01/1975 delhi
xxx 1/01/1976 mumbai
bbb 2/03/1977 bangalore
aaa 1/01/1976 mumbai

2) Keep repeating the following process until you get to over 110,000 records:
cp A B
cat B > A

3) After you have the number of records you want:
cat -n A > B
sed 's/^I/ /' B > A    ### This will remove the ctl-I (tabs) after the numbers.

I confirm you results :

$ wc -l vdup.txt
  163840 vdup.txt
$ time vdup_noarrays.sh

r�el    0m14,68s
util    0m7,45s
sys     0m0,04s
$ time vdup_arrays.sh  

r�el    16m51,15s
util    16m41,74s
sys     0m0,21s
$

I think that the problem doesn't come from the number of elements in the array.
In this test the array contains only 5 elements, but the elements are very large (up to 1300 Kb) and modified very often.

The situation was inverse in my previous test.
There were more than 100000 elements with a maximal size of 200 Kb and low update rate.

aigles · June 29, 2007, 8:02am

A little modification to vger'sh99 solution and arrays win !

nawk '{
  idx= $2 SUBSEP $3
  arr[idx, ++arrCnt[idx]] = $0
}
END {
  for (i in arrCnt)
     if (arrCnt > 1) 
        for (c=1; c<=arrCnt; c++)
           print arr[i, c];
}' vdup.txt > /dev/null

$ time vdup_noarrays.sh 

real   6.11
user   2.75
sys    0.03
$ time vdup_arrays.sh

real   1008.69
user   1001.02
sys    0.21
$ time vdup_arrays2.sh    # Modified solution

real   5.74
user   5.55
sys    0.15
$

Shell_Life · June 29, 2007, 8:56am

Aigles,
As I said:

vakharia_Mahesh · June 29, 2007, 9:35am

Dear Guru
Thanks lot for your efforts and love to see the result and surely have devoted good time for my question.Hats of to you for your in deapth
reply which will not only increase my knowledge but my respect for
you and this Forum .

bobo · September 20, 2007, 2:49pm

inputfile:

1 aaa 1/01/1975 delhi
2 bbb 2/03/1977 mumbai
3 ccc 2/03/1977 mumbai
4 ddd 1/01/1975 chennai
5 aaa 1/01/1975 kolkatta
6 bbb 2/03/1977 bangalore

program1:

sort -k2,3 inputfile | \
awk '
BEGIN { first_duplicate = 1 }
{
name = $2;
dob = $3;
if (name == prv_name && dob == prv_dob) {
if (first_duplicate)
print "\n" prv_rec;
print $0;
first_duplicate = 0;
} else {
prv_name = name;
prv_dob = dob;
prv_rec = $0;
first_duplicate = 1;
}
}
'

Result:
>

1 aaa 1/01/1975 delhi
5 aaa 1/01/1975 kolkatta

2 bbb 2/03/1977 mumbai
6 bbb 2/03/1977 bangalore
>

Questions:

How do I direct the result to a output.file given the code in program1?

------------------------------------------------------------------------

Program 2:

#Sort is now ( sort -k3,3 inputfile)

sort -k2,3 inputfile | \
awk '
BEGIN { first_duplicate = 1 }
{
name = $2;
dob = $3;
# And if (name == prv_name && dob == prv_dob) become if (name != prv_name && dob == prv_dob)#
if (name != prv_name && dob == prv_dob) {
if (first_duplicate)
print "\n" prv_rec;
print $0;
first_duplicate = 0;
} else {
prv_name = name;
prv_dob = dob;
prv_rec = $0;
first_duplicate = 1;
}
}
'

Result:

>

1 aaa 1/01/1975 delhi
4 ddd 1/01/1975 chennai

2 bbb 2/03/1977 mumbai
3 ccc 2/03/1977 mumbai
>

Questions:

What codes change are needed to have program 3 to give similar results of that program 1 and program 2?

Expert please help!

Program 3 codes:

nawk '{
idx= $2 SUBSEP $3
arr[idx] = (idx in arr) ? arr[idx] ORS $0 : $0
arrCnt[idx]++
}
END {
for (i in arr)
if (arrCnt [i]> 1) print arr
[i]}' myInputfile

jim_mcnamara · September 20, 2007, 4:14pm

Program 1: after the last ' add:

 > newfile

summer_cherry · September 23, 2007, 11:40pm

I suppose your requirements is this:
input(a):

1 aaa 1/01/1975 delhi
2 bbb 2/03/1977 mumbai
3 aaa 1/01/1976 mumbai
4 bbb 2/03/1975 chennai
5 aaa 1/01/1975 kolkatta
6 bbb 2/03/1977 bangalore

output:

2 bbb 2/03/1977 mumbai
6 bbb 2/03/1977 bangalore

1 aaa 1/01/1975 delhi
5 aaa 1/01/1975 kolkatta

code:

awk '{
num[$2$3]=num[$2$3]+1
str[$2$3]=str[$2$3]"\n"$0
}
tr[$2$3]=str[$2$3]"\n"$0
}
END{
for (i in num)
if (num>=2)
print str
}' a

matrixmadhan · September 24, 2007, 2:22am

#! /opt/third-party/bin/perl

open(FILE, "<", "input");

while(<FILE>) {
  chomp;
  my @arr = split(/ /);
  if( defined($fileHash{$arr[1].$arr[2]}) ) {
    $fileHash{$arr[1].$arr[2]} .= ("##" . $_);
  }
  else {
    $fileHash{$arr[1].$arr[2]} = $_;
  }
}

close(FILE);

foreach my $k ( keys %fileHash ) {
  $v = $fileHash{$k};
  if( $v =~ /##/ ) {
    my @arr = split(/##/, $v);
    foreach my $a ( @arr ) {
      print "$a\n";
    }
  }
}

exit 0

This should be even faster