Extract and count number of Duplicate rows

Arun_Mishra · March 8, 2013, 6:20am

Hi All,

I need to extract duplicate rows from a file and write these bad records into another file. And need to have a count of these bad records.
i have a command

awk '
{s[$0]++}
END {
  for(i in s) {
    if(s>1) {
      print i
    }
  }
}' ${TMP_DUPE_RECS}>>${TMP_BAD_DATA_DUPE_RECS}

but this doesnt solve my problem.

Input:
A
  A
  A
  B
  B
  C

Desired Output:
  
A
  A
  B

Count of bad records=3
But when i run my script i get out put as:
A
B
Count of bad records=2. Which is not true.
As always any help appreciated.

franzpizzo · March 8, 2013, 6:41am

I hope that this is what you want:

awk '
{s[$0]++}
END {
  for(i in s) {
  for(j=1;j<s;j++){
      print i;
  }
  }
}' ${TMP_DUPE_RECS}>${TMP_BAD_DATA_DUPE_RECS}

Arun_Mishra · March 8, 2013, 6:48am

Yes man, I tested and it's working.
Thanks very much for the code. Can you please explain what basically it does? The for loop specifically.

Thanks again for the help!

franzpizzo · March 8, 2013, 12:14pm

awk '
{s[$0]++}              # this populate an array, the number of elements is the distinct value in the file (A B C) 
END {                  # and the value is the count of each element: eg. if i=A --> s=3
  for(i in s) {        # for each distinct value i in s
  for(j=1;j<s;j++){ # s is the count of element i: in this way
      print i;         # print s-1 times the element i
  }
  }
}' ${TMP_DUPE_RECS}>${TMP_BAD_DATA_DUPE_RECS}

Don_Cragun · March 8, 2013, 12:47pm

I don't see the need for the END clause for this problem. Doesn't:

awk 'c[$0]++{print}' ${TMP_DUPE_RECS}>${TMP_BAD_DATA_DUPE_RECS}

produce the same output?
When reading records, if the record has been seen more than one time, print it then.

But, looking at it again, this is the same as the script you initially provided that you said was not working.
If what you want is the input lines that are not duplicated that would be:

awk 'c[$0]++==0{print}' ${TMP_DUPE_RECS}>${TMP_BAD_DATA_DUPE_RECS}

which produces the output:

A
  A
  B
  C

which is not what was originally requested.

If there is only one word on each input line, and you want to print lines that are duplicates of previous lines (ignoring leading whitespace), try:

awk 'c[$1]++{print}' ${TMP_DUPE_RECS}>${TMP_BAD_DATA_DUPE_RECS}

which produces the output:

  A
  A
  B

but this still isn't the output originally requested. Please explain in more detail what it is that you want AND give us sample input and output that match your description.

hanson44 · March 8, 2013, 10:53pm

Sounds like you want to know:

1) identity of duplicated (bad) rows.
2) count of duplicated (bad) rows.

What about the much simpler:

$ uniq -c temp.x | grep -v " 1 "
      3 A
      2 B

If you want to change 2 -> 1, 3 -> 2 in further step, that would be easy.