How to get Duplicate rows in a file

raghu.iv85 · April 2, 2009, 9:30am

Hi all,

I have written one shell script. The output file of this script is having sql output.

In that file, I want to extract the rows which are having multiple entries(duplicate rows).
For example, the output file will be like the following way.

===============================================================
<SH12_MC30_CE_VS_NY_HIST_T>

397 44847
400 33653
401 46455

<SH12_MC30_CE_VS_NY_HIST_T_BKP>

397 44847
398 40107
399 39338
400 33653

      In this output, I want numeric duplicate rows only. Suppose this file is having lines to separate the values, those lines also considered as duplicate rows. So I want only the out put from this file which is having more than one entry and which is related to numbers.

Can anyone please tell me the command?
Thanks in advance.

Regards,
Raghu.

jim_mcnamara · April 2, 2009, 9:58am

cat file1 file2 | \
   grep -v -e '^='  -e '^<' | \
   awk '{ arr[$0]++} END{ for (i in arr) { if(arr>1) { print i}  }}' > newfile

cat the files into grep to remove filenames in grep output, grep removes the header lines

siquadri · April 2, 2009, 9:59am

raghu.iv85:

Hi all,

I have written one shell script. The output file of this script is having sql output.

In that file, I want to extract the rows which are having multiple entries(duplicate rows).
For example, the output file will be like the following way.

===============================================================
<SH12_MC30_CE_VS_NY_HIST_T>

397 44847
400 33653
401 46455

<SH12_MC30_CE_VS_NY_HIST_T_BKP>

397 44847
398 40107
399 39338
400 33653

In this output, I want numeric duplicate rows only. Suppose this file is having lines to separate the values, those lines also considered as duplicate rows. So I want only the out put from this file which is having more than one entry and which is related to numbers.

Can anyone please tell me the command?
Thanks in advance.

Regards,
Raghu.

Try this

#!/bin/ksh
sort $1 > sortedfile
nawk '{ while (getline < sortedfile >0); array[n++]=$0; compare and remove non dup record here}'

raghu.iv85 · April 2, 2009, 10:20am

Hi Jim,

I could understand till second line of ur command.
I couldn't understand the awk part. Becoz i dont know the awk features.
But it is working. Thank you very much for that. 'awk' is so nice.
Can you give any aother way to get it instead of awk.

Thanks & Regards,
Raghunadh.

vgersh99 · April 2, 2009, 10:21am

nawk '/^[0-9]/ {a[$0]++} END {for (i in a) if (a>1) print i}' myOutputFile

raghu.iv85 · April 2, 2009, 10:51am

Hi vgersh99,

Thank you very much for ur reply.
'nawk' command id nice. But I dont know the 'awk' functionalities. So if I put this command in my script then I cant explain this command to anyone. So can you please provide me the command instead of 'awk' and 'nawk'.

Thanks in advance,

Regards,
Raghu.

joeyg · April 2, 2009, 12:38pm

I used awk at end only to handle output format. This could be done with a cut command also, although extra care is necessary for positioning.

> cat file9
===============================================================
<SH12_MC30_CE_VS_NY_HIST_T>
===============================================================
397 44847
400 33653
401 46455
===============================================================
<SH12_MC30_CE_VS_NY_HIST_T_BKP>
===============================================================
397 44847
398 40107
399 39338
400 33653

> grep "^[0-9]" file9 | sort | uniq -cd
      2 397 44847
      2 400 33653

> grep "^[0-9]" file9 | sort | uniq -cd | awk '{print $2" "$3}'
397 44847
400 33653

and, if your really don't want awk

> grep "^[0-9]" file9 | sort | uniq -cd | tr -s " " | cut -d" " -f3-4
397 44847
400 33653

Added quicker way -->

> grep "^[0-9]" file9 | sort | uniq -d 
397 44847
400 33653

raghu.iv85 · April 3, 2009, 2:04am

Hi joeyg,

Thank you very much.

Regards,
Raghu.

How to get Duplicate rows in a file

=============================================================== <SH12_MC30_CE_VS_NY_HIST_T>

397 44847 400 33653 401 46455

<SH12_MC30_CE_VS_NY_HIST_T_BKP>

===============================================================
<SH12_MC30_CE_VS_NY_HIST_T>

397 44847
400 33653
401 46455