Extract lines that have dupliucate and count them

is2_egypt · December 3, 2016, 6:24am

Dear friends

i have big file and i want to export the filw with new column for the lines that have same duplicate value in first column : ex : , ex :

-bash-3.00$ cat INTCONT-IS.CSV
M205-00-106_AMDRN:1-0-6-22,12-662-4833,intContact,2016-11-15 02:32:16,50
M205-00-106_AMDRN:1-0-23-17,12-616-0462,intContact,2016-11-15 02:32:23,50
M205-00-106_AMDRN:1-0-6-22,12-621-0646,intContact,2016-11-15 01:19:01,50
M213-00-312_BJWRM:1-0-8-12,12-621-3479,intContact,2016-11-15 01:19:17,50
M213-00-312_BJWRM:1-0-8-29,12-216-5205,intContact,2016-11-15 01:19:30,50
M213-00-312_BJWRM:1-0-12-28,12-621-7122,intContact,2016-11-15 01:19:44,50
M205-00-106_AMDRN:1-0-6-22,\N,intContact,2016-11-15 01:19:55,50
M205-00-106_AMDRN:1-0-6-22,12-574-4566,intContact,2016-11-15 07:46:00,50
V_TARTEABH_TARU013-A:1-1-1-32,13-823-5712,intContact,2016-11-15 22:46:22,50

ideal output shall export the same original file with new column fo the repetition for the first column in the original file , ex :

-bash-3.00$ cat INTCONT-IS.CSV
M205-00-106_AMDRN:1-0-6-22,12-662-4833,intContact,2016-11-15 02:32:16,50,4
M205-00-106_AMDRN:1-0-23-17,12-616-0462,intContact,2016-11-15 02:32:23,50,1
M205-00-106_AMDRN:1-0-6-22,12-621-0646,intContact,2016-11-15 01:19:01,50,4
M213-00-312_BJWRM:1-0-8-12,12-621-3479,intContact,2016-11-15 01:19:17,50,1
M213-00-312_BJWRM:1-0-8-29,12-216-5205,intContact,2016-11-15 01:19:30,50,1
M213-00-312_BJWRM:1-0-12-28,12-621-7122,intContact,2016-11-15 01:19:44,50,1
M205-00-106_AMDRN:1-0-6-22,\N,intContact,2016-11-15 01:19:55,50,4
M205-00-106_AMDRN:1-0-6-22,12-574-4566,intContact,2016-11-15 07:46:00,50,4
V_TARTEABH_TARU013-A:1-1-1-32,13-823-5712,intContact,2016-11-15 22:46:22,50,1

another question , what will be the command if i make this based on 3rd column niot first column ?

Thanks alot

RavinderSingh13 · December 3, 2016, 6:37am

Hello is2_Egypt,

Welcome to forums, regarding your question, as per your expected output, could you please try following and let me know if this helps.

awk -F, 'FNR==NR{A[$1]++;next} ($1 in A){print $0 FS A[$1]}'   Input_file  Input_file

Also for your 2nd query where you have mentioned 3rd field to be checked, could you please provide more details on same, because you have mentioned field separator to be : or , but output expected shown only when we take , as field separator. So kindly do let us know more clear on same.

Thanks,
R. Singh

is2_egypt · December 3, 2016, 6:42am

Hello Singh

Thanks alot ,
i have only one separator which is (,) , for example the following is one entry : M205-00-106_AMDRN:1-0-6-22

so is the above still valid or there is a change to be made ?

Thanks alot

RudiC · December 3, 2016, 6:57am

For the third field to be counted, replace every occurrence of $1 in RavinderSingh13's proposal by $3 . As $3 in every line is "intContact", the count added will be 9 for all lines.

is2_egypt · December 3, 2016, 7:07am

Hello friends

i run it and i got below error (sorry i am still beginner )"

-bash-3.00$ awk -F, 'FNR==NR{A[$1]++;next} ($1 in A){print $0 FS A[$1]}'   INTCONT-IS.CSV  intwithcount.CSV
awk: syntax error near line 1
awk: bailing out near line 1
-bash-3.00$

RudiC · December 3, 2016, 7:16am

What be your OS and awk version?

And, you NEED to repeat the identical input file as the program does two iterations on it. To produce an output file, use shell redirection.

is2_egypt · December 3, 2016, 9:35am

Hello dears

it seems working ok now , below is the command :

-bash-3.00$ nawk -F, 'FNR==NR{A[$1]++;next} ($1 in A){print $0 FS A[$1]}'   INTCONT-IS.CSV INTCONT-IS.CSV > newintwithcount.CSV

here example of outputs :

IP202ROWS-R:1-1-11-17,12-669-1626,intContact,2016-11-15 19:46:00,50,10
IP202ROWS-R:1-1-13-26,12-660-7710,intContact,2016-11-15 00:00:00,50,5
IP202ROWS-R:1-1-14-2,12-660-5834,intContact,2016-11-15 00:00:00,50,8
IP215SULI-I:1-1-1-10,12-252-2488,intContact,2016-11-15 16:46:00,50,2

i am exporting the output file and will confirm with manual check and advise back , thanks alot fro your great support.

---------- Post updated at 09:35 AM ---------- Previous update was at 08:07 AM ----------

hello dears , it is working perfect now, thanks alot.

if i want to export the lines only that have >5 and less than 12 duplication in one step on original file , how can i do that ?

RudiC · December 3, 2016, 11:24am

I'd propose you check your other thread, adapt the solution given there and post it here for discussion.

is2_egypt · December 3, 2016, 12:56pm

Thanks Rudic and all friends replied , answers are perfect and i have it clear now.

Scrutinizer · December 3, 2016, 1:45pm

Since we are reading the same file twice, the test for ($1 in A) is superfluous:

awk -F, 'FNR==NR{A[$1]++; next}{print $0 FS A[$1]}' file file