Duplicate records

jiam912 · August 19, 2016, 5:27am

Gents,

I have a file which contends duplicate records in column 1, but the values in column 2 are different.

3099753489 3
3099753489 5
3101954341 12
3101954341 14
3102153285 3
3102153285 5
3102153297 3
3102153297 5

I will like to get something like this:

output desired

3099753489 3 5
3101954341 12 14
3102153285 3 5
3102153297 3 5

I am trying with this code but does not work.

awk '{
			D[$1]}{key[$1,$2]++}
			END{
			for (i in key) {
			split (i, T, SUBSEP)
			print T[1],key, T[2]}}' file

Please can you help me.

RudiC · August 19, 2016, 5:30am

Is yout input ALWAYS two records? In sequence?

RavinderSingh13 · August 19, 2016, 5:36am

Hello jiam912,

Considering that there are always 2 fields into your Input_file then following may help in same.
I- If you are not worried about the output sequence as like Input_file's sequence.

awk '{A[$1]=A[$1]?A[$1] OFS $2:$2} END{for(i in A){print i OFS A}}'  Input_file

Output will be as follows.

3099753489 3 5
3102153285 3 5
3101954341 12 14
3102153297 3 5

II- If you need output in sequence as Input_file, then following may help you in same.

awk 'FNR==NR{A[$1]=A[$1]?A[$1] OFS $2:$2;next} ($1 in A){print $1 OFS A[$1];delete A[$1]}'  Input_file  Input_file

Output will be as follows.

3099753489 3 5
3101954341 12 14
3102153285 3 5
3102153297 3 5

Thanks,
R. Singh

jiam912 · August 19, 2016, 6:14am

Hi RudiC.

It can be sometimes more than 2 records in secuence

---------- Post updated at 05:14 AM ---------- Previous update was at 05:13 AM ----------

Hi RavinderSingh13.

Thanks a lot

RavinderSingh13 · August 19, 2016, 6:27am

Hello jiam912,

In case you have more than 2 fields into your Input_file then following may help you in same, let's say following is the Input_file.

cat Input_file
3099753489 3
3099753489 5
3101954341 12 21 31  34 56 78
3101954341 14
3102153285 3
3102153285 5
3102153297 3
3102153297 5

Then following is the code for same.

awk 'FNR==NR{if(NF>2){for(i=3;i<=NF;i++){Q=Q?Q OFS $i:$i}} else {Q=$2};A[$1]=A[$1]?A[$1] OFS Q:Q;next} ($1 in A){print $1 OFS A[$1];delete A[$1]}'  Input_file  Input_file

Output will be as follows.

3099753489 3 5
3101954341 5 21 31 34 56 78 14
3102153285 3 5
3102153297 3 5

Thanks,
R. Singh

RudiC · August 19, 2016, 6:38am

Try also

awk 'NR == 1 || $1 != LAST {printf "%s%s", NR==1?"":RS, LAST = $1} {printf " %s", $2} END {print _} ' file
3099753489 3 5
3101954341 12 14 16 24
3102153285 3 5
3102153297 3 5

jiam912 · August 19, 2016, 6:42am

Dear R. Singh

Thanks a lot

And for this case?.

input

3099753489 3
3099753489 5
3099753489 7
3101954341 12
3101954341 14
3102153285 3
3102153285 5
3102153297 3
3102153297 5
3102153297 8

output

3099753489 3 5 7
3101954341 12 14
3102153285 3 5
3102153297 3 5 8

RavinderSingh13 · August 19, 2016, 7:19am

jiam912:

Dear R. Singh
Thanks a lot
And for this case?.
input

3099753489 3
3099753489 5
3099753489 7
3101954341 12
3101954341 14
3102153285 3
3102153285 5
3102153297 3
3102153297 5
3102153297 8

output

3099753489 3 5 7
3101954341 12 14
3102153285 3 5
3102153297 3 5 8

Hello jiam912,

Yes, it works for 2 fields too as follows too, I just added 2nd solution in case you have more than 2 fields into your Input_file.

awk 'FNR==NR{if(NF>2){for(i=3;i<=NF;i++){Q=Q?Q OFS $i:$i}} else {Q=$2};A[$1]=A[$1]?A[$1] OFS Q:Q;next} ($1 in A){print $1 OFS A[$1];delete A[$1]}  Input_file  Input_file

Output will be as follows.

3099753489 3 5 7
3101954341 12 14
3102153285 3 5
3102153297 3 5 8

If you have any queries please do let us know in details.

Thanks,
R. Singh

jiam912 · August 19, 2016, 7:34am

Dear R. Singh

Appreciate your help.

It works perfectly.

Don_Cragun · August 19, 2016, 7:38am

jiam912:

Dear R. Singh

Thanks a lot

And for this case?.

input

3099753489 3
3099753489 5
3099753489 7
3101954341 12
3101954341 14
3102153285 3
3102153285 5
3102153297 3
3102153297 5
3102153297 8

output

3099753489 3 5 7
3101954341 12 14
3102153285 3 5
3102153297 3 5 8

Hi jiam912,
Did you try the code RudiC suggested in post #6 in this thread? Or, if his suggestion gave you a syntax error:

awk '
NR == 1 || $1 != LAST {
	printf "%s%s", (NR==1?"":RS), LAST = $1
}
{	printf " %s", $2
}
END {	print _
}' file

which should work as long as there are only two fields per input line and all lines with the same first field value are adjacent in your input file. If your real input files (like your samples) meet the above requirements, this should be faster than RavinderSingh13's suggestion because it only needs to read your input file once.

RavinderSingh13 · August 19, 2016, 7:58am

don cragun:

Hi jiam912,
Did you try the code RudiC suggested in post #6 in this thread? Or, if his suggestion gave you a syntax error:
awk '
NR == 1 || $1 != LAST {
   printf "%s%s", (NR==1?"":RS), LAST = $1
}
{    printf " %s", $2
}
END {    print _
}' file
which should work as long as there are only two fields per input line and all lines with the same first field value are adjacent in your input file. If your real input files (like your samples) meet the above requirements, this should be faster than RavinderSingh13's suggestion because it only needs to read your input file once.

Hello Don/jiam912,

Not much sure about how much fast this following solution may be, following solution will do:
I- Will read the Input_file once.
II- Will take care of sequence of output as per Input_file only.
III- Will take care of requirement in case more than 2 fields are there for a value of 1st field too.

awk 'FNR==NR{if(NF>2){for(i=3;i<=NF;i++){Q=Q?Q OFS $i:$i}} else {Q=$2};A[$1]=A[$1]?A[$1] OFS Q:Q;if(!C[$1]){D[++j]=$1}} END{for(k=1;k<=j;k++){if(A[D[k]]){print D[k] OFS A[D[k]]};delete A[D[k]]}}'  Input_file

Output will be as follows.

3099753489 3 5
3101954341 5 21 31 34 56 78 14
3102153285 3 5
3102153297 3 5

Where Input_file is as follows.

cat Input_file
3099753489 3
3099753489 5
3101954341 12 21 31  34 56 78
3101954341 14
3102153285 3
3102153285 5
3102153297 3
3102153297 5

EDIT: Adding a non-one liner form of solution too now.

awk 'FNR==NR{
                if(NF>2){
                                for(i=3;i<=NF;i++){
                                                        Q=Q?Q OFS $i:$i
                                                  }
                        }
                else    {
                                Q=$2
                        };
                A[$1]=A[$1]?A[$1] OFS Q:Q;
                if(!C[$1])                        {
                                                        D[++j]=$1
                                                  }
                        }
     END    {
                for(k=1;k<=j;k++)                 {
                                                        if(A[D[k]]){
                                                                        print D[k] OFS A[D[k]]
                                                                   };
                                                        delete A[D[k]]
                                                  }
            }
     '   Input_file

Thanks,
R. Singh

RudiC · August 19, 2016, 8:09am

In case you have more than 2 fields, try

awk 'NR == 1 || $1 != LAST {printf "%s%s", (NR==1?"":RS), LAST = $1} {sub ("^" LAST, _); printf "%s", $0} END {print _} ' file

Akshay_Hegde · August 19, 2016, 8:16am

Try

jiam912:

Dear R. Singh

Thanks a lot

And for this case?.

input

3099753489 3
3099753489 5
3099753489 7
3101954341 12
3101954341 14
3102153285 3
3102153285 5
3102153297 3
3102153297 5
3102153297 8

output

3099753489 3 5 7
3101954341 12 14
3102153285 3 5
3102153297 3 5 8

Input

[akshay@localhost tmp]$ cat f
3099753489 3
3099753489 5
3099753489 7
3101954341 12
3101954341 14
3102153285 3
3102153285 5
3102153297 3
3102153297 5
3102153297 8

Output

[akshay@localhost tmp]$ awk '$1 in A{A[$1]=A[$1] OFS $2; next}{ O[++o]=$1; A[$1]=$2}END{for(i=1; i in O; i++)print O,A[O]}' f
3099753489 3 5 7
3101954341 12 14
3102153285 3 5
3102153297 3 5 8

Readable version

awk '
$1 in A{
       A[$1]=A[$1] OFS $2 
       next
}
{ 
       O[++o]=$1; 
       A[$1]=$2
}
END{
      for(i=1; i in O; i++)
            print O,A[O]
}' f

Don_Cragun · August 19, 2016, 8:45am

With any shell using POSIX shell syntax, you could also do it without awk :

#!/bin/ksh
{	read -r last rest
	printf '%s %s' "$last" "$rest"
	while read -r key rest
	do	[ "$key" = "$last" ] && printf ' %s' "$rest" ||
		    printf '\n%s %s' "$key" "$rest"
		last="$key"
	done
	echo
} < input

which also works with two or more fields/line as long as all lines in the input file with a given key are adjacent.

Akshay_Hegde · August 19, 2016, 8:54am

ravindersingh13:

Hello Akshay,

Above code will work well when there are 2 fields in Input_file, when there are more than 2 fields then a little change into your code will do the trick as follows.
Let's say Input_file is as follows.
cat Input_file
3099753489 3
3099753489 5
3101954341 12 21 31  34 56 78
3101954341 14
3102153285 3
3102153285 5
3102153297 3
3102153297 5
Then following is the one(edited one from your last post):
awk '{for(i=2;i<=NF;i++){W=W?W OFS $i:$2}}$1 in A{A[$1]=A[$1] OFS W;W=""; next}{ O[++o]=$1; A[$1]=W;W=""}END{for(i=1; i in O; i++)print O,A[O]}'  Input_file
Output will be as follows then.
3099753489 3 5
3101954341 12 21 31 34 56 78 14
3102153285 3 5
3102153297 3 5
Thanks,
R. Singh

Hello RavinderSingh13 !

The user has not mentioned anywhere in current thread about more than 2 fields, why do you simply assume more than required and confuse others too ?

Please note records != fields

RavinderSingh13 · August 19, 2016, 9:18am

I Apologies to Akshay and all, my intention in these forums is to learn, learn and only learn and try to help, nothing else. I have deleted my post now.

Thanks,
R. Singh

jiam912 · August 19, 2016, 9:34am

Thanks to all.

Appreciate your help.

---------- Post updated at 08:34 AM ---------- Previous update was at 08:33 AM ----------

Thanks to all
Appreciate your help