Duplicate records

Gents,

I have a file which contends duplicate records in column 1, but the values in column 2 are different.

3099753489 3
3099753489 5
3101954341 12
3101954341 14
3102153285 3
3102153285 5
3102153297 3
3102153297 5

I will like to get something like this:

output desired

3099753489 3 5
3101954341 12 14
3102153285 3 5
3102153297 3 5

I am trying with this code but does not work.

awk '{
			D[$1]}{key[$1,$2]++}
			END{
			for (i in key) {
			split (i, T, SUBSEP)
			print T[1],key, T[2]}}' file

Please can you help me.:b:

Is yout input ALWAYS two records? In sequence?

1 Like

Hello jiam912,

Considering that there are always 2 fields into your Input_file then following may help in same.
I- If you are not worried about the output sequence as like Input_file's sequence.

awk '{A[$1]=A[$1]?A[$1] OFS $2:$2} END{for(i in A){print i OFS A}}'  Input_file

Output will be as follows.

3099753489 3 5
3102153285 3 5
3101954341 12 14
3102153297 3 5

II- If you need output in sequence as Input_file, then following may help you in same.

awk 'FNR==NR{A[$1]=A[$1]?A[$1] OFS $2:$2;next} ($1 in A){print $1 OFS A[$1];delete A[$1]}'  Input_file  Input_file

Output will be as follows.

3099753489 3 5
3101954341 12 14
3102153285 3 5
3102153297 3 5

Thanks,
R. Singh

1 Like

Hi RudiC.

It can be sometimes more than 2 records in secuence

---------- Post updated at 05:14 AM ---------- Previous update was at 05:13 AM ----------

Hi RavinderSingh13.

Thanks a lot

Hello jiam912,

In case you have more than 2 fields into your Input_file then following may help you in same, let's say following is the Input_file.

cat Input_file
3099753489 3
3099753489 5
3101954341 12 21 31  34 56 78
3101954341 14
3102153285 3
3102153285 5
3102153297 3
3102153297 5

Then following is the code for same.

awk 'FNR==NR{if(NF>2){for(i=3;i<=NF;i++){Q=Q?Q OFS $i:$i}} else {Q=$2};A[$1]=A[$1]?A[$1] OFS Q:Q;next} ($1 in A){print $1 OFS A[$1];delete A[$1]}'  Input_file  Input_file

Output will be as follows.

3099753489 3 5
3101954341 5 21 31 34 56 78 14
3102153285 3 5
3102153297 3 5

Thanks,
R. Singh

Try also

awk 'NR == 1 || $1 != LAST {printf "%s%s", NR==1?"":RS, LAST = $1} {printf " %s", $2} END {print _} ' file
3099753489 3 5
3101954341 12 14 16 24
3102153285 3 5
3102153297 3 5

Dear R. Singh

Thanks a lot

And for this case?.

input

3099753489 3
3099753489 5
3099753489 7
3101954341 12
3101954341 14
3102153285 3
3102153285 5
3102153297 3
3102153297 5
3102153297 8

output

3099753489 3 5 7
3101954341 12 14
3102153285 3 5
3102153297 3 5 8

Hello jiam912,

Yes, it works for 2 fields too as follows too, I just added 2nd solution in case you have more than 2 fields into your Input_file.

awk 'FNR==NR{if(NF>2){for(i=3;i<=NF;i++){Q=Q?Q OFS $i:$i}} else {Q=$2};A[$1]=A[$1]?A[$1] OFS Q:Q;next} ($1 in A){print $1 OFS A[$1];delete A[$1]}  Input_file  Input_file

Output will be as follows.

3099753489 3 5 7
3101954341 12 14
3102153285 3 5
3102153297 3 5 8

If you have any queries please do let us know in details.

Thanks,
R. Singh

1 Like

Dear R. Singh

Appreciate your help.

It works perfectly.

Hi jiam912,
Did you try the code RudiC suggested in post #6 in this thread? Or, if his suggestion gave you a syntax error:

awk '
NR == 1 || $1 != LAST {
	printf "%s%s", (NR==1?"":RS), LAST = $1
}
{	printf " %s", $2
}
END {	print _
}' file

which should work as long as there are only two fields per input line and all lines with the same first field value are adjacent in your input file. If your real input files (like your samples) meet the above requirements, this should be faster than RavinderSingh13's suggestion because it only needs to read your input file once.

1 Like

Hello Don/jiam912,

Not much sure about how much fast this following solution may be, following solution will do:
I- Will read the Input_file once.
II- Will take care of sequence of output as per Input_file only.
III- Will take care of requirement in case more than 2 fields are there for a value of 1st field too.

awk 'FNR==NR{if(NF>2){for(i=3;i<=NF;i++){Q=Q?Q OFS $i:$i}} else {Q=$2};A[$1]=A[$1]?A[$1] OFS Q:Q;if(!C[$1]){D[++j]=$1}} END{for(k=1;k<=j;k++){if(A[D[k]]){print D[k] OFS A[D[k]]};delete A[D[k]]}}'  Input_file

Output will be as follows.

3099753489 3 5
3101954341 5 21 31 34 56 78 14
3102153285 3 5
3102153297 3 5

Where Input_file is as follows.

cat Input_file
3099753489 3
3099753489 5
3101954341 12 21 31  34 56 78
3101954341 14
3102153285 3
3102153285 5
3102153297 3
3102153297 5

EDIT: Adding a non-one liner form of solution too now.

awk 'FNR==NR{
                if(NF>2){
                                for(i=3;i<=NF;i++){
                                                        Q=Q?Q OFS $i:$i
                                                  }
                        }
                else    {
                                Q=$2
                        };
                A[$1]=A[$1]?A[$1] OFS Q:Q;
                if(!C[$1])                        {
                                                        D[++j]=$1
                                                  }
                        }
     END    {
                for(k=1;k<=j;k++)                 {
                                                        if(A[D[k]]){
                                                                        print D[k] OFS A[D[k]]
                                                                   };
                                                        delete A[D[k]]
                                                  }
            }
     '   Input_file

Thanks,
R. Singh

1 Like

In case you have more than 2 fields, try

awk 'NR == 1 || $1 != LAST {printf "%s%s", (NR==1?"":RS), LAST = $1} {sub ("^" LAST, _); printf "%s", $0} END {print _} ' file
1 Like

Try

Input

[akshay@localhost tmp]$ cat f
3099753489 3
3099753489 5
3099753489 7
3101954341 12
3101954341 14
3102153285 3
3102153285 5
3102153297 3
3102153297 5
3102153297 8

Output

[akshay@localhost tmp]$ awk '$1 in A{A[$1]=A[$1] OFS $2; next}{ O[++o]=$1; A[$1]=$2}END{for(i=1; i in O; i++)print O,A[O]}' f
3099753489 3 5 7
3101954341 12 14
3102153285 3 5
3102153297 3 5 8

Readable version

awk '
$1 in A{
       A[$1]=A[$1] OFS $2 
       next
}
{ 
       O[++o]=$1; 
       A[$1]=$2
}
END{
      for(i=1; i in O; i++)
            print O,A[O]
}' f
1 Like

With any shell using POSIX shell syntax, you could also do it without awk :

#!/bin/ksh
{	read -r last rest
	printf '%s %s' "$last" "$rest"
	while read -r key rest
	do	[ "$key" = "$last" ] && printf ' %s' "$rest" ||
		    printf '\n%s %s' "$key" "$rest"
		last="$key"
	done
	echo
} < input

which also works with two or more fields/line as long as all lines in the input file with a given key are adjacent.

1 Like

Hello RavinderSingh13 !

The user has not mentioned anywhere in current thread about more than 2 fields, why do you simply assume more than required and confuse others too ?

Please note records != fields

1 Like

I Apologies to Akshay and all, my intention in these forums is to learn, learn and only learn and try to help, nothing else. I have deleted my post now.

Thanks,
R. Singh

1 Like

Thanks to all.

Appreciate your help.

---------- Post updated at 08:34 AM ---------- Previous update was at 08:33 AM ----------

Thanks to all
Appreciate your help