Remove duplicated records and update last line record counts

green_k · March 9, 2019, 5:55pm

Hi Gurus,

I need to remove duplicate line in file and update TRAILER (last line) record count. the file is comma delimited, field 2 is key to identify duplicated record.

I can use below command to remove duplicated. but don't know how to replace last line 2nd field to new count.

awk -F"," '{if($2 in a);else {print $0}{a[$2]=$0}}' file.CSV

below is sample file, before removing duplicate records, total records is 6, after removing duplicated records, total records is 5

before removing

D,1693,20000101,0.480
D,1694,20000101,0.80
D,1695,20000101,0.480
D,1695,20000101,0.480
D,2001,20000101,0.007486
D,2002,20000101,0.0098
T,6, 9020, 330

after remove duplicated

D,1693,20000101,0.480
D,1694,20000101,0.80
D,1695,20000101,0.480
D,2001,20000101,0.007486
D,2002,20000101,0.0098
T,5, 9020, 330

thanks in advance

Don_Cragun · March 9, 2019, 7:04pm

Your description and code are not clear enough to be sure that this is what you want, but it works with the sample data provided:

awk '
BEGIN {	FS = OFS = ","
}
$1 == "D" {
	if($2 in a)
		next
	a[$2]
	printed++
}
$1 == "T" {
	$2 = printed
}
1' file.CSV

Clearly field #2 is not the key to determining duplicate records, it is at least field #2 when and only when field #1 is "D". And, since you are storing the entire line into the a[] array for some reason, maybe you only want to delete identical lines instead of deleting lines with identical keys???

The above code assumes you just want to delete lines with identical keys where the key is the combination of field #1 being "D" and field #2 being unique. The second field in the line with field #1 being "T" is written with whatever was in field #2 changed to the number of lines with field #1 being "D" and field #2 being unique that have been seen before the line that has field #1 being "T". All lines that do not have field #1 being "D" or "T" are copied to the output without being counted.

You should always tell us what operating system and shell you're using when you start a new thread in this forum. The behavior of many utilities varies from operating system to operating system and the features provided by shells vary from shell to shell.

If you want to try the above code on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk .

green_k · March 9, 2019, 8:40pm

don cragun:

Your description and code are not clear enough to be sure that this is what you want, but it works with the sample data provided:
awk '
BEGIN {	FS = OFS = ","
}
$1 == "D" {
	if($2 in a)
		next
	a[$2]
	printed++
}
$1 == "T" {
	$2 = printed
}
1' file.CSV
Clearly field #2 is not the key to determining duplicate records, it is at least field #2 when and only when field #1 is "D". And, since you are storing the entire line into the a array for some reason, maybe you only want to delete identical lines instead of deleting lines with identical keys???

The above code assumes you just want to delete lines with identical keys where the key is the combination of field #1 being "D" and field #2 being unique. The second field in the line with field #1 being "T" is written with whatever was in field #2 changed to the number of lines with field #1 being "D" and field #2 being unique that have been seen before the line that has field #1 being "T". All lines that do not have field #1 being "D" or "T" are copied to the output without being counted.

You should always tell us what operating system and shell you're using when you start a new thread in this forum. The behavior of many utilities varies from operating system to operating system and the features provided by shells vary from shell to shell.

If you want to try the above code on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk .

thanks Don Cragun.
the result is exactly I want.
sorry, I didn't explain my request more detail. you are right. actually, the whole line is identical if field #2 is identical.
My OS is Solaris/SunOS. I will put my OS infor next time.
Thank you again.

Don_Cragun · March 9, 2019, 9:08pm

I'm always glad to have helped. With the sample data you provided, the following would also work:

/usr/xpg4/bin/awk '
BEGIN {	FS = OFS = ","
}
$1 == "D" {
	if($0 in a)
		next
	a[$0]
	printed++
}
$1 == "T" {
	$2 = printed
}
1' file.CSV

Please use this code if you want to delete identical lines. Please use the code in post #2 if you want to delete lines with duplicate field #2 values in lines. (In both cases, only when field #1 is "D".)

nezabudka · March 10, 2019, 12:45am

awk -F, '/^T/ {for(i in A) sum+=(A-1); $2=$2-sum} !A[$0]++' file

Don_Cragun · March 10, 2019, 1:37am

Hi nezabudka,
Nice approach. My code counts the number of lines output and ignores the value originally found in the "T" line field #2; your code subtracts the number of duplicates found.

If there were to be input files with multiple "T" lines, mine would output all of them each containing the number of unique "D" lines seen up to that point while yours will only print the first one found. I assume that an input file will only contain one "T" line, so this difference shouldn't matter.

If there are lines other than "D" and "T" lines, my code will copy them to the output but not include them in the count included in the "T" line; your code will include a count of non-duplicated non-"D" (except for the first "T" line) in its calculations. I have no idea whether or not the actual data to be processed might contain any header lines that should not be included in the in the "T" line output. If header lines are present and should be ignored in the "T" line output, that should have been mentioned in the requirements.

Note that your code replaces the commas in the "T" line output with <space>s because you didn't set OFS to a comma.

nezabudka · March 10, 2019, 4:29am

Hi Don, thanks for the explanation.

awk 'BEGIN {FS=OFS=","} /^T/ {$2=length(A)} !A[$0]++'

Don_Cragun · March 10, 2019, 6:29am

Hi nezabudka,
Always glad to help.

This is another interesting way to do it. Unfortunately, the standards do not specify the behavior of the awk length built-in function when given an array name as an argument. This use is described on the GNU gawk man page and works in BSD awk version 20070501 (but is not documented in the BSD awk man page) that is installed on macOS Mojave (version 10.14.3).

I have no idea whether or not this will work (as an undocumented feature) on green_k's Solaris system in /usr/xpg4/bin/awk or nawk . I also do not know if gawk is installed on green_k's system.

RudiC · March 10, 2019, 6:45am

On top of what Don Cragun said, the last approach would not account for "duplicate duplicates".

Illogic nonsense... please disregard.

Don_Cragun · March 10, 2019, 7:17am

Hi RudiC,
I'm not sure what you mean. I don't see any reason why the code shown in post #7 should fail as long as all of the following are true:

There are only "D" and "T" records in the input file.
There is only one "T" record in the input file.
The "T" record is the last record in the input file.
The awk being used returns the number of elements in the array when length(array_name) is called.

The first three are true in the sample data provided in this thread. The fourth is true with gawk starting with version 3.1.5 according to the Linux 2.6 gawk man page available in the UNIX and Linux Man Pages repository. By experiment, it also works on the awk version 20070501 provided with macOS Mojave version 10.14.3.

Unlike the code in post #5, this code is not subtracting the number of duplicates found, it is directly setting the number of unique elements found.

Am I missing something?

RudiC · March 10, 2019, 7:31am

Hi Don Cragun, sorry for posting that nonsense. My logics seem to require some lubrication. I may need some sleep. Post withdrawn.

Don_Cragun · March 10, 2019, 7:40am

Hi RudiC,
I know the feeling. I'm just up this late because I checked to see what was going on here after resetting all of the clocks in the house. (Daylight Saving time kicked in here this morning when the clock should have hit 2am. I hate Daylight Saving time!)

Sleep tight.

Don