How to delete 'duplicated' column values and make a delimited file too?

newbie_01 · August 31, 2016, 2:42am

Hi,

I have the following output from an Oracle SQL statement and I want to remove duplicated column values.

I know it is possible using Oracle analytical/statistical functions but unfortunately I don't know how to use any of those.

So now, I've gone to PLAN B using awk/sed maybe or any other UNIX string tips/tricks.

The original output is as below:

     CHANGE REQUESTOR            START               END                 STATUS           SERVICE_NO GROUP                     RESOURCE_PERSON
----------- -------------------- ------------------- ------------------- --------------- ----------- ------------------------- --------------------
     153281 User AAA             2016-07-21 23:00:00 2016-07-22 01:00:00 Closed               466814 Support Number 1          Mars
     153282 User ABCDE           2016-07-28 10:00:00 2016-07-28 11:00:00 Closed               466875 Linux                     Martian 01
     153282 User ABCDE           2016-07-28 10:00:00 2016-07-28 11:00:00 Closed               466876 DBA                       Earthling 01
     153283 User BBB             2016-07-28 12:00:00 2016-07-28 15:00:00 Closed               467055 Storage                   Jupiter
     153286 User WXYZ            2016-07-28 18:00:00 2016-08-02 20:00:00 Closed               466877 DBA                       Earthling 02
     153286 User WXYZ            2016-07-28 18:00:00 2016-08-02 20:00:00 Closed               467105 Unix                      Martian 02
     153287 User ABCDEF          2016-08-01 10:00:00 2016-08-01 11:00:00 Closed               466923 Linux                     Martian 01
     153287 User ABCDEF          2016-08-01 10:00:00 2016-08-01 11:00:00 Closed               466924 DBA                       Earthling 01
     153288 User XXX123456       2016-08-12 10:00:00 2016-08-12 11:00:00 Closed               466812 Linux                     Martian 01
     153288 User XXX123456       2016-08-12 10:00:00 2016-08-12 11:00:00 Closed               466813 DBA                       Earthling 01
     153290 User XXXYYYZZZ       2016-08-15 18:30:00 2016-08-15 19:30:00 Closed               467098 Linux                     Martian 01
     153290 User XXXYYYZZZ       2016-08-15 18:30:00 2016-08-15 19:30:00 Closed               467099 DBA                       Earthling 01

Below is the desired output. There are instances where the first five columns are repeated values and where they are repeated values, I want to display those five column values on its first occurrence only.

Desired output below:

     CHANGE REQUESTOR            START               END                 STATUS           SERVICE_NO GROUP                     RESOURCE_PERSON
----------- -------------------- ------------------- ------------------- --------------- ----------- ------------------------- --------------------
     153281 User AAA             2016-07-21 23:00:00 2016-07-22 01:00:00 Closed               466814 Support Number 1          Mars
     153282 User ABCDE           2016-07-28 10:00:00 2016-07-28 11:00:00 Closed               466875 Linux                     Martian 01
                                                                                              466876 DBA                       Earthling 01
     153283 User BBB             2016-07-28 12:00:00 2016-07-28 15:00:00 Closed               467055 Storage                   Jupiter
     153286 User WXYZ            2016-07-28 18:00:00 2016-08-02 20:00:00 Closed               466877 DBA                       Earthling 02
                                                                                              467105 Unix                      Martian 02
     153287 User ABCDEF          2016-08-01 10:00:00 2016-08-01 11:00:00 Closed               466923 Linux                     Martian 01
                                                                                              466924 DBA                       Earthling 01
     153288 User XXX123456       2016-08-12 10:00:00 2016-08-12 11:00:00 Closed               466812 Linux                     Martian 01
                                                                                              466813 DBA                       Earthling 01
     153290 User XXXYYYZZZ       2016-08-15 18:30:00 2016-08-15 19:30:00 Closed               467098 Linux                     Martian 01
                                                                                              467099 DBA                       Earthling 01

A final thing that I am wanting to do if possible is to have the desired output to be a delimited file, i.e. pipe or comma delimited that I can open from a spreadsheet program. In this case, the repeated column values would have to be replaced by the delimiter character instead.

Any advice much appreciated. Thanks in advance.

itkamaraj · August 31, 2016, 2:52am

awk '{if(!a[$1 $2 $3 $4 $5]++){print}else{$1=$2=$3=$4=$5=$6=$7=$8="";print}}' filename

RavinderSingh13 · August 31, 2016, 3:11am

Hello newbie_01,

Could you please try following.

awk '{if(!A[$1,$2,$3,$4,$5,$6,$7,$8]++){print;next} else {print ""}}'  Input_file

Output will be as follows.

     CHANGE REQUESTOR            START               END                 STATUS           SERVICE_NO GROUP                     RESOURCE_PERSON
----------- -------------------- ------------------- ------------------- --------------- ----------- ------------------------- --------------------
     153281 User AAA             2016-07-21 23:00:00 2016-07-22 01:00:00 Closed               466814 Support Number 1          Mars
     153282 User ABCDE           2016-07-28 10:00:00 2016-07-28 11:00:00 Closed               466875 Linux                     Martian 01
      
     153283 User BBB             2016-07-28 12:00:00 2016-07-28 15:00:00 Closed               467055 Storage                   Jupiter
     153286 User WXYZ            2016-07-28 18:00:00 2016-08-02 20:00:00 Closed               466877 DBA                       Earthling 02
      
     153287 User ABCDEF          2016-08-01 10:00:00 2016-08-01 11:00:00 Closed               466923 Linux                     Martian 01
      
     153288 User XXX123456       2016-08-12 10:00:00 2016-08-12 11:00:00 Closed               466812 Linux                     Martian 01
      
     153290 User XXXYYYZZZ       2016-08-15 18:30:00 2016-08-15 19:30:00 Closed               467098 Linux                     Martian 01

As addition, in case you don't want to print the new line and want to print only unique values then following may help.

awk '!A[$1,$2,$3,$4,$5,$6,$7,$8]++'   Input_file

Thanks,
R. Singh

Don_Cragun · August 31, 2016, 4:11pm

Hi newbie_01,
Maybe something like this would come closer to what you want. The other suggestions don't seem to convert the output to CSV format and don't seem to maintain the fixed-width fields either.

#!/bin/ksh
awk '
# Function to extract fields from an input line into an array.
function ef(line,	i) {
	# Extract fileds based on column positions and strip leading and
	# trailing whitespace.
	for(i = 1; i <= nf; i++) {
		array = substr(line, P, L)
		gsub(/^[[:space:]]*|[[:space:]]*$/, "", array)
	}
}

# Function to print a line from an array replacing initial fields that
# duplicate data printed on the previous output line with an empty field.
function pl(	dup, i) {
	# Initialize...
	dup = 1	# Set clear duplicates flag.
	for(i = 1; i <= nf; i++) {
		if(dup && array == last)
			printf("%s", (i == nf) ? array ORS : OFS)
		else {	printf("%s%s", array, (i == nf) ? ORS : OFS)
			last = array
			dup = 0
		}
	}
}
BEGIN {	# Set output field separator before reading any input lines.
	OFS = ","
}
NR == 1 {
	# Save the heading on the 1st line of the file for later processing.
	hl = $0
	next
}
NR == 2 {
	# Calculate the number of fields, their starting positions, and their
	# lengths from the 2nd header line.
	P[1] = 1
	for(i = 1; i <= NF; i++)
		P[i + 1] = P + 1 + (L = length($i))
	nf = NF

	# Extract the headers from the saved header line.
	ef(hl)
	# Print header.
	pl()
	next
}
{	# Extract data from the current input line and print it.
	ef($0)
	pl()
}' file

This will skip printing any leading fields that duplicate data found on the previous line (except the last field on an input line will always be printed).

With your sample input file, this produces the output:

CHANGE,REQUESTOR,START,END,STATUS,SERVICE_NO,GROUP,RESOURCE_PERSON
153281,User AAA,2016-07-21 23:00:00,2016-07-22 01:00:00,Closed,466814,Support Number 1,Mars
153282,User ABCDE,2016-07-28 10:00:00,2016-07-28 11:00:00,Closed,466875,Linux,Martian 01
,,,,,466876,DBA,Earthling 01
153283,User BBB,2016-07-28 12:00:00,2016-07-28 15:00:00,Closed,467055,Storage,Jupiter
153286,User WXYZ,2016-07-28 18:00:00,2016-08-02 20:00:00,Closed,466877,DBA,Earthling 02
,,,,,467105,Unix,Martian 02
153287,User ABCDEF,2016-08-01 10:00:00,2016-08-01 11:00:00,Closed,466923,Linux,Martian 01
,,,,,466924,DBA,Earthling 01
153288,User XXX123456,2016-08-12 10:00:00,2016-08-12 11:00:00,Closed,466812,Linux,Martian 01
,,,,,466813,DBA,Earthling 01
153290,User XXXYYYZZZ,2016-08-15 18:30:00,2016-08-15 19:30:00,Closed,467098,Linux,Martian 01
,,,,,467099,DBA,Earthling 01

You haven't said what operating system you're using. If you want to run this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk . (Note that neither awk nor nawk on a Solaris system will not work with this script.)

Scrutinizer · August 31, 2016, 4:57pm

To preserve column width, you could try something like this:

$ awk '{p=sprintf("%.88s",$0); q=sprintf("%88s",x)} A[p]++{sub(p,q)}1' file

or better because not vulnerable to regex interpretation:

awk '{p=substr($0,1,88); q=substr($0,89); print (A[p]++?sprintf("%88s",x):p) q}' file

The column width can be parameterized:

awk -v w=88 '{p=sprintf("%." w "s",$0); q=sprintf("%" w "s",x)} A[p]++{sub(p,q)}1' file

or

awk -v w=88 '{p=substr($0,1,w); q=substr($0,w+1); print (A[p]++?sprintf("%" w "s",x):p) q}' file

respectively...

output

 
     CHANGE REQUESTOR            START               END                 STATUS           SERVICE_NO GROUP                     RESOURCE_PERSON
----------- -------------------- ------------------- ------------------- --------------- ----------- ------------------------- --------------------
     153281 User AAA             2016-07-21 23:00:00 2016-07-22 01:00:00 Closed               466814 Support Number 1          Mars
     153282 User ABCDE           2016-07-28 10:00:00 2016-07-28 11:00:00 Closed               466875 Linux                     Martian 01
                                                                                              466876 DBA                       Earthling 01
     153283 User BBB             2016-07-28 12:00:00 2016-07-28 15:00:00 Closed               467055 Storage                   Jupiter
     153286 User WXYZ            2016-07-28 18:00:00 2016-08-02 20:00:00 Closed               466877 DBA                       Earthling 02
                                                                                              467105 Unix                      Martian 02
     153287 User ABCDEF          2016-08-01 10:00:00 2016-08-01 11:00:00 Closed               466923 Linux                     Martian 01
                                                                                              466924 DBA                       Earthling 01
     153288 User XXX123456       2016-08-12 10:00:00 2016-08-12 11:00:00 Closed               466812 Linux                     Martian 01
                                                                                              466813 DBA                       Earthling 01
     153290 User XXXYYYZZZ       2016-08-15 18:30:00 2016-08-15 19:30:00 Closed               467098 Linux                     Martian 01
                                                                                              467099 DBA                       Earthling 01

RudiC · September 1, 2016, 3:23am

Brilliant!
To save a few CPU cycles, the assignment of a constant to q could be done in a BEGIN section. Wouldn't anchoring p to the begin-of-line reduce the regex vulnerability?

Please be aware that - should a page break repeat the header and underline line - these two would be pruned as well.