Format DATA

Input File

AU01NAS002,FCNVX133800117,AU01_Melbourne_Australia,ATT,Internal,NAS SILVER,12287.99,3293.98,6946.02
AU01NAS002,FCNVX133800117,AU01_Melbourne_Australia,ATT,Internal,NAS ARCHIVE,12287.99,3327.12,6912.87
II18NAS001,CK200110400822,II18_Mumb,MFFi(COD),Internal,clar_r5_performance,6667.88,2187.03,4254.13
II18NAS001,CK200110400822,II18_Mumb,MFFi(COD),Internal,clarata_archive,7337.19,4681.66,2655.54
II18NAS001,CK200110400822,II18_Mumb,MFFi(COD),Internal,clarata_sas,7557.19,4681.66,2600.54
 

Output File

 
 AU01NAS002,FCNVX133800117,AU01_Melbourne_Australia,ATT,Internal,NAS SILVER,12287.99,3293.98,6946.02
 ,,,,,NAS ARCHIVE,12287.99,3327.12,6912.87
 II18NAS001,CK200110400822,II18_Mumb,MFFi(COD),Internal,clar_r5_performance,6667.88,2187.03,4254.13
 ,,,,,clarata_archive,7337.19,4681.66,2655.54
 ,,,,,clarata_sas,7557.19,4681.66,2600.54

 

Please help!!
Basically replace the common columns except the first one with "," ( the field separator)

For example in above ...the first 5 columns are common for the first 2 and the next 3 records ...

awk  'x[$1,$2,$3,$4,$5]++{$1=$2=$3=$4=$5=""}1' FS=, OFS=, infile
2 Likes

This works if the patterns don't repeat further down the file. For e.g.

AU01NAS002,FCNVX133800117,AU01_Melbourne_Australia,ATT,Internal,NAS SILVER,12287.99,3293.98,6946.02
AU01NAS002,FCNVX133800117,AU01_Melbourne_Australia,ATT,Internal,NAS ARCHIVE,12287.99,3327.12,6912.87
II18NAS001,CK200110400822,II18_Mumb,MFFi(COD),Internal,clar_r5_performance,6667.88,2187.03,4254.13
II18NAS001,CK200110400822,II18_Mumb,MFFi(COD),Internal,clarata_archive,7337.19,4681.66,2655.54
II18NAS001,CK200110400822,II18_Mumb,MFFi(COD),Internal,clarata_sas,7557.19,4681.66,2600.54
AU01NAS002,FCNVX133800117,AU01_Melbourne_Australia,ATT,Internal,NAS SILVER,12287.99,3293.98,6946.02
AU01NAS002,FCNVX133800117,AU01_Melbourne_Australia,ATT,Internal,NAS ARCHIVE,12287.99,3327.12,6912.87

it would yield

AU01NAS002,FCNVX133800117,AU01_Melbourne_Australia,ATT,Internal,NAS SILVER,12287.99,3293.98,6946.02
,,,,,NAS ARCHIVE,12287.99,3327.12,6912.87
II18NAS001,CK200110400822,II18_Mumb,MFFi(COD),Internal,clar_r5_performance,6667.88,2187.03,4254.13
,,,,,clarata_archive,7337.19,4681.66,2655.54
,,,,,clarata_sas,7557.19,4681.66,2600.54
,,,,,NAS SILVER,12287.99,3293.98,6946.02
,,,,,NAS ARCHIVE,12287.99,3327.12,6912.87

Try

awk     '!x[$1,$2,$3,$4,$5]     {delete x}
         x[$1,$2,$3,$4,$5]++    {$1=$2=$3=$4=$5=""}
         1
        ' FS=, OFS=, file
1 Like

Thanks Rudic .you are right .. but if the all the fields are same ... the output should not print them again at all .. for example in the above input ..the output should be

AU01NAS002,FCNVX133800117,AU01_Melbourne_Australia,ATT,Internal,NAS SILVER,12287.99,3293.98,6946.02
,,,,,NAS ARCHIVE,12287.99,3327.12,6912.87
II18NAS001,CK200110400822,II18_Mumb,MFFi(COD),Internal,clar_r5_performance,6667.88,2187.03,4254.13
,,,,,clarata_archive,7337.19,4681.66,2655.54
,,,,,clarata_sas,7557.19,4681.66,2600.54

but if the input is like this ( on or more field after $5 is different)

 
 AU01NAS002,FCNVX133800117,AU01_Melbourne_Australia,ATT,Internal,NAS SILVER,12287.99,3293.98,6946.02
AU01NAS002,FCNVX133800117,AU01_Melbourne_Australia,ATT,Internal,NAS ARCHIVE,12287.99,3327.12,6912.87
II18NAS001,CK200110400822,II18_Mumb,MFFi(COD),Internal,clar_r5_performance,6667.88,2187.03,4254.13
II18NAS001,CK200110400822,II18_Mumb,MFFi(COD),Internal,clarata_archive,7337.19,4681.66,2655.54
II18NAS001,CK200110400822,II18_Mumb,MFFi(COD),Internal,clarata_sas,7557.19,4681.66,2600.54
AU01NAS002,FCNVX133800117,AU01_Melbourne_Australia,ATT,Internal,NAS BRONZE,12287.99,3293.98,6946.02
 

then the output should be

 
 AU01NAS002,FCNVX133800117,AU01_Melbourne_Australia,ATT,Internal,NAS SILVER,12287.99,3293.98,6946.02
,,,,,NAS ARCHIVE,12287.99,3327.12,6912.87
 ,,,,,NAS BRONZE,12287.99,3293.98,6946.02
II18NAS001,CK200110400822,II18_Mumb,MFFi(COD),Internal,clar_r5_performance,6667.88,2187.03,4254.13
,,,,,clarata_archive,7337.19,4681.66,2655.54
,,,,,clarata_sas,7557.19,4681.66,2600.54
 

thanks !

Hello greycells,

Could you please try following and let us know if this helps you.
Let's say we have input file is as follows.

cat testt1
AU01NAS002,FCNVX133800117,AU01_Melbourne_Australia,ATT,Internal,NAS SILVER,12287.99,3293.98,6946.02
AU01NAS002,FCNVX133800117,AU01_Melbourne_Australia,ATT,Internal,NAS ARCHIVE,12287.99,3327.12,6912.87
II18NAS001,CK200110400822,II18_Mumb,MFFi(COD),Internal,clar_r5_performance,6667.88,2187.03,4254.13
II18NAS001,CK200110400822,II18_Mumb,MFFi(COD),Internal,clarata_archive,7337.19,4681.66,2655.54
II18NAS001,CK200110400822,II18_Mumb,MFFi(COD),Internal,clarata_sas,7557.19,4681.66,2600.54
AU01NAS002,FCNVX133800117,AU01_Melbourne_Australia,ATT,Internal,NAS BRONZE,12287.99,3293.98,6946.02
sort -k1,1 testt1 | awk -F, '!X[$1,$2,$3,$4,$5] {delete X} X[$1,$2,$3,$4,$5]++ {$1=$2=$3=$4=$5=""} 1' OFS=,

Output will be as follows.

AU01NAS002,FCNVX133800117,AU01_Melbourne_Australia,ATT,Internal,NAS ARCHIVE,12287.99,3327.12,6912.87
,,,,,NAS BRONZE,12287.99,3293.98,6946.02
,,,,,NAS SILVER,12287.99,3293.98,6946.02
II18NAS001,CK200110400822,II18_Mumb,MFFi(COD),Internal,clar_r5_performance,6667.88,2187.03,4254.13
,,,,,clarata_archive,7337.19,4681.66,2655.54
,,,,,clarata_sas,7557.19,4681.66,2600.54

EDIT: Just want to add here a point I have used sort utility in solution which will sort the file's content according to first column and then it will fulfil the request. kindly let us know if you have any other requirements, queries etc for same.

Thanks,
R. Singh

1 Like

I'm trying to understand how awk work with arrays and keen to learn.. can anyone explain how this solution works also reference material will be helpful

sort -k1,1 testt1 | awk -F, '!X[$1,$2,$3,$4,$5] {delete X} X[$1,$2,$3,$4,$5]++ {$1=$2=$3=$4=$5=""} 1' OFS=,

Thanks in advance

A great way to learn about any utility is to read the manual page for that utility. In this case that could be done by looking at the output from the commands:

man awk

and:

man sort

The code Ravinder suggested (reformatted with comments added) is:

sort -k1,1 file |			# Sort file with the 1st field as the
					# primary sort key using sequences of
					# blanks as the field separators.
awk -F, ' 				# Use awk to process the sorted data
					# with comma as the input field
					# separator.
!X[$1,$2,$3,$4,$5] {delete X}		# If the element of array X with index
					# set to the 1st 5 fields on the line
					# separated by the the contents of the
					# SUBSEP variable has the value zero,
					# delete all elements from the array X.
					# Since array eelements are initialized
					# to zero if no value has been stored,
					# this will happen on the 1st line of a
					# set of lines with the same strings in
					# the 1st five fields on the line.
X[$1,$2,$3,$4,$5]++ {$1=$2=$3=$4=$5=""}	# Increment the value of the element of
					# X corresonding to this line.  If the
					# element of X corresponding to this
					# line had a value greater than zero
					# before it was incremented, set the
					# first five fields to the empty string.
1' OFS=, 				# Print the (possibly updated) line.
					# Set the output field separator to a
					# comma.

Note that the sort utility uses a default field separator of any combination of blanks (i.e., spaces and tabs). While the input uses comma as a field separator. And since there are five fields used to determine which lines are to be grouped, all five of those fields should be included in the primary sort key. That would be:

sort -t, -k1,5 file

But, since the primary key is the 1st five fields on the line and variable length numeric fields are not part of the key, specifying a field separator and sort key is redundant since the default behavior of sort provides the desired order.

Note that the delete X is not required by the standards, but is available on some versions of awk . Note also that the statement:

!X[$1,$2,$3,$4,$5] {delete X}

could be removed and still get the same output. But, doing so will cause the amount of memory used by awk to increase slightly for each new group of lines. If there are millions of groups in the input file being processed, this could significantly slow down processing. If there are a few hundred groups, the difference might not be noticed at all.

I don't see the need for arrays here. If you're going to destroy the entire array every time you create a new array element, creating and destroying the array is just overhead. I would simplify the code to:

sort file | awk -F, '
{	if(p == $1 FS $2 FS $3 FS $4 FS $5)
		$1 = $2 = $3 = $4 = $5 = ""
	else	p = $1 FS $2 FS $3 FS FS $4 FS $5
}
1' OFS=,

which produces exactly the same output (unless your implementation of awk gives you a syntax error for delete array_name ) and doesn't depend on non-standard awk features.

1 Like