Script to code every 2 consecutive entries as single entry

ks_reddy · April 8, 2015, 10:51pm

All,

I come across the below requirement and my search on the previous posts did not result into any matches.
I have one column of data from a csv file like below. And I need to add additional column based on string count in first column.

Given column, Required column, Other columns
A, 1, ...leave all other columns in the file as it is ....
A, 1, ...leave all other columns in the file as it is ....
A, 1, ...leave all other columns in the file as it is ....
B, 1, ...leave all other columns in the file as it is ....
B, 1, ...leave all other columns in the file as it is ....
C, 2, ...leave all other columns in the file as it is ....
C, 2, ...leave all other columns in the file as it is ....
D, 2, ...leave all other columns in the file as it is ....
E, 3, ...leave all other columns in the file as it is ....
E, 3, ...leave all other columns in the file as it is ....
F, 3, ...leave all other columns in the file as it is ....
so.. on

Basically every 2 consecutive keys in column 1 should be coded as one single key. Also the number of rows for each key in column 1 will vary (as noticed here for the count of A's , B's and C and D).
My actual file has thousands of rows.

Thanks in advance..
Sidda

Don_Cragun · April 8, 2015, 11:26pm

Your explanation of what you are trying to do leaves me completely baffled. You seem to want to combine the rows with keys A and B to produce some unspecified output; combine the rows with keys C and D to produce some unspecified output; combine the rows with keys E and F to produce some unspecified output; ...

What is a "string count in first column"?

Please give us a CLEAR description of what you are trying to do and show us the output that is supposed to be produced from your small sample input file.

ks_reddy · April 9, 2015, 12:57am

Hi Don,
Here is my clear requirement.
Column-1 is my given data with unique strings where each string appearing more than once but in the order.
For example 'A' appears only in 3 rows but no where else in later part of the file. Similarly 'B' appears 2 times. But 'D' appear only once. etc.
Coming to the Required Index column "for every 2 keys in the first column there should be a single index and that too it should auto increment as shown in the given example below like 1, 2,3 and so.. on".

Given column, Required Index 
A, 1
A, 1
A, 1
B, 1
B, 1
C, 2
C, 2
D, 2
E, 3
E, 3
F, 3
etc.. 1000's of rows.

Hope this time my requirement is clearly depicted..

Regards
Sidda

Don_Cragun · April 9, 2015, 1:33am

Perhaps something like:

awk '
BEGIN {	FS = OFS = "," }
$1 != last { last = $1; x = int(1 + c++ / 2) }
$2 = x
' file.csv

which with your original sample input in a file named file.csv produces the output:

A,1, ...leave all other columns in the file as it is ....
A,1, ...leave all other columns in the file as it is ....
A,1, ...leave all other columns in the file as it is ....
B,1, ...leave all other columns in the file as it is ....
B,1, ...leave all other columns in the file as it is ....
C,2, ...leave all other columns in the file as it is ....
C,2, ...leave all other columns in the file as it is ....
D,2, ...leave all other columns in the file as it is ....
E,3, ...leave all other columns in the file as it is ....
E,3, ...leave all other columns in the file as it is ....
F,3, ...leave all other columns in the file as it is ....

If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk .

ks_reddy · April 9, 2015, 2:00am

Thanks Don.
Your script works very well, but it won't print the remaining columns it is. I am trying to modify that. Will be great if you have a smarter solution to print $0 as it is in addition to the columns 1(Given column) and 2(Generated Index column).

Regards
Sidda

Don_Cragun · April 9, 2015, 2:34am

I don't get it.

In post #1 in this thread you showed us a sample input file. In post #3, you showed us the 1st two columns of the output you wanted. All of the input fields were preserved from the sample input your provided in post #1 with the contents of the 2nd field replaced by the numbers you said you wanted to be produced in that field in this output.

How is that different from the output produced by my script???

RudiC · April 9, 2015, 7:30am

Would this modification of Don Cragun's proposal do what you need?

awk '
BEGIN           {FS = OFS = "," }
$1 != last      { last = $1; x = int((1 + c++) / 2) }
$2 = x OFS $2
' file.csv

ks_reddy · April 12, 2015, 6:29pm

Thanks RudiC.
Your script is almost okay.
Here is what my final script.

awk '
BEGIN           {FS = OFS = "," }
$1 != last      { last = $1; x = int((1 + c++) / 2) }
$2 = x OFS $2
' Input.csv | awk -F, -v row=1 -v col=2 'FNR==1{} FNR==row{$col="Index"}1 ' OFS=,

Input Data

GivenColumn,C2,C3
1,2,3
1,4,5
2,3,2
3,4,5
6,7,8

Output Data

GivenColumn,Index(Generated COlumn),C2,C3
1,1,2,3
1,1,4,5
2,1,3,2
3,2,4,5
6,2,7,8

Thanks once again Rudi and Don.

RudiC · April 13, 2015, 9:52am

That could be done in one awk , but there need some inconsistencies removed. Try

awk -v row=1 -v col=2 '
BEGIN           {FS = OFS = "," }
$1 != last      { last = $1; x = int((++c) / 2) }
FNR==row        {x="Index"}
$col = x OFS $col
' file
GivenColumn,Index,C2,C3
1,1,2,3
1,1,4,5
2,1,3,2
3,2,4,5
6,2,7,8