A cleaner way to rearrange column

LMHmedchem · November 28, 2016, 2:02am

Hello,

I have some tab delimited text data,

index   name    chg_p   chg_m
1       name,1  1       0
2       name,2  1       1
3       name,3  1       0
4       name,4  1       0
5       name,5  1       1

I need to duplicate the "index" column, call it "id" and insert it after the "name" column. Then I need to print all of the remaining columns. For the example above it would look like,

index   name    id      chg_p   chg_m
1       name,1  1       1       0
2       name,2  2       1       1
3       name,3  3       1       0
4       name,4  4       1       0
5       name,5  5       1       1

I have run something in awk like,

awk -F'\t' 'BEGIN{OFS="\t"} {print $1, $2, $1, $3, $4}' input > output

This appears to work but there are some issues. First, I end up with two columns named "index", so I have to add an additional command to find the second one and change it to "id". The second is that I don't see a clever way to print from column 3 to the end like you would with cut 3-. The only thing I have got to work is to hard code columns $3-$50 (there are that many columns in this particular file).

Is there a way to do this with fewer steps an more elegant code. What I have now is a bit kludgy, even for me.

Thanks,

LMHmedchem

RavinderSingh13 · November 28, 2016, 2:37am

Hello LMHmedchem,

Could you please try following and let us know if this helps you.

awk 'NR==1{$2=$2 "\t" "id";print;next} {$2=$2 OFS $1} 1' OFS="\t"  Input_file

Output will be as follows.

index   name    id      chg_p   chg_m
1       name,1  1       1       0
2       name,2  2       1       1
3       name,3  3       1       0
4       name,4  4       1       0
5       name,5  5       1       1

Thanks,
R. Singh

Don_Cragun · November 28, 2016, 4:42am

In addition to what RavinderSingh13 suggested, you could also try:

awk 'BEGIN { FS = OFS = "\t" } { $2 = $2 OFS (NR > 1 ? $1 : "id") } 1' input > output

As always, if you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk .

drl · November 28, 2016, 6:19am

Hi.

I'll skip the usual posting of the entire demonstration script.

Input:
a  b  c  d  e  f  g  h  i
1,1 1,2 1,3 1,4 1,5 1,6 1,7 1,8 1,9
2,1 2,2 2,3 2,4 2,5 2,6 2,7 2,8 2,9
3,1 3,2 3,3 3,4 3,5 3,6 3,7 3,8 3,9

-----
 Results for sequence "recut 1,2,1,4,6-":
a       b       a       d       f       g       h       i
1,1     1,2     1,1     1,4     1,6     1,7     1,8     1,9
2,1     2,2     2,1     2,4     2,6     2,7     2,8     2,9
3,1     3,2     3,1     3,4     3,6     3,7     3,8     3,9

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 3.16.0-4-amd64, x86_64
Distribution        : Debian 8.6 (jessie) 
bash GNU bash 4.3.30
recut - ( local: RepRev 1.1, ~/bin/recut, 2010-06-10 )
csvtool - ( /usr/bin/csvtool, 2014-08-06 )

Missing textutils:
http://www1.cuni.cz/~obo/textutils/ verified (2016.08)

The recut command allows dash-notation, replication of fields, etc. The details for recut are:

recut   Process fields like cut, allow repetitions and re-ordering. (what)
Path    : ~/bin/recut
Version : - ( local: RepRev 1.1, ~/bin/recut, 2010-06-10 )
Length  : 56 lines
Type    : Perl script, ASCII text executable
Shebang : #!/usr/bin/perl
Modules : (for perl codes)
 Getopt::Long   2.42

See the webpage noted above for 80 or so similar commands.

The command csvtool can also do this:

 Results for sequence "csvtool -t " " -u " " col 1,2,1,4-5,6-":
a b a d e f g h i
1,1 1,2 1,1 1,4 1,5 1,6 1,7 1,8 1,9
2,1 2,2 2,1 2,4 2,5 2,6 2,7 2,8 2,9
3,1 3,2 3,1 3,4 3,5 3,6 3,7 3,8 3,9

Best wishes ... cheers, drl

LMHmedchem · November 28, 2016, 1:30pm

Thank you for the suggestions.

I have tried the code posted by both RavinderSingh13 and Don Cragun and both work with the examples I have tested on openSuse 13.2 x86_64.

There doesn't seem to be any difference in performance.

I haven't been able to try the csvtool and recut suggestions posted by drl because I don't have either tool installed. Both of these tools appear to have an elegant and intuitive syntax for this kind of thing. It looks like csvtool is available in the default repositories but recut is not. I generally lean towards using awk and sed because they are always available. It looks like you need all of textlive to get csvtool, which is too bad because textlive is about 1500 packages.

---------- Post updated at 01:30 PM ---------- Previous update was at 12:23 PM ----------

It looks like I also need to add another column to the end of the file. The column would be named "target" and would have the same value for every row. The row values would be floating point numbers like 0.0 or 1.0.

For example, add a column "target" to the end

index   name    id      chg_p   chg_m
1       name,1  1       1       0
2       name,2  2       1       1
3       name,3  3       1       0
4       name,4  4       1       0
5       name,5  5       1       1

To look like this

index   name    id      chg_p   chg_m   target
1       name,1  1       1       0       1.0
2       name,2  2       1       1       1.0
3       name,3  3       1       0       1.0
4       name,4  4       1       0       1.0
5       name,5  5       1       1       1.0

The only way I can think of to do this is to generate a second file with the new column and then paste the files together.

# get the number of lines in file that column will be added to
lines_in_file=$(wc -l < "$temp_output")

# add header to new file
echo "target" > temp_output2

# add a dummy target value for each data row
for (( c=1; c<$lines_in_file; c++ ))
do
   echo "1.0" >> temp_output2
done
# add newline at end
echo >> temp_output2

# combine the files
paste  $temp_output  temp_output2 > output_file

This more or less works, but are there any suggestions for a better way?

LMHmedchem

RudiC · November 28, 2016, 2:11pm

I presume you are using awk anyhow, as proposed by RavinderSingh13 and Don Cragun? How about adapting their code, like

awk 'BEGIN { FS = OFS = "\t" } { $2 = $2 OFS (!(NR-1)?"id":$1); $(NF+1) = !(NR-1)?"target":1.3 } 1'  file
index	name	id	chg_p	chg_m	target
1	name,1	1	1	0	1.3
2	name,2	2	1	1	1.3
3	name,3	3	1	0	1.3
4	name,4	4	1	0	1.3
5	name,5	5	1	1	1.3

drl · November 28, 2016, 4:57pm

Hi.

Here are additional results with tools available from repositories or from the net:
Code snippet:

pl " Results for sequence \"arrange -f 1 2 1 4-5 7\":"
arrange -f '1 2 1 4-5 7' $FILE |
align
pe " (\"arrange\" is a work in progress)"

pl " Results for sequence \"csvtool -t \" \" -u \" \" col 1,2,1,4-5,6-\":"
csvtool -t " " -u " " col 1,2,1,4-5,6- $FILE

pl " Results for spit - create fields of data:"
spit -r 3 -v "2.71828"

pl " Results for \"csvfix -f 1,2,1,4-5\":"
csvfix read_dsv -s " " -f 1,2,1,4:5 $FILE |
tee f1 |
csvfix put -v "1.0" |
tee f2 |
csvfix write_dsv -s " " |
tee f3

pl " Results for add_field, part of CRUSH toolset:"
add_field -l "target" -v "3.14" -A "i" -d " " $FILE

pl " Results for add_field, part of CRUSH toolset:"
add_field -l "target" -v "3.14" -f 100 -d " " $FILE

producing:
-----

 Results for sequence "arrange -f 1 2 1 4-5 7":
a   b   a   d   e   g
1,1 1,2 1,1 1,4 1,5 1,7
2,1 2,2 2,1 2,4 2,5 2,7
3,1 3,2 3,1 3,4 3,5 3,7
 ("arrange" is a work in progress)

-----
 Results for sequence "csvtool -t " " -u " " col 1,2,1,4-5,6-":
a b a d e f g h i
1,1 1,2 1,1 1,4 1,5 1,6 1,7 1,8 1,9
2,1 2,2 2,1 2,4 2,5 2,6 2,7 2,8 2,9
3,1 3,2 3,1 3,4 3,5 3,6 3,7 3,8 3,9

-----
 Results for spit - create fields of data:
2.71828 
2.71828 
2.71828 

-----
 Results for "csvfix -f 1,2,1,4-5":
a b a d e 1.0
1,1 1,2 1,1 1,4 1,5 1.0
2,1 2,2 2,1 2,4 2,5 1.0
3,1 3,2 3,1 3,4 3,5 1.0

-----
 Results for add_field, part of CRUSH toolset:
a b c d e f g h i target
1,1 1,2 1,3 1,4 1,5 1,6 1,7 1,8 1,9 3.14
2,1 2,2 2,3 2,4 2,5 2,6 2,7 2,8 2,9 3.14
3,1 3,2 3,3 3,4 3,5 3,6 3,7 3,8 3,9 3.14

-----
 Results for add_field, part of CRUSH toolset:
a b c d e f g h i target
1,1 1,2 1,3 1,4 1,5 1,6 1,7 1,8 1,9 3.14
2,1 2,2 2,3 2,4 2,5 2,6 2,7 2,8 2,9 3.14
3,1 3,2 3,3 3,4 3,5 3,6 3,7 3,8 3,9 3.14

recut   Process fields like cut, allow repetitions and re-ordering. (what)
Path    : ~/bin/recut
Version : - ( local: RepRev 1.1, ~/bin/recut, 2010-06-10 )
Length  : 56 lines
Type    : Perl script, ASCII text executable
Shebang : #!/usr/bin/perl
Home    : http://www1.cuni.cz/~obo/textutils/
Modules : (for perl codes)
 Getopt::Long   2.42

arrange Arrange fields, like cut, but in user-specified order. (what)
Path    : ~/bin/arrange
Version : 1.15
Length  : 355 lines
Type    : Perl script, ASCII text executable
Shebang : #!/usr/bin/perl
Modules : (for perl codes)
 warnings       1.23
 strict 1.08
 Carp   1.3301
 Getopt::Euclid 0.4.5

spit    Generate, create structured data sequence, -r rows of -v values in -c columns. (what)
Path    : ~/bin/spit
Version : - ( local: RepRev 1.10, ~/bin/spit, 2015-08-13 )
Length  : 80 lines
Type    : awk or perl script, ASCII text
Shebang : #!/usr/bin/env bash

csvtool tool for performing manipulations on CSV files from sh... (man)
Path    : /usr/bin/csvtool
Version : - ( /usr/bin/csvtool, 2014-08-06 )
Type    : ELF 64-bit LSB executable, x86-64, version 1 (SYSV ...)
Help    : probably available with --help
Home    : https://github.com/Chris00/ocaml-csv

csvfix  Manipulate csv files, file:///home/dennisl/src/csvfix/csvfix16/csvfix.html?Introduction.html (local html) (doc)
Path    : ~/executable/csvfix
Version : - ( local: ~/executable/csvfix, 2014-05-17 )
Type    : ELF 64-bit LSB executable, x86-64, version 1 (SYSV ...)
Home    : http://neilb.bitbucket.org/csvfix/

add_field       adds a field to a string-delimited log file. (man)
Path    : /usr/local/bin/add_field
Version : - ( local: /usr/local/bin/add_field, 2016-11-28 )
Length  : 273 lines
Type    : a /usr/bin/perl -w script, ASCII text executable
Shebang : #!/usr/bin/perl -w
Help    : probably available with [     ]-h,--help
Home    : https://github.com/google/crush-tools
Modules : (for perl codes)
 strict 1.08
 Getopt::Long   2.42
 Carp   1.3301

Of interest may be csvfix which allows a sub-command put to create a new field.

The local spit creates fields of values (or location in a matrix).

The CRUSH toolset includes add_field with a name and value.

The details above show the homes for the toolsets.

Best wishes ... cheers, drl

LMHmedchem · November 29, 2016, 11:13pm

rudic:

I presume you are using awk anyhow, as proposed by RavinderSingh13 and Don Cragun? How about adapting their code, like
awk 'BEGIN { FS = OFS = "\t" } { $2 = $2 OFS (!(NR-1)?"id":$1); $(NF+1) = !(NR-1)?"target":1.3 } 1'  file
index	name	id	chg_p	chg_m	target
1	name,1	1	1	0	1.3
2	name,2	2	1	1	1.3
3	name,3	3	1	0	1.3
4	name,4	4	1	0	1.3
5	name,5	5	1	1	1.3

Yes, I was able to modify the above code to insert another column in the second position with the header "group" and the value "V".

awk 'BEGIN { FS = OFS = "\t" } { $1 = $1 OFS (!(NR-1)?"group":"V") } 1' temp_output1 > temp_output2

Though I was able to modify the script to get what I needed, I still don't fully understand the syntax.

As far as I can tell,
{ FS = OFS = "\t" } means that the input and output field separators are both tab.

{ $1 = $1 OFS (!(NR-1)?"group":"V") } appears to concatenate field 1, plus the output separator, plus "group" for the first record (or V if it is not the first record) and assign that all to field 1. This effectively makes $1 two fields wide with the original column 1 plus the new column. I have tried to reference !(NR-1)? but since google now seems to think it knows better than you what it should show you from a search, searching on "!(NR-1)?" in quotes returns no results that actually have that string. I really miss the "+" in google search.

I don't see where the rest of the fields are printed out, unless the trailing 1 does that.

Do I have this correct, more or less?

LMHmedchem

RudiC · November 30, 2016, 5:36am

Yes, awk sees the (not necessarily trailing) 1 as a "pattern" (a logic expression) that always is TRUE, and with an "action" missing it performs the default action which is "print".

NR is an awk internal variable counting input lines starting from 1, so (NR-1) starts from 0, which is equivalent to FALSE. So !(NR-1) is TRUE only for the first input line. The question mark (?) following is part of a "conditional operator" that assigns either of two values depending on a boolean decision.