Data manipulation using shell

littlewenwen · January 12, 2013, 11:58pm

Dear all

I have a dataset (in text format,delimited by tab) which have 100 variables (say, var0-var99) and more than 100,000 observations. I want to do the following:

for variable var0-var49, I want to add "00" in front of each data (for example, "1" would become "001")
for variable var50-var99, I want to add an underscore _ in front of each data (for example, "1" would become "_1")

How should I write the script?

Don_Cragun · January 13, 2013, 12:23am

Please give us a concrete example of your input file format. (Use CODE tags.)

littlewenwen · January 13, 2013, 12:32pm

Thanks.

The raw data is like:

Var0 Var1 Var2 ... Var50 Var51 ... Var99
1 22 53 ... 3 76 ... 82
.
.
.
.
22 78 65 ... 89 7 ... 12

and I hope, after running code, the data will look like:

Var0 Var1 Var2 ... Var50 Var51 ... Var99
001 0022 0053 ... _3 _76 ... _82
.
.
.
.
0022 0078 0065 ... _89 _7 ... _12

elixir_sinari · January 13, 2013, 1:25pm

Assuming that your actual data file has no headers:

awk -F'\t' '{for(i=1;i<=50 && i<=NF;i++) $i="00"$i;for(;i<=NF;i++) $i="_"$i}1' OFS='\t' file

Yoda · January 13, 2013, 4:32pm

awk -F'\t' '{ for(i=1;i<=NF;i++) (i<=50)?$i="00"$i:$i="_"$i; }1' OFS='\t' file

RudiC · January 13, 2013, 6:09pm

This is certainly not as elegant as I wanted it to be and as above proposals:

$ sed 's/\t\|^/&_/g; s/_/X/51; h; s/X.*$//; s/_/00/g; G; s/\n.*X/_/' file

I was erroneously thinking the s///NUMBER flag would allow for ranges like 1-50, but it doesn't, does it? So the entire thing ended up clumsy...

littlewenwen · January 14, 2013, 7:51pm

Sorry for the late reply and thank you all for great help.

Scrutinizer · January 15, 2013, 1:17am

@RudiC, it looks like you are using GNU sed, since since \| and \t are extensions. GNU sed would allow something like this:

sed 's/\b[0-9]/_&/51g; s//00&/g' file

RudiC · January 15, 2013, 8:13am

That's what I had in mind - didn't know you can combine flags to the s command. It's certainly an advantage if you can read: info sed :