Data manipulation using shell

Dear all

I have a dataset (in text format,delimited by tab) which have 100 variables (say, var0-var99) and more than 100,000 observations. I want to do the following:

  1. for variable var0-var49, I want to add "00" in front of each data (for example, "1" would become "001")

  2. for variable var50-var99, I want to add an underscore _ in front of each data (for example, "1" would become "_1")

How should I write the script?

Please give us a concrete example of your input file format. (Use CODE tags.)

Thanks.

The raw data is like:

Var0 Var1 Var2 ... Var50 Var51 ... Var99
1 22 53 ... 3 76 ... 82
.
.
.
.
22 78 65 ... 89 7 ... 12

and I hope, after running code, the data will look like:

Var0 Var1 Var2 ... Var50 Var51 ... Var99
001 0022 0053 ... _3 _76 ... _82
.
.
.
.
0022 0078 0065 ... _89 _7 ... _12

Assuming that your actual data file has no headers:

awk -F'\t' '{for(i=1;i<=50 && i<=NF;i++) $i="00"$i;for(;i<=NF;i++) $i="_"$i}1' OFS='\t' file
awk -F'\t' '{ for(i=1;i<=NF;i++) (i<=50)?$i="00"$i:$i="_"$i; }1' OFS='\t' file

This is certainly not as elegant as I wanted it to be and as above proposals:

$ sed 's/\t\|^/&_/g; s/_/X/51; h; s/X.*$//; s/_/00/g; G; s/\n.*X/_/' file

I was erroneously thinking the s///NUMBER flag would allow for ranges like 1-50, but it doesn't, does it? So the entire thing ended up clumsy...

Sorry for the late reply and thank you all for great help.

@RudiC, it looks like you are using GNU sed, since since \| and \t are extensions. GNU sed would allow something like this:

sed 's/\b[0-9]/_&/51g; s//00&/g' file
1 Like

That's what I had in mind - didn't know you can combine flags to the s command. It's certainly an advantage if you can read: info sed :