Field extract w/o - and adding formats

Daniel_Gate · August 24, 2012, 2:50pm

I have a txt file like below:

Accession  Orderable  Unique ID  Subject ID  Timepoint  Colletion Date/Time  
12-170-0185  Glucose  1756487  1  D-07_H00.03  18-JUN-2012 06:03:00  
12-170-0185  Insulin  1756487  1  D-07_H00.03  18-JUN-2012 06:03:00  
12-170-0200  Glucose  1756487  1  D-07_H00.05  18-JUN-2012 06:05:00  
12-170-0200  Insulin  1756487  1  D-07_H00.05  18-JUN-2012 06:05:00  
12-170-0215  Glucose  1756487  1  D-07_H00.15  18-JUN-2012 06:15:00  
......

Accession  Orderable  Unique ID  Subject ID  Timepoint  Colletion Date/Time  
12-170-0220  Glucose  1759452  3  D-07_H00.15  18-JUN-2012 06:15:00  
12-170-0220  Insulin  1759452  3  D-07_H00.15  18-JUN-2012 06:15:00  
12-170-0181  Insulin  1751183  4  D-07_H00.03  18-JUN-2012 06:03:00  
......

The field I need to extract is the first one called "Accession".

I need to convert the Accession Number in the following format with length 18.

i.e.

12-198-0212 => "000002012198000212",
12-198-0214 => "000002012198000214",
12-198-0216 => "000002012198000216",
12-198-0218 => "000002012198000218",
12-198-0220 => "000002012198000220",

i.e.

is added in front of

12-198

with

- (dash)

taken out, and

is added in front of

-0212

with

- (dash)

taken out as well.

"000002012198000212"

has

" " ,

, and the double-quotes and comma are added for text delimiter file.

Other than "Accession" field, I don't need the rest of fields on the extracted output.

Please advise.

msabhi · August 24, 2012, 3:00pm

hope this helps

 perl -alne 'if(/^\d+-\d+-\d+/) {@acc=split(/-/,$F[0]);printf("12-198-%s => \"00020%s19800%s\",\n",$acc[2],$acc[0],$acc[2]);}' input_file

Daniel_Gate · August 24, 2012, 3:11pm

Sorry, if I am not clear on i.e.

12-198

It is not static. It can be

12-170

or else.

i.e.

12-170-0185  Glucose  1756487  1  D-07_H00.03  18-JUN-2012 06:03:00  
12-170-0185  Insulin  1756487  1  D-07_H00.03  18-JUN-2012 06:03:00

Please advise.

msabhi · August 24, 2012, 3:18pm

I thought it wouldn't be constant and had given the code for the same at first..later after going through your req again, i re-edited it since it seemed to be constant:) ..cool..here it goes

perl -alne 'if(/^\d+-\d+-\d+/) {@acc=split(/-/,$F[0]);printf("%s => \"00020%s%s00%s\",\n",$F[0],@acc);}'  input_file

spacebar · August 24, 2012, 3:18pm

This is one way to do it:

$ cat t
Accession  Orderable  Unique ID  Subject ID  Timepoint  Colletion Date/Time
12-170-0185  Glucose  1756487  1  D-07_H00.03  18-JUN-2012 06:03:00
12-170-0185  Insulin  1756487  1  D-07_H00.03  18-JUN-2012 06:03:00
12-170-0200  Glucose  1756487  1  D-07_H00.05  18-JUN-2012 06:05:00
12-170-0200  Insulin  1756487  1  D-07_H00.05  18-JUN-2012 06:05:00
12-170-0215  Glucose  1756487  1  D-07_H00.15  18-JUN-2012 06:15:00
12-170-0220  Glucose  1759452  3  D-07_H00.15  18-JUN-2012 06:15:00
12-170-0220  Insulin  1759452  3  D-07_H00.15  18-JUN-2012 06:15:00
12-170-0181  Insulin  1751183  4  D-07_H00.03  18-JUN-2012 06:03:00

$ sed -e '1d' -e 's/^\([[:digit:]]\{2\}\)-\([[:digit:]]\{3\}\)-\([[:digit:]]\{4\}\).*/"0000020\1\200\3"/g' t
"000002012170000185"
"000002012170000185"
"000002012170000200"
"000002012170000200"
"000002012170000215"
"000002012170000220"
"000002012170000220"
"000002012170000181"

Daniel_Gate · August 24, 2012, 3:29pm

It works great!

Two things I like to ask:

1) The output shows as below:

12-170-0185 => "0002012170000185",
12-170-0185 => "0002012170000185",
12-170-0200 => "0002012170000200",
12-170-0200 => "0002012170000200",
12-170-0215 => "0002012170000215",

i.e.

12-170-0185 =>

is not needed, just

"0002012170000185", "0002012170000185", etc

. So, I just need the converted Accession number.

2) If I can have another script showing the converted Accession Number along with the other fields, that will be great!

So, the output will look like;

"0002012170000185",  Glucose  1756487  1  D-07_H00.03  18-JUN-2012 06:03:00  
"0002012170000185",  Insulin  1756487  1  D-07_H00.03  18-JUN-2012 06:03:00  
"0002012170000200",  Glucose  1756487  1  D-07_H00.05  18-JUN-2012 06:05:00

The reason I need to have two scripts (one with only Accession Number; the other with all fileds beginning with Accession Number) is for verifications as it will be very helpful to go through the rows.

Please advise.

msabhi · August 24, 2012, 3:36pm

Req 1>

perl -alne 'if(/^\d+-\d+-\d+/) {@acc=split(/-/,$F[0]);printf("\"00020%s%s00%s\",\n",@acc);}' input_file

Req 2>

perl -alne 'if(/^\d+-\d+-\d+/) {@acc=split(/-/,$F[0]);$F[0]=sprintf("\"00020%s%s00%s\",",@acc); print "@F"}'

input_file

Daniel_Gate · August 24, 2012, 3:41pm

spacebar:

This is one way to do it:

$ cat t
Accession  Orderable  Unique ID  Subject ID  Timepoint  Colletion Date/Time
12-170-0185  Glucose  1756487  1  D-07_H00.03  18-JUN-2012 06:03:00
12-170-0185  Insulin  1756487  1  D-07_H00.03  18-JUN-2012 06:03:00
12-170-0200  Glucose  1756487  1  D-07_H00.05  18-JUN-2012 06:05:00
12-170-0200  Insulin  1756487  1  D-07_H00.05  18-JUN-2012 06:05:00
12-170-0215  Glucose  1756487  1  D-07_H00.15  18-JUN-2012 06:15:00
12-170-0220  Glucose  1759452  3  D-07_H00.15  18-JUN-2012 06:15:00
12-170-0220  Insulin  1759452  3  D-07_H00.15  18-JUN-2012 06:15:00
12-170-0181  Insulin  1751183  4  D-07_H00.03  18-JUN-2012 06:03:00

$ sed -e '1d' -e 's/^\([[:digit:]]\{2\}\)-\([[:digit:]]\{3\}\)-\([[:digit:]]\{4\}\).*/"0000020\1\200\3"/g' t
"000002012170000185"
"000002012170000185"
"000002012170000200"
"000002012170000200"
"000002012170000215"
"000002012170000220"
"000002012170000220"
"000002012170000181"

This works well.

Is there a way to remove the header

Accession  Orderable  Unique ID  Subject ID  Timepoint  Colletion Date/Time

and blank rows?

Please advise.

msabhi · August 24, 2012, 6:24pm

@Daniel : Just in case if you missed the answer, please refer to my last post...

Daniel_Gate · August 24, 2012, 7:49pm

Yes, that works fantastic! Really appreciate it!