Change Parse Script

Roga_Danar · August 30, 2011, 9:49pm

I have a question about changing how parsing occurs currently for us:

input FILE123

TAGA01: 01
TAG02: daadsf
TAG03: adfasdf
TAGBBB04: 35
TAG05: asdfa
TAG07: adfd
TAG07: adfa3
TAG07: 234234
TAGCC08: 3525df
TAG09: adsfa
TAG10: 245
TAG11: nnnn
EOR:
TAGA01: 02
TAG02: abas
TAG03: asdfasd
TAGBBB04: E
TAG05: asdfasd
TAG07: acvasc
TAG07: czcvc
TAG07: 22
TAGCC08: adsfasd
TAG09: Y
TAG11: yyyy
EOR:
.
.
.

Note that some tags may not be in a record, and some tags may repeat in the same record.

I need to covert to the following inline format (limiter doesn't matter, and I can change it should the data include the limiter in other files) and trim it so the tag doesn't appear:

Format:
TAGA01 TAGCC08 TAGBBB04 TAG09 TAG11

output.file:
01 3535df 35 adsfa nnnn
02 adsfasd E Y yyyy
.
.
.

Here is what is used currently (from memory, so the syntax isn't correct but the idea is):

cat FILE123 | egrep "^TAGA01 ^TAGBBB04 ^TAGCC08 ^TAG09 ^TAG11" | awk -F. -f awkfile.awk > output.file

where awkfile.awk contains if statements and a printf output statement (again, syntax along with substring numbers are not correct - but the idea is there):

if ($1==TAGA01) {pTAGA01=substr($1,3)}
.
.
.
if ($1==TAG11) {
   pTAG11=substr($1,4)
   printf pTAGA01 ... pTAG11
}

I wanted to see different ideas for two reasons: one to see if this could be more efficient since every tag gets multiple ifs every time, and just to straight up learn something new.

Thanks for your time!

yazu · August 30, 2011, 10:25pm

If you change EOR: to the blank line you can use perl in its paragraph mode:

sed 's/^EOR:$//' INPUTFILE | 
perl -00 -ne '/
TAGA01:\s+(.*?)\n
.*
TAGCC08:\s+(.*?)\n
# and so on
/xs && print "$1 $2\n"'

Roga_Danar · August 31, 2011, 9:38am

yazu:

If you change EOR: to the blank line you can use perl in its paragraph mode:
sed 's/^EOR:$//' INPUTFILE | 
perl -00 -ne '/
TAGA01:\s+(.*?)\n
.*
TAGCC08:\s+(.*?)\n
# and so on
/xs && print "$1 $2\n"'

Thanks yazu...if you get a moment, could you explain a few things to me:
I understand what the sed command does, though not all of the syntax. From the man page: s/regexp/replacement. Why the ^ (was wondering about this character in the egrep command in the original script) and the $/ (is that the new line character)? And the third / indicates replace with a blank line since no other character is shown?

All of this is piped into the perl script.

Again, from the man page -00 will 'slurp' the stream in paragraph mode. -ne starts a while <> loop command. Not sure about perl command itself. Why the first /? The command searches for a tag:, but not sure what \s+(.?) does, along with . in the between tag searches. Not sure what /ns does, along with what the $1 and $2 refer to in the print line.

And I assume all of this can just be > into an output file, or is there a better way to do that with perl?

Thanks again!