Read tags in text file

AKD · July 20, 2010, 5:12am

Hello Team,

I am writing a script that reads a text (say 1.txt - 2 s2 a+bb means Number State Label) file having data as:

2 s2 a+bb
3 s3 a+bb
4 s4 a+bb

And there is another text file (say 2.txt) that has sample data as;

~x "a+bb"
<BEGIN>
<TOTAL> 3
<STATE> 1
~y "S_2"
<STATE> 2
~y "S_6"
<STATE> 3
~y "S_4"
~z "Z_t"
<END>
~y "S_2"
<FIRST> 4
 5.66 5.66 6.66 7.33
<SECOND> 4
 1.23 4.55 4.55 4.55
~y "S_6"
<FIRST> 4
 5.66 5.66 6.66 7.33
<SECOND> 4
 1.23 4.55 4.55 4.55
~y "S_4"
<FIRST> 4
 5.66 5.66 6.66 7.33
<SECOND> 4
 1.23 4.55 4.55 4.55

My script should be that on reading file 1.txt, it searches 2.txt for label "a+bb" (unique and not patterns like a+bb+c), reads <STATE>2 and then read <FIRST> and <SECOND> tags 2 times to give output as <FIRST><SECOND><FIRST><SECOND> i.e; as 5.66,5.66,6.66,7.33,1.23,4.55,4.55,4.55,5.66,5.66,6.66,7.33,1.23,4.55,4.55,4.55. These will all be comma separated like an array which I will use in my program later.
After this,it again reads second line in 1.txt (3 s3 a+bb), searches again label "a+bb" and read <STATE>3 (as given s3 in 1.txt) and append <FIRST><SECOND><FIRST><SECOND><FIRST><SECOND> 3 times (as given 3 in column 1 in 1.txt) with previous array. It repeats till 1.txt has all line traversed,

I am very much stuck in this part of my program. If any one help me out, I shall be very thankful.

Thanks.

radoulov · July 20, 2010, 5:52am

Could you post an example of the desired output?

AKD · July 20, 2010, 6:02am

For 1.txt my output should be :

5.66,5.66,6.66,7.33,1.23,4.55,4.55,4.55,5.66,5.66,6.66,7.33,1.23,4.55,4.55,4.55,5.66,5.66,6.66,7.33,1.23,4.55,4.55,4.55,5.66,5.66,6.66,7.33,1.23,4.55,4.55,4.55,5.66,5.66,6.66,7.33,1.23,4.55,4.55,4.55,5.66,5.66,6.66,7.33,1.23,4.55,4.55,4.55,5.66,5.66,6.66,7.33,1.23,4.55,4.55,4.55,5.66,5.66,6.66,7.33,1.23,4.55,4.55,4.55,5.66,5.66,6.66,7.33,1.23,4.55,4.55,4.55

That can be interpreted as 2 times State s2 of a+bb in 2.txt, appended by 3 times State s3 of a+bb in 2.txt and 4 times State s4 of a+bb in 2.txt. I hope I have clearly expressed.

radoulov · July 20, 2010, 6:26am

You sample data includes a single label. If the label is important, you should post a bigger sample of your data, in order to make us understand how the different labels should be treated.

AKD · July 20, 2010, 7:24am

I am pasting bigger sample of my data file (2.txt), original file runs into 5MB data.

~b "ST_ah_2_2"
<FIRST> 
 6.42 2.53
<SECOND> 2
 1.8 6.29
~b "ST_ah_3_6"
<FIRST> 2
 6.61 1.02
<SECOND> 2
 1.51 6.33
~b "ST_ah_4_9"
<FIRST> 2
 6.33 1.02
<SECOND> 2
 2.61 2.42
~b "ST_ih_2_2"
<FIRST> 2
 6.66 1.01
<SECOND> 2
 2.01 1.08
~b "ST_ih_3_3"
<FIRST> 2
 6.63 1.20
<SECOND> 2
 2.29 1.02
 ~b "ST_ih_4_4"
<FIRST> 2
 6.87 9.01
<SECOND> 2
 3.45 4.06
~b "ST_er_2_5"
<FIRST> 2
 6.89 1.20
<SECOND> 2
 2.16 4.22
~b "ST_er_3_5"
<FIRST> 2
 6.01 9.20
<SECOND> 2
 6.16 4.22
 ~b "ST_er_4_5"
<FIRST> 2
 6.89 1.20
<SECOND> 2
 6.36 2.42
~a "aa-ah"
<BEGIN>
<STATES> 3
<STATE> 1
~b "ST_ah_2_2"
<STATE> 2
~b "ST_ah_3_6"
<STATE> 3
~b "ST_ah_4_9"
~z "Z_ah"
<END>
~a "iy-ih"
<BEGIN>
<STATES> 3
<STATE> 1
~b "ST_ih_2_2"
<STATE> 2
~b "ST_ih_3_3"
<STATE> 3
~b "ST_ih_4_4"
~z "Z_ih"
<END>
~a "ey+er"
<BEGIN>
<STATES> 3
<STATE> 1
~b "ST_er_2_5"
<STATE> 2
~b "ST_er_3_5"
<STATE> 3
~b "ST_er_4_5"
~z "Z_er"
<END>

If my 1.txt happened to be like this:

2 s1 ey+er
1 s2 ey+er
1 s3 ey+er
1 s1 iy-ih
1 s2 iy-ih
2 s3 iy-ih

I am expecting output as;

6.89 1.20 2.16 4.22 6.89 1.20 2.16 4.22 6.01 9.20 6.16 4.22 6.89 1.20 6.36 2.42 6.66 1.01 2.01 1.08 6.63 1.20 2.29 1.02 6.87 9.01 3.45 4.06 6.87 9.01 3.45 4.06

Output could be understood as;

Read 1.txt. Line 1 is 2 s1 ey+er i.e; 2 times <STATE> 1 of label ey+er in 2.txt.
Search "ey+er" (unique and not pattern as there might be labels like "ey+er+t" etc.) in 2.txt. Go to tag <STATE> 1. Then Combine elements of <FIRST> and <SECOND> tag i.e 6.89 1.20 2.16 4.22. After this, as column 1 in 1.txt was 2, so write this 2 times in a file i.e;

6.89 1.20 2.16 4.22 6.89 1.20 2.16 4.22

Again read 2nd line of 1.txt i.e. 1 s2 ey+er.
Search <STATE> 2 (i.e s2) of label ey+er in 2.txt. Combine elements of <FIRST> and <SECOND> tag i.e.

6.01 9.20 6.16 4.22

. As column 1 of 1.txt has element 1, so only one time I append this in new text file.
Now the combined output is :

6.89 1.20 2.16 4.22 6.89 1.20 2.16 4.22 6.01 9.20 6.16 4.22

I need to repeat this process till 1.txt finishes.

radoulov · July 20, 2010, 9:44am

Something like this:

awk 'END { print RS }
NR == FNR {
  /~b/ && state_name = $2
  if (/^ *[0-9.]/) { 
    sub(/^  */, null)
    state_values[state_name] = state_values[state_name] ? \
      state_values[state_name] FS $0 : $0
      }
  /~a/ && label = $2
  /~b/ && state_names[label] = state_names[label] ? \
            state_names[label] SUBSEP $2 : $2
  next
  }
{
  sn = split(state_names[qq $3 qq], t, SUBSEP)   
  for (i = 0; ++i <= $1;)
    printf "%s ", state_values[t[substr($2, 2)]]  
  }' qq='"' 2.txt 1.txt

rdcwayx · July 20, 2010, 9:55am

After see radoulov's code, mine is useless.

removed

radoulov · July 20, 2010, 10:07am

Your code may not be useless because I suppose mine have too many problems
I don't have time to test it well.

AKD · July 20, 2010, 12:27pm

Dear radoulov ,

Thanks for your wonderful inputs. I was struggling to achieve this for weeks. There is a small issue (I hope). As I told that label search in 2.txt is to be unique and not like pattern search. So if I alter my 2.txt with labels ey+er+s and ey+er:

~b "ST_ah_2_2"
<FIRST> 
 6.42 2.53
<SECOND> 2
 1.8 6.29
~b "ST_ah_3_6"
<FIRST> 2
 6.61 1.02
<SECOND> 2
 1.51 6.33
~b "ST_ah_4_9"
<FIRST> 2
 6.33 1.02
<SECOND> 2
 2.61 2.42
~b "ST_ih_2_2"
<FIRST> 2
 6.66 1.01
<SECOND> 2
 2.01 1.08
~b "ST_ih_3_3"
<FIRST> 2
 6.63 1.20
<SECOND> 2
 2.29 1.02
 ~b "ST_ih_4_4"
<FIRST> 2
 6.87 9.01
<SECOND> 2
 3.45 4.06
~b "ST_er_2_5"
<FIRST> 2
 6.89 1.20
<SECOND> 2
 2.16 4.22
~b "ST_er_3_5"
<FIRST> 2
 6.01 9.20
<SECOND> 2
 6.16 4.22
 ~b "ST_er_4_5"
<FIRST> 2
 6.89 1.20
<SECOND> 2
 6.36 2.42
~a "ey+er+s"
<BEGIN>
<STATES> 3
<STATE> 1
~b "ST_ah_2_2"
<STATE> 2
~b "ST_ah_3_6"
<STATE> 3
~b "ST_ah_4_9"
~z "Z_ah"
<END>
~a "iy-ih"
<BEGIN>
<STATES> 3
<STATE> 1
~b "ST_ih_2_2"
<STATE> 2
~b "ST_ih_3_3"
<STATE> 3
~b "ST_ih_4_4"
~z "Z_ih"
<END>
~a "ey+er"
<BEGIN>
<STATES> 3
<STATE> 1
~b "ST_er_2_5"
<STATE> 2
~b "ST_er_3_5"
<STATE> 3
~b "ST_er_4_5"
~z "Z_er"
<END>

Then the output is:

6.42 2.53 1.8 6.29 6.61 1.02 1.51 6.33 6.33 1.02 2.61 2.42 6.66 1.01 2.01 1.08 6.63 1.20 2.29 1.02 6.87 9.01 3.45 4.06 6.89 1.20 2.16 4.22 6.01 9.20 6.16 4.22 6.89 1.20 6.36 2.42 6.42 2.53 1.8 6.29 6.61 1.02 1.51 6.33 6.33 1.02 2.61 2.42 6.66 1.01 2.01 1.08 6.63 1.20 2.29 1.02 6.87 9.01 3.45 4.06 6.89 1.20 2.16 4.22 6.01 9.20 6.16 4.22 6.89 1.20 6.36 2.42 6.42 2.53 1.8 6.29 6.61 1.02 1.51 6.33 6.33 1.02 2.61 2.42 6.66 1.01 2.01 1.08 6.63 1.20 2.29 1.02 6.87 9.01 3.45 4.06 6.89 1.20 2.16 4.22 6.01 9.20 6.16 4.22 6.89 1.20 6.36 2.42 6.42 2.53 1.8 6.29 6.61 1.02 1.51 6.33 6.33 1.02 2.61 2.42 6.66 1.01 2.01 1.08 6.63 1.20 2.29 1.02 6.87 9.01 3.45 4.06 6.89 1.20 2.16 4.22 6.01 9.20 6.16 4.22 6.89 1.20 6.36 2.42 6.42 2.53 1.8 6.29 6.61 1.02 1.51 6.33 6.33 1.02 2.61 2.42 6.66 1.01 2.01 1.08 6.63 1.20 2.29 1.02 6.87 9.01 3.45 4.06 6.89 1.20 2.16 4.22 6.01 9.20 6.16 4.22 6.89 1.20 6.36 2.42 6.42 2.53 1.8 6.29 6.61 1.02 1.51 6.33 6.33 1.02 2.61 2.42 6.66 1.01 2.01 1.08 6.63 1.20 2.29 1.02 6.87 9.01 3.45 4.06 6.89 1.20 2.16 4.22 6.01 9.20 6.16 4.22 6.89 1.20 6.36 2.42 6.42 2.53 1.8 6.29 6.61 1.02 1.51 6.33 6.33 1.02 2.61 2.42 6.66 1.01 2.01 1.08 6.63 1.20 2.29 1.02 6.87 9.01 3.45 4.06 6.89 1.20 2.16 4.22 6.01 9.20 6.16 4.22 6.89 1.20 6.36 2.42 6.42 2.53 1.8 6.29 6.61 1.02 1.51 6.33 6.33 1.02 2.61 2.42 6.66 1.01 2.01 1.08 6.63 1.20 2.29 1.02 6.87 9.01 3.45 4.06 6.89 1.20 2.16 4.22 6.01 9.20 6.16 4.22 6.89 1.20 6.36 2.42

Just in case, you can look to original data files.
This is my original data for 2.txt: ~o <STREAMINFO> 1 75 <VECSIZ - AKD 1.txt - dF9Eng8z - Pastebin.com

and original data for 1.txt:
1 s2 n+ow 1 s3 n+ow 1 s4 n+o - AKD 2.txt - vrHfUqTD - Pastebin.com

Thanks
Aditya

---------- Post updated at 11:27 AM ---------- Previous update was at 10:13 AM ----------

Thanks radoulov, that was mistake on my part while changing your code. Thanks for the solution.