Concatenating and appending string based on specific pattern match

Input

#GEO-1-type-1-fwd-Initial  890 1519
OPKHIJEFVTEFVHIJEFVOPKHIJTOPKEFVHIJTEFVOPKOPKHIJHIJHIJTTOPKHIJHIJEFVEFVOPKHIJOPKHIJOPKEFVEFVOPKHIJHIJEFVHIJHIJEFVTHIJOPKOPKTEFVEFVEFVOPKHIJOPKOPKHIJTTEFVEFVTEFV

#GEO-1-type-2-fwd-Terminal  1572 2030
HIJOPKHIJEFVTOPKOPKTTOPKHIJOPKHIJEFVOPKTOPKTOPKHIJHIJTEFVOPKTOPKTOPKEFVOPKOPKEFVEFVTEFVOPKHIJEFVEFVOPKHIJOPKOPKHIJHIJEFVEFVHIJEFVEFVTOPKEFVOPKTHIJTTHIJOPK

#GEO-2-type-1-rev-Terminal  2734 2475
EFVTEFVTTOPKTOPKTEFVOPKHIJTEFVTTTOPKEFVTEFVOPKTTOPKTHIJTTTOPKEFVTOPKTEFVEFVEFVTHIJEFVHIJOPKEFVHIJOPKHIJEFVEFVHIJEFVEFVEFVTHIJEFVHIJOPKTHIJ

#GEO-2-type-2-rev-Internal  3041 2804
TEFVEFVOPKHIJTEFVHIJHIJHIJOPKOPKTTOPKHIJTOPKTOPKEFVEFVEFVEFVOPKHIJEFVTEFVTHIJTOPKHIJEFVOPKOPKTHIJEFVHIJHIJOPKOPKHIJHIJTTEFVEFVOPKTTEFVEFVOPKHIJOPKOPKOPK

#GEO-2-type-3-rev-Terminal  4050 3990
IJTOPKHIJEFVOPKOPKTHIJEFVHIJHIJOPKOPKHIJHIJTTEFVEFVOPKTTEFVEFVOPKHIJOPKOPKOPK

Output

#GEO-1-fwd 890 1519 1572 2030 
OPKHIJEFVTEFVHIJEFVOPKHIJTOPKEFVHIJTEFVOPKOPKHIJHIJHIJTTOPKHIJHIJEFVEFVOPKHIJOPKHIJOPKEFVEFVOPKHIJHIJEFVHIJHIJEFVTHIJOPKOPKTEFVEFVEFVOPKHIJOPKOPKHIJTTEFVEFVTEFVHIJOPKHIJEFVTOPKOPKTTOPKHIJOPKHIJEFVOPKTOPKTOPKHIJHIJTEFVOPKTOPKTOPKEFVOPKOPKEFVEFVTEFVOPKHIJEFVEFVOPKHIJOPKOPKHIJHIJEFVEFVHIJEFVEFVTOPKEFVOPKTHIJTTHIJOPK

#GEO-2-rev 4050 3990 3041 2804 2734 2475
IJTOPKHIJEFVOPKOPKTHIJEFVHIJHIJOPKOPKHIJHIJTTEFVEFVOPKTTEFVEFVOPKHIJOPKOPKOPKTEFVEFVOPKHIJTEFVHIJHIJHIJOPKOPKTTOPKHIJTOPKTOPKEFVEFVEFVEFVOPKHIJEFVTEFVTHIJTOPKHIJEFVOPKOPKTHIJEFVHIJHIJOPKOPKHIJHIJTTEFVEFVOPKTTEFVEFVOPKHIJOPKOPKOPKEFVTEFVTTOPKTOPKTEFVOPKHIJTEFVTTTOPKEFVTEFVOPKTTOPKTHIJTTTOPKEFVTOPKTEFVEFVEFVTHIJEFVHIJOPKEFVHIJOPKHIJEFVEFVHIJEFVEFVEFVTHIJEFVHIJOPKTHIJ

I would like to concatenating and appending the string content based on its header description. For those header description got "fwd",it append its content ascending. For those header description got "rev",it append its content descending. I trying the awk and perl do archive my desired goal now. Thanks a lot for any advice and suggestion.

Straight forward approach:

awk -F '[ -]' '{if (NF>1){r=$1"-"$2"-"$5; m=$5;
                   if (m=="fwd"){A[r]=A[r]" "$8" "$9}
                   else if (m=="rev"){A[r]=$8" "$9" "A[r]} }
                else if (!/^$/){
                  if (m=="fwd") {B[r]=B[r]$1}
                  else {if (m=="rev") B[r]=$1B[r]} } }
                END{for (i in A) {print i, A; print B }}' infile

Thanks a lot, Scrutinizer.
Your code works perfectly.
But it will give the output result like this:

#GEO-2-rev 4050 3990 3041 2804 2734 2475
IJTOPKHIJEFVOPKOPKTHIJEFVHIJHIJOPKOPKHIJHIJTTEFVEFVOPKTTEFVEFVOPKHIJOPKOPKOPKTEFVEFVOPKHIJTEFVHIJHIJHIJOPKOPKTTOPKHIJTOPKTOPKEFVEFVEFVEFVOPKHIJEFVTEFVTHIJTOPKHIJEFVOPKOPKTHIJEFVHIJHIJOPKOPKHIJHIJTTEFVEFVOPKTTEFVEFVOPKHIJOPKOPKOPKEFVTEFVTTOPKTOPKTEFVOPKHIJTEFVTTTOPKEFVTEFVOPKTTOPKTHIJTTTOPKEFVTOPKTEFVEFVEFVTHIJEFVHIJOPKEFVHIJOPKHIJEFVEFVHIJEFVEFVEFVTHIJEFVHIJOPKTHIJ

#GEO-1-fwd 890 1519 1572 2030 
OPKHIJEFVTEFVHIJEFVOPKHIJTOPKEFVHIJTEFVOPKOPKHIJHIJHIJTTOPKHIJHIJEFVEFVOPKHIJOPKHIJOPKEFVEFVOPKHIJHIJEFVHIJHIJEFVTHIJOPKOPKTEFVEFVEFVOPKHIJOPKOPKHIJTTEFVEFVTEFVHIJOPKHIJEFVTOPKOPKTTOPKHIJOPKHIJEFVOPKTOPKTOPKHIJHIJTEFVOPKTOPKTOPKEFVOPKOPKEFVEFVTEFVOPKHIJEFVEFVOPKHIJOPKOPKHIJHIJEFVEFVHIJEFVEFVTOPKEFVOPKTHIJTTHIJOPK

In between, can I ask you about the meaning of A/B[r] and what is the $9 represent in your awk code?
What I understand is the header only from $1-$8,right?
Thanks again first, Scrutinizer.

my $key;
while(<DATA>){
	chomp;
	if(/-/){
		my @tmp = split(/[- ]/,$_,6);
		$key=$tmp[4];
		if($hash{$tmp[4]}->{TITLE} == ""){
			$hash{$key}->{TITLE}=$tmp[0]."-".$tmp[0]. "-".$tmp[4];
		}
		else{
			$hash{$key}->{TITLE}=$hash{$key}->{TITLE}. " ".$tmp[6];
		}
	}
	else{
		$hash{$key}->{DATA}=$hash{$key}->{DATA}.$_;
	}
}
foreach my $key( keys %hash){
	print $hash{$key}->{TITLE},"\n";
	print $hash{$key}->{DATA},"\n";
}
__DATA__
#GEO-1-type-1-fwd-Initial  890 1519
OPKHIJEFVTEFVHIJEFVOPKHIJTOPKEFVHIJTEFVOPKOPKHIJHIJHIJTTOPKHIJHIJEFVEFVOPKHIJOPKHIJOPKEFVEFVOPKHIJHIJEFVHIJHIJEFVTHIJOPKOPKTEFVEFVEFVOPKHIJOPKOPKHIJTTEFVEFVTEFV

#GEO-1-type-2-fwd-Terminal  1572 2030
HIJOPKHIJEFVTOPKOPKTTOPKHIJOPKHIJEFVOPKTOPKTOPKHIJHIJTEFVOPKTOPKTOPKEFVOPKOPKEFVEFVTEFVOPKHIJEFVEFVOPKHIJOPKOPKHIJHIJEFVEFVHIJEFVEFVTOPKEFVOPKTHIJTTHIJOPK

#GEO-2-type-1-rev-Terminal  2734 2475
EFVTEFVTTOPKTOPKTEFVOPKHIJTEFVTTTOPKEFVTEFVOPKTTOPKTHIJTTTOPKEFVTOPKTEFVEFVEFVTHIJEFVHIJOPKEFVHIJOPKHIJEFVEFVHIJEFVEFVEFVTHIJEFVHIJOPKTHIJ

#GEO-2-type-2-rev-Internal  3041 2804
TEFVEFVOPKHIJTEFVHIJHIJHIJOPKOPKTTOPKHIJTOPKTOPKEFVEFVEFVEFVOPKHIJEFVTEFVTHIJTOPKHIJEFVOPKOPKTHIJEFVHIJHIJOPKOPKHIJHIJTTEFVEFVOPKTTEFVEFVOPKHIJOPKOPKOPK

#GEO-2-type-3-rev-Terminal  4050 3990
IJTOPKHIJEFVOPKOPKTHIJEFVHIJHIJOPKOPKHIJHIJTTEFVEFVOPKTTEFVEFVOPKHIJOPKOPKOPK

Hi Patrick,

That is because in awk the order of associative array elements is undetermined. On my computer it gets printed in the right order but that is by chance. If that is important we'd have to something to ensure the right order. Single spaces and - are used as separation characters so there are more fields, hence the $9. We could improve the robustness by using * in the -F specification and then using the proper field number.

Thanks for your explanation, Scrutinizer.
I get what you mean now :slight_smile:
I will try to fix the problem by make sure they are in the right order.
I got try your script few times just now.
All give the "rev" result first then only "fwd" :frowning:
Thanks again, Scrutinizer.

Perhaps you could give this a try then:

awk -F '[ -]*' '{ if (NF>1){
                    r=$1"-"$2"-"$5; m=$5;
                    if (!A[r]) O[i++]=r
                    if (m=="fwd") A[r]=A[r]" "$7" "$8
                    else if (m=="rev") A[r]=$7" "$8" "A[r]
                  }
                  else if (NF>0)
                    if (m=="fwd") B[r]=B[r]$1
                    else if (m=="rev") B[r]=$1B[r]
                 }
                 END{for (j=0;j<i;j++) {k=O[j];print k, A[k]; print B[k] }}' infile

---------- Post updated 16-12-09 at 00:24 ---------- Previous update was 15-12-09 at 11:47 ----------

Slightly simplified

awk -F '[ -]*' 'NF>1  { r=$1"-"$2"-"$5; m=$5; if (!A[r]) O[i++]=r
                        if (m=="fwd") A[r]=A[r]" "$7" "$8
                        else if (m=="rev") A[r]=$7" "$8" "A[r] }
                NF==1 { if (m=="fwd") B[r]=B[r]$1
                        else if (m=="rev") B[r]=$1B[r] }
                END   { for (j=0;j<i;j++) {k=O[j];print k, A[k]; print B[k]} }' infile
gawk '/^#.*fwd.*/{
   o=$0
   gsub(/-type.*/,"",o)
   fh=o
   fstr=$(NF-1) OFS $NF OFS fstr
   getline
   fl=$0fl
}
/^#.*rev.*/{
   o=$0
   gsub(/-type.*/,"",o)
   rh=o
   rstr=$(NF-1) OFS $NF OFS rstr
   getline
   rl=$0rl
}
END{
   split(fstr,F," ")
   fidx=asort(F,farr)
   for(i=1;i<=fidx;i++){
        fs=fs OFS farr
   }
   print fh"-fwd "fs
   print fl
   print ""
   split(rstr,R," ")
   ridx=asort(R,rarr)
   for(i=ridx;i>=0;i--){
        rs=rs OFS rarr
   }
   print rh"-rev "rs
   print rl

}