Putting together substrings if pattern is matched

What I would like to do is if the lines with % have the same name, then combine the last 9 letters of the string underneath the last occurrence of that ID with the first 9 letters of the string underneath the first occurrence of that ID.

I have a file that looks like this:

%GOGG
doggocatcatFUNFUNGOGOHIHI
%SONSON
byetailfunfungo
%GOGG
hellobyebyetailfunfungoson
$SONSON
funfungogo

The output should look like this:

%GOGG
nfungosondoggocatc
%SONSON
unfungogobyetailfu

Notice how %GOGG now has "nfungoson" put together with "doggocatc" and the same with %SONSON. It's worth mentioning that the ID patterns can exist more than 2 times.

Is this classwork/homework?
It can be perfectly done with awk, using its associative arrays and its substr and getline functions.
What have you tried already?

1 Like

This is not homework actually. It's just an annoying problem I've run into and can't figure out. What I have tried is the following but it is no where near useful in my opinion.

cat file | egrep -A 2 "%" | head -c15 > file.out

An awk solution:

awk '
        {
                if ( $1 ~ /^%/ )
                {
                        A[$1]++
                        i = $1
                }
                else
                        R[A,i] = $0
        }
        END {
                for ( k in A )
                {
                        print k
                        print substr(R[A[k],k],length(R[A[k],k])-8) substr(R[1,k],1,9)
                }
        }
' file
2 Likes

Another one

awk '
$1~/^%/ {
  key=$1
  next
}
{
  if (key in s1) {
    s2[key]=substr($0,length($0)-8)
  } else {                     
    s1[key]=substr($0,1,9)
  }                  
}
END {               
  for (key in s1) {
    print key
    print s2[key] s1[key]
  }
}
' file
1 Like

Yet one more:

awk '
  /^%/{
    i=$1
    getline s
    A=(i in A)?substr(s, length(s)-8) substr(A,1,9):s
  }
  END{
    for (i in A) {
      print i
      print A
    }
  }
' file
1 Like

Scrutinizer, yours seems to work only when there are multiple occurrences of the string ID and not in events when the string ID exists only once. It should still perform the same for those situations.

Yes I had interpreted it such that you wanted to leave labels that appear only once untouched, and that labels would appear once or twice.

Yoda's suggestion something similar, but cuts off 8 characters at the end of single entries.

It seems to me MadeinGermany's approach is closest to what you are looking for..

Your solution is actually what I was looking for, expect that for single occurring %ID's are as long as the input. So something like this would be the same in the output as in the input.

Could you post an input sample with also single and maybe triple entries (?) and what the output should look like?

An example for single occurrence looks like this.
input:

%TESGO
dfkjsdfgogogocatcatdogtwelvetwentygogogo

output:

%TESGO
ntygogogodfkjsdfgo

for a triple occurrence (or even an occurrence of 4 or 5 etc..) it would be great if the %ID was modified to label what combination was being used:

input:

%TESGO
dfkjsdfgogogocatcatdogtwelvetwentygogogo
%TESGO
gogoCatDoggobye
%TESGO
byenowsoso

output:

%TESGO_A_C
yenowsosodfkjsdfgo
%TESGO_A_B
tDoggobyedfkjsdfgo

In all cases the last 9 substrings in each output line comes from the first occurrence of the first ID string.

This should work with single and double cases and if there are more than two, it takes the value of the last one, as mentioned in post #1:

awk '
  /^%/{
    getline s
    A[$1]=A[$1] s
  }                                           
  END{
    for (i in A) {
      print i
      print substr(A, length(A)-8) substr(A,1,9)
    }           
  }
' file

Assuming we have say 4 occurrences of the same ID and want to combine the end of ID 4 with the beginning of ID4, the beginning of ID 3, and the beginning of ID 2, but not with the beginning of ID1. How can this be done?

Basically, the combinations should look like this.

end of 4 with beginning of 4
            with beginning of 3
            with beginning of 2
end of 3 with beginning of 3
            with beginning of 2
end of 2 with beginning of 2
end of 1 with beginning of 1

input:

%dog
aaaaaaaaaaAAAAAAAAA
%dog
bbbbbbbbbbBBBBBBBBB
%dog
cccccccccccCCCCCCCCC
%dog
xxxxxxxxxxXXXXXXXXX

the output should look like this:

%dog
XXXXXXXXXxxxxxxxxx
%dog
XXXXXXXXXccccccccc
%dog
XXXXXXXXXbbbbbbbbb
%dog
CCCCCCCCCccccccccc
%dog
CCCCCCCCCbbbbbbbbb
%dog
BBBBBBBBBbbbbbbbbb
%dog
AAAAAAAAAaaaaaaaa

I am trying to do something like this but not sure how to print all of these combinations

awk '{
      if ( $1 ~ /^%/ )
      {
                A[$1]++
                i = $1
      }
      else
                {for (j = 2; j <=i; j++)
                A = newA
                F=1
                G[F]=$1
                        }
                        END {
                                for (k in A)
                                {
                                        print k
                                        print substr(R[A[k],k], length(R[A[k])-8) substr(R[1,k],1,9)
                        }
                }}' file