Putting together substrings if pattern is matched

verse123 · March 13, 2014, 2:27pm

What I would like to do is if the lines with % have the same name, then combine the last 9 letters of the string underneath the last occurrence of that ID with the first 9 letters of the string underneath the first occurrence of that ID.

I have a file that looks like this:

%GOGG
doggocatcatFUNFUNGOGOHIHI
%SONSON
byetailfunfungo
%GOGG
hellobyebyetailfunfungoson
$SONSON
funfungogo

The output should look like this:

%GOGG
nfungosondoggocatc
%SONSON
unfungogobyetailfu

Notice how %GOGG now has "nfungoson" put together with "doggocatc" and the same with %SONSON. It's worth mentioning that the ID patterns can exist more than 2 times.

MadeInGermany · March 13, 2014, 4:09pm

Is this classwork/homework?
It can be perfectly done with awk, using its associative arrays and its substr and getline functions.
What have you tried already?

verse123 · March 13, 2014, 4:52pm

This is not homework actually. It's just an annoying problem I've run into and can't figure out. What I have tried is the following but it is no where near useful in my opinion.

cat file | egrep -A 2 "%" | head -c15 > file.out

Yoda · March 13, 2014, 5:12pm

An awk solution:

awk '
        {
                if ( $1 ~ /^%/ )
                {
                        A[$1]++
                        i = $1
                }
                else
                        R[A,i] = $0
        }
        END {
                for ( k in A )
                {
                        print k
                        print substr(R[A[k],k],length(R[A[k],k])-8) substr(R[1,k],1,9)
                }
        }
' file

MadeInGermany · March 13, 2014, 5:28pm

Another one

awk '
$1~/^%/ {
  key=$1
  next
}
{
  if (key in s1) {
    s2[key]=substr($0,length($0)-8)
  } else {                     
    s1[key]=substr($0,1,9)
  }                  
}
END {               
  for (key in s1) {
    print key
    print s2[key] s1[key]
  }
}
' file

Scrutinizer · March 13, 2014, 6:40pm

Yet one more:

awk '
  /^%/{
    i=$1
    getline s
    A=(i in A)?substr(s, length(s)-8) substr(A,1,9):s
  }
  END{
    for (i in A) {
      print i
      print A
    }
  }
' file

verse123 · March 13, 2014, 9:15pm

Scrutinizer, yours seems to work only when there are multiple occurrences of the string ID and not in events when the string ID exists only once. It should still perform the same for those situations.

Scrutinizer · March 14, 2014, 12:41am

Yes I had interpreted it such that you wanted to leave labels that appear only once untouched, and that labels would appear once or twice.

Yoda's suggestion something similar, but cuts off 8 characters at the end of single entries.

It seems to me MadeinGermany's approach is closest to what you are looking for..

verse123 · March 14, 2014, 2:24am

Your solution is actually what I was looking for, expect that for single occurring %ID's are as long as the input. So something like this would be the same in the output as in the input.

Scrutinizer · March 14, 2014, 3:41am

Could you post an input sample with also single and maybe triple entries (?) and what the output should look like?

verse123 · March 14, 2014, 4:14am

An example for single occurrence looks like this.
input:

%TESGO
dfkjsdfgogogocatcatdogtwelvetwentygogogo

output:

%TESGO
ntygogogodfkjsdfgo

for a triple occurrence (or even an occurrence of 4 or 5 etc..) it would be great if the %ID was modified to label what combination was being used:

input:

%TESGO
dfkjsdfgogogocatcatdogtwelvetwentygogogo
%TESGO
gogoCatDoggobye
%TESGO
byenowsoso

output:

%TESGO_A_C
yenowsosodfkjsdfgo
%TESGO_A_B
tDoggobyedfkjsdfgo

In all cases the last 9 substrings in each output line comes from the first occurrence of the first ID string.

Scrutinizer · March 14, 2014, 4:52am

This should work with single and double cases and if there are more than two, it takes the value of the last one, as mentioned in post #1:

awk '
  /^%/{
    getline s
    A[$1]=A[$1] s
  }                                           
  END{
    for (i in A) {
      print i
      print substr(A, length(A)-8) substr(A,1,9)
    }           
  }
' file

verse123 · March 22, 2014, 10:38pm

Assuming we have say 4 occurrences of the same ID and want to combine the end of ID 4 with the beginning of ID4, the beginning of ID 3, and the beginning of ID 2, but not with the beginning of ID1. How can this be done?

Basically, the combinations should look like this.

end of 4 with beginning of 4
            with beginning of 3
            with beginning of 2
end of 3 with beginning of 3
            with beginning of 2
end of 2 with beginning of 2
end of 1 with beginning of 1

input:

%dog
aaaaaaaaaaAAAAAAAAA
%dog
bbbbbbbbbbBBBBBBBBB
%dog
cccccccccccCCCCCCCCC
%dog
xxxxxxxxxxXXXXXXXXX

the output should look like this:

%dog
XXXXXXXXXxxxxxxxxx
%dog
XXXXXXXXXccccccccc
%dog
XXXXXXXXXbbbbbbbbb
%dog
CCCCCCCCCccccccccc
%dog
CCCCCCCCCbbbbbbbbb
%dog
BBBBBBBBBbbbbbbbbb
%dog
AAAAAAAAAaaaaaaaa

I am trying to do something like this but not sure how to print all of these combinations

awk '{
      if ( $1 ~ /^%/ )
      {
                A[$1]++
                i = $1
      }
      else
                {for (j = 2; j <=i; j++)
                A = newA
                F=1
                G[F]=$1
                        }
                        END {
                                for (k in A)
                                {
                                        print k
                                        print substr(R[A[k],k], length(R[A[k])-8) substr(R[1,k],1,9)
                        }
                }}' file