Find out match characters on all lines

cwzkevin · September 24, 2012, 12:44am

I have a file with 22 lines. Each line has only 5 different chars, no white space, and each line is 3,278,824 in length. The 5 chars is "-", "A", "B", "C", "D".
Below is an example of the first 25 chars of the first four lines of the file.

-----ABCDA--CD-BBBBB----D
--A--ABCD--DCD-BBBBC-----
A-A--ABCD---CD-BBBB------
--A--ABCDA-D-D-BBBBC----D

my desire output from the above example is
(1) the number of fully matched alphabet characters of each line on all lines: 9. Which are "ABCD" at column 6~9, "D" at column 14, "BBBB" at 16~19, total 9 chars of fully match. Note, "-" does not count.
(2) the fully matched alphabet characters: ABCDDBBBB
(3) each line outputs a file of not matched alphabet characters:
line1: ACBD
line2: ADCC
line3: AAC
line4: AADCD

The program I could utilize includes bash shell, awk, sed, python, perl, R, mysql, java, c etc. I just couldn't find a way to do it. :wall:
Please help, thanks in advance~!

delugeag · September 24, 2012, 2:00am

I think you can do it with awk ?
the 1)

awk '
/^.........patern/ {n+=1}
/^.....patern/ {m+=1}
end {print n ; print m}
' file

6 "." is for the patern in the 7th column.

pamu · September 24, 2012, 3:07am

For your second requirement..

assuming you have same number of characters per line..

try this..

sed -e 's/.\{1\}/& /g' file | awk '{ for(i=1;i<=NF;i++){a[NR,i]=$i;max=NF}maN=NR;}END{
for (i=1;i<=max;i++){
k=0;p="";
for (j=1;j<=maN;j++){
if(p){if(p != a[j,i]){k=1}}else{p=a[j,i];};
}
if(k != 1 && p != "-"){ print p;}
}
}'

cwzkevin · September 24, 2012, 8:49am

Thanks!
My bad! The example I have up there is just for example. The matching actually has no pattern at all. I don't know where the matching(s) is/are happening and I don't know how many of them are out there and I don't know how long of each matching. Since each line is 3,278,824 in length...

pamu · September 24, 2012, 9:16am

Please provide some extra info about your input file.

cwzkevin · September 24, 2012, 11:14am

pamu:

For your second requirement..

assuming you have same number of characters per line..

try this..

sed -e 's/.\{1\}/& /g' file | awk '{ for(i=1;i<=NF;i++){a[NR,i]=$i;max=NF}maN=NR;}END{
for (i=1;i<=max;i++){
k=0;p="";
for (j=1;j<=maN;j++){
if(p){if(p != a[j,i]){k=1}}else{p=a[j,i];};
}
if(k != 1 && p != "-"){ print p;}
}
}'

Thanks, your code works on my example. But it outputs the matching char one at a line. I need to do something like

 | tr -d '\n'

to remove the newline of every line to get it to print out onto one line only. Thanks, I need to test it on my real file. Should work. Thanks.

cwzkevin · September 24, 2012, 12:03pm

The attach is the first 8192 chars of the first 4 lines. I just need to test the script you provided through and through.

RudiC · September 24, 2012, 1:08pm

22 lines of 3,278,824 chars to be compared char by char!? Wouldn't it be much easier if we could transpose that matrix (yet I don't know how, right now, from the top of my head) ?

cwzkevin · September 24, 2012, 1:36pm

Yes, transpose it and use awk with NF=22, NR=3278824 is the direction of solution in my mind. Actually, pamu's solution is done by transpose it with awk and also count it by awk!!! GREAT!

RudiC · September 25, 2012, 11:10am

The sample zip file you provided is not too good an example. I managed to transpose it though, but lines seem identical on the first sight. Could you provide a sample with 22 lines and, say a few thousand chars per line?

cwzkevin · September 25, 2012, 11:40am

Thank you, Rudi.

I think Pamu's solution works. (Thank you Pamu.)

RudiC · September 25, 2012, 12:04pm

This works (on linux with bash and GNU tools!) for your four line 24 char example from post #1:

$ cat sedfile
1 {s/\(.\)/\1\n/g;w1.tmp
  }
2 {s/\(.\)/\1\n/g;w2.tmp
  }
3 {s/\(.\)/\1\n/g;w3.tmp
  }
4 {s/\(.\)/\1\n/g;w4.tmp
  }
$ sed -nf sedfile infile
$ paste -d" " ?.tmp >filetransposed
$ awk  '{split ($0, b)
          L=1
          for (i=1;i<NF;i++) {L=L && b==b[i+1] && b!="-"; if (!L) break}
         }
         L {n++; printf "%s", b[1]}
         !L {for (i=1;i<=NF;i++) if (b != "-") printf "%s",b >"line"i}
         END {print "\t",n; for (i=1;i<=NF;i++) print "" > "line"i}
        ' filetransposed
ABCDDBBBB     9
$ cat line?
ACBD
ADCC
AAC
AADCD

---------- Post updated at 06:04 PM ---------- Previous update was at 05:58 PM ----------

OK, but does it meet your requirement 3)?

cwzkevin · September 27, 2012, 6:01pm

Thanks, Rudi. I got all my solutions now. Thank you!

rudic:

This works (on linux with bash and GNU tools!) for your four line 24 char example from post #1:
$ cat sedfile
1 {s/$.$/\1\n/g;w1.tmp
  }
2 {s/$.$/\1\n/g;w2.tmp
  }
3 {s/$.$/\1\n/g;w3.tmp
  }
4 {s/$.$/\1\n/g;w4.tmp
  }
$ sed -nf sedfile infile
$ paste -d" " ?.tmp >filetransposed
$ awk  '{split ($0, b)
   L=1
   for (i=1;i<NF;i++) {L=L && b==b[i+1] && b!="-"; if (!L) break}
   }
   L {n++; printf "%s", b[1]}
   !L {for (i=1;i<=NF;i++) if (b != "-") printf "%s",b >"line"i}
   END {print "\t",n; for (i=1;i<=NF;i++) print "" > "line"i}
   ' filetransposed
ABCDDBBBB     9
$ cat line?
ACBD
ADCC
AAC
AADCD
---------- Post updated at 06:04 PM ---------- Previous update was at 05:58 PM ----------

OK, but does it meet your requirement 3)?