Find out match characters on all lines

I have a file with 22 lines. Each line has only 5 different chars, no white space, and each line is 3,278,824 in length. The 5 chars is "-", "A", "B", "C", "D".
Below is an example of the first 25 chars of the first four lines of the file.

-----ABCDA--CD-BBBBB----D
--A--ABCD--DCD-BBBBC-----
A-A--ABCD---CD-BBBB------
--A--ABCDA-D-D-BBBBC----D

my desire output from the above example is
(1) the number of fully matched alphabet characters of each line on all lines: 9. Which are "ABCD" at column 6~9, "D" at column 14, "BBBB" at 16~19, total 9 chars of fully match. Note, "-" does not count.
(2) the fully matched alphabet characters: ABCDDBBBB
(3) each line outputs a file of not matched alphabet characters:
line1: ACBD
line2: ADCC
line3: AAC
line4: AADCD

The program I could utilize includes bash shell, awk, sed, python, perl, R, mysql, java, c etc. I just couldn't find a way to do it. :wall:
Please help, thanks in advance~!

I think you can do it with awk ?
the 1)

awk '
/^.........patern/ {n+=1}
/^.....patern/ {m+=1}
end {print n ; print m}
' file 

6 "." is for the patern in the 7th column.

1 Like

For your second requirement..

assuming you have same number of characters per line..

try this..

sed -e 's/.\{1\}/& /g' file | awk '{ for(i=1;i<=NF;i++){a[NR,i]=$i;max=NF}maN=NR;}END{
for (i=1;i<=max;i++){
k=0;p="";
for (j=1;j<=maN;j++){
if(p){if(p != a[j,i]){k=1}}else{p=a[j,i];};
}
if(k != 1 && p != "-"){ print p;}
}
}'
1 Like

Thanks!
My bad! The example I have up there is just for example. The matching actually has no pattern at all. I don't know where the matching(s) is/are happening and I don't know how many of them are out there and I don't know how long of each matching. Since each line is 3,278,824 in length...

Please provide some extra info about your input file.

Thanks, your code works on my example. But it outputs the matching char one at a line. I need to do something like

 | tr -d '\n' 

to remove the newline of every line to get it to print out onto one line only. Thanks, I need to test it on my real file. Should work. Thanks.


The attach is the first 8192 chars of the first 4 lines. I just need to test the script you provided through and through.

22 lines of 3,278,824 chars to be compared char by char!? Wouldn't it be much easier if we could transpose that matrix (yet I don't know how, right now, from the top of my head) ?

1 Like

Yes, transpose it and use awk with NF=22, NR=3278824 is the direction of solution in my mind. Actually, pamu's solution is done by transpose it with awk and also count it by awk!!! GREAT!

The sample zip file you provided is not too good an example. I managed to transpose it though, but lines seem identical on the first sight. Could you provide a sample with 22 lines and, say a few thousand chars per line?

Thank you, Rudi.

I think Pamu's solution works. (Thank you Pamu.)

This works (on linux with bash and GNU tools!) for your four line 24 char example from post #1:

$ cat sedfile
1 {s/\(.\)/\1\n/g;w1.tmp
  }
2 {s/\(.\)/\1\n/g;w2.tmp
  }
3 {s/\(.\)/\1\n/g;w3.tmp
  }
4 {s/\(.\)/\1\n/g;w4.tmp
  }
$ sed -nf sedfile infile
$ paste -d" " ?.tmp >filetransposed
$ awk  '{split ($0, b)
          L=1
          for (i=1;i<NF;i++) {L=L && b==b[i+1] && b!="-"; if (!L) break}
         }
         L {n++; printf "%s", b[1]}
         !L {for (i=1;i<=NF;i++) if (b != "-") printf "%s",b >"line"i}
         END {print "\t",n; for (i=1;i<=NF;i++) print "" > "line"i}
        ' filetransposed
ABCDDBBBB     9
$ cat line?
ACBD
ADCC
AAC
AADCD

---------- Post updated at 06:04 PM ---------- Previous update was at 05:58 PM ----------

OK, but does it meet your requirement 3)?

1 Like

Thanks, Rudi. I got all my solutions now. Thank you!