I have a file with 22 lines. Each line has only 5 different chars, no white space, and each line is 3,278,824 in length. The 5 chars is "-", "A", "B", "C", "D".
Below is an example of the first 25 chars of the first four lines of the file.
my desire output from the above example is
(1) the number of fully matched alphabet characters of each line on all lines: 9. Which are "ABCD" at column 6~9, "D" at column 14, "BBBB" at 16~19, total 9 chars of fully match. Note, "-" does not count.
(2) the fully matched alphabet characters: ABCDDBBBB
(3) each line outputs a file of not matched alphabet characters:
line1: ACBD
line2: ADCC
line3: AAC
line4: AADCD
The program I could utilize includes bash shell, awk, sed, python, perl, R, mysql, java, c etc. I just couldn't find a way to do it. :wall:
Please help, thanks in advance~!
Thanks!
My bad! The example I have up there is just for example. The matching actually has no pattern at all. I don't know where the matching(s) is/are happening and I don't know how many of them are out there and I don't know how long of each matching. Since each line is 3,278,824 in length...
22 lines of 3,278,824 chars to be compared char by char!? Wouldn't it be much easier if we could transpose that matrix (yet I don't know how, right now, from the top of my head) ?
Yes, transpose it and use awk with NF=22, NR=3278824 is the direction of solution in my mind. Actually, pamu's solution is done by transpose it with awk and also count it by awk!!! GREAT!
The sample zip file you provided is not too good an example. I managed to transpose it though, but lines seem identical on the first sight. Could you provide a sample with 22 lines and, say a few thousand chars per line?