How to remove words that contain 3+ of the same character in a row?

colinireland · August 9, 2012, 10:36am

Hello,

I am looking for a way to remove words from a list that contain 3 or more of the same character.

For example lets say the full list is as follows

ABCDEF
ABBHJK
AAAHJD
KKPPPP
NAUJKS

AAAHJD & KKPPPP should be removed from this list as obviously they contain AAA and PPPP respectively.

My first attempt at this was to use

grep -v '\([[:alpha:]]\)\1' filename

but this will only remove Words with 2+ characters the same in a row.

grep -v '\([[:alpha:]][[:alpha:]]\)\1' filename will remove 4+

My knowledge of Awk/Sed is quite weak. Can anyone lend some advise as to where I should look from here?

Regards,
Colin

Don_Cragun · August 9, 2012, 11:00am

You almost had it the first time. Try:

grep -v '\([[:alpha:]]\)\1\1' filename

elixir_sinari · August 9, 2012, 11:00am

If you need to remove lines with 3 or more occurrences of a character NOT in succession, try

awk '{p=1;for(i=1;i<=length;i++) if(gsub(substr($0,i,1),"&")>=3) {p=0;break}}p' file

This will also remove lines with 3 or more occurrences of a character in succession.

Don_Cragun · August 9, 2012, 11:13am

Ambiguous request. The reply I posted assumed you want to delete lines with three adjacent occurrences of a character. The reply elixir_sinari posted assumed you want to delete any line with three occurrences of a character whether or not they are adjacent. The input you gave will give the same results for either interpretation. What was it that you wanted?

alister · August 9, 2012, 11:27am

elixir_sinari:

If you need to remove lines with 3 or more occurrences of a character NOT in succession, try
awk '{p=1;for(i=1;i<=length;i++) if(gsub(substr($0,i,1),"&")>=3) {p=0;break}}p' file
This will also remove lines with 3 or more occurrences of a character in succession.

That approach isn't very robust. The first argument to gsub is an extended regular expression. If the line contains a . , it will match every character. If there's a ? , + , * , or some other metacharacter, there may be a runtime regular expression compilation failure.

What you're attempting can be done easily with grep and a single regular expression:

grep -v '\(.\).*\1.*\1' file

Regards,
Alister

elixir_sinari · August 9, 2012, 10:49pm

alister:

That approach isn't very robust. The first argument to gsub is an extended regular expression. If the line contains a . , it will match every character. If there's a ? , + , * , or some other metacharacter, there may be a runtime regular expression compilation failure.

What you're attempting can be done easily with grep and a single regular expression:
grep -v '$.$.*\1.*\1' file
Regards,
Alister

I did foresee that possibility while writing the solution. But, I assumed that only alphabets will be in the file.

alister · August 9, 2012, 10:53pm

Looking at the first post, that seems a reasonable assumption given the sample data and the use of the [:alpha:] class.

I'll leave my post as is just in case it's of any use (as I'm sure you know, sometimes the sample data isn't representative).

Regards,
Alister

elixir_sinari · August 9, 2012, 10:57pm

I agree with you, alister. That grep with backrefs is much better suited.