How to grep � symbol?

Hello,
I have multiple text files and I need to know which of them are having character issues.
Below command is not working. Maybe instead of that weird string, I should replace it with ascii code.

grep -A0 "�" file.txt

Thank you
Boris

That - or similar - character is a placeholder for any non-printing character. Where and how did you find it? What are "character issues"? Also, multi-byte chars could be represented. Pls post a hexdump of your data file.

This will work on linux systems:
the -P uses PCRE , the perl regex library, shows the line number (- n ) and highlights the problem(s) ( --color ) It finds characters greater than 127 and so will not work on UTF8 for example

grep --color='auto' -P -n "[^\x00-\x7F]"  myfile.txt

It always helps to include your OS and shell, this will not work HP-UX for example, and because your used -A I guessed.

Edit: Rudi beat me to it.

1 Like

Hello Rudic and Jim,
It is a subrip file and Jim's answer is very helpful for my case.
Marked as solved .

Thank you!
Boris

Be aware that above will also match / identify / eliminate locale characters. E.g. äöüÄ�-Üß in the German language.

Hello,
I am back again with the same question.
I am able to detect if it has U+FFFD inside any file but do not know which files have got this issue.

I run:

printf '%b' "$(printf '\\U%x' {128..131})" | grep -oP "[^\x00-\x7F]"

output:

�
�
�
�

how may I find it?
PS:

printf '%b' "$(printf '\\U%x' {128..131})" | grep -HoP "[^\x00-\x7F]"

gives below output:

(standard input):�
(standard input):�
(standard input):�
(standard input):�
printf '%b' "$(printf '\\U%x' {128..131})" | grep -loP "[^\x00-\x7F]"

gives only one line output:

(standard input):�

thank you
Boris

Not quite sure I understand what failed. grep 's option -H gives filenames of all pattern occurrences, -l prints any matching filename just once, which would satisfy you request: identify all files containing non-ASCII characters.

Thank you Rudic,
This way, gives: invalid range end error.

printf '%b' "$(printf '\\U%x' {128..131})" | grep -l "[^\x00-\x7F]"

Somehow, I am printing the filenames now but info is not correct.
sniff.sh

for file in *.srt
do
printf '%b' "$(printf '\\U%x' {128..131})" $file '\n'
done

Output:

����1.hr.srt
,����JohnnyEnglishStrikesAgain2018.el.srt
����JohnnyEnglishStrikesAgain2018.en.srt

Normally there is no inside *.en.srt

The answer seems like related to another case. Maybe files should be converted to UTF-16 encode prior to run this script.

I am closing this thread as solved.

Kind regards
Boris

Try

printf '%b' "$(printf '\\U%x' {128..131})" | LC_ALL=C grep -lo '[^\x00-\x7F]'
(standard input)

which should be exactly what you need...?

That command gives only standart output
In my understanding, all those commands I have tested so far prints only all srt files but do not search for related charbase string . Then I supposed the reason was not having other language options in my computer*and then added other languages (example:Greek) with locale-gen command.
This issue is not relevant to keyboard language settings.

By using iconv command, I also checked if original file size and converted file size were different. If I could have found any difference, I would have been thinking of adding a size comparison function inside the script but no luck..
PS: Also I have compared char length at each line.

Kind regards
Boris

Hello,
Solved permanently ... Trick is du -b , not du -sh

Thank you
Boris