Hello,
I have multiple text files and I need to know which of them are having character issues.
Below command is not working. Maybe instead of that weird string, I should replace it with ascii code.
grep -A0 "�" file.txt
Thank you
Boris
Hello,
I have multiple text files and I need to know which of them are having character issues.
Below command is not working. Maybe instead of that weird string, I should replace it with ascii code.
grep -A0 "�" file.txt
Thank you
Boris
That - or similar - character is a placeholder for any non-printing character. Where and how did you find it? What are "character issues"? Also, multi-byte chars could be represented. Pls post a hexdump of your data file.
This will work on linux systems:
the -P
uses PCRE
, the perl
regex library, shows the line number (- n
) and highlights the problem(s) ( --color
) It finds characters greater than 127 and so will not work on UTF8
for example
grep --color='auto' -P -n "[^\x00-\x7F]" myfile.txt
It always helps to include your OS and shell, this will not work HP-UX
for example, and because your used -A
I guessed.
Edit: Rudi beat me to it.
Hello Rudic and Jim,
It is a subrip file and Jim's answer is very helpful for my case.
Marked as solved
.
Thank you!
Boris
Be aware that above will also match / identify / eliminate locale
characters. E.g. äöüÄ�-Üß
in the German language.
Hello,
I am back again with the same question.
I am able to detect if it has U+FFFD
inside any file but do not know which files have got this issue.
I run:
printf '%b' "$(printf '\\U%x' {128..131})" | grep -oP "[^\x00-\x7F]"
output:
�
�
�
�
how may I find it?
PS:
printf '%b' "$(printf '\\U%x' {128..131})" | grep -HoP "[^\x00-\x7F]"
gives below output:
(standard input):�
(standard input):�
(standard input):�
(standard input):�
printf '%b' "$(printf '\\U%x' {128..131})" | grep -loP "[^\x00-\x7F]"
gives only one line output:
(standard input):�
thank you
Boris
Not quite sure I understand what failed. grep
's option -H
gives filenames of all pattern occurrences, -l
prints any matching filename just once, which would satisfy you request: identify all files containing non-ASCII characters.
Thank you Rudic,
This way, gives: invalid range end
error.
printf '%b' "$(printf '\\U%x' {128..131})" | grep -l "[^\x00-\x7F]"
Somehow, I am printing the filenames now but info is not correct.
sniff.sh
for file in *.srt
do
printf '%b' "$(printf '\\U%x' {128..131})" $file '\n'
done
Output:
����1.hr.srt
,����JohnnyEnglishStrikesAgain2018.el.srt
����JohnnyEnglishStrikesAgain2018.en.srt
Normally there is no �
inside *.en.srt
The answer seems like related to another case. Maybe files should be converted to UTF-16 encode prior to run this script.
I am closing this thread as solved.
Kind regards
Boris
Try
printf '%b' "$(printf '\\U%x' {128..131})" | LC_ALL=C grep -lo '[^\x00-\x7F]'
(standard input)
which should be exactly what you need...?
That command gives only standart output
In my understanding, all those commands I have tested so far prints only all srt
files but do not search for related charbase string
. Then I supposed the reason was not having other language options in my computer*and then added other languages (example:Greek) with locale-gen
command.
This issue is not relevant to keyboard language settings.
By using iconv
command, I also checked if original file size and converted file size were different. If I could have found any difference, I would have been thinking of adding a size comparison function inside the script but no luck..
PS: Also I have compared char length
at each line.
Kind regards
Boris
Hello,
Solved permanently ... Trick is du -b
, not du -sh
Thank you
Boris