Filter ONLY lines with non-printing charaters

JSKOBS · December 7, 2016, 2:48am

I have a file contains data with non-printing characters. i have used

cat -v filename

to display whole data with non-printing characters also.
However, i need lines with non-printing characters into seperate file. My file is huge and looks like i have to manully find lines using

cat -v filename | more

any help how to implement a script...

RudiC · December 7, 2016, 3:56am

How about

grep '[^[:print:]]' filename

?

Don_Cragun · December 7, 2016, 5:21am

You haven't really described what you consider non-printing characters. In most locales, RudiC's suggestion will select any lines that contain any character that is not <space> and is not in class alpha, not in class digit, and not in class punct (not counting the line terminating <newline> character). For many text files, you might also want to keep lines containing <tab> characters. If that is true in your case, you might try this slight modification to RudiC's suggestion:

grep '[^[:print:][:blank:]]' filename

bakunin · December 7, 2016, 3:15pm

To add a different take: if you mean by "non-printable" characters those which cannot be displayed because of your locale (like, for instance, UTF-8-characters with a "C"-locale) you can use sed s l-command (display characters in a visually unambiguous form) to display these.

For instance, the following file content (german umlauts):

xx � xx
xx � xx

would be displayed as:

# sed -n 'l' umlaut.file
xx \303\204 xx$
xx \303\266 xx$

Note, that this is NOT a translation, so you cannot do further work on the pattern space and transform the resulting 3-digit (octal) codes. Save the resulting file instead and then start a new sed ( grep , ...) -command to further process it, i.e. to select all the lines with umlauts:

# sed -n 'l' umlaut.file > result
# grep '\\[0-7][0-7][0-7]' result

I hope this helps.

bakunin