Redirecting records with non-printable characters

Hi,

I have a huge file (50 Mil rows) which has certain non-printable ASCII characters in it. I am cleaning the file by deleting those characters using the following command -

tr -cd '\11\12\15\40-\176' < unclean_file > clean_file

Please note that I am excluding the following -
tab, linefeed, carriage-return and all keyboard characters while cleaning the file.

However, besides cleansing the file (by the above command) I also need to identify the rows which have these non-printable ASCII characters and redirect them to another file.

As stated earlier, can anyone please advise how I can capture these rows (with non-printable characters) in another file ?

Thanks

grep -v '[[:print:]]' myFile >nonPrintFile
tr -cd '[:print:]' < unclean_file > clean_file

Hi Vgersh99,

Will the command suggested by you also redirect rows containing linefeed, carriage-return and tabs ?

grep -v '[[:print:]]' myFile >nonPrintFile

I do not intend to redirect rows containing linefeed, carriage-return and tabs.

Please advise.

Thanks

You are right, the [:print:] character set does not have tabs and newlines.
Improvements:

tr -cd '[:print:]\t\n' < unclean_file > clean_file
awk '/[^[:print:]\t]/' unclean_file > nonPrint_lines

The CR is really a special character in Unix. Nevertheless you can add a \r .

Thanks @MadeInGermany.

Shouldn't I be also including

\n

in the command ? Otherwise wouldn't it qualify every line in the file to have non-print character since newline is also a non-print character ?

awk '/[^[:print:]\t\r\n]/' unclean_file > nonPrint_lines

Please correct me if I am wrong.

Thanks again !

With default record separators, <newline> characters are stripped from $0 when each line is read and the default print command (used when the condition evaluates to TRUE and there is no action section specified) will add a <newline> to the output. So, the two commands:

awk '/[^[:print:]\t\r\n]/' unclean_file > nonPrint_lines
awk '/[^[:print:]\t\r]/' unclean_file > nonPrint_lines

produce exactly the same output for any input file. (But, the results are unspecified if the last character in a non-empty input file is not a <newline> character.)

And, as MadeInGermany said, <carriage-return> is not a normal character in a UNIX/Linux text file. Unless you're processing DOS format text files, you probably want to copy lines containing <carriage-return> characters from unclean_file to nonPrint_lines .