Non-ASCII char prevents conversion of manpage to plain text

Hello,

I would like to export manual pages to plain text files.

man CommandName | col -bx > CommandName.txt

The above statement works successfully on Mac OS X. However, it often fails on my old Linux. The problem occurs if the source file of the manpage contains an escape sequence for Non-ASCII character such as "\(co" for the copyright character (0xA9).

Whenever "col -bx" encounters an non-ASCII character (0x80 through 0xFF), it aborts any further process and displays the error message, "Invalid or incomplete multibyte or wide character".

The man command on Mac OS X automatically converts non-ASCII characters into ASCII equivalents such as "(C)" for the copyright character. Therefore, col does not receive non-ASCII characters, and the job successfully completes.

On the other hand, the man command on my old Linux does not convert non-ASCII characters into ASCII equivalents. Therefore, col receives non-ASCII characters, and the job fails.

Please suggest me appropriate solutions for this problem.

Is it possible to force the man command on my old Linux to convert non-ASCII characters into ASCII equivalents? Or, is it possible to force the col command to accept non-ASCII characters?

Here are some examples of failed CommandNames with their non-ASCII characters that caused the failures.

find (curly quote, 0xB4)
hexdump (middle dot, 0xB7)
ln (copyright char, 0xA9)

Many thanks in advance.

TERM=lpr man ls >ls.txt

---------- Post updated at 10:08 AM ---------- Previous update was at 10:04 AM ----------

Aack...

LANG=C TERM=lpr man ln >ln.txt

.

The code presented by cjcox does not work at all. Setting the environment variable TERM to lpr does not seem to change the behavior of man. The man command still outputs non-ASCII characters. Furthermore, because he omitted col, the resultant ln.txt is illegible when opened with KWrite. I wonder if cjcox was just kidding.

Try:

LANG=C man ln > ln.txt

Well.. I wasn't kidding. It really does depend on what system we're looking at though. Not everything is well written everywhere and Linux's man (again, there ARE multiple implementations though) in general has some integration to the locale and terminal (which could be arguably the wrong thing, but man is a weird thing without doing some kind of terminal consideration given it's end user is typically human or at least something terminal like).

So... what OS and version in particular could really help in diagnosing this.

---------- Post updated at 10:33 AM ---------- Previous update was at 10:33 AM ----------

Well.. I wasn't kidding. It really does depend on what system we're looking at though. Not everything is well written everywhere and Linux's man (again, there ARE multiple implementations though) in general has some integration to the locale and terminal (which could be arguably the wrong thing, but man is a weird thing without doing some kind of terminal consideration given it's end user is typically human or at least something terminal like).

So... what OS and version in particular could really help in diagnosing this.

I have found a solution. The configuration file /usr/lib/man.conf needs to be modified as I will explain below.

The man command internally calls nroff and/or groff. It also calls geqn or eqn. The file "man.conf" contains some lines that define how man will call nroff, groff, geqn and/or eqn with what options. The "-Tlatin1" option in man.conf allows man to output non-ASCII characters (0x80 through 0xFF). By replacing "-Tlatin1" with "-Tascii, man no longer outputs non-ASCII characters, and man automatically converts non-ASCII characters into ASCII equivalents such as "(C)" for the copyright character.

Before the modification:
NROFF /usr/bin/nroff -Tlatin1 -mandoc
NEQN /usr/bin/geqn -Tlatin1

After the modification:
NROFF /usr/bin/nroff -Tascii -mandoc
NEQN /usr/bin/geqn -Tascii

After the modification to "man.conf", the following command line successfully exports manual pages to plain text files.

man CommandName | col -bx > CommandName.txt

By the way, piping to "col" is necessary. Without "col", the resultant text would not be plain. The direct output from man is not plain text. The fact that the output from man to stdout (terminal) contains bold-face letters and underscored letters suggests that the text is not plain.

In fact, let the output from man be redirected to a file without "col" as shown below, and open the file with GUI-based text editor (e.g., TextEdit on Mac OS X, KWrite on KDE-equipped Linux, NotePad on Windows).

man ln > ln.txt

Then, you will see a bunch of illegible strings like the following, proving that the output is not plain text.

N[]NA[]AM[]ME[]E
S[]SY[]YN[]NO[]OP[]PS[]SI[]IS[]S
D[]DE[]ES[]SC[]CR[]RI[]IP[]PT[]TI[]IO[]ON[]N

The man command outputs lots of backspace characters, which are illegal in plain text. On GUI-based text editors, you will see square characters in place of backspace characters. Since it is difficult to display square characters on this web page, I used [] to represent a square character.

The "col" command converts the above illegible strings into the following plain text.

NAME
SYNOPSIS
DESCRIPTION

Thus, "col" is absolutely necessary to obtain plain text from "man".

I thank cjcox and Franklin52 for trying to help me. I apologize cjcox for mistaking his generous help for a joke.

1 Like