get rid of non-alphanumeric characters

mjomba · November 29, 2010, 3:04am

Hi!
Could anyone so kindly help me a code to eliminate from a txt file, obtained by collecting and merge several web-page, every word (string) containing non alphabetical, numeric and punctuation character (i.e NON a-zA-Z0-9, underscore and punctuation mark)?

Thanks a lot for the help to anyone sending a reply!
mjomba from Tanzania

michaelrozar17 · November 29, 2010, 3:32am

this could help...?

sed 's/[^a-zA-Z0-9_:]/ /g' inputfile

animbane · November 29, 2010, 4:48am

Try this out ...

cat inputFile|sed 's/ [a-zA-Z0-9]*[^a-zA-Z0-9][^a-zA-Z0-9]*[a-zA-Z0-9]* / /g'|sed 's/[ ][a-zA-Z0-9][^a-zA-Z0-9 ][^a-zA-Z0-9 ]*[a-zA-Z0-9]//g'

Scrutinizer · November 29, 2010, 5:59am

awk '{for(i=1;i<=NF;i++)if($i~/[^[:graph:]]/)$i=x}1' file

or

awk '{for(i=1;i<=NF;i++)if($i~/[^[:alnum:]._]/)$i=x}1'

if you only want to include . and _ for example

methyl · November 29, 2010, 7:34am

A representative example would help. Depends what you mean by "word", "punctuation" etc. and whether you will retain the line terminator.

One example of removing every character except those listed is:

cat oldfile tr -cd '[:alnum:][:punct:][:space:]' > newfile

See here for definitions of the various character classes.
Regex Tutorial - POSIX Bracket Expressions

If you actually need to work on "words" it needs a clear definition of what constitutes a "word".

mjomba · December 17, 2010, 6:12am

Thank you very much!
It does what I needed:

cat oldfile | tr -cd '[:alnum:][:punct:][:space:]' > newfilemjomba