mjomba
November 29, 2010, 3:04am
1
Hi!
Could anyone so kindly help me a code to eliminate from a txt file, obtained by collecting and merge several web-page, every word (string) containing non alphabetical, numeric and punctuation character (i.e NON a-zA-Z0-9, underscore and punctuation mark)?
Thanks a lot for the help to anyone sending a reply!
mjomba from Tanzania
this could help...?
sed 's/[^a-zA-Z0-9_:]/ /g' inputfile
1 Like
Try this out ...
cat inputFile|sed 's/ [a-zA-Z0-9]*[^a-zA-Z0-9][^a-zA-Z0-9]*[a-zA-Z0-9]* / /g'|sed 's/[ ][a-zA-Z0-9][^a-zA-Z0-9 ][^a-zA-Z0-9 ]*[a-zA-Z0-9] //g'
awk '{for(i=1;i<=NF;i++)if($i~/[^[:graph:]]/)$i=x}1' file
or
awk '{for(i=1;i<=NF;i++)if($i~/[^[:alnum:]._]/)$i=x}1'
if you only want to include . and _ for example
methyl
November 29, 2010, 7:34am
5
A representative example would help. Depends what you mean by "word", "punctuation" etc. and whether you will retain the line terminator.
One example of removing every character except those listed is:
cat oldfile tr -cd '[:alnum:][:punct:][:space:]' > newfile
See here for definitions of the various character classes.
Regex Tutorial - POSIX Bracket Expressions
If you actually need to work on "words" it needs a clear definition of what constitutes a "word".
mjomba
December 17, 2010, 6:12am
6
Thank you very much!
It does what I needed:
cat oldfile | tr -cd '[:alnum:][:punct:][:space:]' > newfilemjomba