- The problem statement, all variables and given/known data:
I have a file with a fragment of a novel, which I have to clear from punctuation and sort all the words contained one per line and non duplicated, all this going to a file called "palabras".
Here is fragment of the input file:
Don Quijote de la Mancha, Cervantes
Cap�tulo II
Que trata de la notable pendencia
[*] que Sancho Panza tuvo con la sobrina y ama de don Quijote, con otros sujetos graciosos
And here is a fragment of how the file palabras should look like:
ama
Cap�tulo
Cervantes
con
Don
...
-
Relevant commands, code, scripts, algorithms:
-
The attempts at a solution (include all code and scripts):
Surfing on the web to find information, i have only achieved to clear punctuation and put a word in each line, with the following code:
{gsub("[-.,:;�[\*\]\?]","");}
{RS=" ";}
{print > "palabras";}
calling it from terminal with this: cat novela | awk -f p4
p4 is the name of the file of my code.
and when i call from terminal this command: sort -u palabras>palabras2 it generates the file i want (if i put palabras>palabras it generates a blank file)
the question here is, how can i achieve my goal with in the same awk program? cuz i tried this:
{gsub("[-.,:;�[\*\]\?]","");}
{RS=" ";}
{print > "palabras";}
END {sort -u > palabras2;}
With and without END, with sort -u > palabras2 and with sort -u palabras, however the file generated is the same without sorting and without deleting duplicated words.
I would really appreciate any ideas because I have been stucked on this problem for days. Also if you could suggest ideas, where i can call the awk like I said before ( cat novela | awk -f p4).
Thank you in advance.
- Complete Name of School (University), City (State), Country, Name of Professor, and Course Number (Link to Course):
ITESM Campus Monterrey, Monterrey, Mexico
Profesor: Juan Jose Icaza
Course: Laboratorio de Sistemas Operativos