Shell script/awk to sort text

ektorzoza · February 25, 2014, 1:51am

The problem statement, all variables and given/known data:

I have a file with a fragment of a novel, which I have to clear from punctuation and sort all the words contained one per line and non duplicated, all this going to a file called "palabras".

Here is fragment of the input file:

Don Quijote de la Mancha, Cervantes 

Cap�tulo II

Que trata de la notable pendencia 
[*] que Sancho Panza tuvo con la sobrina y ama de don Quijote, con otros sujetos graciosos

And here is a fragment of how the file palabras should look like:

ama
Cap�tulo
Cervantes
con
Don
...

Relevant commands, code, scripts, algorithms:
The attempts at a solution (include all code and scripts):

Surfing on the web to find information, i have only achieved to clear punctuation and put a word in each line, with the following code:

{gsub("[-.,:;�[\*\]\?]","");}
{RS=" ";}
{print > "palabras";}

calling it from terminal with this: cat novela | awk -f p4

p4 is the name of the file of my code.

and when i call from terminal this command: sort -u palabras>palabras2 it generates the file i want (if i put palabras>palabras it generates a blank file)

the question here is, how can i achieve my goal with in the same awk program? cuz i tried this:

{gsub("[-.,:;�[\*\]\?]","");}
{RS=" ";}
{print > "palabras";}
END {sort -u > palabras2;}

With and without END, with sort -u > palabras2 and with sort -u palabras, however the file generated is the same without sorting and without deleting duplicated words.

I would really appreciate any ideas because I have been stucked on this problem for days. Also if you could suggest ideas, where i can call the awk like I said before ( cat novela | awk -f p4).

Thank you in advance.

Complete Name of School (University), City (State), Country, Name of Professor, and Course Number (Link to Course):

ITESM Campus Monterrey, Monterrey, Mexico
Profesor: Juan Jose Icaza
Course: Laboratorio de Sistemas Operativos

Scrutinizer · February 25, 2014, 3:46am

You can use a pipeline ( | ) to feed the awk output into sort, both inside awk and outside awk using the shell. You do not need an END part.

Lucas_0418 · February 25, 2014, 4:05am

Hi guy,
Your awk version is important, don't know if your awk has a built in function 'asort'. If not, it will be hard for implement without 'sort'

ektorzoza · February 25, 2014, 4:25pm

Scrutinizer: Mmm could you provide me an example of a code to sort inside itself? Im new to this language because it is just a lab session of several lab sessions in a whole semester, so it would be very helpful.

And Lucas_0148, I dont really know the version, how could i get it? I just create a file and write inside it, and then execute it from terminal, in a fresh ubuntu installation.

Scrutinizer · February 25, 2014, 5:09pm

For example. Here sort is called as an external program from within awk

awk '{print | "sort"}' file

Here the output of awk is ran through sort through a pipeline..

awk '{print}' file | sort