Extracting Words from Text

eldeingles · May 19, 2012, 7:54am

Hi there, Unix Gurus

Back in September last year you helped me find a way to extract the words in brackets in a textfile to a new one.

In that case my textfile was made up of sentences containing an only bracketed word per sentence/line:

If the boss's son had been [kidnapped], someone would have asked for money by now.
Look, I haven't [committed] a crime, so why can't you let me go?
....

Bur I am trying in vain to do the same but this time on a file full of different texts, not sentences.

...Many astronauts [have] travelled [in] space, but now, ordinary people [are] travelling [in] space too. Dennis Tito [is] over 60 years old, [but] he [hasn't] stopped working yet. In fact, [he] is very active, and [in] 2001, he [did] something amazing. He [became] the world's first [space] tourist. So ... [who] is Dennis Tito? Where [does] he come [from] ? How [did] he become [a] space [tourist] ? Tito [comes] from [the] United States. He was [born] in New York, but [he] has [been] [living] in California [for] many years. He [is] a very rich [and] successful [businessman]...

The following code only extracts the last bracketed word.

sed 's/$.*\[$.*$\].*$/\1\2/g' inputfile > outputfile

As I asked back then, adding the blanked out bracketed words to a new file would be a bonus.

Any help infinitely appreciated.

complex.invoke · May 19, 2012, 9:00am

sed 's/][^]]*\[/ /g' infile | sed 's/.*\[\([^]]*\).*/\1/'

Scott · May 19, 2012, 9:08am

An awk attempt:

$ awk -v RS=[ -v FS=] '$2 {print $1}' file
have
in
are
in
is
but
hasn't
he
in
did
became
space
who
does
from
did
a
tourist
comes
the
born
he
been
living
for
is
and
businessman

eldeingles · May 19, 2012, 1:22pm

Quite near,

yeah, both pieces of code list differently the bracketed words in the text, but I would also need the text with the empty brackets, such as this:

... I met an old [ ] friend last week that I hadn't [ ] [ ] twenty [ ] . He [ ] me about what I [ ] doing and I [ ] him I was back [ ] England for [my] nephew's [ ] , but that I [ ] [ ] ...

Thanks guys!!

Scott · May 19, 2012, 1:27pm

Sorry, did I misread your question?

Are you saying you want to blank out the [words] with _, while also storing those words in a new file?

First, do the awk

(the awk) > newfile

Then do the sed:

sed -i "s/\[[^]]*/[ /g" file

cat file
...Many astronauts [ ] travelled [ ] space, but now, ordinary people [ ] travelling [ ] space too. Dennis Tito [ ] over 60 years old, [ ] he [ ] stopped working yet. In fact, [ ] is very active, and [ ] 2001, he [ ] something amazing. He [ ] the world's first [ ] tourist. So ... [ ] is Dennis Tito? Where [ ] he come [ ] ? How [ ] he become [ ] space [ ] ? Tito [ ] from [ ] United States. He was [ ] in New York, but [ ] has [ ] [ ] in California [ ] many years. He [ ] a very rich [ ] successful [ ]

eldeingles · May 19, 2012, 1:29pm

Yes, Scott!

Scott · May 19, 2012, 1:34pm

We're posting across each other!

I think my previous post fits what you describe

eldeingles · May 19, 2012, 1:46pm

can't get the second bit of code to work

(btw the file's name is file

iMacAA:~ eldeingles$ sed -i "s/\[[^]]*/[ /g" file
sed: 1: "file": invalid command code f

...:wall:

Scott · May 19, 2012, 1:48pm

If you're using a Mac, then:

sed -i .bak "s/\[[^]]*/[ /g" file

(ps: you did the right thing sticking with Snow Leopard ;))

eldeingles · May 19, 2012, 1:51pm

Yes, Oh, man! absolutely!
Thanx for the time you have saved me.