I have a text file downloaded from the web, I want to count the unique words used in the file, and a person's speaking length during conversation by counting the words between the opening and closing quotation marks which differ from the standard ASCII code. Also I found out the file contains some weird blank characters that are invisible from stdout which are the entry that has 118391 and the one has 6380 occurrence in the example.
It seems to me the file was processed with Mac PC by the single/double quotes I can guess, but I am not sure. Here is the output of my Ubuntu terminal:
tr -d '[:blank:]' < infile.txt | grep -o "." | sort | uniq -c | head
4 �
1089 �
1098 �
12146 �
12147 �
118391
6380
12237 about
31 alot
154 apple
1) How do I find out the invisible "blank/empty" characters in the file so that I can get rid of them in order to count the words?
2) How do I count the speaking duration of a person at conversations by the opening/closing double quotation pair? What I tried is:
grep "�.*�" infile.txt
This regex is too greedy that sometime combines adjacent dialogues into single one.
Thanks!