View a file and count all words beginning with specificletter

simpsa27 · March 30, 2019, 10:41am

I am trying to write a command and need to count all the words within the file which begin with the letter S

I have run this command

[casupport@docvlapph005 ca]$ grep '^' TheAgileApproach.dat | wc -l
0
[casupport@docvlapph005 ca]$ grep '^' TheAgileApproach.dat | wc -l
1

When I remove the wc -l I see the output as below:

[casupport@docvlapph005 ca]$ grep '^' TheAgileApproach.dat
[casupport@docvlapph005 ca]$ grep '^' TheAgileApproach.dat
String Filter increment. The most basic implementation that can be used for all data types is shown on the far left. Each version to the right of that adds more functionality.
[casupport@docvlapph005 ca]$

As you can see in the output of point two it brings the whole line but also the word shown is included which should have picked up with the first command?

Is their something specific I am doing wrong?

vbe · March 30, 2019, 11:49am

'^'

means at the beginning of the line, so the output you got is correct...

Addendum
Oh and wc -l counts the lines If you have 2 occurences on the same line it will be counted as 1...

simpsa27 · March 30, 2019, 11:51am

Hey VBE

I want to run so it reads the whole script and prints out the words which start with S or s - what would I edit to do that? I assume I wouldn't use wc -l either?

vbe · March 30, 2019, 12:35pm

If you are trying to learn what can be done by grep ( because others would suggest use sed or awk...)
As I only have my mac laptop at the moment this is what I would do if were to use only grep:

grep -i -e"^s" -e" s" -o TheAgileApproach.dat|wc -l

If you let me the time to try... I will come back with the result

Im back, result:

$ cat TheAgileApproach.dat
String Filter increment. The most basic implementation that can be used for all data types is shown on the far left. Each version to the right of that adds more functionality.

$ grep -ie"^s" -e" s" -o TheAgileApproach.dat|wc -l
       2

Addendum
If it works for you , to understand try little bit of the line at a time and see its output...

simpsa27 · March 30, 2019, 12:58pm

Came back with a count of 55 which is correct.

--- Post updated at 04:58 PM ---

What about if I want to print a list of the words that begin with S as when i do it without wc -l it comes up with a list of s

[casupport@docvlapph005 ca]$ grep -i -e"^s" -e" s" -o TheAgileApproach.dat
 s
 s
 s
 s
 s
 s
 s
 s
 s
 s
 s
 s
 s
 s
 s
 s
 s
 s
 s
 s
 s
 s
 s
 s
 s
 s
 s
 s
 s
 s
 s
 s
 s
 s
S
 s
 s
 s
 s
 s
 s
 s
 s
 s
 s
 s
 s
 s
 s
 s
 s
 s
 s
 s
 s

drl · March 30, 2019, 1:40pm

Hi.

Here are the important parts of a script that seems to do what you wish (including a sample data file). It runs twice, once considering the underscore as a separator, then as a character:

# Utility functions: print-as-echo, print-line-with-visual-space.
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }

pl " Input data file $FILE:"
head $FILE

pl " Results:"
tr -s '[[:punct:][:space:]]' '\n' < $FILE |
tee t1 |
grep '^[sS]' |
tee t2 |
wc -l

pl  "Content of intermediate files (columnized by local utility):"
for f in t?
do
  pl " File: $f:"
  my-columns $f
done

pl " Results, considering "_" as a character:"
# tr -s '[^\w\s_]' '\n' < $FILE |
grep -o -P '[\w_]+' $FILE |
tee t1 |
grep '^[sS]' |
tee t2 |
wc -l

pl  "Content of intermediate files (columnized by local utility):"
for f in t?
do
  pl " File: $f:"
  my-columns $f
done

producing:

$ ./s1 data3

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 3.16.0-7-amd64, x86_64
Distribution        : Debian 8.11 (jessie) 
bash GNU bash 4.3.30
tr (GNU coreutils) 8.23
grep (GNU grep) 2.20

-----
 Input data file data3:
(2) SHALL we see?
(3) Is Sheriff Nokill allowing us to Shoot on Sight.
(4) We are agin "Shoot on site" but OK with "shoot on Sight".
(1) Nothing here to See, move along.
(0) Go USA! 
(1) un_Sharpened.
(1) un-Sharpened.
(1) Sharp
(13) total

-----
 Results:
13

-----
Content of intermediate files (columnized by local utility):

-----
 File: t1:
      see     Nokill   Shoot We    on   with  1       See   Go  Sharpened 1    
2     3       allowing on    are   site shoot Nothing move  USA 1         Sharp
SHALL Is      us       Sight agin  but  on    here    along 1   un        13   
we    Sheriff to       4     Shoot OK   Sight to      0     un  Sharpened total

-----
 File: t2:
SHALL Sheriff Sight site  Sight Sharpened Sharp
see   Shoot   Shoot shoot See   Sharpened

-----
 Results, considering _ as a character:
12

-----
Content of intermediate files (columnized by local utility):

-----
 File: t1:
2     Is       to    We    site  on      to    Go           un        total
SHALL Sheriff  Shoot are   but   Sight   See   USA          Sharpened
we    Nokill   on    agin  OK    1       move  1            1        
see   allowing Sight Shoot with  Nothing along un_Sharpened Sharp    
3     us       4     on    shoot here    0     1            13       

-----
 File: t2:
SHALL see Sheriff Shoot Sight Shoot site shoot Sight See Sharpened Sharp

Best wishes ... cheers, drl

RudiC · March 30, 2019, 1:45pm

How about, given your grep accepts "regular expression extensions" like \b that are not necessarily available in all systems / versions:

grep -o "\b[sS][^ ]*" <<< "String Filter increment. The most basic implementation that can be used for all data types is shown on the far left. Each version to the right of that adds more functionality."
String
shown

Replace the "here string" with your input file when applying / testing it on your system.

Don_Cragun · March 30, 2019, 2:23pm

It helps to know what operating system and shell you're using since the utilities on various operating systems have options that might help with what you are trying to do that are not available on other operating systems. Whenever you start a new thread here, please always tell us what shell and operating system you're using.

We need a much clearer definition of what you consider to be a word starting with "s" or "S". Is a word just alphabetic characters? Can hyphens be included in a word (e.g., sub-sonic)? Can numeric characters be included in a word (e.g., "straight-6")? Can apostrophes be included in a word (e.g., "she's")? When apostrophes can be included in words, how are we to distinguish between a phrase surrounded by single-quotes and a word containing a hyphen? (Note that regular expressions in grep can only look at a single line and quoted strings can cross many line boundaries.)

And, if we can't see a sample of the data you're working with and the output you expect to get from it, we have no way to verify that anything we might suggest might work for you. Please give us a representative sample input and the corresponding exact output you hope to produce from that input.