Print number of lines for files in directory, also print number of unique lines

spacegoose · July 11, 2019, 1:06pm

I have a directory of files, I can show the number of lines in each file and order them from lowest to highest with:

wc -l *|sort

15263 Image.txt
16401 reference.txt
40459 richtexteditor.txt

How can I also print the number of unique lines in each file?

15263 1401 Image.txt
16401 15999 reference.txt
40459 35670 richtexteditor.txt

If this is possible, how could I also sort it by unique vs overall count?

vgersh99 · July 11, 2019, 1:49pm

how about this:

#!/bin/ksh

wc -l * | sed '$d' | sort | while read lines file junk
do
   echo $lines $(sort < $file | uniq -u |wc -l) $file
done

nezabudka · July 11, 2019, 1:58pm

awk '{u[$0]; l++} ENDFILE {print length(u), l, FILENAME; delete u; l=0}' * | sort -k1,1n

vgersh99 · July 11, 2019, 2:21pm

this is gawk specific AND it does not count the unique lines correctly.
How about another version:

gawk '{l[$0]++} ENDFILE {for (i in l) {if (l==1) u++;t+=l} print t, u, FILENAME; delete l; u=t=0}' *

Test files:
file1:

file2:

Don_Cragun · July 11, 2019, 2:33pm

Please always tell us what shell and operating system you're using when you start a new thread. Don't assume that everyone who wants to help you has read all of your previous threads.

#!/bin/bash
tmpf="/tmp/$$.result"

trap 'rm -f "$tmpf"' EXIT

awk '
function dump() {
	print linecount, distinct, lastfile
	linecount = distinct = 0
	split("", lines)
}

FILENAME != lastfile {
	if(lastfile)
		dump()
	lastfile = FILENAME
}

{	linecount++
	if(lines[$0]++ == 0)
		distinct++
}

END {	dump()
}' * > "$tmpf"

echo 'Sorted by increaasing number of lines in files:'
sort -n "$tmpf"

echo 'Sorted by increaasing number of distinct lines in files:'
sort -k2,2n "$tmpf"

Note that this should work with any version of awk (but on Solaris systems, you'll need to use nawk or /usr/xpg4/bin/awk ).

Scrutinizer · July 11, 2019, 2:38pm

Suggestion with regular awk:

awk '
FNR==1 {
  filenr++
  Name[filenr]=FILENAME
}

!Seen[filenr,$0]++ {
  Uniq[filenr]++
} 

{
  Total[filenr]++
} 

END {
  for(i in Name)
    print Total, Uniq, Name
}
' file* | sort -nk1,1 -nk2,2 -k3,3

MadeInGermany · July 11, 2019, 2:40pm

The following variant correctly handles filenames with special characters:

for f in *; do printf "%s/%s lines are unique in file %s\n" $(sort "$f" | uniq -u | wc -l) $(wc -l < "$f") "$f"; done

Post #3 has another perception of "unique":

for f in *; do printf "%s/%s unique lines in file %s\n" $(sort  -u "$f" | wc -l) $(wc -l < "$f") "$f"; done

Didn't see the "sort" requirement. Left as an exercise.

spacegoose · July 11, 2019, 4:19pm

FYI - Just tried this, it printed correct counts but unique counts were off. I will check the others and update.

Thanks nezabudka!! This seems to work with gawk -- thanks also vgersh99 for pointing out gawk -- tried your different gawk but counts still off ... as in your original solution -- maybe uniq is not being done in correct order?

don cragun:

Please always tell us what shell and operating system you're using when you start a new thread. Don't assume that everyone who wants to help you has read all of your previous threads.
#!/bin/bash
tmpf="/tmp/$$.result"

trap 'rm -f "$tmpf"' EXIT

awk '
function dump() {
	print linecount, distinct, lastfile
	linecount = distinct = 0
	split("", lines)
}

FILENAME != lastfile {
	if(lastfile)
		dump()
	lastfile = FILENAME
}

{	linecount++
	if(lines[$0]++ == 0)
		distinct++
}

END {	dump()
}' * > "$tmpf"

echo 'Sorted by increaasing number of lines in files:'
sort -n "$tmpf"

echo 'Sorted by increaasing number of distinct lines in files:'
sort -k2,2n "$tmpf"
Note that this should work with any version of awk (but on Solaris systems, you'll need to use nawk or /usr/xpg4/bin/awk ).

Thanks Don Cragun -- this also works!

madeingermany:

The following variant correctly handles filenames with special characters:
for f in *; do printf "%s/%s lines are unique in file %s\n" $(sort "$f" | uniq -u | wc -l) $(wc -l < "$f") "$f"; done
Post #3 has another perception of "unique":
for f in *; do printf "%s/%s unique lines in file %s\n" $(sort  -u "$f" | wc -l) $(wc -l < "$f") "$f"; done
Didn't see the "sort" requirement. Left as an exercise.

Thanks MadeInGermany, first gives same unique count as vgersh99, second works for me. Maybe my perception of unique is incorrect

I'm getting my unique count by:

sort filename | uniq | wc -l

The contents of my files are URLs if that makes a difference.

vgersh99 · July 11, 2019, 4:28pm

worked just fine with my test harness files quoted previously!

RudiC · July 11, 2019, 4:48pm

For the fun of it:

for FN in *; do { sort $FN | tee >(uniq -u | wc -l >&3) | wc -l; echo $FN; } 3>&1; done | paste -s -d"\t\t\n" | sort -n

Don_Cragun · July 11, 2019, 4:53pm

You might note that the suggestion in post #5 in this thread invokes awk (using only standard awk features) once and sort twice producing both of the requested sorted outputs. Unlike some of the scripts in this thread, it doesn't need multiple invocations of sort or tr per file processed. And, the awk script processes one file at a time keeping only unique lines from that file (rather than keeping unique lines in memory from all files being processed). When the files being processed contain tens of thousands of input lines and tens of thousands of lines from most of those files are unique, that can chew up a lot of system resources.

And, although most of us corrected the use of sort without the n flag when sorting numeric values, none of us said why we did that. (If you use sort without the n flag, the sort performed is an alphanumeric sort; not a numeric sort. So, for example the string 9 is alphanumerically greater than the string 100000 because the leading digit 9 in the first string is greater than the leading digit 1 in the second string. When the n flag is given to sort , it performs a numeric sort instead of an alphanumeric sort for the key fields to which the flag is attached.)

spacegoose · July 11, 2019, 5:07pm

With nezabudka's gawk I get:

gawk '{u[$0]; l++} ENDFILE {print length(u), l, FILENAME; delete u; l=0}' * | sort -k1,1n
4 6 file1
5 7 file2

With yours I get:

gawk '{l[$0]++} ENDFILE {for (i in l) {if (l==1) u++;t+=l} print t, u, FILENAME; delete l; u=t=0}' *
6 2 file1
7 3 file2

I'm on a Mac running GNU Awk 5.0.1, API: 2.0 (GNU MPFR 4.0.2, GNU MP 6.1.2).

Scrutinizer · July 11, 2019, 5:21pm

Adaptation to post#6 that uses less memory, only unique lines for one file at a time (thanks Don):

awk '
  FNR==1 {
    filenr++
    Name[filenr]=FILENAME
    split("", Seen)
  }

  !Seen[$0]++ {
    Uniq[filenr]++
  } 

  {
    Total[filenr]++
  } 

  END {
    for(i in Name)
      print Total, Uniq, Name
  }
' file* | sort -nk1,1 -nk2,2 -k3,3

Don_Cragun · July 11, 2019, 5:35pm

If you change:

gawk '{l[$0]++} ENDFILE {for (i in l) {if (l==1) u++;t+=l} print t, u, FILENAME; delete l; u=t=0}' *

to:

gawk '{l[$0]++} ENDFILE {for (i in l) {u++;t+=l} print t, u, FILENAME; delete l; u=t=0}' *

I think you'll get the results you want. (But, I don't have gawk installed on my system to verify that it works.)

Note that each subscript value represents a unique input line. So, there is no test needed to count the number of unique lines in a file. The test that is currently in that code is only counting unique lines if they only appear in the file once.

Don_Cragun · July 11, 2019, 5:47pm

Similarly, if you change nezabudka's code from:

gawk '{u[$0]; l++} ENDFILE {print length(u), l, FILENAME; delete u; l=0}' * | sort -k1,1n

to:

gawk '{u[$0]; l++} ENDFILE {print l, length(u), FILENAME; delete u; l=0}' * | sort -k1,1n

I think you will also get what you want. (She printed the correct values in the wrong order.)

----

Hi Scrutinizer,
Glad to have been able to help.

Cheers,
Don

MadeInGermany · July 12, 2019, 4:06am

[ Sorry, I did not read the prev page ]