Which cut command is more efficient?

sumoka · March 17, 2011, 11:36am

Hi,

I've got a query regarding which of the following is more efficient & why -

cat <filename>|cut -d'*' -f2- > <newfilename>
or 
cut -d'*' -f2- <filename> > <newfilename>

Thanks.

shamrock · March 17, 2011, 4:04pm

Spawning two processes can never be better than one...so based on this you can figure out which is more efficient.

Corona688 · March 17, 2011, 4:12pm

Spawning two processes can be better if the single process does something stupid, but cat isn't even doing anything useful here. See useless use of cat.

shamrock · March 17, 2011, 4:27pm

That would be the exception and assumes the scripter doesnt know what to do with that one spawned process.

methyl · March 22, 2011, 11:31am

For trivial file size the main overhead is loading programs and opening the file. Reading the file directly with "cut" makes sense.
For large files the argument is less clear unless the command is better than "cat" at reading data from disc.

For example with a 600 Mb text file:

timex cat bigfile|cut -f2 -d- >/dev/null

real       17.60
user        0.17
sys         2.62

timex cut -f2 -d- bigfile >/dev/null

real       17.45
user       16.10
sys         1.32

Here reading the file direcly in "cut" is fractionally quicker but has a greater impact on the system overall.

Corona688 · March 23, 2011, 2:05pm

Does this system have multiple cores? That'd be the only way I could explain that -- one core runs 'cat' and reads the file while the other runs 'cut' to process it. This may mean using twice as much CPU power for a nearly unmeasurable CPU gain.

sumoka · March 23, 2011, 3:07pm

Experts,

This is slightly beyond my comprehension but let me try to put it in my words to verify.

. combining cat & Cut will cause spawning and will utilize more CPU threads which is fine for smaller files.
. In case of bigger files as in my case, it is better to directly operate the Cut command on the file. This will result in optimum CPU utilization.

Please correct me.

This turned out be little more complex than I though. Thanks.

Corona688 · March 23, 2011, 4:07pm

Not so much "fine" as "negligible".

Correct. It's a bad habit in general -- test data tends to be small so the problem isn't apparent, only when you make it do real work will you run into trouble.

methyl · March 23, 2011, 6:01pm

@Corona688
Yes, 36 cores (9x4). CPU power not an issue. Regularly running over 30,000 concurrent processes.

Bottleneck on reading large files is invariably the disc system, closely followed by the software. This is where reading moderate size files with "cat" scores over the read function in some unix utilities. I recognise that "cut" is actually one of the better ones.

For the advanced user with large data files I am not averse to using "dd" or "cpio" (or both) to read from the disc in an optimum manner.

On a single core system running ancient unix it was very important to minimise the number of concurrent processes. This is really not the case nowadays unless you happen to be running unix on a home system.

Back to the O/P.
The conventional answer is that running more processes is less efficient. On a modern large system with multiple processors (i.e. the norm) it can be more efficient to run a pipeline of multiple efficient processes than to run a single inefficient process.
The "Useless use of cat" brigade have clearly never used a modern computer where apparent inefficiencies are in fact covered by proper utilisation of the software and hardware as a team.
By applying lateral thought we can deduce that hardware design evolution is actually targetted towards making inefficient processes efficient. We can take advantage of that by tactical use of the previously-inefficent processes.

Nuff said.

Corona688 · March 23, 2011, 6:39pm

That hardly compares to the 'cat' being run here, untuned and untunable. You're doubling the amount of work done for a <1% improvement in speed -- and with 30,000 concurrent processes, that's time something else probably could've used.

In my early shell-scripting days I wrote scripts that I'm sure would need your 36 processors to function with any efficiency Having CPU power to waste hardly makes it a good idea to do so.

sumoka · March 24, 2011, 4:44pm

can you suggest one solid link or book that covers these basics...?

methyl · March 27, 2011, 9:00pm

My stats do not show waste of CPU power. They show a CPU power saving by using "cat" because "cut" is less efficient at reading files. However in a single stream environment loading multiple processes in a long pipeline would have been a performance disaster.

In my early days or unix I have dealt with system crashes caused by for example: too many concurrent processes; too many forks; disc buffer overload; mysterious kernel crash etc. . It's hard to even generate these situations on modern systems after an initial large-scale kernel build.

There is a very good O'Reilly book "System Performance Tuning" but do bear in mind that it does not cover very large unix systems adequately.

shamrock · March 28, 2011, 12:49pm

What makes you arrive at the fact that..."cut" is less efficient at reading files than cat.

Corona688 · March 28, 2011, 1:08pm

So reading the file, writing it to a pipe, and reading from the pipe, utilizing two separate CPU's simultaneously is more efficient than reading it once and using it once? If your CPU benchmarks show that this uses less CPU, frankly, they're wrong. Less total real time maybe, but nothing in that reduces the amount of CPU cut uses -- adding more commands can only add more CPU utilization.

The only performance benefit I can see is the pipe effectively acts as a read-ahead buffer, albeit a highly expensive one. With the power expended for that 1% performance improvement, how much more actual work could have been done instead by running two instances of cut on different data sets?

---------- Post updated at 11:08 AM ---------- Previous update was at 10:58 AM ----------

cut has to read line by line. cat can just read and write huge blocks.

shamrock · March 28, 2011, 3:02pm

cat's behavior can be altered by supplying cmd. line flags so why would it read in chunks...block I/O on long lines would make its output binary.

Corona688 · March 28, 2011, 3:11pm

Because it usually doesn't need to care about line boundaries. Stopping in the middle of a line won't make it "binary". (though cat without parameters ought to be binary-safe.) What blocks it does I/O in won't change the content of said I/O, a line too long for the block will just take a couple blocks to finish.

shamrock · March 28, 2011, 3:55pm

You are right...I was thinking about reading the file into a struct whose padding may get junk chars being output...but I was wrong. However here is a link to a recent version of cat and it does process cmd. line with and without flags differently.

drl · March 28, 2011, 5:47pm

Hi.

Along this line, here is something similar to an exercise I usually had students do:

#!/usr/bin/env bash

# @(#) s1	Demonstrate cat copying a system executable file.

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
pe() { for i;do printf "%s" "$i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
db() { ( printf " db, ";for i;do printf "%s" "$i";done; printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && . $C cat

# Remove debris, list current situation..
pl " Current situation:"
rm -f f1
/bin/ls -lgG

# Copy executable with cat, look at type, make it executable.
pl " New file characteristics:"
cat /bin/ls > f1
file f1
chmod +x f1

# Run it.
pl " Results of executing copy of file:"
./f1 -lgG

# Compare the files with cmp.
pl " Results of comparison:"
if cmp --quiet /bin/ls f1
then
  pe " Files are the same according to cmp."
else
  pe " Files differ."
fi

exit 0

producing:

% ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0.7 (lenny) 
GNU bash 3.2.39
cat (GNU coreutils) 6.10

-----
 Current situation:
total 4
-rwxr--r-- 1 836 Mar 28 16:44 s1

-----
 New file characteristics:
f1: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.8, stripped

-----
 Results of executing copy of file:
total 108
-rwxr-xr-x 1 101992 Mar 28 16:45 f1
-rwxr--r-- 1    836 Mar 28 16:44 s1

-----
 Results of comparison:
 Files are the same according to cmp.