I have a file with a million records like the above sample input. I will be grateful if somebody could help me to print the text between the first open and close braces. A little urgent please.
NOTE: There may also be other text like ANNUAL etc.,. in the braces, so please help me to print text between first "(" and ")".
sed statement, not very good but working for this case
time sed -e 's/.*_(\(.*[YH]\).*/\1/g' file >/dev/null
real 0m17.126s
user 0m16.878s
sys 0m0.151s
shocknecks cut command:
time cut -f2 -d"(" file | cut -f1 -d")" > /dev/null
real 0m8.056s
user 0m4.948s
sys 0m0.645s
my awk statement
time awk -F '[()]' '{print $2}' file > /dev/null
real 0m5.048s
user 0m4.751s
sys 0m0.252s
ghostdogs awk statement
time awk -F"(" '{gsub(/\).*/,"",$2);print $2}' file > /dev/null
real 0m5.634s
user 0m5.448s
sys 0m0.158s
rerun the test:
time sed -e 's/.*_(\(.*[YH]\).*/\1/g' file >/dev/null
real 0m15.135s
user 0m14.842s
sys 0m0.158s
time cut -f2 -d"(" file | cut -f1 -d")" > /dev/null
real 0m5.203s
user 0m4.624s
sys 0m0.573s
time awk -F '[()]' '{print $2}' file > /dev/null
real 0m2.497s
user 0m2.324s
sys 0m0.154s
time awk -F"(" '{gsub(/\).*/,"",$2);print $2}' file > /dev/null
real 0m8.524s
user 0m7.952s
sys 0m0.247s
linux awk seems to be more efficient with two field separators, I'll try the test on on aix 6.1 Power6 on monday
gogo need more solutions, getting interesting here
Edit: I'll try to find a system with no load, the differences in the two runs are very big
gawk should be available in the linux toolbox for aix, perhaps I'll try that too
Edit: perl is very fast too, same statement as with sed:
time perl -pe 's/.*_\((.*[YH]).*/\1/g' file >/dev/null
real 0m6.198s
user 0m6.014s
sys 0m0.163s
#10 seconds delay
time perl -pe 's/.*_\((.*[YH]).*/\1/g' file >/dev/null
real 0m5.897s
user 0m5.768s
sys 0m0.125s
I found your AWK line time result impressive The time returned is so much shorter than what I got when using AWK during my tests. To put this difference into figures I reproduced your examples in a 64Bit AIX TL6 system. As my list has approx. 1.2 million lines runtimes are not comparable to your test easily but it may be interesting to see how the results relate to each other (from the AIX system).
> time sed -e 's/.*_(\(.*[YH]\).*/\1/g' cutlist > /dev/null
real 0m28.00s
user 0m27.26s
sys 0m0.70s
> time cut -f2 -d"(" cutlist | cut -f1 -d")" > /dev/null
real 0m4.81s
user 0m6.00s
sys 0m0.28s
> time awk -F '[()]' '{print $2}' cutlist > /dev/null
real 0m14.22s
user 0m13.95s
sys 0m0.19s
> time awk -F"(" '{gsub(/\).*/,"",$2);print $2}' cutlist > /dev/null
real 0m13.14s
user 0m12.89s
sys 0m0.19s
The price for using AIX AWK and/or SED is definitely higher than using the LINUX equivalents. Surprisingly AIX cut seems to be faster than LINUX cut though.
unfortunately, gawk is faster on linux x86 systems, but not on ppc aix, I tried the tests on a p550 aix6.1, lpar has 1 virtual cpu, uncapped:
the script:
#!/usr/bin/ksh
set -x
time sed -e 's/.*_(\(.*[YH]\).*/\1/g' file >/dev/null
time cut -f2 -d"(" file | cut -f1 -d")" > /dev/null
time awk -F '[()]' '{print $2}' file > /dev/null
time gawk -F '[()]' '{print $2}' file > /dev/null
time awk -F"(" '{gsub(/\).*/,"",$2);print $2}' file > /dev/null
time gawk -F"(" '{gsub(/\).*/,"",$2);print $2}' file > /dev/null
time perl -pe 's/.*_\((.*[YH]).*/\1/g' file >/dev/null
time sed -e 's/).*//' -e 's/^.*(//' file >/dev/null
output:
-->./runscript
+ sed -e s/.*_(\(.*[YH]\).*/\1/g file
+ 1> /dev/null
real 0m8.07s
user 0m7.33s
sys 0m0.17s
+ cut -f1 -d)
+ cut -f2 -d( file
+ 1> /dev/null
real 0m1.78s
user 0m0.46s
sys 0m0.02s
+ awk -F [()] {print $2} file
+ 1> /dev/null
real 0m4.31s
user 0m5.46s
sys 0m0.09s
+ gawk -F [()] {print $2} file
+ 1> /dev/null
real 0m6.10s
user 0m5.23s
sys 0m0.41s
+ awk -F( {gsub(/\).*/,"",$2);print $2} file
+ 1> /dev/null
real 0m3.84s
user 0m3.47s
sys 0m0.04s
+ gawk -F( {gsub(/\).*/,"",$2);print $2} file
+ 1> /dev/null
real 0m7.33s
user 0m6.13s
sys 0m0.40s
+ perl -pe s/.*_\((.*[YH]).*/\1/g file
+ 1> /dev/null
real 0m10.34s
user 0m9.27s
sys 0m0.06s
+ sed -e s/).*// -e s/^.*(// file
+ 1> /dev/null
real 0m4.66s
user 0m3.98s
sys 0m0.18s
the same test on a high performance sap-lpar, idle that time, 3 virtual cpus, uncapped, Power 6 4,7 ghz, AIX 5.3 ML09
./runscript
+ sed -e s/.*_(\(.*[YH]\).*/\1/g file
+ 1> /dev/null
real 0m4.77s
user 0m4.61s
sys 0m0.09s
+ cut -f1 -d)
+ cut -f2 -d( file
+ 1> /dev/null
real 0m1.66s
user 0m2.09s
sys 0m0.06s
+ awk -F [()] {print $2} file
+ 1> /dev/null
real 0m2.51s
user 0m2.45s
sys 0m0.03s
+ gawk -F [()] {print $2} file
+ 1> /dev/null
real 0m3.79s
user 0m3.46s
sys 0m0.31s
+ awk -F( {gsub(/\).*/,"",$2);print $2} file
+ 1> /dev/null
real 0m2.38s
user 0m2.33s
sys 0m0.03s
+ gawk -F( {gsub(/\).*/,"",$2);print $2} file
+ 1> /dev/null
real 0m4.41s
user 0m3.97s
sys 0m0.33s
+ perl -pe s/.*_\((.*[YH]).*/\1/g file
+ 1> /dev/null
real 0m4.49s
user 0m4.41s
sys 0m0.04s
+ sed -e s/).*// -e s/^.*(// file
+ 1> /dev/null
real 0m2.50s
user 0m2.39s
sys 0m0.09s
your cut-command was very fast on all sytems, as we expected
gawk was slow in comparison to awk, the reason I think is, that gawk was compiled with gcc and not with xlc if thats possible
a gcc compiled binary can't use multiple cores I was told long time ago, don't know if that's still the case
When testing for performance take into account that the first test you run will put (parts of) the input file into the file cache. any subsequently run program will benefit from the data being already in cache.
I don't know if this has been taken into account by you, but the huge performance difference between "sed" and "awk" together with sed always being the first one mentioned looks suspicious to me.