Print text between two delimiters

sravicha · May 15, 2009, 6:17am

Hi,

Can somebody help me with the below situation,

Input File,

2007_08_07_IA-0100-014_(MONTHLY).PDF
2007_08_07_IA-0100-031_(QUARTERLY)(RERUN).PDF
2008-02-28_KR-1022-003_(MONTH)(RERUN)(REC1).CSV

Required output,

MONTHLY
QUARTERLY
MONTH

I have a file with a million records like the above sample input. I will be grateful if somebody could help me to print the text between the first open and close braces. A little urgent please.

NOTE: There may also be other text like ANNUAL etc.,. in the braces, so please help me to print text between first "(" and ")".

Thanks in advance

ghostdog74 · May 15, 2009, 6:23am

 # awk -F"(" '{gsub(/\).*/,"",$2);print $2}' file
MONTHLY
QUARTERLY
MONTH

sravicha · May 15, 2009, 6:34am

Thank you so much it is working fine...

funksen · May 15, 2009, 6:36am

awk -F '[()]' '{print $2}' file

sravicha · May 15, 2009, 7:07am

Thank you funksen... The output is very fast..

Thank you all guys...

ghostdog74 · May 15, 2009, 7:34am

on a large file, using -F'[()]' may not be that fast, eg experimentation on file with million lines

# wc -l < file1
1000000
# head -10 file1
2007_08_07_IA-0100-014_(MONTHLY).PDF
2007_08_07_IA-0100-031_(QUARTERLY)(RERUN).PDF
2008-02-28_KR-1022-003_(MONTH)(RERUN)(REC1).CSV
2007_08_07_IA-0100-014_(MONTHLY).PDF
2007_08_07_IA-0100-031_(QUARTERLY)(RERUN).PDF
2008-02-28_KR-1022-003_(MONTH)(RERUN)(REC1).CSV
2007_08_07_IA-0100-014_(MONTHLY).PDF
2007_08_07_IA-0100-031_(QUARTERLY)(RERUN).PDF
2008-02-28_KR-1022-003_(MONTH)(RERUN)(REC1).CSV
2007_08_07_IA-0100-014_(MONTHLY).PDF

# time awk -F '[()]' '{print $2}' file1  > /dev/null

real    0m30.326s
user    0m30.194s
sys     0m0.116s

# time awk -F"(" '{gsub(/\).*/,"",$2);print $2}' file1 > /dev/null

real    0m11.002s
user    0m10.929s
sys     0m0.064s

# time awk -F '[()]' '{print $2}' file1  > /dev/null

real    0m30.035s
user    0m29.966s
sys     0m0.044s

# time awk -F"(" '{gsub(/\).*/,"",$2);print $2}' file1 > /dev/null

real    0m11.195s
user    0m11.033s
sys     0m0.136s

shockneck · May 15, 2009, 11:07am

You can influence speed by choosing different tools. I used your code snippet to create a similar file that has about 1.2 million lines:

> head cutlist
2007_08_07_IA-0100-014_(MONTHLY).PDF
2007_08_07_IA-0100-031_(QUARTERLY)(RERUN).PDF
2008-02-28_KR-1022-003_(MONTH)(RERUN)(REC1).CS
2007_08_07_IA-0100-014_(MONTHLY).PDF
2007_08_07_IA-0100-031_(QUARTERLY)(RERUN).PDF
2008-02-28_KR-1022-003_(MONTH)(RERUN)(REC1).CS
2007_08_07_IA-0100-014_(MONTHLY).PDF
2007_08_07_IA-0100-031_(QUARTERLY)(RERUN).PDF
2008-02-28_KR-1022-003_(MONTH)(RERUN)(REC1).CS
2007_08_07_IA-0100-014_(MONTHLY).PDF
>
> wc -l cutlist
 1232454 cutlist
>

Reading the file from disk takes a fraction of a second only. The closer we get to this speed the better.

 > time cat cutlist > /dev/null

real    0m0.41s
user    0m0.04s
sys     0m0.37s

Then I used first SED to truncate the field

time sed -e 's/^.*(//' -e 's/).*//' cutlist >/dev/null

real    0m13.34s
user    0m12.36s
sys     0m0.91s

Afterwards I used a less demanding program

time cut -f2 -d"(" cutlist | cut -f1 -d")" > /dev/null

real    0m5.06s
user    0m6.02s
sys     0m0.57s

Enjoy!

ghostdog74 · May 15, 2009, 11:13am

your sed command's output is different than what's required.

funksen · May 15, 2009, 4:12pm

I tried the tests on a 64 bit fedora10 running on vmware

 cat /proc/cpuinfo
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 107
model name      : AMD Athlon(tm) 64 X2 Dual Core Processor 6000+
stepping        : 2
cpu MHz         : 3099.996
cache size      : 512 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow rep_good nopl pni cx16 lahf_lm extapic 3dnowprefetch
bogomips        : 6199.99
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp tm stc 100mhzsteps

processor       : 1
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 107
model name      : AMD Athlon(tm) 64 X2 Dual Core Processor 6000+
stepping        : 2
cpu MHz         : 3099.996
cache size      : 512 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow rep_good nopl pni cx16 lahf_lm extapic 3dnowprefetch
bogomips        : 6201.39
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp tm stc 100mhzsteps

head -10 file
2007_08_07_IA-0100-014_(MONTHLY).PDF
2007_08_07_IA-0100-031_(QUARTERLY)(RERUN).PDF
2008-02-28_KR-1022-003_(MONTH)(RERUN)(REC1).CSV
2007_08_07_IA-0100-014_(MONTHLY).PDF
2007_08_07_IA-0100-031_(QUARTERLY)(RERUN).PDF
2008-02-28_KR-1022-003_(MONTH)(RERUN)(REC1).CSV
2007_08_07_IA-0100-014_(MONTHLY).PDF
2007_08_07_IA-0100-031_(QUARTERLY)(RERUN).PDF
2008-02-28_KR-1022-003_(MONTH)(RERUN)(REC1).CSV
2007_08_07_IA-0100-014_(MONTHLY).PDF

wc -l file
1000000 file

sed statement, not very good but working for this case

time sed -e 's/.*_(\(.*[YH]\).*/\1/g' file >/dev/null


real    0m17.126s
user    0m16.878s
sys     0m0.151s

shocknecks cut command:

 time cut -f2 -d"(" file | cut -f1 -d")" > /dev/null

real    0m8.056s
user    0m4.948s
sys     0m0.645s

my awk statement

time awk -F '[()]' '{print $2}' file  > /dev/null

real    0m5.048s
user    0m4.751s
sys     0m0.252s

ghostdogs awk statement

time awk -F"(" '{gsub(/\).*/,"",$2);print $2}' file > /dev/null

real    0m5.634s
user    0m5.448s
sys     0m0.158s

rerun the test:

time sed -e 's/.*_(\(.*[YH]\).*/\1/g' file >/dev/null

real    0m15.135s
user    0m14.842s
sys     0m0.158s


time cut -f2 -d"(" file | cut -f1 -d")" > /dev/null

real    0m5.203s
user    0m4.624s
sys     0m0.573s

time awk -F '[()]' '{print $2}' file  > /dev/null

real    0m2.497s
user    0m2.324s
sys     0m0.154s

time awk -F"(" '{gsub(/\).*/,"",$2);print $2}' file > /dev/null

real    0m8.524s
user    0m7.952s
sys     0m0.247s

linux awk seems to be more efficient with two field separators, I'll try the test on on aix 6.1 Power6 on monday

gogo need more solutions, getting interesting here

Edit: I'll try to find a system with no load, the differences in the two runs are very big

vgersh99 · May 15, 2009, 4:20pm

Linux's 'awk' is probably 'gawk' "under the covers":
awk --version

funksen · May 15, 2009, 4:26pm

jep it is

gawk should be available in the linux toolbox for aix, perhaps I'll try that too

Edit: perl is very fast too, same statement as with sed:

time perl -pe 's/.*_\((.*[YH]).*/\1/g' file >/dev/null

real    0m6.198s
user    0m6.014s
sys     0m0.163s

#10 seconds delay

time perl -pe 's/.*_\((.*[YH]).*/\1/g' file >/dev/null

real    0m5.897s
user    0m5.768s
sys     0m0.125s

shockneck · May 16, 2009, 9:39am

True, cut and paste error. Sorry about that. The right line has the sed command order switched.

sed -e 's/).*//' -e 's/^.*(//' cutlist

shockneck · May 16, 2009, 10:23am

funksen:

I tried the tests on a 64 bit fedora10 running on vmware

[...]

time sed -e 's/.*_(\(.*[YH]\).*/\1/g' file >/dev/null

real    0m15.135s
user    0m14.842s
sys     0m0.158s


time cut -f2 -d"(" file | cut -f1 -d")" > /dev/null

real    0m5.203s
user    0m4.624s
sys     0m0.573s

time awk -F '[()]' '{print $2}' file  > /dev/null

real    0m2.497s
user    0m2.324s
sys     0m0.154s

time awk -F"(" '{gsub(/\).*/,"",$2);print $2}' file > /dev/null

real    0m8.524s
user    0m7.952s
sys     0m0.247s

linux awk seems to be more efficient with two field separators, I'll try the test on on aix 6.1 Power6 on monday
[...]

I found your AWK line time result impressive The time returned is so much shorter than what I got when using AWK during my tests. To put this difference into figures I reproduced your examples in a 64Bit AIX TL6 system. As my list has approx. 1.2 million lines runtimes are not comparable to your test easily but it may be interesting to see how the results relate to each other (from the AIX system).

> time sed -e 's/.*_(\(.*[YH]\).*/\1/g' cutlist > /dev/null

real    0m28.00s
user    0m27.26s
sys     0m0.70s

> time cut -f2 -d"(" cutlist | cut -f1 -d")" > /dev/null

real    0m4.81s
user    0m6.00s
sys     0m0.28s

> time awk -F '[()]' '{print $2}' cutlist > /dev/null

real    0m14.22s
user    0m13.95s
sys     0m0.19s

> time awk -F"(" '{gsub(/\).*/,"",$2);print $2}' cutlist > /dev/null

real    0m13.14s
user    0m12.89s
sys     0m0.19s

The price for using AIX AWK and/or SED is definitely higher than using the LINUX equivalents. Surprisingly AIX cut seems to be faster than LINUX cut though.

funksen · May 18, 2009, 6:13am

shockneck:

I found your AWK line time result impressive The time returned is so much shorter than what I got when using AWK during my tests. To put this difference into figures I reproduced your examples in a 64Bit AIX TL6 system. As my list has approx. 1.2 million lines runtimes are not comparable to your test easily but it may be interesting to see how the results relate to each other (from the AIX system).
> time sed -e 's/.*_($.*[YH]$.*/\1/g' cutlist > /dev/null

real    0m28.00s
user    0m27.26s
sys     0m0.70s

> time cut -f2 -d"(" cutlist | cut -f1 -d")" > /dev/null

real    0m4.81s
user    0m6.00s
sys     0m0.28s

> time awk -F '[()]' '{print $2}' cutlist > /dev/null

real    0m14.22s
user    0m13.95s
sys     0m0.19s

> time awk -F"(" '{gsub(/\).*/,"",$2);print $2}' cutlist > /dev/null

real    0m13.14s
user    0m12.89s
sys     0m0.19s
The price for using AIX AWK and/or SED is definitely higher than using the LINUX equivalents. Surprisingly AIX cut seems to be faster than LINUX cut though.

unfortunately, gawk is faster on linux x86 systems, but not on ppc aix, I tried the tests on a p550 aix6.1, lpar has 1 virtual cpu, uncapped:

the script:

#!/usr/bin/ksh
set -x
time sed -e 's/.*_(\(.*[YH]\).*/\1/g' file >/dev/null
time cut -f2 -d"(" file | cut -f1 -d")" > /dev/null
time awk -F '[()]' '{print $2}' file  > /dev/null
time gawk -F '[()]' '{print $2}' file  > /dev/null
time awk -F"(" '{gsub(/\).*/,"",$2);print $2}' file > /dev/null
time gawk -F"(" '{gsub(/\).*/,"",$2);print $2}' file > /dev/null
time perl -pe 's/.*_\((.*[YH]).*/\1/g' file >/dev/null
time sed -e 's/).*//' -e 's/^.*(//' file >/dev/null

output:

-->./runscript              
+ sed -e s/.*_(\(.*[YH]\).*/\1/g file       
+ 1> /dev/null                              

real    0m8.07s
user    0m7.33s
sys     0m0.17s
+ cut -f1 -d)
+ cut -f2 -d( file
+ 1> /dev/null

real    0m1.78s
user    0m0.46s
sys     0m0.02s
+ awk -F [()] {print $2} file
+ 1> /dev/null

real    0m4.31s
user    0m5.46s
sys     0m0.09s
+ gawk -F [()] {print $2} file
+ 1> /dev/null

real    0m6.10s
user    0m5.23s
sys     0m0.41s
+ awk -F( {gsub(/\).*/,"",$2);print $2} file
+ 1> /dev/null

real    0m3.84s
user    0m3.47s
sys     0m0.04s
+ gawk -F( {gsub(/\).*/,"",$2);print $2} file
+ 1> /dev/null

real    0m7.33s
user    0m6.13s
sys     0m0.40s
+ perl -pe s/.*_\((.*[YH]).*/\1/g file
+ 1> /dev/null

real    0m10.34s
user    0m9.27s
sys     0m0.06s
+ sed -e s/).*// -e s/^.*(// file
+ 1> /dev/null

real    0m4.66s
user    0m3.98s
sys     0m0.18s

the same test on a high performance sap-lpar, idle that time, 3 virtual cpus, uncapped, Power 6 4,7 ghz, AIX 5.3 ML09

./runscript 
+ sed -e s/.*_(\(.*[YH]\).*/\1/g file
+ 1> /dev/null                       

real    0m4.77s
user    0m4.61s
sys     0m0.09s
+ cut -f1 -d)
+ cut -f2 -d( file
+ 1> /dev/null

real    0m1.66s
user    0m2.09s
sys     0m0.06s
+ awk -F [()] {print $2} file
+ 1> /dev/null

real    0m2.51s
user    0m2.45s
sys     0m0.03s
+ gawk -F [()] {print $2} file
+ 1> /dev/null

real    0m3.79s
user    0m3.46s
sys     0m0.31s
+ awk -F( {gsub(/\).*/,"",$2);print $2} file
+ 1> /dev/null

real    0m2.38s
user    0m2.33s
sys     0m0.03s
+ gawk -F( {gsub(/\).*/,"",$2);print $2} file
+ 1> /dev/null

real    0m4.41s
user    0m3.97s
sys     0m0.33s
+ perl -pe s/.*_\((.*[YH]).*/\1/g file
+ 1> /dev/null

real    0m4.49s
user    0m4.41s
sys     0m0.04s
+ sed -e s/).*// -e s/^.*(// file
+ 1> /dev/null

real    0m2.50s
user    0m2.39s
sys     0m0.09s

your cut-command was very fast on all sytems, as we expected
gawk was slow in comparison to awk, the reason I think is, that gawk was compiled with gcc and not with xlc if thats possible

a gcc compiled binary can't use multiple cores I was told long time ago, don't know if that's still the case

Edit:

here is the mpstat output

mpstat 10 1 > log &
awk -F"(" '{gsub(/\).*/,"",$2);print $2}' file > /dev/null

cat log

System configuration: lcpu=6 ent=0.3 mode=Uncapped

cpu  min  maj  mpc  int   cs  ics   rq  mig lpa sysc us sy wa id   pc  %ec  lcs
  0  753    0    0  801  509  223    0    3 100 4668 91  8  0  1 0.27 88.7  564
  1    0    0    0   21    0    0    0    0   -    0  0 12  0 88 0.00  0.3   21
  6  415    0    0   88    3    2    0    4 100  989 44 46  0 10 0.01  2.0   93
  7    0    0    0   20    0    0    0    0   -    0  0 31  0 69 0.00  0.1   20
 10    0    0    0   20    0    0    0    0   -    0  0 36  0 64 0.00  0.1   20
 11    0    0    0   20    0    0    0    0   -    0  0 36  0 64 0.00  0.1   20
  U    -    -    -    -    -    -    -    -   -    -  -  -  0  9 0.03  8.6    -
ALL 1168    0    0  970  512  225    0    7 100 5657 82  8  0 10 0.27 91.4  738

same command with gawk:

cat log

System configuration: lcpu=6 ent=0.3 mode=Uncapped

cpu  min  maj  mpc  int   cs  ics   rq  mig lpa sysc us sy wa id   pc  %ec  lcs
  0 2620    0    0  669  501  221    0   23 100 65960 79 20  0  1 0.18 36.1  469
  1 1035    0    0   76  117   50    0   51 100 9124 69 28  0  3 0.03  5.5  112
  6    0    0    0  244    0    0    0    1 100 136396 92  8  0  0 0.29 58.4  248
  7   24    0    0   19    0    0    0    4 100   44  8 26  0 66 0.00  0.2   21
 10    0    0    0   19    0    0    0    0   -    0  0 36  0 64 0.00  0.1   19
 11    0    0    0   19    0    0    0    0   -    0  0 36  0 64 0.00  0.1   19
ALL 3679    0    0 1046  618  271    0   79 100 211524 86 13  0  1 0.50 166.3  888

gawk produces a lot more system calls on aix

bakunin · May 18, 2009, 9:21am

When testing for performance take into account that the first test you run will put (parts of) the input file into the file cache. any subsequently run program will benefit from the data being already in cache.

I don't know if this has been taken into account by you, but the huge performance difference between "sed" and "awk" together with sed always being the first one mentioned looks suspicious to me.

bakunin

funksen · May 19, 2009, 10:58am

hi bakunin,

yes I thought about that, therefore I run the script twice and posted the second result

although the time it takes to open the file is minimal in comparison to the process runtime, but you are right, it should be taken into account