nawk and regular expression

ejazsyed · October 3, 2011, 6:10pm

I have an input file which looks like the example below and I want to format it with 2 columns from the header based on the word "CUSIP" followed by a 9 digit string with first 3 being numeric and in the same line NNN.NN% pattern for the percentage value. (In RED)

I started of with nawk but with my limited knowledge of regex I am stuck any help would help.

Thanks

input file

[SourceFile]

****************************************************
* *
* THE FOLLOWING IS THE SOLICITOR MAIL FILE *
* *
****************************************************
1
THE DEPOSITORY TRUST COMPANY PAGE: 1
SPECIAL SECURITY POSITION LISTING PROGRAM: PXYD0001
FOR CUSIP / DESCRIPTION: 11617N-CD-1 / CDFS1.1%081811 BE+#
POSITIONS AS OF: 08/17/11
0--------------------------------
| PARTICIPANT | QUANTITY |
--------------------------------
| 101 |BANK OF NY| 111,000 |
| 111 |FRST CLEAR| 11,000 |
| 17 |JONES E D | 11,000 |
| 111 |JPMC CLEAR| 118,000 |
| 11 |MSSB | 11,000 |
| 116 |NFS LLC | 61,000 |
| 111 |PERSHING | 1,116,000 |
| 171 |SOUTHWEST | 111,000 |
|1111 |WFB/SAFEKP| 100,000 |

output file

Desired Output --- 2 new columns from header CUSIP# and %

[TargetFile]

| 101 |BANK OF NY| 111,000 |11617N-CD-1|1.1
| 111 |FRST CLEAR| 11,000 |11617N-CD-1|1.1
| 17 |JONES E D | 11,000 |11617N-CD-1|1.1
| 111 |JPMC CLEAR| 118,000 |11617N-CD-1|1.1
| 11 |MSSB | 11,000 |11617N-CD-1|1.1
| 116 |NFS LLC | 61,000 |11617N-CD-1|1.1
| 111 |PERSHING | 1,116,000 |11617N-CD-1|1.1
| 171 |SOUTHWEST | 111,000 |11617N-CD-1|1.1
|1111 |WFB/SAFEKP| 100,000 |11617N-CD-1|1.1

Corona688 · October 3, 2011, 6:13pm

Which parts of your solicitor mail file are actually in it? Put them in [code] tags so we can see.

ejazsyed · October 3, 2011, 6:16pm

Done in RED

Corona688 · October 3, 2011, 6:19pm

I know which parts you want. I just don't know which of the junk is labels and which is actually in the file -- it's seemingly labelled twice, so at least one of them must be in the file, if not both...

If you use code tags instead of trying to label things ---------------------------like this---------------------- then the bounds of the file will be obvious. And, usefully, color tags like you've already used still work inside [code] tags.

ejazsyed · October 3, 2011, 6:32pm

have added 2 tags SourceFile and Targetfile is that what I am supposed to do. Sorry it's my first day in this forum.

I just need help with this line

FOR CUSIP / DESCRIPTION: 11617N-CD-1 / CDFS1.1%081811 BE+#

how do I catch the pattern "CUSIP" followed by anything with 9 digits 1st 3 numbers

Corona688 · October 3, 2011, 6:44pm

That's... close enough, I guess Watch the video, it tells you what to do

Working on something.

ejazsyed · October 3, 2011, 6:52pm

Will do after I get home ... no video streaming at client ... Thanks

Corona688 · October 3, 2011, 6:59pm

$ cat input.awk
BEGIN { # Split 'lines' on --------------
        RS="--------------------------------"
        # Split 'columns' on newline
        FS="\n"
}

NR==1 { # Third-last line contains the info we want, split apart on space
        split($(NF-2), A, " ");
        V1=A[5];
        # Extract the part we want with a regex
        match($(NF-2), /[0-9]+\.[0-9]+/);
        V2=substr($(NF-2), RSTART, RLENGTH);
        # Don't let the routines below get fed this
        next    }

NR==2 { # Stop splitting on ----, start splitting on lines
        RS="\n"
        # Start splitting fields on | instead of \n
        FS="|"
        # Use | as separator when printing too
        OFS="|"
        # Don't let the routines below be fed this record
        next
}

# Only run this code block when we're on the third 'line' or more
NR>2 {
  # Set last record to string we found earlier
  $(NF)=V1
  # Set record after that to string we found earlier
  $(NF+1)=V2
# Only print when we're on the third 'line' or more and there's more than two columns
      } (NF>2)&&(NR>2)
$ awk -f input.awk < input
| 101 |BANK OF NY| 111,000 |11617N-CD-1|1.1
| 111 |FRST CLEAR| 11,000 |11617N-CD-1|1.1
| 17 |JONES E D | 11,000 |11617N-CD-1|1.1
| 111 |JPMC CLEAR| 118,000 |11617N-CD-1|1.1
| 11 |MSSB | 11,000 |11617N-CD-1|1.1
| 116 |NFS LLC | 61,000 |11617N-CD-1|1.1
| 111 |PERSHING | 1,116,000 |11617N-CD-1|1.1
| 171 |SOUTHWEST | 111,000 |11617N-CD-1|1.1
|1111 |WFB/SAFEKP| 100,000 |11617N-CD-1|1.1
$

ahamed101 · October 3, 2011, 11:38pm

Try this... code purely based on your input file...

awk '/CUSIP/{gsub(/[a-z]|%.*/,"",$7);_1=$5"|"$7}/^--/{f=1;next}{if(f==1)print $0 _1}' input_file

--ahamed

durden_tyler · October 4, 2011, 12:01am

$
$
$ cat input.txt
****************************************************
* *
* THE FOLLOWING IS THE SOLICITOR MAIL FILE *
* *
****************************************************
1
THE DEPOSITORY TRUST COMPANY PAGE: 1
SPECIAL SECURITY POSITION LISTING PROGRAM: PXYD0001
FOR CUSIP / DESCRIPTION: 11617N-CD-1 / CDFS1.1%081811 BE+#
POSITIONS AS OF: 08/17/11
0--------------------------------
| PARTICIPANT | QUANTITY |
--------------------------------
| 101 |BANK OF NY| 111,000 |
| 111 |FRST CLEAR| 11,000 |
| 17  |JONES E D | 11,000 |
| 111 |JPMC CLEAR| 118,000 |
| 11  |MSSB | 11,000 |
| 116 |NFS LLC | 61,000 |
| 111 |PERSHING | 1,116,000 |
| 171 |SOUTHWEST | 111,000 |
|1111 |WFB/SAFEKP| 100,000 |
$
$
$ perl -lne '/^.*CUSIP.*?: (.*?) \/ .*?([\d\.]+)%.*$/ and $x="$1|$2"; /^\|\s*\d+\s*\|.*/ and print $_,$x' input.txt
| 101 |BANK OF NY| 111,000 |11617N-CD-1|1.1
| 111 |FRST CLEAR| 11,000 |11617N-CD-1|1.1
| 17  |JONES E D | 11,000 |11617N-CD-1|1.1
| 111 |JPMC CLEAR| 118,000 |11617N-CD-1|1.1
| 11  |MSSB | 11,000 |11617N-CD-1|1.1
| 116 |NFS LLC | 61,000 |11617N-CD-1|1.1
| 111 |PERSHING | 1,116,000 |11617N-CD-1|1.1
| 171 |SOUTHWEST | 111,000 |11617N-CD-1|1.1
|1111 |WFB/SAFEKP| 100,000 |11617N-CD-1|1.1
$
$

tyler_durden

ejazsyed · October 4, 2011, 1:17pm

WOW !!!

What does the following regex evaluate to ?? I know the first part is small case alphabets

/[a-z]|%.*/

Corona688 · October 5, 2011, 11:42am

A literal %, followed by zero or more characters of any kind.

ejazsyed · October 5, 2011, 12:55pm

What is the "|" for does it means

that anything from a to z OR % followed by anything ??