CSV File with Multiple Search Parameter

chetanojha · March 8, 2017, 4:52pm

Dear Team Members,

I have a unique problem. Below is the dataset which I have. I am writing a script which will read through the file and pull the invoice no. (Field 2 of C1 row).

"C1",990001,"L1","HERO","MOTORCYCLE","ASIA-PACIFIC","BEIJING"
"C2","CLUTCH","HYUNDAI",03032017
"C2","BRAKES","EUROPARTS",03032017
"C2","TYRES","MICHELIN",03032017

"C1",990002,"L2","HERO","CAR","EU1","BUDAPEST"
"C2","CLUTCH","FORD",03032017
"C2","BRAKES","EUROPARTS",03032017
"C2","TYRES","PIRELLI",03032017

For example. My script needs to accept parameters like TYRES MICHELIN.
Script will search the file for the C2 rows and whereever it will find the above two parameters - it will bring the Invoice number along wih the MICHELIN from respective C1 rows. It will also remove the doubtquotes from the texts.

./findInvoice.sh MICHELIN

Output of the script will give me

990001 MICHELIN

Another example is

./findInvoice.sh  PIRELLI

and output of the script will be

990002 PIRELLI

Any help would be useful.

Thanks.

wisecracker · March 8, 2017, 5:11pm

You are calling your shell script differently.
./findInvoice.sh TYRE MICHELIN
AND
./findInvoice.sh PIRELLI
Which one is correct, as the _positional_ parameter gives '$2' in the first and '$1' in the second.
Please clarify.
I am assuming 'TYRE' should not be there and that '$1' is needed for both.

chetanojha · March 8, 2017, 5:19pm

I have now corrected the calling of script.

./findInvoice.sh MICHELIN
or
./findInvoice.sh PIRELLI

Above is the correct one

drysdalk · March 8, 2017, 6:13pm

Hi,

Is this the sort of thing you're after ? Note that 'example.txt' is just populated with the sample input that you supplied in your post.

$ cat script.sh
#!/bin/bash
file=example.txt

if [ "$1" == "" ]
then
        echo You must supply a search term
        exit 1
fi

while read -r line
do
        case "$line" in
                \"C1\",*)
                        invoice=`echo $line | /usr/bin/awk -F, '{gsub(/"/,""); print $2}'`
                        ;;
                \"C2\",*)
                        if echo $line | /bin/grep ,\"$1\" >/dev/null 2>/dev/null
                        then
                                echo $invoice $1
                                unset invoice
                        fi
                        ;;
        esac
done < "$file"

$ ./script.sh PIRELLI
990002 PIRELLI
$ ./script.sh MICHELIN
990001 MICHELIN
$ ./script.sh EUROPARTS
990001 EUROPARTS
990002 EUROPARTS
$

Hope this helps. If not, then if you could advise how the behaviour or output differs from your expectations I can have another crack at solving this one.

Don_Cragun · March 9, 2017, 12:17am

If you want to search for multiple strings on one invocation of your script or if you want to print the 2nd field from C1 lines even if a given search string is also found on a C1 (as well as on a C2 ) line, you could try this simplified script:

#!/bin/ksh
printf '%s\n\n' "$@" |
awk -F, -v RS= '
NR == FNR {
	keys[$0]
	next
}
{	for(key in keys)
		if(index($0, key))
			print $2, key
}' - dataset

which, if invoked with the command:

./script.sh  BEIJING FORD HYUNDAI TYRE

with your sample input in a file named dataset would produce the output:

990001 HYUNDAI
990001 BEIJING
990001 TYRE
990002 TYRE
990002 FORD

This depends on the C1 lines being the 1st line in a record and there being a blank line in your file between records which seems to fit your sample input file.

This was written and tested with a Korn shell, but will also work with any other shell that uses Bourne shell syntax (such as bash ). With minor changes, this could be made to print only records that matched all operands given to your script instead of matching any of the operands. But, to do that, you'd have to specify which of the operands is to be printed when they all match.

If you want to try this on a Solaris/SunOS system, change awk in the script to /usr/xpg4/bin/awk or nawk .

chetanojha · March 9, 2017, 5:58am

Hello Don,

I am not able to run this script. It says cannot open the -dataset.

Other Few questions:

How do I pass multiple files? So that this script can look into multiple files (over 100K+).
I am running this script in bash shell. Script will look into over 100m records to find out the selected few.
This will output the result in a file?

RudiC · March 9, 2017, 7:13am

Please consider the space between "-" and "dataset" in Don Cragun's proposal. "-" designates the stdin file descriptor.

specify several file names in lieu of just "dataset". Please be aware that the source of the data (file name) will not be printed with the data.
record count shouldn't be a problem as long as system limits are not exceeded.
As given the code prints to stdout. Use redirection to capture it in a file.

chetanojha · March 9, 2017, 8:47am

For Don's script, it is always returning the first invoice numbers instead of respective invoice number.

So for below dataset

"C1",990001,"L1","HERO","MOTORCYCLE","ASIA-PACIFIC","BEIJING"
"C2","CLUTCH","HYUNDAI",03032017
"C2","BRAKES","EUROPARTS",03032017
"C2","TYRES","MICHELIN",03032017

"C1",990002,"L2","HERO","CAR","EU1","BUDAPEST"
"C2","CLUTCH","FORD",03032017
"C2","BRAKES","EUROPARTS",03032017
"C2","TYRES","PIRELLI",03032017

I am getting below.

 ./script.sh FORD PIRELLI MICHELIN
 990001  FORD
 990001 PIRELLI
 990001 MICHELIN

RudiC · March 9, 2017, 9:17am

This is what I get:

990001 MICHELIN
990002 PIRELLI
990002 FORD

I suspect your input file having a non-standard structure like DOS line terminators (<CR>, \r, 0x0D) for instance which leads to the record separator to fail and read everything into one record.

chetanojha · March 9, 2017, 9:52am

Hi,
For don's script.. no matter which string I passed i am getting info about only first invoice as shown bellows

 ./script.sh PIRELLI HYUNDAI

Output is as below

 
 990001 PIRELLI
 990001 HYUNDAY

---------- Post updated at 09:52 AM ---------- Previous update was at 09:29 AM ----------

Aha.. yeh.. I think I got the difference.

When I am using below example.txt file - it works.

"C1",990001,"L1","HERO","MOTORCYCLE","ASIA-PACIFIC","BEIJING"
"C2","CLUTCH","HYUNDAI",03032017
"C2","BRAKES","EUROPARTS",03032017
"C2","TYRES","MICHELIN",03032017

"C1",990002,"L2","HERO","CAR","EU1","BUDAPEST"
"C2","CLUTCH","FORD",03032017
"C2","BRAKES","EUROPARTS",03032017
"C2","TYRES","PIRELLI",03032017

But when I am using below - it doesn't

"C1",990001,"L1","HERO","MOTORCYCLE","ASIA-PACIFIC","BEIJING"
"C2","CLUTCH","HYUNDAI",03032017
"C2","BRAKES","EUROPARTS",03032017
"C2","TYRES","MICHELIN",03032017
"C1",990002,"L2","HERO","CAR","EU1","BUDAPEST"
"C2","CLUTCH","FORD",03032017
"C2","BRAKES","EUROPARTS",03032017
"C2","TYRES","PIRELLI",03032017

The one without extra empty line is the requirement. Any changes which I can do in the script to cater to the requirement?

Also, I need to pass multiple files together to pull the data out? How can I do it?

durden_tyler · March 9, 2017, 9:54am

chetanojha:

Hi,
For don's script.. no matter which string I passed i am getting info about only first invoice as shown bellows
 ./script.sh PIRELLI HYUNDAI
 
Output is as below
 
 990001 PIRELLI
 990001 HYUNDAY
 

I agree with RudiC that the End-Of-Line (EOL) characters in your data file are, most likely, non-standard. If you are on Unix/Linux, they should be "\n".
Can you show the EOL characters in your data file?

RudiC · March 9, 2017, 10:32am

As always it pays off to do your posting VERY carefully! You see the difficulties that arise by just dropping an empty line!

Adapting Don Cragun's proposal to your new structures, try

awk -F, -v ITM="$*" '
BEGIN           {for (n=split (ITM, TMP, " "); n>0; n--) keys[TMP[n]]
                }

$1 == "\"C1\""  {INV = $2
                 next
                }

                {for (key in keys) if (index ($0, key)) print INV, key
                }
' file1 file1
990001 MICHELIN
990002 FORD
990002 PIRELLI
990001 MICHELIN
990002 FORD
990002 PIRELLI

Please be aware that the dash (for stdin) is gone, and I added more than one file in the input stream.

vgersh99 · March 9, 2017, 10:47am

modifying Don's approach with the new data file input format:

#!/usr/bin/bash

printf '%s\n' "$@" |
awk -F, '
NR == FNR {
    keys[$0]
    next
}
$2 ~ "^[0-9][0-9]*$" { inv=$2;next}
{
      for(key in keys)
        if(index($0, key))
            print inv, key
}' - dataset

durden_tyler · March 9, 2017, 11:54am

Yup, my bad. The blank line was the issue.
An alternative solution using Perl invoked from a Bash shell script is as follows:

 $
$ cat -n dataset
     1  "C1",990001,"L1","HERO","MOTORCYCLE","ASIA-PACIFIC","BEIJING"
     2  "C2","CLUTCH","HYUNDAI",03032017
     3  "C2","BRAKES","EUROPARTS",03032017
     4  "C2","TYRES","MICHELIN",03032017
     5  "C1",990002,"L2","HERO","CAR","EU1","BUDAPEST"
     6  "C2","CLUTCH","FORD",03032017
     7  "C2","BRAKES","EUROPARTS",03032017
     8  "C2","TYRES","PIRELLI",03032017
$
$ cat -n invoice.sh
     1  #!/usr/bin/bash
     2  perl -e 'map{$x{$_}++} split(/ /,$ARGV[0]);
     3           open(FH, "<", $ARGV[1]);
     4           while (<FH>) {
     5               chomp (@token = split(/,/));
     6               $id = $token[1] if $token[0] =~ /C1/;
     7               map {s/"//g; print "$id\t$_\n" if defined $x{$_}} @token }
     8           close(FH);
     9          ' "$*" dataset
    10
$
$ ./invoice.sh HERO BUDAPEST HYUNDAI MICHELIN PIRELLI
990001  HERO
990001  HYUNDAI
990001  MICHELIN
990002  HERO
990002  BUDAPEST
990002  PIRELLI
$
$ ./invoice.sh MICHELIN PIRELLI FORD
990001  MICHELIN
990002  FORD
990002  PIRELLI
$
$ ./invoice.sh PIRELLI HYUNDAI
990001  HYUNDAI
990002  PIRELLI
$
$

Or as a self-contained Perl program for the same "dataset" file as above:

 $
$ cat -n invoice.pl
     1  #!/usr/bin/perl -w
     2  use strict;
     3  my $file = "dataset";
     4  my (%x, @token, $id);
     5  foreach (@ARGV) { $x{$_}++ }
     6  open(FH, "<", $file) or die "Can't open $file: $!";
     7  while (<FH>) {
     8      chomp (@token = split(/,/));
     9      $id = $token[1] if $token[0] =~ /C1/;
    10      foreach (@token) {
    11          s/"//g;
    12          print "$id\t$_\n" if defined $x{$_};
    13      }
    14  }
    15  close(FH) or die "Can't close $file: $!";
    16
$
$ perl invoice.pl HERO BUDAPEST HYUNDAI MICHELIN PIRELLI
990001  HERO
990001  HYUNDAI
990001  MICHELIN
990002  HERO
990002  BUDAPEST
990002  PIRELLI
$
$ perl invoice.pl MICHELIN PIRELLI FORD
990001  MICHELIN
990002  FORD
990002  PIRELLI
$
$ perl invoice.pl PIRELLI HYUNDAI
990001  HYUNDAI
990002  PIRELLI
$
$