Search/grep on row and column wise

calredd · June 20, 2015, 10:33pm

Hello,
I have a comma seperate metadata as follows:

CITY        ,COUNTY,STATE,COUNTRY
NEW_YORK,NYC     ,NY     ,USA
NEWARK    ,ESSEX ,NJ      ,USA
CHICAGO   ,COOK  ,IL       ,USA
SEATTLE   ,MINER ,WA     ,USA

In my process, I get two key values ie
CITY NAME (can be one of the values in first column shown above)
COLUMN NAME (I get one of the values in the 1st row, ie either CITY or COUNTY or STATE or COUNTRY)

For EX:
If my keys are CHICAGO & STATE , then I need my ouput as IL
If my keys are NEWARK & COUNTRY , then I need my ouput as USA

Please note that the column position can change. ie the order in which I get data in my columns is not necessarily in the order shown above. I may get my columns in this order also -> CITY,STATE,COUNTY,STATE

Now to implement this solution, I can grep on the CITY_NAME but then I need to know where the other column is. I am trying different solutions but without much success.

Any simple solutions to find this using sed or awk is appreciated.
Thanks.

Don_Cragun · June 20, 2015, 11:36pm

None of your examples match your sample data.

You don't have a field named CITY ; you have a field named CITY followed by eight spaces. If you try to look for a value in the CITY field, it should fail because there is no CItY field.

With a comma separated data file, you don't have a city name of CHICAGO ; you have a city name of CHICAGO followed by three spaces, and you don't have a city name of NEWARK ; you have a city name of NEWARK followed by four spaces.

And, if someone entered the keys, CHICAGO followed by three spaces and STATE they should get IL followed by 7 spaces; not just IL .

And, if you have columns in the order CITY,STATE,COUNTY,STATE (which you said is possible) and the user gives keys NEW_YORK and STATE are you supposed to return the value in column 2, column 4, or both???

How are the keys supplied to your script?

What operating system and shell are you using?

What have you tried so far?

Is this a homework assignment?

calredd · June 20, 2015, 11:45pm

don cragun:

None of your examples match your sample data.

You don't have a field named CITY ; you have a field named CITY followed by eight spaces. If you try to look for a value in the CITY field, it should fail because there is no CItY field.

With a comma separated data file, you don't have a city name of CHICAGO ; you have a city name of CHICAGO followed by three spaces, and you don't have a city name of NEWARK ; you have a city name of NEWARK followed by four spaces.

And, if someone entered the keys, CHICAGO followed by three spaces and STATE they should get IL followed by 7 spaces; not just IL .

And, if you have columns in the order CITY,STATE,COUNTY,STATE (which you said is possible) and the user gives keys NEW_YORK and STATE are you supposed to return the value in column 2, column 4, or both???

How are the keys supplied to your script?

What operating system and shell are you using?

What have you tried so far?

Is this a homework assignment?

Sorry for the confusion but there are no space in the data. I only put spaces to make it easier to understand. Here's how the data will appear:

CITY,COUNTY,STATE,COUNTRY
NEW_YORK,NYC,NY,USA
NEWARK,ESSEX,NJ,USA
CHICAGO,COOK,IL,USA
SEATTLE,MINER,WA,USA

Q. How are the keys supplied to your script? A value from first column and a column name
Ex: NEWARK & STATE, SEATTLE & COUNTRY, NEWARK & COUNTY

Q. What operating system and shell are you using? LINUX and ksh

So far, I have tried doing a egrep on both CITY and the value from the first row (ex: NEW_YORK or CHICAGO) to get two rows. But after that, I need to narrow down the value to a specific column name which can be either COUNTY or STATE or COUNTRY and here's where I am struck.

No, this is not a homework but rather I need to mine a metadata file to narrow down the value to a specific row and column.

Hope my explantion helps.

Don_Cragun · June 21, 2015, 1:06am

calredd:

Sorry for the confusion but there are no space in the data. I only put spaces to make it easier to understand. Here's how the data will appear:
CITY,COUNTY,STATE,COUNTRY
NEW_YORK,NYC,NY,USA
NEWARK,ESSEX,NJ,USA
CHICAGO,COOK,IL,USA
SEATTLE,MINER,WA,USA
Q. How are the keys supplied to your script? A value from first column and a column name
Ex: NEWARK & STATE, SEATTLE & COUNTRY, NEWARK & COUNTY

Q. What operating system and shell are you using? LINUX and ksh

So far, I have tried doing a egrep on both CITY and the value from the first row (ex: NEW_YORK or CHICAGO) to get two rows. But after that, I need to narrow down the value to a specific column name which can be either COUNTY or STATE or COUNTRY and here's where I am struck.

No, this is not a homework but rather I need to mine a metadata file to narrow down the value to a specific row and column.

Hope my explantion helps.

I asked "How are the keys supplied to your script?" You responded "A value from first column and a column name
Ex: NEWARK & STATE, SEATTLE & COUNTRY, NEWARK & COUNTY"
That response is an answer to the question "What is supplied to your script?", not "How".

Is your script passed two operands?
Is your script passed one operand with a literal <space><ampersand><space> between the two keys?
Are the values read from a file? (If so what is the file format?)
Can more than one pair of keys be supplied to your script to be processed during a single invocation of your script?
Exactly what format should be used for the output?
[list=i]
Are the keys supposed to be included in the output?
If more than one pair of keys are input, is the output supposed to be a single line or one line for each pair of keys?
[/list]
Is the data built into the script or is it in a file?
If the data to be processed is in a file, is the pathname of the file built into the script (If so, what is that pathname?), or is it passed in as an option argument (If so, what is the option letter and what is the default pathname if no option is provided?), or is it passed in as an operand?

Please help us help you. Don't make us guess at what you're trying to do. Show us that you have thought about the design of your script and aren't just asking us to not only guess at how your script is supposed to work but also what the output is supposed to look like.

And, PLEASE, use CODE tags where appropriate in your posts.

calredd · June 21, 2015, 10:10am

don cragun:

I asked "How are the keys supplied to your script?" You responded "A value from first column and a column name
Ex: NEWARK & STATE, SEATTLE & COUNTRY, NEWARK & COUNTY"
That response is an answer to the question "What is supplied to your script?", not "How".

Is your script passed two operands?

Is your script passed one operand with a literal <space><ampersand><space> between the two keys?

Are the values read from a file? (If so what is the file format?)

Can more than one pair of keys be supplied to your script to be processed during a single invocation of your script?

Exactly what format should be used for the output?
[list=i]

Are the keys supposed to be included in the output?

If more than one pair of keys are input, is the output supposed to be a single line or one line for each pair of keys?
[/list]

Is the data built into the script or is it in a file?

If the data to be processed is in a file, is the pathname of the file built into the script (If so, what is that pathname?), or is it passed in as an option argument (If so, what is the option letter and what is the default pathname if no option is provided?), or is it passed in as an operand?

Please help us help you. Don't make us guess at what you're trying to do. Show us that you have thought about the design of your script and aren't just asking us to not only guess at how your script is supposed to work but also what the output is supposed to look like.

And, PLEASE, use CODE tags where appropriate in your posts.

Is your script passed two operands?
Yes, I get two parameters passed.
Is your script passed one operand with a literal <space><ampersand><space> between the two keys?
No. I get two parameters passed.
Are the values read from a file? (If so what is the file format?)
Yes, it is a comma seperated file as shown in my prior post.
Can more than one pair of keys be supplied to your script to be processed during a single invocation of your script?
No, I only get one pair of keys.
Exactly what format should be used for the output?
I just need to fetch the value which can assigned to a variable.
Are the keys supposed to be included in the output?
No
If more than one pair of keys are input, is the output supposed to be a single line or one line for each pair of keys?
Only one pair of keys are input.
Is the data built into the script or is it in a file?
Data is an external file.
If the data to be processed is in a file, is the pathname of the file built into the script (If so, what is that pathname?), or is it passed in as an option argument (If so, what is the option letter and what is the default pathname if no option is provided?), or is it passed in as an operand?
Data is a pathname. For sake of simplicity, pls assume that data is residing at /tmp/metadata.csv

My solution so far has been:

$ cat /tmp/metadata.csv | egrep "NEWARK|COUNTY"
CITY,COUNTY,STATE,COUNTRY
NEWARK,ESSEX,NJ,USA

Where /tmp/metadata.csv is the file path.
NEWARK is the first parameter
COUNTY is the second parameter

Now, I am trying to use awk to get my final output as ESSEX.

Hope this explantion helps.
Thanks again for your questions.

RudiC · June 21, 2015, 1:30pm

Try

awk -F,    '
FNR==1          {for (n=1; n<=NF; n++) {if ($n==COL) KEYCOL=n}
                 next
                }
$1==KEY         {print $KEYCOL}
'  KEY=CHICAGO COL=COUNTY   file
COOK

In case you have one parameter with the keywords separated by "&", try

awk -F,    '
FNR==1          {gsub (" ", "", KEY)
                 split (KEY, Karr, "&")
                 for (n=1; n<=NF; n++) {if ($n==Karr[2]) KEYCOL=n}
                 next
                }
$1==Karr[1]     {print $KEYCOL}
'  KEY="CHICAGO & COUNTY"   file

Don_Cragun · June 21, 2015, 2:44pm

In case the database might be switched to use expected spaces in names (such as NEW YORK instead of underscores (as in NEW_YORK ), you might want to change:

FNR==1          {gsub (" ", "", KEY)
                 split (KEY, Karr, "&")

in the 2nd script to:

FNR==1          {split (KEY, Karr, " *& *")

I don't know how this database is going to account for the fact that the five boroughs of New York City are five different counties (The Bronx is in Bronx county, Brooklyn is in Kings county, Manhattan is in New York county, Queens is in Queens county, and Staten Island is in Richmond county).

calredd · June 21, 2015, 3:11pm

rudic:

Try

awk -F,    '
FNR==1          {for (n=1; n<=NF; n++) {if ($n==COL) KEYCOL=n}
   next
   }
$1==KEY         {print $KEYCOL}
'  KEY=CHICAGO COL=COUNTY   file
COOK

In case you have one parameter with the keywords separated by "&", try

awk -F,    '
FNR==1          {gsub (" ", "", KEY)
   split (KEY, Karr, "&")
   for (n=1; n<=NF; n++) {if ($n==Karr[2]) KEYCOL=n}
   next
   }
$1==Karr[1]     {print $KEYCOL}
'  KEY="CHICAGO & COUNTY"   file

Thank you, appreciate it. Your solution seems to be working. I will further test it for different scenarios.