Search and retreive after matched words

nqp200 · August 27, 2015, 7:24am

I have following source file format

{"driver":{"first_name":"xxxx","last_name":"yyyy"},"confirmation_id":"US285204420","vendor":{"id":"1234","name":"BUDGET"}}

I need to extract the data from above and provide in below desired output is

xxxx,yyyy,US285204420,1234,BUDGET

Can you please advise ?

Skrynesaver · August 27, 2015, 7:38am

perl -ne 'print "$1,$2,$3,$4,$5\n" if /{"driver":{"first_name":"([^"]+)","last_name":"([^"]+)"},"confirmation_id":"([^"]+)","vendor":{"id":"([^"]+)","name":"([^"]+)"}}/' filename

ASSUMPTIONS, the format is one record per line and 1 2345 should have been 12345 in the source data

nqp200 · August 27, 2015, 7:58am

Thank you.

Your assumptions are correct, it is one record per line and its 12345.
Is there a way to route this to a file?

Also There is a date field in another text file.I need to append the date field to this output at the end.

b)Also is there a Unix code equivalent as I am not too familiar with Perl.Seems Perl implementation is very easy.

RudiC · August 27, 2015, 8:00am

Try also

awk -F\" '/^{"driver.*first_name.*last_name.*}}$/ {print $6, $10, $14, $20, $24}' OFS=, file
xxxx,yyyy,US285204420,1 234,BUDGET

nqp200 · August 27, 2015, 9:30am

Could you please dissect the code to what this does?

RavinderSingh13 · August 27, 2015, 9:44am

Hello nqp200,

 awk -F\" '/^{"driver.*first_name.*last_name.*}}$/ {print $6, $10, $14, $20, $24}' OFS=, file

Here -F means it is the field separator which is character " but we used \" to escape it as many characters have their special meaning so we are escaping it. Then Regex /^{"driver.*first_name.*last_name.*}}$/ means a line which starts from word driver and then have words first_name and last_name in them and ends with character }} , if this happens then perform action print with fields 6th, 10th, 14th, 20th and 24th field. OFS means Output field separator which is , .

Hope this helps.

Thanks,
R. Singh

nqp200 · August 27, 2015, 12:55pm

how is the 6th position determined likewise 10,14,24

Don_Cragun · August 27, 2015, 4:06pm

With " as the field separator, your sample input is shown on the first line below and the field numbers are shown on the next line (with the leading digit dropped in field numbers 13, 15, 19, 21, and 23 since I couldn't fit two digits under those one character fields):

{"driver":{"first_name":"xxxx","last_name":"yyyy"},"confirmation_id":"US285204420","vendor":{"id":"1234","name":"BUDGET"}}
1"     2" 3"         4"5"   6"7"        8"9"  10"11"             12"3"         14"5"    16"17"18"9"  20"1"  22"3"    24"25

Does this help?

nqp200 · August 27, 2015, 4:37pm

thanks Don!

Aia · August 27, 2015, 4:38pm

At times, it is best to let the application tell you. With awk you can iterate over the record.

$echo '{"driver":{"first_name":"xxxx","last_name":"yyyy"},"confirmation_id":"US285204420","vendor":{"id":"1234","name":"BUDGET"}}' | 
awk -F\" '{for (i=1; i<=NF; i++){print "$" i " = " $i}}'
$1 = {
$2 = driver
$3 = :{
$4 = first_name
$5 = :
$6 = xxxx
$7 = ,
$8 = last_name
$9 = :
$10 = yyyy
$11 = },
$12 = confirmation_id
$13 = :
$14 = US285204420
$15 = ,
$16 = vendor
$17 = :{
$18 = id
$19 = :
$20 = 1234
$21 = ,
$22 = name
$23 = :
$24 = BUDGET
$25 = }}

nqp200 · August 30, 2015, 3:56pm

Before closing this thread I have one more question I need to create separate types of files for databeing retrieved from this dataset.
Input excel file has
3 unique ids each in a cell and the above dataset in one cell..with above code I am able to get the data extract.However my input filehas
id, memberid,ititnid,unparsingdataset

Now if I need to split this xlsx file to retrieve only unparsed dataset and add one unique id to this dataset ,what should be possible logic.
1.I am using this

restructured the layout using "

awk -v val=$num -F'"' '{print $4 "|" $1 "|" $2 "|" $3  }' Car.xlsx>Car2new.xlsx>Car.xlsx

2.enerate on file for one type of dataset; will this work..adding unique id to each file so that i can join this after adding to database table

awk -F\" '/^{"driver.*first_name.*last_name.*}}$/ {print $6, $10, $14, $260}� OFS=, CarReservationsNov2.xlsx > Customer.xlsx*

awk -F\" '/^{"driver.*first_name.*last_name.*}}$/ {print $20, $24, $34,$14,$260}� OFS=, CarReservationsNov2.xlsx > vendor.xlsx*

where

$20 = 1363
$21 = ,
$22 = name
$23 = :
$24 = BUDGET
$25 = ,
$26 = image
$27 = :
$28 = https
$29 = },
$30 = pick_up
$31 = :{
$32 = date_time
$33 = :
$34 = 2014-11-01T12:00:00+00:00
$260=5451  id 
$261=c870efa6-6184-11e4-8df3-121e6b429d9a memberid
$262=39556itiid

Don_Cragun · August 30, 2015, 7:47pm

First: awk processes text files; not MicroSoft Excel files. I, therefore, assume that you have used Excel to extract the data from that spreadsheet file into a CSV formatted text file and named that file with a .xlsx file extension instead of a more appropriate .txt file extension. Is my assumption correct?

Note also that awk does not produce Excel spreadsheets as output files. Although UNIX and Linux systems don't determine file types by pathname extensions, naming a file with an extension that does not match the format associated with that filename extension will confuse anyone familiar with the traditional contents of a MicroSoft excel spreadsheet file with the .xlsx filename extension.

Second: What is the format of the "3 unique ids each in a cell" and the format of "the above dataset in one cell"?

Now if I need to split this xlsx file to retrieve only unparsed dataset and add one unique id to this dataset ,what should be possible logic.
1.I am using this

restructured the layout using "
awk -v val=$num -F'"' '{print $4 "|" $1 "|" $2 "|" $3  }' Car.xlsx>Car2new.xlsx>Car.xlsx

The above code destroys any data that may have existed in the file Car.xlsx and creates two empty files Car.xlsx and Car2new.xlsx . Even if you didn't redirect output to destroy your input file, this awk script does not create any unique IDs; it just moves the 4th input field to be the 1st output field, and changes field separators from the double-quote character to the vertical-bar character. You pass a number to it as a parameter, but val is never used in your script. Please explain, in English, what you are trying to do and show us sample input and the desired output corresponding to that input (both in CODE tags).

2.enerate on file for one type of dataset; will this work..adding unique id to each file so that i can join this after adding to database table

awk -F\" '/^{"driver.*first_name.*last_name.*}}$/ {print $6, $10, $14, $260}' OFS=, CarReservationsNov2.xlsx > Customer.xlsx*

awk -F\" '/^{"driver.*first_name.*last_name.*}}$/ {print $20, $24, $34,$14,$260}' OFS=, CarReservationsNov2.xlsx > vendor.xlsx*

where

$20 = 1363
$21 = ,
$22 = name
$23 = :
$24 = BUDGET
$25 = ,
$26 = image
$27 = :
$28 = https
$29 = },
$30 = pick_up
$31 = :{
$32 = date_time
$33 = :
$34 = 2014-11-01T12:00:00+00:00
$260=5451  id 
$261=c870efa6-6184-11e4-8df3-121e6b429d9a memberid
$262=39556itiid

What you are trying to do with > vendor.xlsx* is not clear. Using a character in an output filename that is a special character in shell pathname expansion is likely to cause you lots of problems later. If you are attempting to cause the awk script to produce multiple output files, that is not the way to do it. If there are other existing file with names starting with the string vendor.xlsx (or Customer.xlsx ) you will be directing the output to the 1st match file and using the other matched files as additional input file to your script. Please clearly describe what files you are trying to produce and show us the content that you want to appear in each of those files (in CODE tags).