Replacement to cut command

jg355187 · October 25, 2018, 7:25am

Experts,

Its been a long never programmed on Shell, thought this might be the opportunity to ask your valuable suggestion on one of the challenges I'm going through, regarding the parsing the string to variable with the usage of "CUT"


#Azure DataLake Path Of the File

DATASET_PATH="adl://xyz123.azuredatalakestore.net/devhdfs/DataWareHouse/sf_inbound/serviceappointment"

#Shell Class

hiveClass () {
hadoop fs -ls ${DATASET_PATH}
}

#Variable that Stores the Complete Path of the File

var=`hiveClass | grep -i "parquet" | cut -d' ' -f15`

Example of Hive Class If executed Explicitly


spark@hn0-emrazs:~$ DATASET_PATH="adl://xyz123.azuredatalakestore.net/devhdfs/DataWareHouse/sf_inbound/serviceappointment"

spark@hn0-emrazs:~$ hiveClass () {
> hadoop fs -ls ${DATASET_PATH}
> }

spark@hn0-xyz1:~$ var=`hiveClass | grep -i "parquet" | cut -d' ' -f15`

spark@hn0-xyz1:~$ echo "$var"

spark@hn0-xyz1:~$ var=`hiveClass | grep -i "parquet" | cut -d' ' -f14`

spark@hn0-xyz1:~$ echo "$var"

spark@hn0-xyz1:~$ var=`hiveClass | grep -i "parquet" | cut -d' ' -f16`

spark@hn0-xyz1:~$ echo "$var"

spark@hn0-xyz1:~$ var=`hiveClass | grep -i "parquet" | cut -d' ' -f13`

spark@hn0-xyz1:~$ echo "$var"
adl://xyz123.azuredatalakestore.net/devhdfs/DataWareHouse/sf_inbound/appointment/part-00000-aa60bb3c-6780-44fa-b93d-4232df81faa1-c000.snappy.parquet

Raw Execution of the Command will have this Result

sparksshuser@hn0-xyz1:~$ hadoop fs -ls adl://xyz123.azuredatalakestore.net/devhdfs/DataWareHouse/sf_inbound/appointment

Found 2 items
-rw-r-----+  1 sparksshuser sparksshuser          0 2018-10-25 02:08 adl://xyz123.azuredatalakestore.net/devhdfs/DataWareHouse/sf_inbound/appointment/_SUCCESS

-rw-r-----+  1 sparksshuser sparksshuser     594663 2018-10-25 02:07 adl://xyz123.azuredatalakestore.net/devhdfs/DataWareHouse/sf_inbound/appointment/part-00000-aa60bb3c-6780-44fa-b93d-4232df81faa1-c000.snappy.parquet

Now the Real Challenge is CUT with Field Value. The result will not have constant field value to ensure that I can schedule my script. Every time it changes because of the increase in file size byte.


var=`hiveClass | grep -i "parquet" | cut -d' ' -f13`

Now My question I wanted to cast complete File Url into Variable, so that I can use this as a feeder into Hive table without using "cut -d' ' -f???"


adl://xyz123.azuredatalakestore.net/devhdfs/DataWareHouse/sf_inbound/appointment/part-00000-aa60bb3c-6780-44fa-b93d-4232df81faa1-c000.snappy.parquet

dodona · October 25, 2018, 7:47am

awk '{print $13}'

jg355187 · October 25, 2018, 8:05am

$13 is based on empty field if Im not wrong ?

But empty spaces may increase or decrease and more dynamic in nature. This is due to the field "size of the file" = "594663". If size of the file increases the by another 6 digits extra , something like "111222594663" , situation may come I have decrease from $13 to $10/11/12, which is not practically possible when jobs are scheduled and vice versa if the size decreases $13 to $14/15/16. Kindly advise , what would be the best approach.


-rw-r-----+  1 sparksshuser sparksshuser     594663 2018-10-25 02:07 adl://xyz123.azuredatalakestore.net/devhdfs/DataWareHouse/sf_inbound/appointment/part-00000-aa60bb3c-6780-44fa-b93d-4232df81faa1-c000.snappy.parquet

Scrutinizer · October 25, 2018, 9:48am

Hi, try:

... | grep -i "parquet" | awk '{print $5}'

Instead.

awk - with the default file separator values - clusters whitespace together, whereas cut counts each space character individually.

MadeInGermany · October 25, 2018, 12:22pm

Funny, I count: field#8:

hadoop fs -ls | awk '/parquet/ {print $8}'

Or take the last field:

hadoop fs -ls | awk '/parquet/ {print $NF}'

If you need case-insensitive you can keep the grep -i

hadoop fs -ls | grep -iw "parquet" | awk '{print $NF}'