AWK regex and Dataset handling

alrai · December 18, 2022, 8:48am

Greetings

I have a task here where I wish to use the power of Regex and AWK to manage datasets. I have picked a movie data set and in some fields, the information is nested with multiple ',' which is already the field separator. I understand that this can be done with GUI tools but I wish to be able to handle it via AWK cli. Let me paste 2 records of the data set to elaborate my problem. I am fairly a beginner here and I will appreciate any guidance. Thank you.

Following are 3 records, including the header from the movie dataset I found on kaggle.

False,"{'id': 10194, 'name': 'Toy Story Collection', 'poster_path': '/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg', 'backdrop_path': '/9FBwqcd9IRruEDUrTdcaafOMKUq.jpg'}",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]", movieurl,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences.",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States of America'}]",1995-10-30,373554033,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415

False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, 'name': 'Fantasy'}, {'id': 10751, 'name': 'Family'}]",,8844,tt0113497,en,Jumanji,"When siblings Judy and Peter discover an enchanted board game that opens the door to a magical world, they unwittingly invite Alan -- an adult who's been trapped inside the game for 26 years -- into their living room. Alan's only hope for freedom is to finish the game, which proves risky as all three find themselves running from giant rhinoceroses, evil monkeys and other terrifying creatures.",17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,"[{'name': 'TriStar Pictures', 'id': 559}, {'name': 'Teitler Film', 'id': 2550}, {'name': 'Interscope Communications', 'id': 10201}]","[{'iso_3166_1': 'US', 'name': 'United States of America'}]",1995-12-15,262797249,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso_639_1': 'fr', 'name': 'Français'}]",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413

What possible approaches I can use, using regex and AWK to display each record. I can't because the number of fields maybe different in each record but the nested record is contained in . How do I treat any data within as ONE field and still display the whole record using single regex, maybe using an AWK script?

Much obliged for any advice from the regex PROs.

drysdalk · December 18, 2022, 9:32am

Hello,

Welcome to the forum ! We hope you enjoy your time here, and find this to be a friendly and helpful place.

This can indeed still be handled by awk, with the biggest issue being the one you have correctly identified: the field separator. Some fields here are quoted fields with an embedded comma as part of the data, and so a simple straightforward awk -F, isn't quite going to cut it.

However, there is a solution: awk supports a built-in variable called FPAT, which can be used to specify a field pattern as a regular expression. This is a pattern that you would use to tell awk which patterns of text should be considered fields, and which should not be considered as fields.

You can read all about this at the on-line GNU awk documentation here, along with some sample code:

Splitting By Content (The GNU Awk User's Guide)

Have a read over that, try implementing a solution using FPAT, and let us know how you get on. In the event that you have problems, please do feel free to come back here with the code you've written, the output you get, and a description of the difficulties you're facing, and we can take things from there.

Hope this helps !

alrai · December 18, 2022, 9:51am

Thanks drysdalk

Good lead, I'll get right on it.

Thanks for the welcome indeed, I intend to stay here and have fun learning and then start sharing what I learn.

Two thumbs up...

chatgpt · January 4, 2023, 3:13am

Here is the answer provided by ChatGPT:

It sounds like you want to extract the data within the square brackets [] and treat them as a single field in your output. One way to do this with awk is to use the match() function to extract the data within the brackets, and then use that as the value for a new field in your output.

Here is an example awk script that demonstrates this approach:

Copy code
# Set the field separator as a comma
BEGIN { FS = "," }

# For each line, extract the data within the brackets and store it in a new field
{
  match($0, /\[.*\]/);
  new_field = substr($0, RSTART, RLENGTH);
  print new_field
}
To use this script, you can save it to a file (e.g. extract_fields.awk) and then run it on your data like this:

Copy code
awk -f extract_fields.awk data.csv
This will output the data within the brackets for each line in the input file. If you want to include the rest of the fields in the output as well, you can use the print statement to print all of the fields, like this:

Copy code
# Set the field separator as a comma
BEGIN { FS = "," }

# For each line, extract the data within the brackets and store it in a new field
{
  match($0, /\[.*\]/);
  new_field = substr($0, RSTART, RLENGTH);

  # Print all of the fields, including the new field
  for (i = 1; i <= NF; i++) {
    printf("%s", $i)
    if (i < NF) {
      printf(",")
    }
  }
  printf("\n")
}
I hope this helps! Let me know if you have any questions.

MadeInGermany · January 4, 2023, 9:27am

I think the ChatGPT answer is not good.
It does not address the complexity of nested delimiters in the given input file.
Was the input file given at all?

The FPAT approach is more promising. But complex and maybe still not precise enough.
The best precision (understanding of the delimiter levels) can be achieved with perl.

Here comes a half-way solution with awk FPAT:

awk 'BEGIN { FPAT="[^,]*|\"[^\"]*\"" } { for (i=1; i<=NF; i++) print "f" i, $i }' inputfile

The fields in "[ ]" can be split further, using an explicit split().