awk assistance - Comparing 2 csv files

tirmUK · July 4, 2018, 1:17pm

Hello all,

I have searched high and low for a solution to this, many have come really close but not quite what I'm after.

I have 2 files. One contains GUID's, for example:

8121E002-96FE-4C9C-BC5A-6AFF20DACECD
84468F30-F3B7-418B-81F0-0908E80792BF

A second file, contains a path to the said guid, for example:

"test-data","TEST/DATA/84468F30-F3B7-418B-81F0-0908E80792BF.pdf"
"test-data","TEST/DATA/8121E002-96FE-4C9C-BC5A-6AFF20DACECD.pdf"

I need a 3rd csv file created, like this:

84468F30-F3B7-418B-81F0-0908E80792BF, "TEST/DATA/84468F30-F3B7-418B-81F0-0908E80792BF.pdf"
8121E002-96FE-4C9C-BC5A-6AFF20DACECD, "TEST/DATA/8121E002-96FE-4C9C-BC5A-6AFF20DACECD.pdf"

The closest i have come is the following:

awk -F "[,]" 'NR==FNR{q=$1;$1="";A[q]=$0;;next} ($2 in A) {print FILENAME, $2}' test.csv *.csv

(Note, there are several csv files to search for, hence *.csv )

I believe the issue is somehting to do with the $2 in A as its not "in" its more a like operator.

Any assistance would be a massive help, thank you!!

Tirm

disedorgue · July 4, 2018, 1:58pm

Hi,
Maybe like as:

$ cat /tmp/a.csv
8121E002-96FE-4C9C-BC5A-6AFF20DACECD
84468F30-F3B7-418B-81F0-0908E80792BF
$ cat /tmp/b.csv 
"test-data","TEST/DATA/84468F30-F3B7-418B-81F0-0908E80792BF.pdf"
"test-data","TEST/DATA/8121E002-96FE-4C9C-BC5A-6AFF20DACECD.pdf"
$ awk -F "," 'BEGIN{OFS=FS}NR==FNR{q=$1;$1="";A[q]=$0;;next} {q=$2;gsub(/.*\/|\..*/,"",$2)} ($2 in A) {print $2,q}' /tmp/a.csv /tmp/b.csv 
84468F30-F3B7-418B-81F0-0908E80792BF,"TEST/DATA/84468F30-F3B7-418B-81F0-0908E80792BF.pdf"
8121E002-96FE-4C9C-BC5A-6AFF20DACECD,"TEST/DATA/8121E002-96FE-4C9C-BC5A-6AFF20DACECD.pdf"

Regards.

RudiC · July 4, 2018, 2:05pm

Please become accustomed to deploy utmost care when specifying your request, esp. constraints like "Exactly (no more and no less than) the GUIDs in file1", and give people a chance to understand what be test.csv, and what *.csv.
Why, for example, not just simply

awk -F, '{TMP = $2; gsub (/^.*\/|\..*$/, _, TMP); print TMP, $2}' OFS=, file2
84468F30-F3B7-418B-81F0-0908E80792BF,"TEST/DATA/84468F30-F3B7-418B-81F0-0908E80792BF.pdf"
8121E002-96FE-4C9C-BC5A-6AFF20DACECD,"TEST/DATA/8121E002-96FE-4C9C-BC5A-6AFF20DACECD.pdf"

I guess because you want to discriminate file2 with the values in file1. How about

awk -F "[,]" 'NR==FNR{A[$1]; next} { for (a in A) {if ($2 ~ a) print a, $2}}' OFS=, file[12]
84468F30-F3B7-418B-81F0-0908E80792BF,"TEST/DATA/84468F30-F3B7-418B-81F0-0908E80792BF.pdf"
8121E002-96FE-4C9C-BC5A-6AFF20DACECD,"TEST/DATA/8121E002-96FE-4C9C-BC5A-6AFF20DACECD.pdf"

tirmUK · July 4, 2018, 4:49pm

Thank you for the replies. Let me put some more clarify around the request, as suggested, as I'm not sure the awk suggestions are working.

First here is the csv, that contains the list of GUIDS:

$ cat test.csv
8121E002-96FE-4C9C-BC5A-6AFF20DACECD
84468F30-F3B7-418B-81F0-0908E80792BF
8121E002-96FE-4C9C-BC5A-6AFF20DACECD
1BCE1E40-D1BE-4DC1-8A0C-9EB236F56944

Now, here is an example of the csv that contains the data I'm trying to retrieve:

"test","data/content/FN0/FN0/FN0/170535BB-A28D-42C4-92ED-767BB1469C8D%7BC0348F5A-0000-C624-9710-5ED2E8AA2B14%7D0"
"test","data/content/FN0/FN0/FN0/17ECDCFA-AF30-4C82-A156-99F941739352%7B1373E3F5-D5BE-475B-900F-B73ECB05C6AB%7D0"
"test","data/content/FN0/FN0/FN0/182FECB0-ADBF-4F27-9DD6-CE5508872AA5%7BC88C4971-F16A-4C28-9E14-07D0AD4E3C79%7D0"
"test","data/content/FN0/FN0/FN0/194D4F16-CD21-46EF-A584-8C378FBAD55F%7B439BC63F-C291-479D-BFEC-121BC86E3988%7D0"
"test","data/content/FN0/FN0/FN0/1AD46B75-8357-421D-A072-64872C6C763C%7BF299763B-D507-4303-A819-00BD0C60AAA5%7D0"
"test","data/content/FN0/FN0/FN0/1B810336-4EA7-49E3-8325-69487AC0CE95%7BD2F0EDEA-1486-451C-A09C-9AC39D582BBF%7D0"
"test","data/content/FN0/FN0/FN0/1BA93974-FFCC-4BE2-AFBE-11B92D579D4B%7B805AFF56-0000-CC14-9F56-C335802C5C15%7D0"
"test","data/content/FN0/FN0/FN0/1BBC8C9A-725B-428D-AC5A-9C0129C80F82%7BE515FB66-D51E-4057-97C8-94CDEC52F83A%7D0"
"test","data/content/FN0/FN0/FN0/1BCE1E40-D1BE-4DC1-8A0C-9EB236F56944%7B496DDE3B-6102-4744-80FA-C30D64D91815%7D0"

As you can see, 1BCE1E40-D1BE-4DC1-8A0C-9EB236F56944 is in both the first csv and the 2nd csv. The last line of the 2nd csv it is embedded into a double GUID. I want the output to be:

1BCE1E40-D1BE-4DC1-8A0C-9EB236F56944, "data/content/FN0/FN0/FN0/1BCE1E40-D1BE-4DC1-8A0C-9EB236F56944%7B496DDE3B-6102-4744-80FA-C30D64D91815%7D0"

I tried the following suggestion:

awk -F "," 'BEGIN{OFS=FS}NR==FNR{q=$1;$1="";A[q]=$0;;next} {q=$2;gsub(/.*\/|\..*/,"",$2)} ($2 in A) {print $2,q}' test.csv file2.csv

But it did not find the above GUID.

RudiC, for your suggestion, where do I pass in the first filename?

Hopefully that helps!

Regards

Tirm

RudiC · July 5, 2018, 12:43am

Which is, except for the space, wherever that comes from, exactly what my second suggestion would yield - given the input files are served in the correct order. The shell will expand file[12] to file1 file2

Now, there are two possibilities to arrange two file names: file1 file2 or file2 file1 . Which did you test - only one will yield an (THE) output?

tirmUK · July 5, 2018, 10:33am

Hi RudiC,

That works great thank you. I have ran a few test's on smaller files, and its fine,

I have 2 further queries that extend this. First, the lookup file contains 1.87 million rows, and the CSV file to check it against, contains 400,000 rows (there are around 40 csv files to chec).

Running it takes a very long time, so I'm wondering if that might change the command at all, or if it just needs to run overnight.

The second query, is can I use the awk command above, but instead of searching in a csv, can i search a directory? If we take the example above:

data/content/FN0/FN0/FN0/1BCE1E40-D1BE-4DC1-8A0C-9EB236F56944%7B496DDE3B-6102-4744-80FA-C30D64D91815%7D0

Can I instead add a find command? So read the first line of the csv and get:

1BCE1E40-D1BE-4DC1-8A0C-9EB236F56944

Then search

data/content/*

Recursively and find the filename that contains it?

Thank you so much for your help so far, very kind!

Regards

Tirm

RudiC · July 5, 2018, 11:30am

2 million lines is quite something, and you have to read all into memory. Checking against 40 * 0.4 E6 (16 million) will take its time. Not sure if the system will be already into swapping with data amounts like those. Try cutting the 2 million in half or quarter.

Not sure if I understand your second query. Your usage of "csv" is not quite consistent and obvious to me.
awk needs files to operate upon, not directories. If you open, read, and close 40 (or so) files for every single line read from the (lookup / test / csv / GUID) file, you'll thrash your file system. Not clever.

tirmUK · July 5, 2018, 11:48am

Hi,

Yes agree 100% with your evaluation. I'm thinking s arching for the file in a directory might be quicker than checking millions of lines of csv. The logic would be:

Read the first line of the csv
Search the data directory for any file that matches the pattern (not exact, a contains)
If path is found, write out to a csv.
Move to next line in csv

I'm wondering if this is better done in python or something similar, but would rather stick to grep/awk if I can.

Does that make sense?

Thanks!

Tirm

tirmUK · July 18, 2018, 4:40am

I ended up outputing the directory list to a csv, then using a fast searchin algorithm to create the copy routines. I had to do this, because the awk command was doing far too many reloading of the data, so it would have taken months to run!

Thanks everyone for your assistance.