Finding the missing columns in target file

Butterfly · February 9, 2024, 8:27am

Hi Team - i want to compare two csv files based on 2nd and 3rd column - table name and column name if the table is present in target then it should check the column name and if the column name is missing in target then it should print the entire line to result.csv file . Can someone help me resolve it? Appreciating your help on this.
i tried the below command but its fetching tables missing also:

awk -F ',' 'NR==FNR{source[$2,$3]; next} ($2,$3) in source{delete source[$2,$3]; next} 1; END{for (key in source) print source[key]}'  target.csv source.csv > result.csv

Paul_Pedant · February 9, 2024, 9:51am

Examples of actual data would be helpful to visualise the issue and test any solutions.

What does target mean? Is the data in the two CSV files symmetrical? It is not clear whether print the entire line refers to the target file, the other file, or both.

MadeInGermany · February 9, 2024, 10:03am

Welcome @Butterfly !

source[$2,$3]
just defines the key, does not store a value, unlike
source[$2,$3]=$0
And
print source[key]
prints the value.

If you want to print the keys then you just do
print key
Here a difficulty is that the ($2,$3) is in reality ($2 SUBSEP $3) with the unprintable SUBSEP character.
A work-around is to use the single-character FS that was set by the -F','
Replace all $2,$3 by $2 FS $3 and it becomes printable.

Butterfly · February 9, 2024, 10:11am

its giving the table missing value too. Any idea on how to resolve this?

MadeInGermany · February 9, 2024, 10:13am

Ah yes, and the
1
is a true condition and does a { print } action when it's at the second file. Omit it.

Butterfly · February 9, 2024, 10:21am

sure here is the sample data
source.csv

TABSCHEMA TABLENAME COLUMN SCALE NULL
ODS                 colors              red             10        N
ODS                 colors              green         5          Y
ODS                 colors              yellow        4          N
ODS                 name               Harry         3          Y
ODS                 name               Tom           10       N
ODS                name                Jack            3       N   
ODS                things               Hat              5       Y
ODS                things               Pot              2        N
ODS                things               Jug              5      Y

Target.csv

SCHEMA  TABLENAME     COLUMN   SCALE  NULL
ODS                 colors              red             10        N
ODS                 colors              green         5          Y
ODS                 name               Tom           10       N
ODS                name                Jack            3       N

i would want the result to be printed as :

SCHEMA TABLENAME  COLUMN  SCALE  NULL
ODS                 colors              yellow        4          N
ODS                 name               Harry         3          Y

only if the tables exists in both the csv files the columns check should happen and the if the columns are missing it should be written to results.csv file

Butterfly · February 9, 2024, 10:29am

it still prints the tables that are missing in target

MadeInGermany · February 9, 2024, 11:24am

Thanks for the input
I assume the space-separated columns are comma-separated in reality.

This means the columns $2 and $3 are treated differently.

It is more efficient to only store values from one file and print along the other file; you don't need an END section then.
Because your result lines are from the source.csv it makes sense to first process target.csv, so printing goes while processing source.csv, and the correct order is guaranteed.

awk -F',' 'NR == FNR { tablename[$2]; column[$3]; next } ($2 in tablename) && !($3 in column)' target.csv source.csv

While processing the first file, it builds separate $2-indexed and $3-indexed arrays.
While processing the second file, it does the comparisons. I have omitted a { print } action because it is default if the condition is true.
Explicit would be
awk -F',' 'NR == FNR { tablename[$2]; column[$3]; next } ($2 in tablename) && !($3 in column) { print }' target.csv source.csv
Even more explicit would be if within { }
awk -F',' '{ if (NR == FNR) { tablename[$2]; column[$3]; next }; if (($2 in tablename) && !($3 in column)) { print $0 } }' target.csv source.csv

Butterfly · February 9, 2024, 12:56pm

Thank you so much. The solution worked

Butterfly · February 9, 2024, 1:48pm

Also i can see some tables and columns are missing in the results file. Any idea on this?

munkeHoller · February 9, 2024, 2:10pm

@Butterfly , i'd help by showing rather than saying you can see .... let us see as well

tks

Butterfly · February 9, 2024, 2:17pm

Sorry buddy. The data is huge and wouldnt like to spam the chat box .Some data existing in source but not in target is not been pulled to the results. Need help to deal with this issue. Thanks

munkeHoller · February 9, 2024, 2:34pm

@Butterfly , an actual (small) representable sample showing the issue(s) is all that's required.

Butterfly · February 9, 2024, 2:55pm

Source.csv
SCHEMA TABLENAME COLUMN SCALE NULL
ODS colors red 3 y
ODS colors blue 4 n
ODS colors green 5 y
ODS colors yellow 7 y
ODS colors pink 5 n
ODS colors white 9 n
ODS name Tom 5 y
ODS name Jack 6 n
ODS name Harry 7 n
ODS name John 8 n

target.csv
SCHEMA TABLENAME COLUMN SCALE NULL
ODS colors red 3 y
ODS colors blue 4 n
ODS colors pink 5 n
ODS colors white 9 n
ODS name Tom 5 y
ODS name Jack 6 n
ODS things hat 5 y
ODS things pen 6 y
Results.csv
SCHEMA TABLENAME COLUMN SCALE NULL
ODS colors green 5 y
ODS colors yellow 7 y

Missing in Results
SCHEMA TABLENAME COLUMN SCALE NULL
ODS name Harry 7 n
ODS name John 8 n

vgersh99 · February 9, 2024, 3:08pm

@Butterfly ,
please don't post input/out of CSV as HTML tables. Format your input/output as plain CSV files (as you have them).
The markdown code tags (for formating) are defined here.
I'm not going to reformat your data samples as it's a bit tedious - just edit your post to provide a REAL CSV data vs HTML formatted one.

MadeInGermany · February 9, 2024, 3:41pm

Do you mean the missing lines are

ODS things hat 5 y
ODS things pen 6 y

?

Butterfly · February 9, 2024, 3:44pm

That are in target and we can ignore it . i am not able to see "Missing in Results.csv" in "Results.csv"

MadeInGermany · February 9, 2024, 3:58pm

Did you forget the redirection to the results.csv?
Now added:

awk -F',' 'NR == FNR { tablename[$2]; column[$3]; next } ($2 in tablename) && !($3 in column)' target.csv source.csv > results.csv

This prints the 4 differences.

But it cannot print the ones that are only in target.csv, because it goes along with source.csv.
If you want these, then take your initial attempt, augmented with the =$0 value assignment.

awk -F',' 'NR==FNR{source[$2,$3]=$0; next} ($2,$3) in source{delete source[$2,$3]; next} 1; END{for (key in source) print source[key]}' target.csv source.csv > results.csv

This prints the 6 differences.

Matt-Kita · February 9, 2024, 4:01pm

I don't know awk that well, but I believe this bash code produces the expected result

#!/usr/bin/env bash
while IFS=, read -r SCHEMA TABLENAME COLUMN SCALE NULL <&3; do
    grep -q "^.*,$TABLENAME,.*,.*,.*$" target.csv &&\
    if ! grep -q "^.*,$TABLENAME,$COLUMN,.*,.*$" <(grep "^.*,$TABLENAME,.*,.*,.*$" target.csv); then
        echo "$SCHEMA,$TABLENAME,$COLUMN,$SCALE,$NULL" >> result.csv; 
    fi;
done 3< source.csv

Correct it as you find suitable.

Paul_Pedant · February 9, 2024, 4:07pm

You can also fix with awk -v SUBSEP=',' ... or awk 'BEGIN { SUBSEP = ","; ..., before you make any index entries. Then the X[3,4] dual index is really the string "3,4".

Somewhat confusingly, the default value for awk SUBSEP (subscript separator) is the ASCII FS character (an arbitrary choice), and the default value for awk FS is "Any amount of whitespace".

Oct   Dec   Hex   Char
034   28    1C    FS  (file separator)

You may occasionally want to split the array index into an array, like:

for (j in X) { split (j, V, SUBSEP); ....; }    #.. And use V[1] and V[2] separately.