CSV File:Filter duplicate records from column1 & another column having unique record

as7951 · December 28, 2017, 11:35am

Hi Experts,

I have csv file with 30, 40 columns
Pasting just 2 column for problem description.
Need to print error if below combination is not present in file
check for column-1 (DocumentNumber) and filter columns where value in DocumentNumber field is same.
For all such rows, the field LineNumber (column-2) should be unique for each row.
if column1 contain duplicate value(2345,2345) on row(1-2) then, column 2 must contain any random unique value like (1,2) in row(1-2)
similary for column 1 row(3-4) with duplicate value(6789,6789), then column 2 must contain uniquie value as below 5,6
If combination as explained above is not present, then logs must be printed in another file with error code and line number

Sample file.

DocumentNumber LineNumber
2345	         1
2345	         2
6789	         5
6789	         6
4321             2
4321             3

RudiC · December 28, 2017, 6:22pm

More details , please. What should the output look like? Will the always be exactly two lines per document number? What be the criterion for field#3 - just non-identical numbers per document No.? Any limits on those numbers?

rbatte1 · December 29, 2017, 3:10am

Is this not https://www.unix.com/shell-programming-and-scripting/276167-filter-duplicate-records-csv-file-condition-one-column.html#post303010264 ? If it is the same discussion, let me know and I will close off this thread so all the comments go to a single place for clarity.

Kind regards,
Robin

as7951 · December 29, 2017, 3:38am

Hi Robin,

This is a separate query and thread and not the same as mentioned in "Filter duplicate records from csv file with condition on one column".

---------- Post updated at 03:38 AM ---------- Previous update was at 03:34 AM ----------

Hi robin,

i dnt want to modify input file and do not want separate output,
just wanted to print line number with error code if above conditions are not met.

as7951 · January 2, 2018, 5:26am

Hi Experts,

Apologies in case i am disturbing you with my posts.
I am not much good with awk scripting but I do shell scripting and try to learn more with the issues i come across
But sincerely i need to know work around for this query.

I tried the below code, but it is not working as per my expectation.
It is working when column 2 contains unique value in every row, but if row 2 and row 5 contains same value, it prints "error".

awk -F"|" '
{++CNT[$1]
}
{++ABC[$2]
}

(CNT[$1] && ABC[$2] > 1) { print "error"
        }
'

Request if you can help to improve.

I need to have file suppose that contains duplicate values in column 1 then against those duplicate value in column 2 there should be unique values
In above sample file.
There wont' be 2 line per document number, there can be any number of duplicate values, it can be more than 5 or even 50
Yes, there should be non-identical number in column2(Line number) per Document number(column1) and there is no limit on number, they just has to be non duplicate.
if column 1 contain duplicate values in row then corresponding to those duplicate values in row column 2 should contain non duplicate values

RudiC · January 2, 2018, 5:46am

No apologies needed as people in these forums are here to help. Posts don't disturb anybody - what IS disturbing is if people don't learn, be it to comply to forum rules, how to resonably specify a problem, or to apply / adapt coding hints to actual problerms.

Your code sample doesn't word with the sample in post#1 as the field separator in the data is a <TAB> followed by multiple spaces (matched by the default awk FS ) and the code has | . Try

awk 'C[$1,$2]++ {print "error line", NR}' file

and report back the results.

as7951 · January 2, 2018, 5:56am

Hi Rudic,

Thank you
It worked
You saved my life.

Salute you.

Also, pls can you let me know how this code is performing the required task.
what C stands for

RudiC · January 2, 2018, 6:05am

Glad it worked. I presume I can apply at "bay watch" now, being a life saver.

The C is just an array variable - as you used them in your code (ABC, CNT) - with a very short, cryptic name. Its elements are undefined in the beginning thus evaluating to FALSE when encountered (and created empty) the first time.Then they're post incremented and will yield TRUE for any further reference with identical indices built from document and line number.