Finding total distinct count from multiple csv files through UNIX script

STCET22 · August 28, 2017, 2:08am

Hi All ,
I have multiple pipe delimited csv files are present in a directory.I need to find out distinct count on a column on those files and need the total distinct
count on all files.

We can't merge all the files here as file size are huge in millions.I have tried in below way for each files.

cat Test1.csv|cut -d"|" -f38|uniq|wc -l
cat Test2.csv|cut -d"|" -f38|uniq|wc -l
cat Test3.csv|cut -d"|" -f38|uniq|wc -l

I need to automate the above procedure (ex: by putting each files count in a temporary file)as multiple csv files are present in the directory as I need the total distinct count on a column on all those files.Can anyone please help me with this.

Don_Cragun · August 28, 2017, 3:25am

Are all of your .csv files sorted on field 38? If not, your code won't work. (You will get artificially high counts for the number of distinct values in a file because uniq produces a line of output for each case where a field 38 value changes from the value found on the previous line.)

Does each distinct value in field 38 of all of your files appear in only one or your input files? If not, your code won't work. (You don't have any way to determine which distinct values in a single file appear in one or more of the other files.)

Are you always processing 3 files?

Having 3 files of a megabyte each each should not cause any problem producing a single merged or sorted combined file. Why are you unable to merge them?

Why not just use a single awk script to read all of your files once and produce the output you want for each input file and for the combined input from all of the input files?

Do you really want the number of distinct field 38 values in each input file? Or, do you really just want the number of distinct field 38 values in the merged input files?

RudiC · August 28, 2017, 3:35am

Please become accustomed to provide decent context info of your problem.
It is always helpful to support a request with system info like OS and shell, related environment (variables, options), preferred tools, adequate (representative) sample input and desired output data and the logics connecting the two, and, if existent, system (error) messages verbatim, to avoid ambiguities and keep people from guessing.

Please be aware that without prior sort , uniq can't reliably find unique entries.

STCET22 · August 28, 2017, 4:46am

Hi ,
Please find below my response :
All the csv files are not sorted on field 38 but field 38 contains only numerical values.
Each distinct value appear only in one file.Each distinct value won't be there in multiple files.
We are not processing 3 files only.It can be around 100-120 files.

As all the input files(around 120) can contains data in millions.If system data size exceeds 400gb ,there is chance if system blown up ,that's why we not merging all the files.

We need total distinct count on field 38 on all the files present on the directory.

As I'm relatively new to the scripting area ,can you please help me with the above requirement.Your help will be very much appreciated.

Don_Cragun · August 28, 2017, 6:08am

As RudiC suggested (and as you well know from being a member of this forum for over 2.5 years), if you don't provide sample input and corresponding desired output from a couple of sample input files, you don't tell us what operating system and shell you're using, and you don't provide a clear specification of what you are trying to do, there is a good chance that any suggestions provided might not work with your data in your environment. The following is completely untested and makes several assumptions that might be wrong...

awk -F'|' '
!(($38 + 0) in a) {
	a[$38 + 0]
	c++
}
END {	print c
}' *.csv

STCET22 · August 28, 2017, 7:00am

Hi Don ,
PFB the sample input files in a directory.

[omnidevint@sftp311 full_01jan13_31aug15]$ ls -ltr
total 800068
-rw-rw-r--. 1 omnidevint omnidevint  97184275 Aug 25 15:38 Final_Customer_Data_2013_2015_0.csv
-rw-rw-r--. 1 omnidevint omnidevint  88463636 Aug 25 15:38 Final_Customer_Data_2013_2015_1.csv
-rw-rw-r--. 1 omnidevint omnidevint  89536908 Aug 25 15:38 Final_Customer_Data_2013_2015_2.csv
-rw-rw-r--. 1 omnidevint omnidevint 107937776 Aug 25 15:38 Final_Customer_Data_2013_2015_3.csv
-rw-rw-r--. 1 omnidevint omnidevint 161423700 Aug 25 15:38 Final_Customer_Data_2013_2015_4.csv
-rw-rw-r--. 1 omnidevint omnidevint 129165088 Aug 25 15:38 Final_Customer_Data_2013_2015_5.csv
-rw-rw-r--. 1 omnidevint omnidevint 117823518 Aug 25 15:38 Final_Customer_Data_2013_2015_6.csv
-rw-rw-r--. 1 omnidevint omnidevint  27721536 Aug 25 15:38 Final_Customer_Data_2013_2015_7.csv

The contents of each files like below where the field 38 contains customer_id.

 head -3 Final_Customer_Data_2013_2015_0.csv
first_name|last_name|email_address|brand|store_no|sales_associate_no|create_date|create_source|address_type_code|address_1|address_2|address_3|address_4|address_5|post_code|telephone_no|country_code|gvisa_flag|marital_status|gender|title|mobile_no|phone_opt_in_flag|mail_opt_in_flag|email_type|opt_in_flag|transaction_id|bm_customer_id|address_Active_Flag|mail_opt_in_date|phone_opt_in_date|date_last_modified|membership_date|head_of_household_flag|email_opt_in_date|points_current_balance|rewards_id|customer_id
Paula|Frediani|plfrediani@comcast.net|GYM|8888|9999|07/22/1999|STORE|BILLING|1906 Miller Ave||Belmont|CA||94002 1765|6506545255|US||U|U||6506545255|0|1|Rewards Signup||||1|02/17/2007||11/12/2016|06/18/2010|1||133|144RG2JS7F31735|58
Valerie|Burkart|valzb@sbcglobal.net|GYM|8888|9999|08/05/1999|STORE|BILLING|UNKNOWN||PLEASANTON|CA||94566|9254850902|US||U|U||9254850902|0|1|Rewards Signup|||2253998879592428|1|02/17/2007||03/18/2012|03/17/2012|1||239|1AS4V98V8G31735|255

And we are working in linux operating system and in bash shell.

Linux sftp311 3.10.0-327.el7.x86_64 #1 SMP Thu Oct 29 17:29:29 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux

And we need the distinct customer_id count from all the csv files in the directory.Based on these info ,if you provide me any script ,that will be helpful.Thanks in advance.

Don_Cragun · August 28, 2017, 7:19am

You didn't say anything about there being header lines in your files that need to be ignored, so the code I suggested in post #5 in this thread would give you a count that is 1 more than the number of different customer IDs found in all of the .csv files in the directory in which you run that script. I assume that you can manually subtract 1 from the result printed or change the line in the code that prints the value of c to instead print c-1 to get the results you want.

Did you try running the code I suggested? Do you have some reason to think it is not doing what I described above?

If what I provided is not sufficient, please explain what it is doing wrong AND show us exactly the output you want from the sample input you provided.

STCET22 · August 28, 2017, 8:37am

Hi Don,
Thanks for your help.The code you have mentioned is working.One clarification is this providing us distinct customer_id by removing duplicates.

[omnidevint@sftp311 full_01jan13_31aug15]$ awk -F'|' '
> !(($38 + 0) in a) {
> a[$38 + 0]
> c++
> }
> END {print c
> }' *.csv
3051724

As I'm new to awk,could you kindly explain the above block for my understanding.Else apart from using awk can we do this in any other way.

MadeInGermany · August 28, 2017, 1:48pm

The shell runs awk code in a multi-line 'string'

awk '
awk code
'

In the awk code:
$38 is column 38.
a[ ] is a string-addressed array (hash).
For each line:
If $38 is not in a (a[$38] does not exist), then the following code in braces is run.
It defines a[$38] (without assigning a value) and increases c (initially variables are 0 or empty).
If all lines of all given files are processed it processes the END section: print c.

c is incremented if a[$38] does not exist. But not if the same $38 is met again because then a[$38] exists.

Don_Cragun · August 28, 2017, 2:57pm

In addition to what MadeInGermany has already said, you might note that I used $38 + 0 instead of just $38 . The awk language allows fields to be interpreted as strings or as numbers. By adding zero to a field's value, we force the field to be treated as a number. So, if some of you field 38 values have leading zeros and some don't, adding 0 will force customer_id 000255 and customer id 255 to be treated as a single value. If we treated field 38 as a string, those two values would be different customer_ids. Note that when I presented the code, you hadn't given us any examples and we didn't know if that field would contain integer values or floating point values. Adding 0 also makes awk treat 123 and 123.000 and 123.0 and 1.23e+2 as a single value even though they are four different strings.