Modify csv-files with awk

elRonaldo · August 14, 2018, 5:31am

Hello everyone!

I have thousands of csv files I have to import into a Database table. As usually the files aren't perfect. For example they have a different number of columns and some weird columns.

The second problem is, that I have to add 3 parts of the filename into 3 rows in the table. The manufacturer, filetype and the date/time are only in the filename and not in the columns in the file itself.

The filename looks like:
Manufacturer1_FiletypeA_20180502120102.csv
Manufacturer2_FiletypeA_20180502140537.csv
Manufacturer3_FiletypeB_20180501010424.csv

And the data like:

abcd; XXXX; 1234; XXXX; XXXX; 43563; abcde; XXXX; XXXX; 
cdef; XXXX, 4354; XXXX; XXXX; 43244; cdefg; XXXX;

where XXXX stands for useless data I don't need.

What I have at the moment.
a script to select only the needed columns (awkScript):

 
#!/usr/bin/awk -f 
BEGIN {FS = ";"; OFS = ";"; ORS = "\n"} 
{print $1,$2,$7,$28,$29,$33,$34,$38,$39,$44,$45}

And a code that edits one file at the time and moves it to a new place. From the new place I can import the file into the database.

 
hdfs dfs -cat {remoteHdfsDir}/{file} | awk -f {awkScript} | sed 's/\r//' | sed '$d' | hdfs dfs -put - {hdfsWorkDir}/{file}

The two sed commands are needed because the files are from windows.
I use a python script to loop through the files, something like

for file in filelist: 
         run script 
         split filename
import file into database and add columns manufacturer, filetype and date/time from filename

Because there are thousands of files, I need to add the 3 parts of the filename into the csv files and merge them to one file.

Anyone has an idea, how I could add the filename etc. to 3 columns in the file in my script?
And has anyone an idea, how I could put the files together to one (several) file(s)?

RudiC · August 14, 2018, 8:47am

Please show how and where the respective filename part should be added. What does "merge them to one file" mean - ALL input files resulting in one big output file? And, I don't know anythig about that hdfs dfs command - is that really necessary?

elRonaldo · August 14, 2018, 9:56am

Thank you RudiC for your response.

elronaldo:

The filename looks like:
Manufacturer1_FiletypeA_20180502120102.csv
Manufacturer2_FiletypeA_20180502140537.csv
Manufacturer3_FiletypeB_20180501010424.csv

And the data like:
abcd; XXXX; 1234; XXXX; XXXX; 43563; abcde; XXXX; XXXX; 
cdef; XXXX, 4354; XXXX; XXXX; 43244; cdefg; XXXX;
 
where XXXX stands for useless data I don't need.

The columns in the new file should look like this

 abcd; 1234; 43563; abcde; Manufacturer1; FiletypeA; 20180502120102
 cdef; 4354; 43244; cdefg; Manufacturer1; FiletypeA; 20180502120102

I have a few hundred files every day. They aren't huge so it would be good to merge them to one big file. So I need to connect to the DB only once and not a few hundred times.

Yes, the hdfs dfs is necessary because the files are on a Hadoop cluster.

RudiC · August 14, 2018, 10:27am

That is unfortunate as cat ting a file into a pipe and then reading it from stdin loses its file name, making it impossible to add its parts to the csv output file.
In a normal *nix setup, I'd propose something like (untested!)

awk '
BEGIN   {FS = OFS = ";"} 

LAST    {print LAST, FNMOD
        }
 
        {sub (/\r/, "")
         LAST = sprintf ("%s; %s; %s; %s", $1, $3, $6, $7)
         FNMOD = FILENAME
         gsub (/_/, "; ", FNMOD)
         sub (/.csv$/, "", FNMOD)
        } 
' *.csv

elRonaldo · August 14, 2018, 1:22pm

Thank you RudiC. I'm going to try this tomorrow.

RudiC · August 14, 2018, 3:06pm

If nobody in here has a brilliant proposal on how to deal with hdfs , you could copy the files to a local disk / temp directory, and then run above. If reading from a directory different from your current working directory, additional processing of the FILENAME might be necessary.

elRonaldo · August 15, 2018, 1:47am

Ok thank you. It looks like there is (at the moment) no easy solution.
I have to solve it in my python script and save the files in a temp directory.