Hello everyone!
I have thousands of csv files I have to import into a Database table. As usually the files aren't perfect. For example they have a different number of columns and some weird columns.
The second problem is, that I have to add 3 parts of the filename into 3 rows in the table. The manufacturer, filetype and the date/time are only in the filename and not in the columns in the file itself.
The filename looks like:
Manufacturer1_FiletypeA_20180502120102.csv
Manufacturer2_FiletypeA_20180502140537.csv
Manufacturer3_FiletypeB_20180501010424.csv
And the data like:
abcd; XXXX; 1234; XXXX; XXXX; 43563; abcde; XXXX; XXXX;
cdef; XXXX, 4354; XXXX; XXXX; 43244; cdefg; XXXX;
where XXXX stands for useless data I don't need.
What I have at the moment.
a script to select only the needed columns (awkScript):
#!/usr/bin/awk -f
BEGIN {FS = ";"; OFS = ";"; ORS = "\n"}
{print $1,$2,$7,$28,$29,$33,$34,$38,$39,$44,$45}
And a code that edits one file at the time and moves it to a new place. From the new place I can import the file into the database.
hdfs dfs -cat {remoteHdfsDir}/{file} | awk -f {awkScript} | sed 's/\r//' | sed '$d' | hdfs dfs -put - {hdfsWorkDir}/{file}
The two sed commands are needed because the files are from windows.
I use a python script to loop through the files, something like
for file in filelist:
run script
split filename
import file into database and add columns manufacturer, filetype and date/time from filename
Because there are thousands of files, I need to add the 3 parts of the filename into the csv files and merge them to one file.
Anyone has an idea, how I could add the filename etc. to 3 columns in the file in my script?
And has anyone an idea, how I could put the files together to one (several) file(s)?