Read multiple text files and copy data to csv

hi

i need to extract lines from multiple files to a csv file.
for example, i have these 3 files

file1.txt

date:29dec1980
caller:91245824255
called:8127766

file2.txt

date:11apr2014
caller:9155584558
called:8115478

file3.txt

date:25jun2015
caller:445225552
called:8117485

i need grep/awk to read these files and create a csv like this

output.csv

file,date,caller,called
file1.txt,29dec1980,91245824255,8127766
file2.txt,11apr2014,9155584558,8115478
file3.txt,25jun2015,445225552,8117485

i got this far

awk '/date/||/caller/||/called/ {print}' ORS=',' test01 > test012.csv

any idea?
thanks

Try:

awk '{print FILENAME,$2,$4,$6}' FS=':|\n' OFS=, RS= file*.txt
1 Like

thanks Scrutinizer

is there a way to use awk recursively ?

You are welcome, yes you could do that, for what purpose?

the final idea is
i have 12 millions call recorded files. each file got a header with details of the call. files are in folders. for example, a file created at 20hr25min36sec on the 2nd of march 2010 will be in d:\2010\03\02\202536.55254.audio

i m hoping to run awk to find details like caller/called/date and create a csv file for my 12 millions files.

once i got the csv, i will need to categories the calls per department then copy files per department to a specific folder

thanks for reading

How are we supposed to determine the department from the data you have shown us?

Will there be any files in your audio directories (e.g., d:\2010\03\02\202536.55254.audio ) other than .txt files in the format you showed us in post #1 in this thread?

Will filenames in the audio directories be unique across all audio directories or could the last component of the pathnames appear in multiple audio directories?

Do you want the full pathname of your files in the CSV output file or just the last component of the pathname?

Do you want one CSV file with data from all 12 million files, or do you want one CSV file for each department?

How many departments are there?

Hi Don,

below my answers

How are we supposed to determine the department from the data you have shown us?
Yes, the ""called" data will tells me what department

Will there be any files in your audio directories (e.g., d:\2010\03\02\202536.55254.audio ) other than .txt files in the format you showed us in post #1 in this thread?
no, every file will be *.audio. they can be read by notepad

Will filenames in the audio directories be unique across all audio directories or could the last component of the pathnames appear in multiple audio directories?
No they are unique. a real example is 20160317T085058.927444.audio

Do you want the full pathname of your files in the CSV output file or just the last component of the pathname?
the full pathname. a real example is E:\unarchived\20160331\20160331T074846.793023.audio

Do you want one CSV file with data from all 12 million files, or do you want one CSV file for each department?
one csv is enough, i can split afterward

How many departments are there?
Only 3.

many thanks

Not clear. Which of your requirements is not fulfilled by Scrutinizer's proposal in post#2 applied to the entire directory tree? Please be aware that you did not specify any means/conditions/logics to categorize by department.

Hi

awk '{print FILENAME,$2,$4,$6}' FS=':|\n' OFS=, RS= file*.txt

would only search files in the root directory. is there a way to search subfolders?
many thanks

awk allows for quite a long input file list, so try

  • adding */file*.txt and further down the tree
  • using a command substitution like "$(find . -type f)"

Should system limits like LINE_MAX be exceeded, you need to run the awk script on several split input streams.

1 Like

To elaborate on what RudiC mentioned, you can try:

find some_dir -type f -name '*.audio' -exec awk '{print FILENAME,$2,$4,$6}' FS=':|\n' OFS=, RS= {} + > outfile
1 Like

Hi,

i had a look on the actual server and found some issues

the files dont always contain the same info.

for example, these 2 files.
one got destination but doesnt have pc name

Filename1

IP address=192.168.10.25
PC name=RecSrv
UTC begin=20150729T075150.991582
UTC end=20150729T075234.659672
destination IP address=192.168.11.1
duration (ms)=43668
local begin=20150729T085150.991582
local end=20150729T085234.659672
source IP address=192.168.11.254

Filename2

destination=D:/recording/20150729/filename2.audio
UTC begin=20150729T075150.991582
UTC end=20150729T075234.659672
destination IP address=192.168.11.1
duration (ms)=43668
local begin=20150729T085150.991582
local end=20150729T085234.659672
source IP address=192.168.11.254
actual extension=12345
original called party=08865412596
original calling party=12345

so i need awk to search for all these lines

IP address=
PC name=
destination=
UTC begin=
UTC end=
destination IP address=
duration (ms)=
local begin=
local end=
source IP address=
actual extension=
original called party=
original calling party=

and put result in a csv like this. If there is data, it would just put a blank like ,,

output.csv

FILENAME,IP address,PC name,destination,UTC begin,UTC end,destination IP address,duration (ms),local begin,local end,source IP address,actual extension,original called party,original calling party
D:/recording/20150729/filename1.audio,192.168.10.25,RecSrv,,20150729T075150.991582,20150729T075234.659672,192.168.11.1,43668,20150729T085150.991582,20150729T085234.659672,192.168.11.254,,,
D:/recording/20150729/filename2.audio,,D:/recording/20150729/filename2.audio,20150729T075150.991582,20150729T075234.659672,192.168.11.1,43668,20150729T085150.991582,20150729T085234.659672,192.168.11.254,12345,08865412596,12345

Do you think it's possible?

thanks

Please use code tags as required by forum rules!

Maybe one of the solutions in this thread (from the bottom margin of this page) can put you in the right direction (or even solve after some adaption)?

Please, you cannot keep changing your input file specification and also your requirements in every post. Do not waste the forum members' valuable time and next time please get your specifications right in post #1.

@ RudiC: sorry for the code tag. will pay more attention in the future
@Scrutinizer: you didnt waste your time. I ve learnt a lot from the thread. i wanted to start simple and built around it. Sorry if you see it as waste of time

---------- Post updated at 10:43 AM ---------- Previous update was at 10:11 AM ----------

i had a look on the other thread, replicated the test on my pc. tried to adjust it to

awk -F, -v HD="IP address,PC name,destination,UTC begin,UTC end,destination IP address,duration (ms),local begin,local end,source IP address,actual extension,original called party,original calling party" '
BEGIN           {for (N=i=split (HD, T); i>0; i--) SRCH[T]
                 for (i=1; i<=N; i++) printf "%s\t", T
                 print ""
                }
FNR==1 && L     {for (i=1; i<=N; i++) printf "%s\t", PR[T]
                 print ""
                 delete PR
                }

$1 in SRCH      {PR[$1]=$NF
                 L=1
                }
END             {for (i=1; i<=N; i++) printf "%s\t", PR[T]
                 print ""
                }
'  *

but got this

IP address      PC name destination     UTC begin       UTC end  destination IP address  duration (ms)   local begin     local end        source IP address       actual extension        original called party    original calling party

thanks

Try setting the field separator by adding FS="=" or "-F=" (remove -F, ); this will get you one step further.

And this is the one step further:

awk -F= -vOFS="," '
BEGIN           {HD = "IP address=PC name=destination=UTC begin=UTC end=destination IP address=duration (ms)=local begin=local end=" \
                        "source IP address=actual extension=original called party=original calling party"
                 for (N=i=split (HD, T); i>0; i--) SRCH[T]
                 for (i=1; i<N; i++) printf "%s%s", T, OFS
                 print T[N]
                }
FNR==1 && L     {for (i=1; i<N; i++) printf "%s%s", PR[T], OFS
                 print PR[T[N]]
                 delete PR
                }

$1 in SRCH      {PR[$1]=$NF
                 L=1
                }
END             {for (i=1; i<N; i++) printf "%s%s", PR[T], OFS
                 print PR[T[N]]
                }
' file[12]
IP address,PC name,destination,UTC begin,UTC end,destination IP address,duration (ms),local begin,local end,source IP address,actual extension,original called party,original calling party
192.168.10.25,RecSrv,,20150729T075150.991582,20150729T075234.659672,192.168.11.1,43668,20150729T085150.991582,20150729T085234.659672,192.168.11.254,,,
,,D:/recording/20150729/filename2.audio,20150729T075150.991582,20150729T075234.659672,192.168.11.1,43668,20150729T085150.991582,20150729T085234.659672,192.168.11.254,12345,08865412596,12345
1 Like

ok i did this

awk "-F=", -v HD="IP address,PC name,destination,UTC begin,UTC end,destination IP address,duration (ms),local begin,local end,source IP address,actual extension,original called party,original calling party" '
BEGIN           {for (N=i=split (HD, T); i>0; i--) SRCH[T]
                 for (i=1; i<=N; i++) printf "%s\t", T
                 print ""
                }
FNR==1 && L     {for (i=1; i<=N; i++) printf "%s\t", PR[T]
                 print ""
                 delete PR
                }

$1 in SRCH      {PR[$1]=$NF
                 L=1
                }
END             {for (i=1; i<=N; i++) printf "%s\t", PR[T]
                 print ""
                }
'  *

and i get this now

IP address,PC name,destination,UTC begin,UTC end,destination IP address,duration (ms),local begin,local end,source IP address,actual extension,original called party,original calling party

as you said... one step closer :slight_smile:

but what i dont understand is why i dont value below
thanks

---------- Post updated 01-04-16 at 05:16 AM ---------- Previous update was 31-03-16 at 11:16 AM ----------

nearly there! :slight_smile:
how can I add a column for FILENAME?
thanks

---------- Post updated at 05:22 AM ---------- Previous update was at 05:16 AM ----------

i tried this

awk -F= -vOFS="," '
BEGIN           {HD = "FILENAME=IP address=PC name=destination=UTC begin=UTC end=destination IP address=duration (ms)=local begin=local end=" \
                        "source IP address=actual extension=original called party=original calling party"
                 for (N=i=split (HD, T); i>0; i--) SRCH[T]
                 for (i=1; i<N; i++) printf "%s%s", T, OFS
                 print T[N]
                }
FNR==1 && L     {for (i=1; i<N; i++) printf "%s%s", PR[T], OFS
                 print PR[T[N]]
                 delete PR
                }

$1 in SRCH      {PR[$1]=$NF
                 L=1
                }
END             {for (i=1; i<N; i++) printf FILENAME "%s%s", PR[T], OFS
                 print PR[T[N]]
                }
' *.ipfx_audio > test.csv

but i guess i need to liunk the filename column with FILENAME

Try

awk -F= -vOFS="," '
NR == 1         {HD = "IP address=PC name=destination=UTC begin=UTC end=destination IP address=duration (ms)=local begin=local end=" \
                        "source IP address=actual extension=original called party=original calling party"
                 for (N=i=split (HD, T); i>0; i--) SRCH[T]
                 for (i=1; i<N; i++) printf "%s%s", T, OFS
                 print T[N]
                 FN = FILENAME
                }
FNR==1 && L     {printf "%s%s", FN, OFS
                 for (i=1; i<N; i++) printf "%s%s", PR[T], OFS
                 print PR[T[N]]
                 FN = FILENAME
                 delete PR
                }

$1 in SRCH      {PR[$1]=$NF
                 L=1
                }
END             {printf "%s%s", FN, OFS
                 for (i=1; i<N; i++) printf "%s%s", PR[T], OFS
                 print PR[T[N]]
                }
' 
1 Like

i tried to a filename header like this

awk -F= -vOFS="," '
> NR == 1         {HD = "FILENAME=IP address=PC name=destination=UTC begin=UTC end=destination IP address=duration (ms)=local begin=local end=" \
>                         "source IP address=actual extension=original called party=original calling party"
>                  for (N=i=split (HD, T); i>0; i--) SRCH[T]
>                  for (i=1; i<N; i++) printf "%s%s", T, OFS
>                  print T[N]
>                  FN = FILENAME
>                 }
> FNR==1 && L     {printf "%s%s", FN, OFS
>                  for (i=1; i<N; i++) printf "%s%s", PR[T], OFS
>                  print PR[T[N]]
>                  FN = FILENAME
>                  delete PR
>                 }
>
> $1 in SRCH      {PR[$1]=$NF
>                  L=1
>                 }
> END             {printf "%s%s", FN, OFS
>                  for (i=1; i<N; i++) printf FILENAME "%s%s", PR[T], OFS
>                  print PR[T[N]]
>                 }
> ' *

and get this

FILENAME,IP address,PC name,destination,UTC begin,UTC end,destination IP address,duration (ms),local begin,local end,source IP address,actual extension,original called party,original calling party
20090130T165309.474091.ipfx_audio,,10.11.40.61,D01309,,20090130T165309.474090,20090130T170003.303469,10.11.41.151,413829,20090130T165309.474091,20090130T170003.303469,10.11.0.241,,,
20090130T165558.886637.ipfx_audio,,10.11.40.67,D01332,,20090130T165558.886637,20090130T165839.247671,10.11.41.136,160361,20090130T165558.886637,20090130T165839.247671,10.11.0.240,,,
20090130T170010.074146.ipfx_audio,,10.11.40.61,D01309,,20090130T170010.074146,20090130T170019.325071,10.11.41.151,9250,20090130T170010.074146,20090130T170019.325071,10.11.41.97,,,

but i shouldn't get a ,, between 20090130T165309.474091.ipfx_audio and 10.11.40.61

where do i need to remove it?
thanks