Conditional awk

EAGL · July 17, 2013, 10:00am

Hello All,

I have a file like this:

bash-3.00$ cat 1.txt 
201112091147|0|1359331220|1025
201112091147|0|1359331088|1024
201112091144|0|1359331172|1025
201112091147|0|1359331220|1021
201112091149|0|1359331088|1027
201112091144|0|1359331172|1029

and a list of MSISDNs in another file where there is only one column which is MSISDN:

bash-3.00$ cat 2.txt 
201112091149
201112091140

I need to print out the records from first file 1.txt where MSISDN from second file 2.txt equals to the first column of 1.txt. so output should like this (at the end there must be filename too):

201112091149|0|1359331088|1027,1.txt

I have tried something like this but could not finalise and make it work, appreciate any help to write a report with the msisdns and the file names which they are in.

nawk 'FNR=NR { a[$1] = $1; next }{ printf ("%s\t%s\n", $0, FILENAME)}' 1.txt 2.txt

KR.

Yoda · July 17, 2013, 10:32am

nawk -F\| 'NR==FNR{A[$1];next}$1 in A{print $0,FILENAME}' OFS=, 2.txt 1.txt

pamu · July 17, 2013, 10:51am

awk -F \| 'NR==FNR{A[$0]++;next}{if(A[$1]){print $0","FILENAME}}' 2.txt 1.txt

EAGL · July 17, 2013, 11:10am

thanks,

if i need to check if MSISDN column of 1.txt ($1] has correct length which must have 12 Numbers then should i use length($1) function with a if sentence?

Besides please advise how fast this one liner will be when there are tousands of files like 1.txt and tousands of rows in each file?

Corona688 · July 17, 2013, 11:31am

Yes, if(length($1) == 2) { ... } ought to work.

awk is designed to process big flatfiles, it should do well here. Thousands of files of thousands of lines comes to millions of lines, which is not unfeasible.

There's always ways to make things faster, of course. Will you be doing lots of things like

awk '{...}' file1 file2 > output
awk '{...}' file1 file3 >> output
awk '{...}' file1 file4 >> output

If so, you could do awk '{...}' file1 file2 file3 file4 ... > output
...which would save a lot of time since file1 wouldn't need to be read thousands of times, just once.

If you need different files for each one that's still possible, but would need changes to the code.

EAGL · July 17, 2013, 12:04pm

corona688:

Yes, if(length($1) == 2) { ... } ought to work.

awk is designed to process big flatfiles, it should do well here. Thousands of files of thousands of lines comes to millions of lines, which is not unfeasible.

There's always ways to make things faster, of course. Will you be doing lots of things like
awk '{...}' file1 file2 > output
awk '{...}' file1 file3 >> output
awk '{...}' file1 file4 >> output 
If so, you could do awk '{...}' file1 file2 file3 file4 ... > output
...which would save a lot of time since file1 wouldn't need to be read thousands of times, just once.

If you need different files for each one that's still possible, but would need changes to the code.

Hello Corona,

First of all thanks a lot,
Let me make the description of files spesific :

file 1.txt will have lets say 2000 rows, and there will be a few tousands of files like it. Besides file 2.txt is just one file with only one column but with 10.000 lines.

So would it work well? or should i divide file 2.txt to lets say 100 files and in a for-do loop check things?

Corona688 · July 17, 2013, 12:06pm

Splitting into more or less files wouldn't change the actual amount of work. If anything, bigger files and fewer loops would be more efficient -- awk is at its most efficient when you give it a big job and let it do its thing, instead of restarting it for tiny jobs all the time.

It's not like it's keeping those 10,000 lines around in memory anyway, nothing's getting overloaded. (And even if it were, 10,000 lines is not a lot for awk!)

Your specific description doesn't tell me much about anything except how big your files are, you haven't told me what you will be doing with them.

EAGL · July 18, 2013, 7:56am

Thats beautiful then i did not know the limits of AWK which is impressing.
In tousands of files which have been saved for a couple of months we will try to find which MSISDNs has been processed as the operator we do business with asked.. I believe their BI side does not do what is expected so they digging into old CDRs with a file in their hand..They have a file (2.txt) which has only msisdns which are seperated with newline and they want to check if there is a corresponding CDR for those MSISDN list. And they have a script which triggers a JAR file which has5 classes; checking MSISDNs forms and then checking their existance in old records.This process takes too much time to run and moreover as i have not experienced in Java i could not understand what the classes does so i gave up and chose AWK.

---------- Post updated at 02:56 PM ---------- Previous update was at 07:59 AM ----------

corona688:

Yes, if(length($1) == 2) { ... } ought to work.

awk is designed to process big flatfiles, it should do well here. Thousands of files of thousands of lines comes to millions of lines, which is not unfeasible.

There's always ways to make things faster, of course. Will you be doing lots of things like
awk '{...}' file1 file2 > output
awk '{...}' file1 file3 >> output
awk '{...}' file1 file4 >> output 
If so, you could do awk '{...}' file1 file2 file3 file4 ... > output
...which would save a lot of time since file1 wouldn't need to be read thousands of times, just once.

If you need different files for each one that's still possible, but would need changes to the code.

Hello Again,

I have tried:

# /usr/xpg4/bin/awk -F\| 'NR==FNR{A[$1];next}$1 in A {print $0,FILENAME >> "compare_result_18072013.txt"}' OFS=, /data/lcm/validation/2.txt *.done
-bash: /usr/xpg4/bin/awk: Arg list too long

*.done are the CDR files, 114.000 ones, which are processed.

in 2.txt there are only 3 MSISDNs.

how should i proceed with giving inputs to this one liner now?

RudiC · July 18, 2013, 8:28am

114000 files is quite something for poor old awk. Might also exceed LINE_MAX. Try to split them into reasonable chunks, e.g. by date, or by filename elements, or create a "control" file with lines of, say, 100 filenames in it and feed this line by line to awk.

Corona688 · July 18, 2013, 11:03am

That is not an awk limitation, that's a system limit -- too many files for the shell to list at once with *.

find does not have this limitation, and can pipe into xargs.

FYI, echo a b c | xargs cat is equivalent to cat a b c , which makes it a nice way to pass thousands of arguments. If there are too many to run all at once, it will split it into several different runs of awk.

find . -name '*.done' | xargs /usr/xpg4/bin/awk -F\| 'NR==FNR{A[$1];next}$1 in A {print $0,FILENAME >> "compare_result_18072013.txt"}' OFS=, /data/lcm/validation/2.txt