Direct the invalid lines to a separate files

anandapani · October 10, 2009, 9:39am

Hi,

I have a pipe delimited file with 1 million records. I need to validate each line by counting the number of delimiters, if any line fails to have the specified number of delimiters, taat line has to be sent to a reject file.

Kindly suggest.
if code provided, it is highly appreciated, and helpful.
Please let me know if I can provide any further information

Scott · October 10, 2009, 9:46am

Hi.

awk -F"|" 'NF != 10 { print > "reject_file" }' input_file

In this example, 10 is the number of fields. This is one more than the number of delimeters (so if you have 10 delimeters, the number of fields would be 11)

danmero · October 10, 2009, 9:46am

awk -F":" 'NF != 10{print > "reject.file"}' infile

Where:
: delimiter
10 number of delimiters + 1
reject.file :rolleyes:

thegeek · October 10, 2009, 11:29am

$ sed -n cat t.txt 
aa:bb:cc
ab:bb:cc
ac:bb:cc
ad:bb:cc
aebb:cc
af:bb:cc
ag:bb:cc
ah:bb:cc
ai:bb:cc
ajbb:cc
$ sed -n '/\([^:]*:\)\{2\}/!p' t.txt 
aebb:cc
ajbb:cc

It prints all the lines which dont have 2 delimiters. Hope this helped ?!

anandapani · October 13, 2009, 2:21pm

Thanks Scottn and Danmero

Your answers wonderfully wonderfully on sample files. But I have very huge files (>7GB) with more than 150 fields. For such cases, I learnt from this forum that, I can go for gawk. Unfortunately gawk command is not present in my shell, but I have nawk.

so could you please suggest, if you find any other alternative to the above question except gawk.

Thanks Thegeek,
Could you help me in achieving the below task?
sending records with unexpected number of fields to a reject file

Cheers,
Ananda

steadyonabix · October 13, 2009, 2:55pm

"sent" to a reject file is somewhat ambiguous, do you mean write a copy of the line to the reject file or delete it from the original and write it to the reject file?

You can use the split shell command to split the file into manageable chunks if it is too big for awk. See the man file for split: -

man split

methyl · October 13, 2009, 3:15pm

This is probably a good moment to mention exactly what Operating System and version you have.

A 7 Gigabyte file with a million records implies a record size of about 7 Kilobytes. This file size and record size is way beyond what is normal for processing in unix shell script or basic awk programming. I am assuming that whatever program you used to count the records was able to report the statistics accurately.

Do you have a high level programming language available? Preferably one which can deal with flat files above 2 Gigabytes in size.

thegeek · October 13, 2009, 11:11pm

What is a big deal in that, i shown you how to show in STDOUT, just redirect it as:

sed -n '/\([^:]*:\)\{2\}/!p' t1 > reject

If you want to do using SED, then do like this.. both are same in this situation.

sed -n '/\([^:]*:\)\{2\}/!w reject' t1

anandapani · October 14, 2009, 2:13am

Hi geek,

I have below sample file and need the output with records which doesn't have a single delimiter "|".

In below file I need to get a single line as output
ERRORline is this|lmonp|asdfasdf

I have mentioned the code I am using. This code is not working fine . Please help me by modifying the code.

sed -n '/\([^|]*|\)\{1\}/!p' tdyfle.dat

H|2009-10-13
abcdefghijklmnop|lkjadsdlkjkj
ERRORline is this|lmonp|asdfasdf
abcdefghijklmnop|ASDJASDFASDF
bcdefghijklmonpasdfasdf|asdflkjoiuqwe
T|4

skmdu · October 14, 2009, 2:31am

# cat a
H|2009-10-13
abcdefghijklmnop|lkjadsdlkjkj
ERRORline is this|lmonp|asdfasdf
abcdefghijklmnop|ASDJASDFASDF
bcdefghijklmonpasdfasdf|asdflkjoiuqwe
T|4

[# sed -n '/^\([^|]*|\)\{2,\}/p' a
ERRORline is this|lmonp|asdfasdf

anandapani · October 14, 2009, 2:45am

Hi Skmdu,

Thanks for your code. but Iam afraid, the code is not returning any rows.

Cheers,

skmdu · October 14, 2009, 3:01am

I guess, failing because of extended regular expression.

Try this out, it may help you.

# sed -n '/\([^|]*|\)\([^|]*|\)/p' filename

anandapani · October 14, 2009, 3:50am

Skumdu,

Your guess is absolutely correct,

Given code catered my requrement.
However, I have attached a small sample file, but actual file is a pipe delimited file with approx 150 fileds.
My objetive is to identify the records which are not having 150 delimiters...

So Could you please modify your code for my actual file??

Thanks a lot for your great help

---------- Post updated at 01:20 PM ---------- Previous update was at 12:40 PM ----------

Can anybody help please

skmdu · October 14, 2009, 4:24am

# cat a
H|2009-10-13
abcdefghijklmnop|lkjadsdlkjkj
a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|
ERRORline is this|lmonp|asdfasdf
abcdefghijklmnop|ASDJASDFASDF
bcdefghijklmonpasdfasdf|asdflkjoiuqwe
T|4

# perl -e 'while (<>){ print unless (/^([^\|]*\|){150,}/); }' < a
H|2009-10-13
abcdefghijklmnop|lkjadsdlkjkj
ERRORline is this|lmonp|asdfasdf
abcdefghijklmnop|ASDJASDFASDF
bcdefghijklmonpasdfasdf|asdflkjoiuqwe
T|4

anandapani · October 14, 2009, 4:56am

Thanks for the perl code.

I have a small doubt in the code. It is not retrieving the lines which contains less than the specified number of delimiters. If a line contains more than the number, it is not retrieving the same.
In the below case, a line with 152 delimiters will not be returned?
Any patch for the issue ?

Can we apply the same login using AWK..

Thank you

skmdu:

# cat a
H|2009-10-13
abcdefghijklmnop|lkjadsdlkjkj
a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|
ERRORline is this|lmonp|asdfasdf
abcdefghijklmnop|ASDJASDFASDF
bcdefghijklmonpasdfasdf|asdflkjoiuqwe
T|4

# perl -e 'while (<>){ print unless (/^([^\|]*\|){150,}/); }' < a
H|2009-10-13
abcdefghijklmnop|lkjadsdlkjkj
ERRORline is this|lmonp|asdfasdf
abcdefghijklmnop|ASDJASDFASDF
bcdefghijklmonpasdfasdf|asdflkjoiuqwe
T|4

skmdu · October 14, 2009, 5:06am

As you asked in you previous post, the perl code which I posted is to retrieve a row which are not having the minimum of 150 delimiter.

In my last post, execution of perl code didnt return a line which has 152 '|'. Is this what you expected? can you please clarify?

And you are saying "Its not retrieving the lines which contains less than the specified number of delimiters. If a line contains more than the number, it is not retrieving the same. " what does it mean? It prints nothing?

Showed in my last post it prints the line which has less than 150 | delimiters.

anandapani · October 14, 2009, 5:47am

Skmdu,

Previously I could have requested like this.

I want to have all the records which are not having 150 delimters. I want to see the records with < or > 150 delimiters.

If possible i want to capture all such records in a reject_file.

Am i clear now?
Please let me know if I can provide any kind of clarifications

Cheers,

skmdu · October 14, 2009, 6:13am

So as per your input, you want to retrieve the lines, which has < or > 150 | delimiters, not equal to 150. If am not wrong, this may help you.

# cat a
H|2009-10-13
abcdefghijklmnop|lkjadsdlkjkj
a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|
b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|
ERRORline is this|lmonp|asdfasdf
ERRORline is this|lmonp|asdfasdf|sfsdf|sdfsdf
abcdefghijklmnop|ASDJASDFASDF
bcdefghijklmonpasdfasdf|asdflkjoiuqwe
T|4

# perl -e 'while (<> ){ print if(/^([^\|]*\|){1,149}[^|]*$/ || /^([^\|]*\|){151,}/); }' < a
H|2009-10-13
abcdefghijklmnop|lkjadsdlkjkj
b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|
ERRORline is this|lmonp|asdfasdf
ERRORline is this|lmonp|asdfasdf|sfsdf|sdfsdf
abcdefghijklmnop|ASDJASDFASDF
bcdefghijklmonpasdfasdf|asdflkjoiuqwe
T|4

danmero · October 14, 2009, 9:50am

awk -F'|' 'NF != 152' in_file > reject_file

For 150 Field Separator(delimiter) you will have 152 fields.

anandapani · October 14, 2009, 12:38pm

Awk fails to run if the NF > 99. gawk can be used in place, but my shell is not having that command. So I thought to do it with perl logic.
But in the given code, it retrieves all the rows with more than 150 delimiters. My intentition is to filter all the records those not having 150 delimiters.

Therefore, code has to filter recors not only with 151 limiters but also with 149 delimiters.

If I am not clear, please let me know.