Direct the invalid lines to a separate files

Hi,

I have a pipe delimited file with 1 million records. I need to validate each line by counting the number of delimiters, if any line fails to have the specified number of delimiters, taat line has to be sent to a reject file.

Kindly suggest.
if code provided, it is highly appreciated, and helpful.
Please let me know if I can provide any further information

Hi.

awk -F"|" 'NF != 10 { print > "reject_file" }' input_file

In this example, 10 is the number of fields. This is one more than the number of delimeters (so if you have 10 delimeters, the number of fields would be 11)

awk -F":" 'NF != 10{print > "reject.file"}' infile

Where:
: delimiter
10 number of delimiters + 1
reject.file :rolleyes:

$ sed -n cat t.txt 
aa:bb:cc
ab:bb:cc
ac:bb:cc
ad:bb:cc
aebb:cc
af:bb:cc
ag:bb:cc
ah:bb:cc
ai:bb:cc
ajbb:cc
$ sed -n '/\([^:]*:\)\{2\}/!p' t.txt 
aebb:cc
ajbb:cc

It prints all the lines which dont have 2 delimiters. Hope this helped ?!

Thanks Scottn and Danmero

Your answers wonderfully wonderfully on sample files. But I have very huge files (>7GB) with more than 150 fields. For such cases, I learnt from this forum that, I can go for gawk. Unfortunately gawk command is not present in my shell, but I have nawk.

so could you please suggest, if you find any other alternative to the above question except gawk.

Thanks Thegeek,
Could you help me in achieving the below task?
sending records with unexpected number of fields to a reject file

Cheers,
Ananda

"sent" to a reject file is somewhat ambiguous, do you mean write a copy of the line to the reject file or delete it from the original and write it to the reject file?

You can use the split shell command to split the file into manageable chunks if it is too big for awk. See the man file for split: -

man split

This is probably a good moment to mention exactly what Operating System and version you have.

A 7 Gigabyte file with a million records implies a record size of about 7 Kilobytes. This file size and record size is way beyond what is normal for processing in unix shell script or basic awk programming. I am assuming that whatever program you used to count the records was able to report the statistics accurately.

Do you have a high level programming language available? Preferably one which can deal with flat files above 2 Gigabytes in size.

What is a big deal in that, i shown you how to show in STDOUT, just redirect it as:

sed -n '/\([^:]*:\)\{2\}/!p' t1 > reject

If you want to do using SED, then do like this.. both are same in this situation.

sed -n '/\([^:]*:\)\{2\}/!w reject' t1

Hi geek,

I have below sample file and need the output with records which doesn't have a single delimiter "|".

In below file I need to get a single line as output
ERRORline is this|lmonp|asdfasdf

I have mentioned the code I am using. This code is not working fine . Please help me by modifying the code.

sed -n '/\([^|]*|\)\{1\}/!p' tdyfle.dat
H|2009-10-13
abcdefghijklmnop|lkjadsdlkjkj
ERRORline is this|lmonp|asdfasdf
abcdefghijklmnop|ASDJASDFASDF
bcdefghijklmonpasdfasdf|asdflkjoiuqwe
T|4
# cat a
H|2009-10-13
abcdefghijklmnop|lkjadsdlkjkj
ERRORline is this|lmonp|asdfasdf
abcdefghijklmnop|ASDJASDFASDF
bcdefghijklmonpasdfasdf|asdflkjoiuqwe
T|4

[# sed -n '/^\([^|]*|\)\{2,\}/p' a
ERRORline is this|lmonp|asdfasdf

Hi Skmdu,

Thanks for your code. but Iam afraid, the code is not returning any rows.

Cheers,

I guess, failing because of extended regular expression.

Try this out, it may help you.

# sed -n '/\([^|]*|\)\([^|]*|\)/p' filename

Skumdu,

Your guess is absolutely correct,

Given code catered my requrement.
However, I have attached a small sample file, but actual file is a pipe delimited file with approx 150 fileds.
My objetive is to identify the records which are not having 150 delimiters...

So Could you please modify your code for my actual file??

Thanks a lot for your great help

---------- Post updated at 01:20 PM ---------- Previous update was at 12:40 PM ----------

Can anybody help please

# cat a
H|2009-10-13
abcdefghijklmnop|lkjadsdlkjkj
a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|
ERRORline is this|lmonp|asdfasdf
abcdefghijklmnop|ASDJASDFASDF
bcdefghijklmonpasdfasdf|asdflkjoiuqwe
T|4

# perl -e 'while (<>){ print unless (/^([^\|]*\|){150,}/); }' < a
H|2009-10-13
abcdefghijklmnop|lkjadsdlkjkj
ERRORline is this|lmonp|asdfasdf
abcdefghijklmnop|ASDJASDFASDF
bcdefghijklmonpasdfasdf|asdflkjoiuqwe
T|4

Thanks for the perl code.

I have a small doubt in the code. It is not retrieving the lines which contains less than the specified number of delimiters. If a line contains more than the number, it is not retrieving the same.
In the below case, a line with 152 delimiters will not be returned?
Any patch for the issue ?

Can we apply the same login using AWK..

Thank you

As you asked in you previous post, the perl code which I posted is to retrieve a row which are not having the minimum of 150 delimiter.

In my last post, execution of perl code didnt return a line which has 152 '|'. Is this what you expected? can you please clarify?

And you are saying "Its not retrieving the lines which contains less than the specified number of delimiters. If a line contains more than the number, it is not retrieving the same. " what does it mean? It prints nothing? :confused:

Showed in my last post it prints the line which has less than 150 | delimiters.

Skmdu,

Previously I could have requested like this.

I want to have all the records which are not having 150 delimters. I want to see the records with < or > 150 delimiters.

If possible i want to capture all such records in a reject_file.

Am i clear now?
Please let me know if I can provide any kind of clarifications

Cheers,

So as per your input, you want to retrieve the lines, which has < or > 150 | delimiters, not equal to 150. If am not wrong, this may help you.

# cat a
H|2009-10-13
abcdefghijklmnop|lkjadsdlkjkj
a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|
b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|
ERRORline is this|lmonp|asdfasdf
ERRORline is this|lmonp|asdfasdf|sfsdf|sdfsdf
abcdefghijklmnop|ASDJASDFASDF
bcdefghijklmonpasdfasdf|asdflkjoiuqwe
T|4

In the above file, line which has number of 'a|' has totally 150 |.
The line which has number of 'b|' has totally 160 |.
so it should not retrieve a| line.

# perl -e 'while (<> ){ print if(/^([^\|]*\|){1,149}[^|]*$/ || /^([^\|]*\|){151,}/); }' < a
H|2009-10-13
abcdefghijklmnop|lkjadsdlkjkj
b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|
ERRORline is this|lmonp|asdfasdf
ERRORline is this|lmonp|asdfasdf|sfsdf|sdfsdf
abcdefghijklmnop|ASDJASDFASDF
bcdefghijklmonpasdfasdf|asdflkjoiuqwe
T|4
awk -F'|' 'NF != 152' in_file > reject_file

For 150 Field Separator(delimiter) you will have 152 fields.

Awk fails to run if the NF > 99. gawk can be used in place, but my shell is not having that command. So I thought to do it with perl logic.
But in the given code, it retrieves all the rows with more than 150 delimiters. My intentition is to filter all the records those not having 150 delimiters.

Therefore, code has to filter recors not only with 151 limiters but also with 149 delimiters.

If I am not clear, please let me know.