I have a pipe delimited file with 1 million records. I need to validate each line by counting the number of delimiters, if any line fails to have the specified number of delimiters, taat line has to be sent to a reject file.
Kindly suggest.
if code provided, it is highly appreciated, and helpful.
Please let me know if I can provide any further information
In this example, 10 is the number of fields. This is one more than the number of delimeters (so if you have 10 delimeters, the number of fields would be 11)
Your answers wonderfully wonderfully on sample files. But I have very huge files (>7GB) with more than 150 fields. For such cases, I learnt from this forum that, I can go for gawk. Unfortunately gawk command is not present in my shell, but I have nawk.
so could you please suggest, if you find any other alternative to the above question except gawk.
Thanks Thegeek,
Could you help me in achieving the below task?
sending records with unexpected number of fields to a reject file
"sent" to a reject file is somewhat ambiguous, do you mean write a copy of the line to the reject file or delete it from the original and write it to the reject file?
You can use the split shell command to split the file into manageable chunks if it is too big for awk. See the man file for split: -
This is probably a good moment to mention exactly what Operating System and version you have.
A 7 Gigabyte file with a million records implies a record size of about 7 Kilobytes. This file size and record size is way beyond what is normal for processing in unix shell script or basic awk programming. I am assuming that whatever program you used to count the records was able to report the statistics accurately.
Do you have a high level programming language available? Preferably one which can deal with flat files above 2 Gigabytes in size.
# cat a
H|2009-10-13
abcdefghijklmnop|lkjadsdlkjkj
ERRORline is this|lmonp|asdfasdf
abcdefghijklmnop|ASDJASDFASDF
bcdefghijklmonpasdfasdf|asdflkjoiuqwe
T|4
[# sed -n '/^\([^|]*|\)\{2,\}/p' a
ERRORline is this|lmonp|asdfasdf
Given code catered my requrement.
However, I have attached a small sample file, but actual file is a pipe delimited file with approx 150 fileds.
My objetive is to identify the records which are not having 150 delimiters...
So Could you please modify your code for my actual file??
Thanks a lot for your great help
---------- Post updated at 01:20 PM ---------- Previous update was at 12:40 PM ----------
I have a small doubt in the code. It is not retrieving the lines which contains less than the specified number of delimiters. If a line contains more than the number, it is not retrieving the same.
In the below case, a line with 152 delimiters will not be returned?
Any patch for the issue ?
As you asked in you previous post, the perl code which I posted is to retrieve a row which are not having the minimum of 150 delimiter.
In my last post, execution of perl code didnt return a line which has 152 '|'. Is this what you expected? can you please clarify?
And you are saying "Its not retrieving the lines which contains less than the specified number of delimiters. If a line contains more than the number, it is not retrieving the same. " what does it mean? It prints nothing?
Showed in my last post it prints the line which has less than 150 | delimiters.
So as per your input, you want to retrieve the lines, which has < or > 150 | delimiters, not equal to 150. If am not wrong, this may help you.
# cat a
H|2009-10-13
abcdefghijklmnop|lkjadsdlkjkj
a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|a|
b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|
ERRORline is this|lmonp|asdfasdf
ERRORline is this|lmonp|asdfasdf|sfsdf|sdfsdf
abcdefghijklmnop|ASDJASDFASDF
bcdefghijklmonpasdfasdf|asdflkjoiuqwe
T|4
In the above file, line which has number of 'a|' has totally 150 |.
The line which has number of 'b|' has totally 160 |.
so it should not retrieve a| line.
# perl -e 'while (<> ){ print if(/^([^\|]*\|){1,149}[^|]*$/ || /^([^\|]*\|){151,}/); }' < a
H|2009-10-13
abcdefghijklmnop|lkjadsdlkjkj
b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|b|
ERRORline is this|lmonp|asdfasdf
ERRORline is this|lmonp|asdfasdf|sfsdf|sdfsdf
abcdefghijklmnop|ASDJASDFASDF
bcdefghijklmonpasdfasdf|asdflkjoiuqwe
T|4
Awk fails to run if the NF > 99. gawk can be used in place, but my shell is not having that command. So I thought to do it with perl logic.
But in the given code, it retrieves all the rows with more than 150 delimiters. My intentition is to filter all the records those not having 150 delimiters.
Therefore, code has to filter recors not only with 151 limiters but also with 149 delimiters.