search a number in very very huge amount of data

Hi,
I have to search a number in a very long listing of files.the total size of the files in which I have to search is 10 Tera Bytes.

How to search a number in such a huge amount of data effectively.I used fgrep but it is taking many hours to search. Is there any other feasible solution to search in such huge data. Any way to serach effectively?

Hi,

If you want only the file name, the user -l in the grep. so that it will scan the entire file.

what you are searching ? and is that number occurs more time in the file ? ( or only one time )

you can try with perl or python to process the big file.

The file format is like :

00.87.123.45|54999999999|98765|-|[15/NOV/2011:20:02:08 |"www.unix.com/"|GTR|1.1.1|UU|654|0012|0|[TTTTTTTT]|0432|text/html|SDEWRTRERERERERER==|1.4.5.6|fkjgkjfgfg|-|-|-|-|-|-|-|-|-|-|-|2|123456789|0|1||123,45,7654,76123|345|67|45654654645645645|54.67.4.345|323423423|4567|56098|-|

I used awk command to search.I already know that the serached pattern occurs in 2nd field.So compare the string only with 2nd field.but when I am using -v option to pass a variable to awk command but it is not searching the variable in the file, while if I hardcode the value then it is searching the variable.Is there any mistake in the below code?

script:
-----

variable=$1
for i in `ls *`
do
gunzip -c $i | awk -F"|" -v var="$variable" '$2~/variable/' >> output.txt &
done
 
 
 
./script 999999999

Unless you've got a blindingly fast RAID setup, "hours" is to be expected for 10 terabytes no matter what you do.

That's a useless use of ls *.

You don't want 9 processes writing to the same file at the same time. They may interfere with each other, overwriting each others' lines, etc.

It's looking for the string 'variable' because you put it in //, just give it the variable. You didn't even name it 'variable' though, you named it 'var'. Try $2 ~ var

In summary, I'd do this:

zcat * | awk -F"|" -v var="$1" '$2 ~ var' > output.txt

Depending on whether your disk's faster than your processor or vice versa, there may be ways to speed this up by running multiple gunzip's at once. I'm not sure how to do that yet but I'll think about it.