Using grep and a parameter file to return unique values

clippertm · March 20, 2014, 1:31am

Hello Everyone!

I have updated the first post so that my intentions are easier to understand, and also attached sample files (post #18).

I have over 500 text files in a directory. Over 1 GB of data. The data in those files is organised in lines:

My intention is to return one line per parameter match across all files.

The first parameter is: '4=[1 to 2000]'

The second parameter is: '3078='

So when grep, awk etc. finds a line that contains both '4=1' and '3078=' it prints the line, and start looking for a line that contains '4=2' and '3078='.

This across all the 500 files (-m 1 does not work in this case as 4=1 and 4=2 might be contained in 1 file and not in the 499 others).

Please also note that '4=[1 to 2000]' and '3078=' are not always at the same position in a line.

Can you please please please help me? I am at loss at what to do

Lucas_0418 · March 20, 2014, 3:27am

Hi clippertm,
Shall we use sort -u -t'|' -k2,2 instead of uniq ?

clippertm · March 20, 2014, 3:32am

Hi Lucas!

Yes! This is the spirit!

However I realise that my list file does not work :o

$ grep -h -f ../list.txt *.* | grep '3078=' | sort -u -t'|' -k2,2

only returns one line instead of 4

The values in the file are "line" separated: each value has its own line.

Perhaps I do not understand how the pattern file works.

Does it look for '4=745' and '3078=', then for '4=746' and '3078=', then for '4=747' and '3078=' etc.?

Or for all those 4=745 4=746 4=747 etc. on the same line?

How can I write a file (or use the command) that look for the values successively? ('4=745' and '3078=', then for '4=746' and '3078=', then for '4=747' and '3078=' etc.)

I tried to use -F:

$ grep -h -F -f ../list.txt *.* | grep '3078=' | sort -u -t'|' -k2,2

But it seems to defeat the sort function!

Or perhaps a more efficient way would be to use a patter directly on the command line, instead of a file: something that goes:

$ grep -h '4=[745-755]++' *.* | grep '3078=' | sort -u -t'|' -k2,2

WOuld you know how to write this?

Lucas_0418 · March 20, 2014, 4:41am

Hi clippertm,
Confusing with "The values in the file are "line" separated: each value has its own line." Does the data file not be separated by '|' ? or you are talking about the pattern file ?
As my poor knowledge of shell, consider it does look for a line that contain 4=745 4=746 4=747 etc. then send a matched line to grep "3078"
Of cause you could prepare a patter file like this:

4=745|[^|]\{1,\}|3078=
4=746|[^|]\{1,\}|3078=
4=747|[^|]\{1,\}|3078=
4=748|[^|]\{1,\}|3078=
4=749|[^|]\{1,\}|3078=
4=750|[^|]\{1,\}|3078=
4=751|[^|]\{1,\}|3078=
4=752|[^|]\{1,\}|3078=
4=753|[^|]\{1,\}|3078=
4=754|[^|]\{1,\}|3078=
4=755|[^|]\{1,\}|3078=

But it also can not solve only print every first match line

clippertm · March 20, 2014, 4:44am

Hi Lucas,

The pattern file is line separated.

The data files are "|" separated. In addition, "4=74*" and "3078=" are not always at the same position.

Lucas_0418 · March 20, 2014, 4:57am

What's your environment, found that your code could work in my cygwin.

grep -h '4=[745-755]++' *.* | grep '3078=' | sort -u -t'|' -k2,2

May you could put all the pattern in a file, then use option -m of grep to get the first match line.

while read a
do
grep -h -m 1 "$a" *.*
done< yourpatterfile

clippertm · March 20, 2014, 5:00am

Hi Lucas,

My environment is also cygwin (latest).

grep -h '4=[745-755]++' *.* | grep '3078=' | sort -u -t'|' -k2,2

does not work, it stalls (I used 745-755 to simplify and make things faster, I actually run it from 1 to 2000!). It also returns "grep: invalid range" sometimes.

Lucas_0418 · March 20, 2014, 5:04am

Sorry clippertm,
I made a mistake, delete the double plus.

grep -h '4=[745-755]' *.* | grep '3078=' | sort -u -t'|' -k2,2

?
If [745-755] is not available, use this:

grep -h '4=\(74[5-9]\|75[0-5]\)' a | grep '3078=' | sort -u -t'|' -k2,2

clippertm · March 20, 2014, 5:17am

Hi Lucas,

No problem.

Thanks, the count loop seems to work, but the whole command does not seem to return unique occurences anymore, but all of them

apmcd47 · March 20, 2014, 5:19am

Your first problem is that you need to match the patterns in your pattern file with the second field. You cannot just use grep -f list.txt . You really need to use awk or perl. You can then pattern-match on the two fields you are interested in without a stray 254=47587 in, say, the last field, matching your patterns file.

Something I was not sure about. Are you after the first instance of 4=475 and 3078= AND the first instance of 4=476 and 3078= etc etc rather than the first instance of ANY 4=xxx and 3078= ? In which case you will have to loop through your patterns file anyway.

I am not going to offer any code because I don't really know awk and my perl is rusty. Good luck

Andrew

clippertm · March 20, 2014, 5:39am

Hi Apm,

Well thank you for your reply, but I only know grep :o

I have never used awk or perl :o

---------- Post updated at 04:28 AM ---------- Previous update was at 04:24 AM ----------

I have come up with this:

However it still returns all the occurrences, not only the first one

---------- Post updated at 04:39 AM ---------- Previous update was at 04:28 AM ----------

Actually, it seems that the simple:

does the trick!

Really simple.. am I missing anything?

Lucas_0418 · March 20, 2014, 5:41am

Sorry clippertm,
I think the problem of output all occurences is that we used the wildcard *.*
so maybe we must use sort after grep
Or use perl or awk as apmcd47 say, u know, both could solve the problem.

We know the wildcard *.* make the grep consider every first match in every file both are first occurence, may cat *.*|grep could work, but I am not able to test it when I am in a bus.

And sorry for my not very good English,  let me check what's your desired output again.
a. line has 4=745 and 3078=
b. line only has 4=475 not 3078=
c. line only has 3078= not 4=475

For your addtional question:
$[1-9]\|[1-9][0-9]\|[1-9][0-9][0-9]\|1[0-9][0-9][0-9]\|2000$

clippertm · March 20, 2014, 5:54am

Nope it does not work

returns less lines than it should...

---------- Post updated at 04:44 AM ---------- Previous update was at 04:43 AM ----------

Hi Lucas,

I do not how to use awk/perl at all :o

---------- Post updated at 04:54 AM ---------- Previous update was at 04:44 AM ----------

Does anybody know how to check a range from 1 to 2000? Not 0001, 0002, 0003 etc. but 1, 2, 3

MadeInGermany · March 20, 2014, 7:51am

With awk:

awk -F "|" '
# hash the search list
NR==FNR {L[$1]=0; next}
# now procede with the data files
# print if the following is true
($4~/^3078=/ && ($2 in L) && L[$2]++==0)
' searchlist.txt datafile1.txt datafile2.txt

Search the 3078= everywhere:

awk -F "|" '
# hash the search list
NR==FNR {L[$1]=0; next}
# now procede with the data files
# print if the following is true
(/|3078=/ && ($2 in L) && L[$2]++==0)
' searchlist.txt datafile1.txt datafile2.txt

---------- Post updated at 06:51 AM ---------- Previous update was at 05:35 AM ----------

lucas_0418:

What's your environment, found that your code could work in my cygwin.
grep -h '4=[745-755]++' *.* | grep '3078=' | sort -u -t'|' -k2,2
May you could put all the pattern in a file, then use option -m of grep to get the first match line.
while read a
do
grep -h -m 1 "$a" *.*
done< yourpatterfile

grep -m1 exits at every 1st match per file.
awk is much more flexible:

awk -F "|" -v low=745 -v high=755 '
# build the Lookup hash
BEGIN {for (i=low; i<=high; i++) L["4="i]}
# main loop
# if in Lookup hash and if a field begins with 3078=
($2 in L) && /|3078=/ {
  print
  # delete from the Lookup hash
  delete L[$2]
}
' datafile*.txt

clippertm · March 20, 2014, 10:35pm

Hi Lucas,

Thank you for the range!

The output I am looking for is a. line has 4=745 and 3078=

Thanks again for your help!

---------- Post updated at 09:35 PM ---------- Previous update was at 09:30 PM ----------

Hi MadeInGermany,

Thank you for your awk samples, they do not produce the output I am looking for

If I change the last one to:

awk -F "|" -v low=1 -v high=2000 '
# build the Lookup hash
BEGIN {for (i=low; i<=high; i++) L["4="i]}
# main loop
# if in Lookup hash and if a field begins with 3078=
($2 in L) && /|3078=/ {
  print
  # delete from the Lookup hash
  delete L[$2]
}
' *.txt

It only returns 4 results and there should be 100s.

Lucas_0418 · March 21, 2014, 12:17am

Hi clippertm,
Have you tried the cat *.*|grep , I think if your pattern file is something like this: (I posted yesterday)

4=745|[^|]\{1,\}|3078=
4=746|[^|]\{1,\}|3078=
4=747|[^|]\{1,\}|3078=
4=748|[^|]\{1,\}|3078=
4=749|[^|]\{1,\}|3078=
4=750|[^|]\{1,\}|3078=
4=751|[^|]\{1,\}|3078=
4=752|[^|]\{1,\}|3078=
4=753|[^|]\{1,\}|3078=
4=754|[^|]\{1,\}|3078=
4=755|[^|]\{1,\}|3078=

this command may work:

for pattern in `cat your_pattern_file`
do
    cat *.*|grep -h -m 1 "$pattern"
done

The only problem of the command above is that I don't know if it is ineffective:D

clippertm · March 21, 2014, 1:20am

Hi Lucas,

Sorry it does not work

cannot work because sometimes there are different "4=" per file. If it returns only one per file, then it misses all the other "4="

If I write:

grep -h '4=745' *.* | grep '3078='

it returns 11,000 lines.

If I write:

grep -h 4=745|[^|]\{1,\}|3078=

it returns much less lines..

I have spent hours on this issue.. I am at loss at what to do

clippertm · March 21, 2014, 5:27am

Sample files!

clippertm · March 23, 2014, 8:45pm

Adding expected output:

Lucas_0418 · March 23, 2014, 11:52pm

Hi clippertm,
Let's use awk instead of grep, try this, it works fine in my cygwin, hope it could work in your cygwin too.

awk '$2~/^4=[0-9]+$/{split($2,a,/=/);if(int(a[2])>=0&&int(a[2])<=2000&&$0~/|3078=/&&!b[$2]){b[$2]++;print $0}}' FS='|' *.txt

$ cat data1.txt
5021=0|4=748|12=ABC|3078=7484561|4102=748
5021=0|4=749|12=ABC|3214=748|3078=7486512
5021=0|4=748|12=DEF|3078=7481564151|855=748
5021=0|4=750|12=ABC|987=748|3078=7481231
5021=0|4=750|12=DEF|3078=41561|6321=748
5021=0|4=750|12=DEF|3078=7812|8412=748
5021=0|4=750|12=DEF|3078=121888|8855=748
5021=0|4=749|12=ABC|3078=12688|2222=748
5021=0|4=748|12=GHI|3078=812135|8745=748
5021=0|4=748|12=ABC|3078=812121|9647=748
5021=0|4=753|12=GHI|7444=748|3078=121888
$ cat data2.txt
5022=0|4=755|12=ABC|3078=7484561|4102=748
5022=0|4=743|12=ABC|3214=748|3078=7486512
5022=0|4=755|12=DEF|3078=7481564151|855=748
5022=0|4=755|12=ABC|987=748|3078=7481231
5022=0|4=749|12=DEF|3078=41561|6321=748
5022=0|4=748|12=DEF|3078=7812|8412=748
5022=0|4=752|12=DEF|3078=121888|8855=748
5022=0|4=740|12=ABC|3078=12688|2222=748
5022=0|4=740|12=GHI|3078=812135|8745=748
5022=0|4=743|12=ABC|3078=812121|9647=748
5022=0|4=752|12=GHI|7444=748|3078=121888
$ awk '$2~/^4=[0-9]+$/{split($2,a,/=/);if(int(a[2])>=0&&int(a[2])<=2000&&$0~/|3078=/&&!b[$2]){b[$2]++;print $0}}' FS='|' *.txt
5021=0|4=748|12=ABC|3078=7484561|4102=748
5021=0|4=749|12=ABC|3214=748|3078=7486512
5021=0|4=750|12=ABC|987=748|3078=7481231
5021=0|4=753|12=GHI|7444=748|3078=121888
5022=0|4=755|12=ABC|3078=7484561|4102=748
5022=0|4=743|12=ABC|3214=748|3078=7486512
5022=0|4=752|12=DEF|3078=121888|8855=748
5022=0|4=740|12=ABC|3078=12688|2222=748