read in variable data from another file - grep

ccox85 · December 16, 2008, 10:47am

Hello! I think this should be an easy solution. I have a large file with many fields of data. The first field has a unique identifier (a subject number) for every record for a chunk of data. Something like this:

There were ten experimental conditions (ec), but the ec is identified by only one field in one record for each subject. I already ran a script to create a list of all subject numbers and their corresponding ec. For simplicity's sake (or so I think) I divided up that list by ec, so i now have numerous lists containing all of the subject numbers for each ec. I hope you can picture this.

Lets call my main data file "all.inq" and my lists "[0-9].lst"

I want to grep all the records in all.inq that begin with the numbers listed in 0.lst and put them in a file 0.inq, and then repeat the process for all other lists. This is probably a one liner, but I just don't know much about grep.

Thanks!
Chris

ccox85 · December 16, 2008, 11:01am

I thought of a way to simplify the question and make it general.

grep syntax is:

Can you have grep read the pattern from a file, and go line by line through that file and just append all standard output to the intended file?

nj78 · December 16, 2008, 11:22am

for name in list
do
egrep ^$name pattern > $name.out

done

to dump all in the same file use >> $name.out.

list would be the numbers.

ccox85 · December 16, 2008, 12:13pm

I dont understand how this would work. It looks to me like list would be the file name of my list file, and name would just read in each record in the file (??? I dont know if that is the right interpretation). Then, the grep command looks for lines in the data file that start with the subject numbers in the list file. However, If I print to >> $name.out, then I am going to have a new file for every subject... which would not be good. I want to group by experimental condition.

If I am interpreting this correctly, I would be able to to this and just define the output files name to explicly be the ec number, if I read from my individual lists that are already grouped by condition.

Is there a way to read in column $1 from the list file (which is the ec) and have that govern the name of the .out, and then read the second column of the list file (the subject number) into the grep search pattern pattern?

So the grep line would be something like this, i think:

grep '^$2' all.inq >> $1.out

I just dont know how to read in the master list file so that it would work.

vgersh99 · December 16, 2008, 12:18pm

is 'ec' the first field in all.inq?

Pls post a sample all.inq, a sample 1.lst and a desired output of 1.inq

And indicate a 'common' field between all.ina and 1.lst files

ccox85 · December 16, 2008, 12:58pm

Alright, this is a sample from all.inq:

570613667 6 191 compatible unpleasb 30 1 579 30 Anger
570613667 6 192 compatible europeanamericanright 76 1 490 35 g10cm.bmp
570613667 6 193 compatible unpleasb 30 1 805 49 Debt
570613667 6 194 compatible pleasb 76 1 803 3 Dinner
570613667 6 195 compatible pleasb 30 0 617 11 Merry
570613667 6 196 compatible unpleasb 30 1 638 15 Lice
570613667 6 197 compatible pleasb 76 1 595 43 Home
570613667 6 198 compatible pleasb 76 1 569 35 Yacht
570613667 6 199 compatible europeanamericanright 76 1 497 50 g28cm.bmp
570613667 6 200 compatible unpleasb 30 1 627 48 Broken

The file is enormous so this is only a small chunk of block 6 from subject# 570613667. If you look in my first post, that simplified version should be enough to work with. The experiment condition is only in one field of one record at the top of each subject (that line also begins with the subject#, which is a good thing for me). I already handled this part of the process with awk. I found a unique quality of a field in the first line, and printed the ec and sn fields ($6=ec; $1=sn, if you care).

So, my master list looks like this:

There are no headers. $1=ec; $2=sn

In case it would be helpful, I also have made sub lists by grepping all lines that begin with 1, 2, 3, ... for all possible ec.

Ideally, I would not use these sub lists, and I would be able to read right from the master list.

My output would look exactly like the all.inq, except that the subjects with ec=1 would be in a file 1.inq, and so on. Again, the relationship is mandated by the records in the master list.

I hope this is clear enough for you to help me now! Thank you,

Chris

PS you can safely assume all spaces are tabs.

ccox85 · December 16, 2008, 1:03pm

While we are at it, I need to remove a group of subjects from ec 1. I have their subject numbers listed, but I don't know how to use this list to expedite deleting them from the master list or 1.lst.

vgersh99 · December 16, 2008, 1:42pm

not following the whole thing, but......

assuming all.inq is:

570613667 6 191 compatible unpleasb 30 1 579 30 Anger
570613667 6 192 compatible europeanamericanright 76 1 490 35 g10cm.bmp
570613667 6 193 compatible unpleasb 30 1 805 49 Debt
570613667 6 194 compatible pleasb 76 1 803 3 Dinner
570613667 6 195 compatible pleasb 30 0 617 11 Merry
570613667 6 196 compatible unpleasb 30 1 638 15 Lice
570613667 6 197 compatible pleasb 76 1 595 43 Home
570613667 6 198 compatible pleasb 76 1 569 35 Yacht
570613667 6 199 compatible europeanamericanright 76 1 497 50 g28cm.bmp
570613667 6 200 compatible unpleasb 30 1 627 48 Broken

and the 'ec' values are in the SECOND field

and the a list of 'subjects' to delete is stored in a file sub2delete one per line:

nawk -v suf='.inq' 'FNR==NR { sub2del[$1]; next} !($2 in sub2del) { file=$2 suf; print >> file; close(file)}' sub2delete all.inq

Is that something you're looking for?

ccox85 · December 16, 2008, 4:00pm

hmmm... I don't how to ask the question I guess. And I did not understand a lot of what is going on in your suggestion, vgersh99. I feel like I am making this more complicated than I need to. I found a work around, as inelegant as it is:

I have 10 lists. Each list contains all of the subject numbers, one number per line, grouped by experimental condition. Because I dont know how to make grep read patterns from file, I used vi to insert:

grep '[subject number]' all.inq >> [0-9].inq

around the subject number on every line, saved it as $ec.grep and then ran the file as a shell script:

$>sh [0-9].grep

This solved the first problem.

Now, what I want to do now is very simple. I have a list of subject numbers (call it sn.all). I have a second list (call it sn.rm) of subject numbers that need to be deleted from the first list. I have about 500 subject numbers, and I have 60 something that need to be deleted. I want to somehow compare sn.all to sn.rm, and remove all numbers in sn.rm from sn.all and create a new file calls sn.fin.

I hope this makes sense. The fact that I don't know what I am doing makes this harder.

Chris

ccox85 · December 16, 2008, 10:41pm

Thank you for your help, mods! I just wanted to let you know that you can close this thread if you like, because I came across a solution. I have never used sed before, but it has the option to read from file like I want to do, and I should be able to mimic the grep functionality with it.

Now I am going to bombard the boards with sed questions. Yippee!

nj78 · December 17, 2008, 11:21am

ccox85:

I dont understand how this would work. It looks to me like list would be the file name of my list file, and name would just read in each record in the file (??? I dont know if that is the right interpretation). Then, the grep command looks for lines in the data file that start with the subject numbers in the list file. However, If I print to >> $name.out, then I am going to have a new file for every subject... which would not be good. I want to group by experimental condition.

If I am interpreting this correctly, I would be able to to this and just define the output files name to explicly be the ec number, if I read from my individual lists that are already grouped by condition.

Is there a way to read in column $1 from the list file (which is the ec) and have that govern the name of the .out, and then read the second column of the list file (the subject number) into the grep search pattern pattern?

So the grep line would be something like this, i think:
grep '^$2' all.inq >> $1.out
I just dont know how to read in the master list file so that it would work.

I don't want to be too specific in case you are just doing homework, but you had spaces in your example, and it sounded like you wanted files such as 1.out, 2.out.

ccox85 · December 17, 2008, 3:11pm

I understand better now, thanks. Unfortunately, I am not doing homework. Just working in a lab where windows is a swear word and I am being motivated to learn shell scripts.