cmccabe
December 16, 2014, 1:45pm
1
I have downloaded source code for 97 files using:
wget -x -i link.txt
then run a rename loop:
for file in *
do
mv $file $file.txt
done
to keep the html tags but make the file a text that can be parsed.
In each of the 97 txt files the gene # is variable, but the gene is associated or should have a corresponding OMIM #. They are all in,
C:\Users\cmccabe\Desktop\list\geneticslab.emory.edu.txt\tests_txt
Is there a way to search the source code for these gene names and OMIM #�s?
For example, in the attached file there are 26 genes:
Output (tab-delimited)
A B
Gene OMIM
AKT1 164730
ALK 105590
APC 611731
The gene names seem to be after
target = '_blank'>AKT1</a>
and the OMIM # seem to be
style = 'margin-bottom:10px;'><a href =
I think
sed
can parse html but I am not familiar enough to know how to code it for multiple files in a directory.
Thank you to all for the help :).
RudiC
December 16, 2014, 3:34pm
2
You don't need to rename the files to .txt to parse them with sed
. Try:
awk '/^<h2 id="genes"/ {getline
for (i=1; i<=NF; i++)
{n1=gsub ("[, ]*<a href = .http://www.omim.org/entry/", "", $i)
n2=gsub (". target = ._blank.", "", $i)
n3=gsub ("</a", "", $(i+1))
if (n1 + n2) print $(i+1) "\t" $i
}
exit
}
' FS=">" /tmp/CM080
AKT1 164730
ALK 105590
APC 611731
BRAF 164757
CDH1 192090
CTNNB1 116806
EGFR 131550
ERBB2 164870
FBXW7 606278
FGFR2 176943
FOXL2 605597
GNAQ 600998
GNAS 139320
KIT 164920
KRAS 190070
MAP2K1 176872
MET 164860
MSH6 600678
NRAS 164790
PDGFRA 173490
PIK3CA 171834
PTEN 601728
SMAD4 600993
SRC 190090
STK11 602216
TP53 191170
cmccabe
December 16, 2014, 4:34pm
3
Can a loop be used to parse each of the 97 files in
C:\Users\cmccabe\Desktop\list\geneticslab.emory.edu.txt\tests
keeping the original file and the newly created parsed.txt? Thank you :).
So, CM080 (original) - CM080parsed.txt
Also, if I run the command to write an output file it takes a very long time
awk '/^<h2 id="genes"/ {getline
for (i=1; i<=NF; i++)
{n1=gsub ("[, ]*<a href = .http://www.omim.org/entry/", "", $i)
n2=gsub (". target = ._blank.", "", $i)
n3=gsub ("</a", "", $(i+1))
if (n1 + n2) print $(i+1) "\t" $i
}
exit
}
' FS=">" CM080 > output.txt
, but if there is no output it runs quickly.
awk '/^<h2 id="genes"/ {getline
for (i=1; i<=NF; i++)
{n1=gsub ("[, ]*<a href = .http://www.omim.org/entry/", "", $i)
n2=gsub (". target = ._blank.", "", $i)
n3=gsub ("</a", "", $(i+1))
if (n1 + n2) print $(i+1) "\t" $i
}
exit
}
' FS=">" CM080
Thank you :).
RudiC
December 17, 2014, 5:29am
4
There's no reason it should run slower when printing to a file, and it doesn't if I try. For your multiple files, try
awk '/^<h2 id="genes"/ {getline
for (i=1; i<=NF; i++)
{n1=gsub (".*omim.org/entry/", "", $i)
n2=gsub (". target = ._blank.>", "\t", $i)
n3=gsub ("</a>.*", "", $i)
print $i > FILENAME".parsed"
}
}
' FS="," /tmp/CM*
after having adapted the input files' path. It requires that ALL input files have the same structure as CM080 does.
cmccabe
December 17, 2014, 9:47am
5
awk '/^<h2 id="genes"/ {getline
> for (i=1; i<=NF; i++)
> {n1=gsub (".*omim.org/entry/", "", $i)
> n2=gsub (". target = ._blank.>", "\t", $i)
> n3=gsub ("</a>.*", "", $i)
> print $i > FILENAME".parsed"
> }
> }
> ' FS="," C:\Users\cmccabe\Desktop\list\geneticslab.emory.edu.txt\parse
awk: fatal: cannot open file `C:UserscmccabeDesktoplistgeneticslab.emory.edu.txtparse' for reading (No such file or directory)
I am using cygwin on a windows machine. Should I try Ubuntu? Thank you :).
RudiC
December 17, 2014, 9:51am
6
Looks like it suppressed all "\" in the input file path. Try to escape or quote.
You did not specify any wildcard, so it would work on file "parse" only. Is this what you want? Or is "parse" a directory?
cmccabe
December 17, 2014, 9:56am
7
I put all 97 of the files to be parsed in a directory:
C:\Users\cmccabe\Desktop\list\geneticslab.emory.edu.txt\parse
So maybe that should be
"C:\Users\cmccabe\Desktop\list\geneticslab.emory.edu.txt\parse
"*
Not sure how to escape.
Thank you :).
RudiC
December 17, 2014, 9:59am
8
How would you specify a path for other tools?
Try
"C:\Users\cmccabe\Desktop\list\geneticslab.emory.edu.txt\parse\*"
cmccabe
December 17, 2014, 10:27am
9
awk '/^<h2 id="genes"/ {getline
> for (i=1; i<=NF; i++)
> {n1=gsub (".*omim.org/entry/", "", $i)
> n2=gsub (". target = ._blank.>", "\t", $i)
> n3=gsub ("</a>.*", "", $i)
> print $i > FILENAME".parsed"
> }
> }
> ' FS="," "C:\Users\cmccabe\Desktop\list\geneticslab.emory.edu.txt\parse\*"
awk: fatal: cannot open file `C:\Users\cmccabe\Desktop\list\geneticslab.emory.edu.txt\parse\*' for reading (No such file or directory)
For windows, I would quote the path and that seems to have worked to un-suppress the \, but it doesn't like the wildcard. Thank you :).
RudiC
December 17, 2014, 10:58am
10
What's the result of
dir "C:\Users\cmccabe\Desktop\list\geneticslab.emory.edu.txt\parse\*"
?
cmccabe
December 17, 2014, 11:22am
11
awk '/^<h2 id="genes"/ {getline
> print $i > FILENAME".parsed"
for (i=1; i<=NF; i++)
> {n1=gsub (".*omim.org/entry/", "", $i)
> n2=gsub (". target = ._blank.>", "\t", $i)
> n3=gsub ("</a>.*", "", $i)
> print $i > FILENAME".parsed"
> }
> }
> ' FS="," dir "C:\Users\cmccabe\Desktop\list\geneticslab.emory.edu.txt\parse\*"
awk: fatal: cannot open file `dir' for reading (No such file or directory)
Thank you very much :).
RudiC
December 18, 2014, 5:16am
12
Please issue above command directly from the command line, not as the input stream parameter to awk
.
cmccabe
December 18, 2014, 10:48am
13
I used the following in ubuntu and it worked perfectly.
awk '/^<h2 id="genes"/ {getline for (i=1; i<=NF; i++) {n1=gsub (".*omim.org/entry/", "", $i) n2=gsub (". target = ._blank.>", "\t", $i) n3=gsub ("</a>.*", "", $i) print $i > FILENAME".parsed" }
}
' FS="," /home/dnascopev/Desktop/list/geneticslab.emory.edu.txt/parse/*
each of the 97 files in that directory have the original and a.parsed (CM080 and CM080.parsed, CM081 and CM081.parsed). [/CODE]
Is there a way to combine the 97 files that .parsed into one overall file called all_genes.txt? Thank you for all your help :).
cat *.parsed > all_genes.txt
1 Like
RudiC
December 18, 2014, 12:36pm
15
You explicitly asked for single unique .parse files in post #3 . Anyhow, remove the > FILENAME".parsed"
from within the script and add an > all_genes.txt
to the very end. Or use Corona688's proposal.
1 Like
cmccabe
December 18, 2014, 4:33pm
16
Thank you very much for all your help :).