how to write shell script to extract lines we want

rainboisterous · June 10, 2010, 11:15pm

hi
i have a file which is very large . it contains lines in the format below:

seed url, html url
....
...
seed url, html url

i have sort it already.
2010�Ϸ��*_��籩_�� ż��ר��24��Ļ�� δ��Ž�_2010�Ϸ��_��籩_��
2010�Ϸ��*_��籩_�� 뱴��˹���� Ī�*��ʽ��_��_2010�Ϸ��_��籩_��
2010�Ϸ��*_��籩_�� ר�ù������ ӽ�16ǿ__2010�Ϸ��_��籩_��
2010�Ϸ��*_��籩_�� Ը�Ů�Ѳ�� 뵽�й��è(ͼ)_H��_2010�Ϸ��*_��籩_��
2010�Ϸ��*_��籩_�� ƽ��裺��м�� *��ս��ڡ��Ƽ��ݡ�__2010�Ϸ��*_��籩_��
��ݷ��ز��Ż�_ס�ں��_��*��Ӱ��ý��_��ݷ��,��·�,��¥��,��,��ⷿ,��س��,��ݶ��ַ��ѡ 70��г�� 8��ۻ��µ�-��-��*��-ס�ں��
��ݷ��ز��Ż�_ס�ں��_��*��Ӱ��ý��_��ݷ��,��·�,��¥��,��,��ⷿ,��س��,��ݶ��ַ��ѡ 2010��6��11�ա�ס��ʱ��-�ʱ�-��*��-ס�ں��
��ݷ��ز��Ż�_ס�ں��_��*��Ӱ��ý��_��ݷ��,��·�,��¥��,��,��ⷿ,��س��,��ݶ��ַ��ѡ �ⷿ�н��"��500��200" �г��ɷ��н��ֶ�-�н��,�ⷿ-��*��-ס�ں��
��ݷ��ز��Ż�_ס�ں��_��*��Ӱ��ý��_��ݷ��,��·�,��¥��,��,��ⷿ,��س��,��ݶ��ַ��ѡ 5�¾��ݽ񹫲� ͨ��Դ��Ϣ��ٴηŻ�--����-ס�ں��
��ݷ��ز��Ż�_ס�ں��_��*��Ӱ��ý��_��ݷ��,��·�,��¥��,��,��ⷿ,��س��,��ݶ��ַ��ѡ �� Ȩʽ�Ƶ�� -�Ƶ�ʽ��Ԣ,��-��*��-ס�ں��
��ݷ��ز��Ż�_ס�ں��_��*��Ӱ��ý��_��ݷ��,��·�,��¥��,��,��ⷿ,��س��,��ݶ��ַ��ѡ 5�·ݾ��ݹ�� Ѽ۸�ͬ��3.1%--����-ס�ں��

now i want to get 3 htmlurl for each seedurl.
any tips will be appreciated.

---------- Post updated at 11:15 AM ---------- Previous update was at 11:11 AM ----------

hi
i have a file which is very large . it contains lines in the format below:

seedurl1, htmlur1
seedurl1, htmlurl2
....
seedurl1,htmlurln
.....
seedurlm,htmlurl1
seedurlm,htmlurl2
.....
seedurlm,htmlurln
......

now i want to get 3 htmlurl3 for each seedurl.
any tips will be appreciated.

rdcwayx · June 10, 2010, 11:20pm

Your first post can't be read.

For your second post, do you want to get the output as below?

seedurl1, htmlur1, htmlurl2, htmlurl3 (the first 3 urls for each seedurl)?
...
seedurlm, htmlur1, htmlurl2, htmlurl3

rainboisterous · June 10, 2010, 11:22pm

for the second,i want to get the output as:
seedurl1,htmlurl1
seedurl1,htmlurl2
seedurl1,htmlurl3
.....
seedurlm,htmlurl1
seedurlm,htmlurl2
seedurlm,htmlurl3
....

thanks,any idea

rdcwayx · June 10, 2010, 11:28pm

your first post can't be read, all of them are converted to http links automatically. I guess you need wrap CODE tags around the input file.

rainboisterous · June 10, 2010, 11:30pm

you can treat first post and second post as the same, and ignore the first.
just give tips about the second.
thanks

rdcwayx · June 10, 2010, 11:45pm

Should have better solution.

$ cat urfile
seedurl1,htmlurl1
seedurl1,htmlurl2
seedurl1,htmlurl3
seedurl1,htmlurl4
seedurl1,htmlurl5
seedurl1,htmlurl6
seedurlm,htmlurl1
seedurlm,htmlurl2
seedurlm,htmlurl3
seedurlm,htmlurl4
seedurlm,htmlurl5

$ awk -F , '{a[$1]=a[$1] FS $2}
            END {for (i in a) {split(a,b,","); printf "%s,%s\n%s,%s\n%s,%s\n",i,b[2],i,b[3],i,b[4]}} ' urfile
seedurlm,htmlurl1
seedurlm,htmlurl2
seedurlm,htmlurl3
seedurl1,htmlurl1
seedurl1,htmlurl2
seedurl1,htmlurl3

With your real data:

$ cat urfile1
http://2010.sina.com.cn,http://2010.sina.com.cn/2010-06-09/01528724.shtml
http://2010.sina.com.cn,http://2010.sina.com.cn/2010-06-09/03238769.shtml
http://2010.sina.com.cn,http://2010.sina.com.cn/2010-06-09/04448785.shtml
http://2010.sina.com.cn,http://2010.sina.com.cn/2010-06-09/05328842.shtml
http://2010.sina.com.cn,http://2010.sina.com.cn/2010-06-09/13359200.shtml
http://zzhz.zjol.com.cn,http://zzhz.zjol.com.cn/05zzhz/system/2010/06/10/016678515.shtml
http://zzhz.zjol.com.cn,http://zzhz.zjol.com.cn/05zzhz/system/2010/06/11/016678967.shtml
http://zzhz.zjol.com.cn,http://zzhz.zjol.com.cn/05zzhz/system/2010/06/11/016679056.shtml
http://zzhz.zjol.com.cn,http://zzhz.zjol.com.cn/05zzhz/system/2010/06/11/016679169.shtml
http://zzhz.zjol.com.cn,http://zzhz.zjol.com.cn/05zzhz/system/2010/06/11/016679553.shtml
http://zzhz.zjol.com.cn,http://zzhz.zjol.com.cn/05zzhz/system/2010/06/11/016679707.shtml

$ awk -F , '{a[$1]=a[$1] FS $2}
            END {for (i in a) {split(a,b,","); printf "%s,%s\n%s,%s\n%s,%s\n",i,b[2],i,b[3],i,b[4]}} ' urfile1
http://zzhz.zjol.com.cn,http://zzhz.zjol.com.cn/05zzhz/system/2010/06/10/016678515.shtml
http://zzhz.zjol.com.cn,http://zzhz.zjol.com.cn/05zzhz/system/2010/06/11/016678967.shtml
http://zzhz.zjol.com.cn,http://zzhz.zjol.com.cn/05zzhz/system/2010/06/11/016679056.shtml
http://2010.sina.com.cn,http://2010.sina.com.cn/2010-06-09/01528724.shtml
http://2010.sina.com.cn,http://2010.sina.com.cn/2010-06-09/03238769.shtml
http://2010.sina.com.cn,http://2010.sina.com.cn/2010-06-09/04448785.shtml

rainboisterous · June 10, 2010, 11:48pm

thanks