Split a line into multiple lines based on delimeters

Hi,

I need help to split any lines that contain ; or ,

input.txt

Ac020	 Not a good chemical process
AC030	 many has failed, 3 still maintained
AC040	 Putative; epithelial cells
AC050	 Predicted binding activity
AC060	 rodC Putative; upregulated in 48;h biofilm vs planktonic 

The output should be:
Output.txt

Ac020	 Not a good chemical process
AC030	 many has failed 
AC030    3 still maintained
AC040	 Putative
AC040    epithelial cells
AC050	 Predicted binding activity
AC060	 rodC Putative
AC060    upregulated in 48
AC060    h biofilm vs planktonic 

I did below code but it does not give me the ID in first column for the splited ones

sed -e 's/\(.\), /\1\n\t\t /g' input.txt | sed -e 's/\(.\);/\1\n\t\t/g' > Output.txt

The result that I got is:

Ac020	 Not a good chemical process
AC030	 many has failed 
         3 still maintained
AC040	 Putative
         epithelial cells
AC050	 Predicted binding activity
AC060	 rodC Putative
         upregulated in 48
         h biofilm vs planktonic 

I don't know how should i do it to show the ID. Can anyone advise/help me on this? thanks

Try

$ cat file
Ac020	 Not a good chemical process
AC030	 many has failed, 3 still maintained
AC040	 Putative; epithelial cells
AC050	 Predicted binding activity
AC060	 rodC Putative; upregulated in 48;h biofilm vs planktonic
$ awk 'gsub(/[;,]/,RS $1 OFS) + 1' OFS='\t' file

Resulting

Ac020	 Not a good chemical process
AC030	 many has failed
AC030	 3 still maintained
AC040	 Putative
AC040	 epithelial cells
AC050	 Predicted binding activity
AC060	 rodC Putative
AC060	 upregulated in 48
AC060	h biofilm vs planktonic
1 Like

Hi Akshay Hegde,

Thanks a bunch! It worked great.. :slight_smile:

This also will work bit lengthy

$ awk 'match($0,regex){ n=split(substr($0,length($1)+1),A,regex); for(i=1;i<=n;i++)print $1,A; next }1' regex='[;,]' file
awk 'n=split(substr($0,length($1)+1),A,regex){for(i=1;i<=n;i++)print $1,A; next }1' regex='[;,]' file
1 Like

Yeah.. tried both and it works great too. A little bit complicated to understand compared to the first one. But I am glad that it gives something for me to think of. Thanks. :smiley:

You can also do this with sed:

sed -E ':a ; s/^([^ \t]+[ \t]+)([^,;]+)[,;][ \t]*/\1\2\n\1/; ta' infile
1 Like

Yeah, it worked great too.. :b: Thanks Chubler_XL..

Hello redse171,

Following may also help.

awk '{a=$1; gsub(/\,|\;/,"\n" a OFS,$0); print}'  filename

Output will be as follows.

 
Ac020    Not a good chemical process
AC030    many has failed
AC030  3 still maintained
AC040    Putative
AC040  epithelial cells
AC050    Predicted binding activity
AC060    rodC Putative
AC060  upregulated in 48
AC060 h biofilm vs planktonic

Thanks,
R. Singh

1 Like

Hi R. Singh,

Yeah..it worked perfectly too. Thanks so much :b: