Supplemental information: data and scripts to identify category-specific genes in distinct Xanthomonas spp. genomes.
Bogdanove, A.J., Koebnik, R., Lu, H., Furutani, A. Angiuoli, S.V. Patil, P.B., Van Sluys, M.-A. Ryan, R.P., Meyer, D.F., Han, S., Aparna, G., Rajaram, M., Delcher, A.L., Phillippy, A.M., Puiu, D., Schatz, M.C., Shumway, M., Sommer, D.D., Trapnell, C., Benahmed, F., Dimitrov, G., Madupu, R., Radune, D., Sullivan, S., Jha, G., Ishihara, H., Lee, S.-W., Pandey, A., Sharma, V., Sriariyanun, M., Szurek, B., Vera-Cruz, C.M., Dorman, K.S., Ronald, P.C., Verdier, V., Dow, J.M., Sonti, R.V., Tsuge, S., Brendel, V., Rabinowicz, P.D., Leach, J.E., White, F.F., & Salzberg, S.L. (2011) Two new complete genome sequences offer insight into host and tissue specificity of plant pathogenic Xanthomonas spp. J. Bact., submitted May 16, 2011; revised, July 1, 2011.

Software used:
BLAST; orthoMCL (v2.0: Release date: 2010/10/12); and MCL.

Datasets:
Xanthomonas Monocot Dicot
ab. Strain Sequences ab. Strain Sequences
Vascular Xoo Xoo PXO99A Genome: NC_010717 Xcc Xcc B100 Genome: NC_010688
Xoo MAFF 311018 Genome: NC_007705 Xcc ATCC 33913 Genome: NC_003902
Xoo KACC10331 Genome: NC_006834 Xcc 8004 Genome: NC_007086
      Xcv Xcv 85-10 Genome: NC_007508

Plasmid:
      NC_007504 (pXCV2);
      NC_007505 (pXCV19);
      NC_007506 (pXCV38);
      NC_007507 (pXCV183);

Xcv 64 Genome: NONE

Plasmid:
      NC_004987 (pXV64);

Xag Xag AG1 Genome: NONE

Plasmid:
      NC_010872 (pAG1);
      NC_010876 (pXAG81);
      NC_010887 (pXAG82);

Non-vascular Xoc Xoc BLS256 From Prof. Adam Bogdanove Xaa Xaa B69 Genome: NONE
     

Plasmid:
      NC_005240 (pXcB);

Xac Xac 306 Genome: NC_003919

Plasmid:
      NC_003921 (pXAC33);
      NC_003922 (pXAC64);

Xcr Xcr 756C From Prof. Adam Bogdanove

Tool (to detect strain-, tissue- or host- specific genes): LINK

Scripts and Pipeline:
  • Sequence downloading and preprocessing
    • FASTA-format protein sequences
      From NCBI
      • Download GenBank-format files using the script DOWNLOAD_DATA.sh
            Example: sh DOWNLOAD_DATA.sh
      • Retrieve protein sequences from the GenBank-format files using 01_retrieveProteinSeqFromGB.pl
            Example: perl 01_retrieveProteinSeqFromGB.pl -g NC_010717.gbk -o XooP.faa
            Here NC_010717.gbk is the input GenBank-format file; XooP.faa is the output filename (fasta-format).
            A shell file (RETRIEVE_PROSEQ.sh) is available to process all strains.
      • Concatenate protein sequences.
        Some stains, such as Xac and Xcv, have plasmids. Therefore, all proteins from that strain should be concatenated together.
            Example:cat Xac.faa Xac_pXAC33.faa Xac_pXAC64.faa > Xac_FULL.faa
            Results:Xac_FULL.faa; Xcv_FULL.faa
      From Prof. Adam Bogdanove

    • FASTA-format genomic sequences
      From NCBI
      • Download FASTA-format files using the script DOWNLOAD_FASTA.sh
            Example: sh DOWNLOAD_FASTA.sh
      • Rename file names and FASTA-format sequence titles
            Example: mv NC_010717.fna XooP.GENOME
      From Prof. Adam Bogdanove

  • Re-format protein sequences into orthoMCL detectable format using 02_reformatProteinSeq.pl
        Example: perl 02_reformatProteinSeq.pl Xcc8.faa
        Sample Output:  Xcc8.fasta

  • Concatenate all protein sequences (inlcuding proteins in plasmids) and run all_against_all BLAST (BLASTP)
        Example:
    • cat XooP.fasta XooM.fasta XooK.fasta Xoc.fasta XccB.fasta XccA.fasta Xcc8.fasta Xcv.fasta Xac.fasta Xcr.fasta > Xanthomonas.fasta
    • formatdb -i Xanthomonas.fasta -o F -p T
    • blastall -p blastp -i Xanthomonas.fasta -d Xanthomonas.fasta -o Xanthomonas.fasta.blastp -e 1e-10 -F F -m 8 &
        Output:  Xanthomonas.fasta (The whole protein sequences, ten strains including plasmids);  Xanthomonas.fasta.blastp (BLASTP results)

  • Run orthoMCL
        For detailed information, please check the UserGuide of orthoMCL.
        Output:  Xanthomonas_groups.txt (v2.0 format); Xanthomonas_groups.txt.v1.4_format (v1.4 format)
        Note: The v1.4 format output is derived from the v2.0 format using 09_reformat_orthoMCL_output.pl with the command "perl 09_reformat_orthoMCL_output.pl Xanthomonas_groups.txt".

  • Build a database to store proteins related to transposase.
        The protein is considered as a transposase if there are keywords "transposase" or "IS" in its annotation title.
        Result: Xanthomonas.faa.transposase

  • BLAST all Xanthomonas proteins against transposase database (-e 1e-20)
        Example: blastall -p blastp -i Xanthomonas.faa -d Xanthomonas.faa.transposase -F F -e 1e-20 -m 8 -o Xanthomonas.faa_transposase.blastp.e20 -a 4 -v 2 -b 2
        Result: Xanthomonas.faa_transposase.blastp.e20

  • Filter the original orthomcl output to exclude clusters with transposase and short proteins. And also add singletons into the output.
        Rules:
          1) If the length of protein is less than 60 aa, then remove this protein.
          2) If the number of transposase genes within a cluster is greater than half of the total gene number, then this cluster will be removed. Otherwise, keep all genes in this cluster.
          3) Add singletons, which is neither transposase, nor short gene, into the orthomcl output.
        Example: perl 21_UpdateGroupInfo.pl    Xanthomonas.fasta    Xanthomonas.faa_transposon.blastp.e20    Xanthomonas_groups.txt
        Result: Xanthomonas_groups.txt.FINAL_GRP

  • Build the specific-essential homolog groups
        Specific-essential: Clusters for which all strains in the category are represented and there are no additional homologs in other strains.
    Present(List) Num and Sequences
    (before tBLASTn and BLASTp)
    Num and Sequences
    (after tBLASTn and BLASTp)
    ND 3 -
    NM 253 -
    VD 14 -
    VM 52 -
    ND_NM 0 -
    ND_VD 333 172 (List)
    ND_VM 0 -
    NM_VD 0 -
    NM_VM 90 54 (List)
    VD_VM 1 1 (List)
    ND_NM_VD 102 -
    ND_NM_VM 1 -
    ND_VD_VM 56 -
    NM_VD_VM 2 -
    ND_NM_VD_VM 2508 -


  • Build the specific homolog groups
        Specific: A cluster of homologs was considered to be specific to a category if all cluster members belong to this category (although a homolog may be missing in one or more strains of the category) and are not found elsewhere
    Present(List) Num and Sequences
    (before tBLASTn and BLASTp)
    ND 661
    NM 253
    VD 1158
    VM 487
    ND_NM 11
    ND_VD 947
    ND_VM 15
    NM_VD 14
    NM_VM 205
    VD_VM 38
    ND_NM_VD 131
    ND_NM_VM 23
    ND_VD_VM 134
    NM_VD_VM 48
    ND_NM_VD_VM 2915


  • Build the essential homolog groups
        Essential: A cluster of homologs was considered to be essential to a category if all strains in the category are represented in the cluster (although there may be additional homologs in other strains)
    Present(List) Num and Sequences
    (before tBLASTn and BLASTp)
    ND 67
    NM 458
    VD 65
    VM 103
    ND_NM 10
    ND_VD 355
    ND_VM 3
    NM_VD 23
    NM_VM 251
    VD_VM 3
    ND_NM_VD 267
    ND_NM_VM 42
    ND_VD_VM 56
    NM_VD_VM 41
    ND_NM_VD_VM 2508