FAIR principles applied to genome assembly and annotation

Module 3: data management

Patrick DURAND

Ifremer (Plouzané)

Stéphanie ROBIN

INRAE (Rennes)

2024-01-17

Module content

  • Biological databases
  • FAIR and DMP
  • Data life cycle
  • Collecting and sharing data and metadata

Biological databases

  • Generalities : Definition, types, classification, key properties, applications, main uses, file formats
  • Focus on the EMBL-EBI (European Molecular Biology Laboratory - European Bioinformatics Institute)
  • Focus on the NCBI (National Center for Biotechnology Information), NIH (National Institutes of Health, USA)
  • Other databases

Definition of biological databases

  • Collection of structured biological data that can be accessed, searched and analyzed
  • They play a crucial role in bioinformatics and computational biology : scientists can access to a wide variety of biological relevant data
  • Organized in a collection of biological information : DNA, proteins sequences, molecular structure, gene expression data …
  • Designed to facilitate efficient data storage, retrieval and analysis, allowing researchers to explore and interpret biological data on a large scale

=> FAIR (Findable Accessible Interoperable Reusable) context!

image from https://digitalworldbiology.com/sites/default/files/DWB%20Product/teaser-images/exploring_databases.png

Types of biological databases : generalists or specialists

Types of biological databases : generalists or specialists

Specialists : databases that specialize in a particular organism or a particular type of data.

Classification of biological databases : Primary and secondary

Primary biological databases : data that it derives experimentally

  • Data : original and curated data

  • Raw experimental data

  • Purposes : data deposition and curation

  • Sources : researchers and laboratories

  • Original formats (fasta, fastq …)

  • Frequent updates and additions

Examples: GenBank (NCBI), EMBL-EBI, DDBJ (DNA Data Bank of Japan)

Classification of biological databases : Primary and secondary

Secondary biological databases : data obtained by analysis or treatment of data present in primary database

  • Data : derived data

  • Processed and annotated data

  • Purposes : data integration and analysis

  • Sources : primary databases

  • Standardized formats (GFF, BED …)

  • Updates based on primary databases

Examples : KEGG (Kyoto Encyclopedia of Genes and Genomics), Pfam (Protein Families), SCOP (Structural Classification Of Proteins)

Key properties of a well-designed biological databases

  • Comprehensive
  • Accuracy of data (quality control)
  • Accessibility (user friendly interface)
  • Interoperability (compatible with other databases or bioinformatics tools)
  • Standardization (standards and formats for data representation, such as sequence file formats, ontologies, controlled vocabulary)
  • Documentation (clear and metadata describing the data sources, data collection methods, and data processing procedures)
  • Data security and privacy (sensitive nature)
  • Updates and maintenance (new data, fix errors)
  • Community engagement (encouraging user feed-back, collaboration and contribution)
  • Long-term sustainability (supported by an organization or consortium)

=> FAIR PRINCIPLES!

Applications of biological databases : examples

  • Genome annotation and analysis
  • Protein structure prediction and modeling
  • Genetic variations and disease-causing mutations
  • Comparative genomics and evolutionary studies (Phylogeny, taxonomy)
  • Gene expression analysis and functional genomics
  • Drug discovery and target identification
  • Systems biology and pathway analysis

=> Databases are essential tools for managing and utilizing data effectively in various fields and applications

Main uses of biological databases

  • Search data
  • Download data
  • Data analysis using integrated tools
  • Submit data

Main uses of biological databases

Search data

Main uses of biological databases

Download data

Main uses of biological databases

Data analysis using integrated tools (BLAST, genome browser, variation viewer, taxonomy browser…)

Main uses of biological databases

Submit data

The different types of file format (sequences)

FASTA

A text-based format for representing either nucleotide sequences or amino acid (protein) sequences

>MU105551.1 Sinapis alba strain S2 GC0560-79 unplaced genomic scaffold Sina_scaffold1, whole genome shotgun sequence
TGCATCAAAGATATTACCCTTGTGAGCTGGTGTAAGCTCACCGAGACGTACCGAGATCACGTTGATCTTGGGAACTCGTA
TATCTTGGTGTTCTTAAATTTCTGCACATCATCAAACACAAATCTAACTCAGCACAAATAAACACAACATCTGCAATCTG
AATCACATCAGAATTGAATACAAGAACATAAAACcaacacaaaaaacaaaagttagagactttttggtttcttcaagttt
ttttCAAAGAGAGAAGGGCTAAATAAAACGTACGTGGCCATCAGGGTTTCCTGATTGCTCCCAGTAGAGTGTGTGAGTAT

FASTQ

A text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding PHRED quality scores

@A00514:914:HTC5YDRXY:1:1101:4851:1031 1:N:0:GTATTCCACC+TTGTCTACAT
TNGTCGTTGAGCTTGGACTTGACCAGCGGGAAGTCGGACGACGACAGACGGCGGCCGTCTTTGGTGCATGTCCTCAGCACACGTTCCTCCCACGTCTGCGG
+
F#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00514:914:HTC5YDRXY:1:1101:4517:1047 1:N:0:GTATTCCACC+TTGTCTACAT
AGTCCGTTGATGCAGTAAATACTTGCCAACAAGCATCCCGGTAGTCCAATGAAATAGGTCACTTTCTTCCACAATGACAGACCTTGGAGGACCTCGGGTGA
+
,F,FFFFFFFFFFFFFFFFFFFF::F:FFFFFFFFFFFFF,F,FFFFFFFFFFF:FFFFFFF:FFFFFFFFFFFFFFFFFFFFF:F,FF:FFFFFF,FFFF

The different types of file format (sequences)

EMBL format

A text-based format for storing sequences and their associated meta-information, feature coordinates, and annotations

One sequence entry starts with an identifier line (“ID”), followed by further annotation lines. The start of the sequence is marked by a line starting with “SQ” and the end of the sequence is marked by two slashes (“//”).

 Example

ID   ADHBADA2   standard; DNA; VRT; 1145 BP.
XX
AC   J00923; J00924;
XX
DT   13-JUN-1985 (Rel. 06, Created)
DT   22-NOV-1994 (Rel. 41, Last updated, Version 2)
XX
DE   Duck alpha-A-globin gene and 5' flank.
XX
KW   alpha-globin; globin.
XX
OS   Cairina moschata (duck)
OC   Eukaryota; Animalia; Metazoa; Chordata; Vertebrata; Aves;
OC   Neornithes; Neognathae; Anseriformes; Anatidae.
XX
RN   [1]
RP   603-696
RX   MEDLINE; 83028533.
RA   Niessing J., Erbil C., Neubauer V.;
RT   "The isolation and partial characterization of linked alpha-A- and
RT   alpha-D-globin genes from a duck DNA recombinant library";
RL   Gene 18:187-191(1982).
XX
RN   [2]
RP   1-1145
RX   MEDLINE; 83158759.
RA   Erbil C., Niessing J.;
RT   "The complete nucleotide sequence of the duck alpha-A-globin
RT   gene";
RL   Gene 20:211-217(1982).
XX
DR   EPD; 33033; Cm  a'A-globin.
DR   SWISS-PROT; P01987; HBA_CAIMO.
XX
CC   The alpha-A-globin gene is linked to the alpha-D-globin gene. [1]
CC   compared their alpha-A-globin gene sequence with chicken alpha-A-
CC   and alpha-S-globin gene sequences, as well as with other avian and
CC   mammalian alpha-A-globin gene sequences. NCBI gi: 212911
XX
FH   Key             Location/Qualifiers
FH
FT   source          1..1145
FT                   /organism="Cairina moschata"
FT   prim_transcript 331..1145
FT                   /note="alpha-A-globin mRNA"
FT   CDS             join(367..461,612..816,921..1049)
FT                   /note="alpha-A globin; NCBI gi: 212914"
FT                   /codon_start=1
FT   exon            367..461
FT                   /note="alpha-A globin"
FT                   /number=1
FT   intron          462..611
FT                   /note="alpha-A-globin intron A"
FT   exon            612..816
FT                   /number=2
FT   intron          817..920
FT                   /note="alpha-A-globin intron B"
FT   exon            921..>1049
FT                   /note="alpha-A globin"
FT                   /number=3
XX
SQ   Sequence 1145 BP; 193 A; 435 C; 291 G; 226 T; 0 other;
     ctcatgctgg ggttgcctcc ccccctcaaa ccctaacctt aatcccatct cgtgctgggg        60
     tcagaccccc ctaaccctaa cccagttcat gccgggatca gcccccccaa accctaaccc       120
     taaacccatc tcgtgccggg gtcagacccc ccccaaccct aaccccgacc ccagttcatg       180
     ccggggtcgc ccccccccgg tggtgccggt gccgcaggcg gggcagggcg gcggccccgc       240
     ctggccgagg tccagccgcg acggggcggg cggggcgggg cggcgcccgg gccggcacgg       300
     ggatataagg ccggcggcac cagtgggggc acccgtgctg ggggctgcca acgcggagct       360
     gcaaccatgg tgctgtctgc ggctgacaag accaacgtca agggtgtctt ctccaaaatc       420
     ggtggccatg ctgaggagta tggcgccgag accctggaga ggtaggtgtc tgtccccgtc       480
     ctttgtccgt ccctgatcct ctcctctcta accccatgct ctcccccacc ataactgtcc       540
     gtgtcctacc ccaccccatc catcccccct gtccgttgat cccgctggcc ctgactcgct       600
     ctgctccaca ggatgttcat cgcctacccc cagaccaaga cctacttccc ccactttgac       660
     ctgcagcacg gctctgctca gatcaaggcc catggcaaga aggtggcggc tgccctagtt       720
     gaagctgtca accacatcga tgacattgcg ggtgctctct ccaagctcag tgacctccac       780
     gcccaaaagc tccgtgtgga ccctgtcaac ttcaaagtga gtctggtgac tccccccagc       840
     tcctcttcag cacccatcct gggccatccg gccacccctt tacctccccc actcgctcac       900
     cgtctccttt tgcctttcag ttcctgggcc actgcttcct ggtggtggtt gccatccacc       960
     accccgctgc cctgacccca gaggtccacg cttccctgga caagttcatg tgcgccgtgg      1020
     gtgctgtgct gactgccaag taccgttaga cggcaccgtg gctagagctg gacccaccct      1080
     gttgccagcc ttccaactgc aagcagccaa atgatctgaa ataaaatctg ttgcatttgt      1140
     gctcc                                                                  1145
//

The different types of file format (sequences)

GenBank format (NCBI)

A text-based format for storing sequences and their associated meta-information, feature coordinates, and annotations

LOCUS       JN835270                1574 bp    mRNA    linear   INV 07-JUL-2015
DEFINITION  Spodoptera litura putative olfactory receptor 3 (OR3) mRNA,
            complete cds.
ACCESSION   JN835270
VERSION     JN835270.2
KEYWORDS    .
SOURCE      Spodoptera litura
  ORGANISM  Spodoptera litura
            Eukaryota; Metazoa; Ecdysozoa; Arthropoda; Hexapoda; Insecta;
            Pterygota; Neoptera; Endopterygota; Lepidoptera; Glossata;
            Ditrysia; Noctuoidea; Noctuidae; Amphipyrinae; Spodoptera.
REFERENCE   1  (bases 1 to 1574)
  AUTHORS   Lin,X., Zhang,Q., Wu,Z. and Du,Y.
  TITLE     Identification and Differential Expression of a Candidate Sex
            Pheromone Receptor in Natural Populations of Spodoptera litura
  JOURNAL   PLoS ONE 10 (6), E0131407 (2015)
   PUBMED   26126192
  REMARK    Publication Status: Online-Only
REFERENCE   2  (bases 1 to 1574)
  AUTHORS   Wu,Z.N., Du,Y.J. and Zhuge,Q.C.
  TITLE     Two olfactory receptors from Spodptera litura
  JOURNAL   Unpublished
REFERENCE   3  (bases 1 to 1574)
  AUTHORS   Wu,Z.N., Du,Y.J. and Zhuge,Q.C.
  TITLE     Direct Submission
  JOURNAL   Submitted (10-OCT-2011) Institute of Health and Environment,
            Wenzhou Medical College, University-Town, Wenzhou, Zhejiang 325035,
            China
REFERENCE   4  (bases 1 to 1574)
  AUTHORS   Wu,Z.N., Du,Y.J. and Zhuge,Q.C.
  TITLE     Direct Submission
  JOURNAL   Submitted (26-MAR-2012) Institute of Health and Environment,
            Wenzhou Medical College, University-Town, Wenzhou, Zhejiang 325035,
            China
  REMARK    Sequence update by submitter
COMMENT     On Mar 26, 2012 this sequence version replaced JN835270.1.
FEATURES             Location/Qualifiers
     source          1..1574
                     /organism="Spodoptera litura"
                     /mol_type="mRNA"
                     /db_xref="taxon:69820"
     gene            1..1574
                     /gene="OR3"
                     /note="SlitOR3"
     CDS             67..1365
                     /gene="OR3"
                     /codon_start=1
                     /product="putative olfactory receptor 3"
                     /protein_id="AEY84943.2"
                     /translation="MGLKKFLFEDESVQGITAPTDYLYVKLLRMTLRIIATWPQTELG
                     EKEPVYLNVFLKYFYLVATIGCQIGSYCYLRTYSDELTMMEAGHNYIMIMMTFIDISR
                     ITSLTFSTEYRKLCKEFFTKMHLFYFKDISEHAMATHKRVHLISHLFTLWLLFQMLFG
                     VPLFNLVPMYYNYAAGRYKPGGTQNSTFEHSMYYQYPFDTLTEIKGYIVANIINWILS
                     SLCVIWFCMCDLILLLMVFNIWGHLRMLTYTLNNFPRPSVETTITIDGGLTVTSAIYS
                     DEENVQIFRKLKECVDYHRRIVEFNSKVSDVFGPMLIIYYLYHQTSGCLLLLECSQMT
                     APVLMRYLPLTIVCTRQLIQLSVIFELIGTESEKLPSAVYSVPWECMSISNKKFVIFF
                     LMNVKEPIAVKPPGIANVGVTTMASILRTSLSYFTFLRSI"
ORIGIN      
        1 atggacaccc ttcagaagat atgcagatag aactttagat gtaaattact gattttcttc
       61 gcaaagatgg gtttaaaaaa gtttcttttt gaagatgaat ctgtgcaggg catcactgca
      121 ccaactgatt acctctacgt taaactcttg cgcatgactt tgcgtatcat cgccacctgg
      181 cctcagacag aactcggaga gaaggaacct gtgtacttga atgtttttct aaagtatttt
      241 tatttggtag ccaccattgg ctgtcaaatc ggttcatatt gttacctcag aacgtatagt
      301 gatgaactta ccatgatgga ggctggtcat aattatatta tgattatgat gacattcatc
      361 gatatatcaa gaattacatc gctcacattt tctacggaat acagaaaact ttgcaaagaa
      421 ttcttcacga aaatgcatct cttctatttc aaagatatat ctgaacatgc gatggcgact
      481 cacaaacggg tgcacttgat atctcatttg tttactttgt ggttattatt tcaaatgctg
      541 tttggggtgc ctctttttaa cctcgtacca atgtactata attatgctgc tggaagatat
      601 aaacctggtg gcacacaaaa ctcaactttt gaacattcta tgtattacca gtatccattt
      661 gatacgctca ctgaaataaa gggatatatc gttgctaata ttattaattg gatattgtct
      721 tctctctgcg taatctggtt ttgcatgtgt gacctcattc tattactgat ggtgttcaat
      781 atttggggac atttgagaat gcttacatac accctgaata attttcctag gccaagtgta
      841 gagactacta tcacaataga tggaggatta acagtaacaa gtgctatata ttctgatgaa
      901 gagaacgtac aaatatttag aaagttaaaa gaatgtgttg attatcatcg acggattgtt
      961 gaatttaata gcaaagtttc tgatgtattt ggtccaatgt tgatcatcta ttatttatac
     1021 catcaaacaa gcggctgttt attgttactg gagtgttcac aaatgactgc accagtccta
     1081 atgcgttatt tacccctgac gatagtgtgc actcggcaac tgattcagtt atcagtaatt
     1141 tttgaactaa ttggaactga gagcgaaaag ttaccaagcg ccgtgtacag tgtaccatgg
     1201 gaatgtatga gcataagtaa taaaaaattt gtgatattct tcttgatgaa tgttaaagag
     1261 ccaatagctg tgaaacctcc tggcattgcc aatgttggtg tcacgacaat ggcttcgatt
     1321 ttgagaacat cgttgtccta tttcaccttc ttacgcagca tttgaattta tgactttcta
     1381 aacaactatc tctcaaatca tctctcacga gttttgtcaa gttttctcac aaattatcac
     1441 actatattta tgctaacagt taaatgtaaa aatgttaata ataatgaata tgtttgctat
     1501 gaaatagttt atcaccaatt gttttatatt gtttcaataa attctgacgt cacaaaaaaa
     1561 aaaaaaaaaa aaaa
//

The different types of file format (annotation)

GFF (General Feature Format)

A text-based for describing genes and other features of DNA, RNA and protein sequences (annotation)

##gff-version 3
HiC_scaffold_1  AUGUSTUS    gene    7086    9916    0.14    +   .   ID=g14228
HiC_scaffold_1  AUGUSTUS    mRNA    7086    9916    0.14    +   .   ID=g14228.t1;Parent=g14228
HiC_scaffold_1  AUGUSTUS    five_prime_UTR  7086    7230    0.21    +   .   ID=g14228.t1.5UTR1;Parent=g14228.t1
HiC_scaffold_1  AUGUSTUS    exon    7086    7405    .   +   .   ID=g14228.t1.exon1;Parent=g14228.t1
HiC_scaffold_1  AUGUSTUS    CDS 7231    7405    0.84    +   0   ID=g14228.t1.CDS1;Parent=g14228.t1
HiC_scaffold_1  AUGUSTUS    CDS 8821    8930    0.99    +   2   ID=g14228.t1.CDS2;Parent=g14228.t1
HiC_scaffold_1  AUGUSTUS    exon    8821    8930    .   +   .   ID=g14228.t1.exon2;Parent=g14228.t1
HiC_scaffold_1  AUGUSTUS    CDS 8999    9190    1   +   0   ID=g14228.t1.CDS3;Parent=g14228.t1
HiC_scaffold_1  AUGUSTUS    exon    8999    9190    .   +   .   ID=g14228.t1.exon3;Parent=g14228.t1
HiC_scaffold_1  AUGUSTUS    CDS 9761    9883    1   +   0   ID=g14228.t1.CDS4;Parent=g14228.t1
HiC_scaffold_1  AUGUSTUS    exon    9761    9916    .   +   .   ID=g14228.t1.exon4;Parent=g14228.t1
HiC_scaffold_1  AUGUSTUS    three_prime_UTR 9884    9916    0.4 +   .   ID=g14228.t1.3UTR1;Parent=g14228.t1
  1. seqname - name of the chromosome or scaffold; chromosome names can be given with or without the ‘chr’ prefix. Important note: the seqname must be one used within Ensembl, i.e. a standard chromosome name or an Ensembl identifier such as a scaffold ID, without any additional content such as species or assembly. See the example GFF output below.
  2. source - name of the program that generated this feature, or the data source (database or project name)
  3. feature - feature type name, e.g. Gene, Variation, Similarity
  4. start - Start position* of the feature, with sequence numbering starting at 1.
  5. end - End position* of the feature, with sequence numbering starting at 1.
  6. score - A floating point value.
  7. strand - defined as + (forward) or - (reverse).
  8. frame - One of ‘0’, ‘1’ or ‘2’. ‘0’ indicates that the first base of the feature is the first base of a codon, ‘1’ that the second base is the first base of a codon, and so on..
  9. attribute - A semicolon-separated list of tag-value pairs, providing additional information about each feature.

The different types of file format (annotation)

GTF (General Transfer Format)

Identical to GFF, version 2

HiC_scaffold_1  AUGUSTUS    exon    7086    7230    .   +   .   ID 1 ; gene_biotype protein_coding ; gene_id "g14228.t1"  ; transcript_id "g14228.t1" ;
HiC_scaffold_1  AUGUSTUS    exon    7231    7405    .   +   0   ID 1 ; gene_biotype protein_coding ; gene_id "g14228.t1"  ; transcript_id "g14228.t1" ;
HiC_scaffold_1  AUGUSTUS    exon    8821    8930    .   +   2   ID 1 ; gene_biotype protein_coding ; gene_id "g14228.t1"  ; transcript_id "g14228.t1" ;
HiC_scaffold_1  AUGUSTUS    exon    8999    9190    .   +   0   ID 1 ; gene_biotype protein_coding ; gene_id "g14228.t1"  ; transcript_id "g14228.t1" ;
HiC_scaffold_1  AUGUSTUS    exon    9761    9883    .   +   0   ID 1 ; gene_biotype protein_coding ; gene_id "g14228.t1"  ; transcript_id "g14228.t1" ;
HiC_scaffold_1  AUGUSTUS    exon    9884    9916    .   +   .   ID 1 ; gene_biotype protein_coding ; gene_id "g14228.t1"  ; transcript_id "g14228.t1" ;
  1. seqname - name of the chromosome or scaffold; chromosome names can be given with or without the ‘chr’ prefix. Important note: the seqname must be one used within Ensembl, i.e. a standard chromosome name or an Ensembl identifier such as a scaffold ID, without any additional content such as species or assembly. See the example GFF output below.
  2. source - name of the program that generated this feature, or the data source (database or project name)
  3. feature - feature type name, e.g. Gene, Variation, Similarity
  4. start - Start position* of the feature, with sequence numbering starting at 1.
  5. end - End position* of the feature, with sequence numbering starting at 1.
  6. score - A floating point value.
  7. strand - defined as + (forward) or - (reverse).
  8. frame - One of ‘0’, ‘1’ or ‘2’. ‘0’ indicates that the first base of the feature is the first base of a codon, ‘1’ that the second base is the first base of a codon, and so on..
  9. attribute - A semicolon-separated list of tag-value pairs, providing additional information about each feature.

The different types of file format (annotation)

BED format

A text file used to store genomic regions as coordinates and associated annotations

chr7  127471196  127472363  Pos1  0  +  127471196  127472363  255,0,0
chr7  127472363  127473530  Pos2  0  +  127472363  127473530  255,0,0
chr7  127473530  127474697  Pos3  0  +  127473530  127474697  255,0,0
chr7  127474697  127475864  Pos4  0  +  127474697  127475864  255,0,0
chr7  127475864  127477031  Neg1  0  -  127475864  127477031  0,0,255
chr7  127477031  127478198  Neg2  0  -  127477031  127478198  0,0,255
chr7  127478198  127479365  Neg3  0  -  127478198  127479365  0,0,255
chr7  127479365  127480532  Pos5  0  +  127479365  127480532  255,0,0
chr7  127480532  127481699  Neg4  0  -  127480532  127481699  0,0,255
  1. chrom - name of the chromosome or scaffold. Any valid seq_region_name can be used, and chromosome names can be given with or without the ‘chr’ prefix.
  2. chromStart - Start position of the feature in standard chromosomal coordinates (i.e. first base is 0).
  3. chromEnd - End position of the feature in standard chromosomal coordinates
  4. name - Label to be displayed under the feature, if turned on in “Configure this page”.
  5. score - A score between 0 and 1000. See track lines, below, for ways to configure the display style of scored data.
  6. strand - defined as + (forward) or - (reverse).
  7. thickStart - coordinate at which to start drawing the feature as a solid rectangle
  8. thickEnd - coordinate at which to stop drawing the feature as a solid rectangle
  9. itemRgb - an RGB colour value (e.g. 0,0,255). Only used if there is a track line with the value of itemRgb set to “on” (case-insensitive).
  10. blockCount - the number of sub-elements (e.g. exons) within the feature
  11. blockSizes - the size of these sub-elements
  12. blockStarts - the start coordinate of each sub-element

The different types of file format (annotation)

VCF (Variant Call Format)

A standard text file format for storing gene sequence variations

##fileformat=VCFv4.2
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3

1.CHROM The name of the sequence (typically a chromosome) on which the variation is being called. This sequence is usually known as ‘the reference sequence’, i.e. the sequence against which the given sample varies.
2.POS The 1-based position of the variation on the given sequence.
3.ID The identifier of the variation, e.g. a dbSNP rs identifier, or if unknown a “.”. Multiple identifiers should be separated by semi-colons without white-space.
4.REF The reference base (or bases in the case of an indel) at the given position on the given reference sequence.
5.ALT The list of alternative alleles at this position.
6.QUAL A quality score associated with the inference of the given alleles.
7.FILTER A flag indicating which of a given set of filters the variation has failed or PASS if all the filters were passed successfully.
8.INFO An extensible list of key-value pairs (fields) describing the variation. See below for some common fields. Multiple fields are separated by semicolons with optional values in the format: =[,data].
9.FORMAT An (optional) extensible list of fields for describing the samples. See below for some common fields.
+SAMPLEs For each (optional) sample described in the file, values are given for the fields listed in FORMAT

The different types of file format (alignment)

SAM (Sequence Alignment Map)

A text-based format originally for storing biological sequences aligned to a reference sequence

D00562:237:CBFBNANXX:5:2307:15276:42153 1187    HiC_scaffold_1  375 3   12M292N112M =   803 745 CACAGAGGCTGTCACAATCTACGTTCCCCAGTAAGTCGGCGGCGGAGCTGTCTCGACTGGAGCCGTGTACGTCAGTAGCGGGCATGGAGCTATGGTCGCAGTGCTACGAGCGCTGGCTCGTGCC    CBCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCGGGGGGGGGGGGFEGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGAGGGGGGGGG    
MD:Z:124    PG:Z:MarkDuplicates RG:Z:T1_S15 NH:i:2  HI:i:2  NM:i:0  AS:i:250    XS:A:+
D00562:237:CBFBNANXX:5:1306:14559:55311 1187    HiC_scaffold_1  375 3   12M292N112M =   806 745 CACAGAGGCTGTCACAATCTACGTTCCCCAGTAAGTCGGCCGCGGAGCTGTCTCGACTGCAGCCGTGTACGGCAGTAGCGGGCATGGAGCTATGGTCGCAGTGCTACGAGCGCTGGCGCGTGCC    33AA0;1;/?E;11E1CGGCG:FB111FGG/<>11>1EEG/<E///</1::?EG1>EGG0:E>0=/9>FG0//<900DEGG/DGGGGGG<//EG/;EG.>6.////:C..96CG.C....69..    
MD:Z:40G18G11T45T6  PG:Z:MarkDuplicates RG:Z:T1_S15 NH:i:2  HI:i:2  NM:i:4  AS:i:235    XS:A:+
D00562:237:CBFBNANXX:5:1111:19941:76960 163 HiC_scaffold_1  375 3   12M292N112M =   803 745 CACAGAGGCTGTCACAATCTACGTTCCCCAGTAAGTCGGCGGCGGAGCTGTCTCGACTGGAGCCGTGTACGTCAGTAGCGGGCATGGAGCTATGGTCGCAGTGCTACGAGCGCTGGCTCGTGCC    CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEGGGGGGGGGGGGGGGGGGGGGDGGGGGGGGGGG    
MD:Z:124    PG:Z:MarkDuplicates RG:Z:T1_S15 NH:i:2  HI:i:2  NM:i:0  AS:i:250    XS:A:+

1 QNAME Query template/pair NAME
2 FLAG bitwise FLAG
3 RNAME Reference sequence NAME
4 POS 1-based leftmost POSition/coordinate of clipped sequence
5 MAPQ MAPping Quality (Phred-scaled)
6 CIGAR extended CIGAR string
7 MRNM Mate Reference sequence NaMe (`=’ if same as RNAME)
8 MPOS 1-based Mate POSition
9 TLEN inferred Template LENgth (insert size)
10 SEQ query SEQuence on the same strand as the reference
11 QUAL query QUALity (ASCII-33 gives the Phred base quality)
12+ OPT variable OPTional fields in the format TAG:VTYPE:VALUE

BAM (Binary Alignment Map)

A compressed binary version of a SAM file that is used to represent aligned sequences up to 128 Mb

=> All these formats are standard, you must use it without modifying it!

EMBL-EBI (European Molecular Biology Laboratory - European Bioinformatics Institute)

EBI website contains 165 databases

EMBL-EBI (European Molecular Biology Laboratory - European Bioinformatics Institute)

Example of DNA & RNA database : ENA (European Nucleotide Archive)

ENA (European Nucleotide Archive) : nucleotide sequencing information, covering raw sequencing data, sequence assembly information and functional annotation

EMBL-EBI (European Molecular Biology Laboratory - European Bioinformatics Institute)

Example of DNA & RNA database : ENA (European Nucleotide Archive)

[ENA (European Nucleotide Archive)]

EMBL-EBI (European Molecular Biology Laboratory - European Bioinformatics Institute)

Example of DNA & RNA database : Ensembl

Ensembl : genome browser for animal genomes that supports research in comparative genomics, evolution, sequence variation and transcriptional regulation.
Ensembl annotates genes, computes multiple alignments, predicts regulatory function and collects disease data.
Ensembl tools include BLAST, BLAT, BioMart and the Variant Effect Predictor (VEP) for all supported species.
Release 111 (01/2024) : 317 genomes!

EMBL-EBI (European Molecular Biology Laboratory - European Bioinformatics Institute)

Example of DNA & RNA database : Rfam

Rfam : collection of RNA families, each represented by multiple sequence alignments, consensus secondary structures and covariance models

EMBL-EBI (European Molecular Biology Laboratory - European Bioinformatics Institute)

Example of DNA & RNA database : Rfam

Rfam

EMBL-EBI (European Molecular Biology Laboratory - European Bioinformatics Institute)

Example of protein database : Uniprot

Uniprot : high-quality, comprehensive and freely accessible resource of protein sequence and functional information.

EMBL-EBI (European Molecular Biology Laboratory - European Bioinformatics Institute)

Example of protein database : Uniprot

Uniprot

EMBL-EBI (European Molecular Biology Laboratory - European Bioinformatics Institute)

Example of protein database : AlphaFold

AlphaFold : database for protein structure predictions for numerous species

EMBL-EBI (European Molecular Biology Laboratory - European Bioinformatics Institute)

Example of protein database : AlphaFold

AlphaFold

EMBL-EBI (European Molecular Biology Laboratory - European Bioinformatics Institute)

Example of protein database : InterProscan

InterProscan : functional analysis of proteins by classifying them into families and predicting domains and important sites

EMBL-EBI (European Molecular Biology Laboratory - European Bioinformatics Institute)

Example of protein database : InterProscan

InterProscan

EMBL-EBI (European Molecular Biology Laboratory - European Bioinformatics Institute)

Example of protein database : Pfam

Pfam : A database of conserved protein families and domains. Pfam is a member database of InterPro. The Pfam database is a large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs)

EMBL-EBI (European Molecular Biology Laboratory - European Bioinformatics Institute)

Example of protein database : HMMER

HMMER : Fast sensitive protein homology searches using profile hidden Markov models (HMMs) for querying against both sequence and HMM target databases.

EMBL-EBI (European Molecular Biology Laboratory - European Bioinformatics Institute)

Example of protein database : HMMER

HMMER

NCBI National Center for Biotechnology Information (NIH national institutes of health, USA)

Biomedical and genomic information

NCBI website contains 59 Databases

NCBI National Center for Biotechnology Information (NIH national institutes of health, USA)

Example of DNA & RNA database : Nucleotide Database

Nucleotide Database : A collection of nucleotide sequences from several sources, including :

Searching the Nucleotide Database will yield available results from each of its component databases.

NCBI National Center for Biotechnology Information (NIH national institutes of health, USA)

Example of DNA & RNA database : Nucleotide Database

Nucleotide Database

NCBI National Center for Biotechnology Information (NIH national institutes of health, USA)

Example of DNA & RNA database : GenBank

GenBank : an annotated collection of all publicly available DNA sequences, is part of the International Nucleotide Sequence Database Collaboration , which comprises :
- the DNA DataBank of Japan (DDBJ),
- the European Molecular Biology Laboratory (EMBL),
- GenBank at NCBI.

GenBank consists of several divisions, most of which can be accessed through the Nucleotide database.

NCBI National Center for Biotechnology Information (NIH national institutes of health, USA)

Example of DNA & RNA database : SRA (Sequence Read Archive)

SRA (Sequence Read Archive) : stores sequencing data from the next generation of sequencing platforms including Roche 454 GS System®, Illumina Genome Analyzer®, Life Technologies AB SOLiD System®, Helicos Biosciences Heliscope®, Complete Genomics®, and Pacific Biosciences SMRT®.

NCBI National Center for Biotechnology Information (NIH national institutes of health, USA)

Example of DNA & RNA database : SRA (Sequence Read Archive)

SRA (Sequence Read Archive)

NCBI National Center for Biotechnology Information (NIH national institutes of health, USA)

Example of DNA & RNA database : Assembly

Assembly : structure of assembled genomes, assembly names and other meta-data, statistical reports, and links to genomic sequence data.

NCBI National Center for Biotechnology Information (NIH national institutes of health, USA)

Example of DNA & RNA database : Assembly

Assembly

NCBI National Center for Biotechnology Information (NIH national institutes of health, USA)

Example of DNA & RNA database : Genome

Genome : Contains sequence and map data from the whole genomes of over 1000 organisms

NCBI National Center for Biotechnology Information (NIH national institutes of health, USA)

Example of DNA & RNA database : Genome

Genome

NCBI National Center for Biotechnology Information (NIH national institutes of health, USA)

Example of DNA & RNA AND protein database : RefSeq

RefSeq : A comprehensive, integrated, non-redundant, well-annotated set of reference sequences including genomic, transcript, and protein.

NCBI National Center for Biotechnology Information (NIH national institutes of health, USA)

Example of DNA & RNA AND protein database : RefSeq

RefSeq

NCBI National Center for Biotechnology Information (NIH national institutes of health, USA)

Example of protein database : Protein Database

Protein Database : includes protein sequence records from a variety of sources, including GenPept, RefSeq, Swiss-Prot, PIR, PRF, and PDB.

NCBI National Center for Biotechnology Information (NIH national institutes of health, USA)

Example of protein database : Protein Database

Protein Database

NCBI National Center for Biotechnology Information (NIH national institutes of health, USA)

Example of publication database : PubMed

PubMed : collection of biomedical books that can be searched directly or from linked data in other NCBI databases

NCBI National Center for Biotechnology Information (NIH national institutes of health, USA)

Example of publication database : PubMed

PubMed

NCBI National Center for Biotechnology Information (NIH national institutes of health, USA)

Tool for sequence analysis : Blast

NCBI National Center for Biotechnology Information (NIH national institutes of health, USA)

Tool for downloading genome data : NCBI Datasets

NCBI National Center for Biotechnology Information (NIH national institutes of health, USA)

Tool for downloading genome data : command line NCBI Genome

NCBI National Center for Biotechnology Information (NIH national institutes of health, USA)

Tool for downloading SRA data : SRA Toolkit

Downloading public data

Prefetch is a part of the SRA toolkit. This program downloads Runs (sequence files in the compressed SRA format) and all additional data necessary to convert the Run from the SRA format to a more commonly used format. Prefetch can be used to correct and finish an incomplete Run download.

Use this prefetch command to download the Runs from the previous example in SRA format.

One Run:

$ prefetch SRR000001

A list of Runs:

prefetch --option-file SraAccList.txt

fasterq-dump and sam-dump are also part of the SRA toolkit and can be used to convert the prefetched Runs from compressed SRA format to fastq or sam format. For example:

fasterq-dump --split-files SRR11180057.sra

You can also avoid the prefetch step and download and convert the Run in one step by entering just the Run accession without the .sra extension in your fasterq-dump or sam-dump command:

fasterq-dump --split-files SRR11180057 

Databases of specific genomes

Non-sequence-centric databases

A catalog of worldwide biological databases

Database Commons is a curated catalog of worldwide biological databases, with the aim to provide a full landscape of biological databases throughout the world and enable easy retrieval and access to a specific collection of databases of interest

Public banks installed on the IFB cluster

Banks can be available in different versions and in different index format: fasta, gff, blast, bwa, bowtie, bowtie2, start, diamond, picard,

ls /shared/bank/
MMSEQS_DB                data.galaxyproject.org   malus_domestica        oryza_punctata         saccharomyces_cerevisiae
README.txt               dram                     mgnify                 oryza_rufipogon        saccharum_spontaneum
_scripts                 drosophila_melanogaster  mmdb                   oryza_sativa_aus       salmonella_typhimurium
accession2taxid          eggnog-mapper-2.1.12     mus_musculus           oryza_sativa_basmati   schizosaccharomyces_pombe
alphafold2               elaeis_guineensis        musa_acuminata         oryza_sativa_indica    software_specific
amrfinder                emapperdb                ncbi_taxonomy          oryza_sativa_japonica  solanum_lycopersicum
anas_platyrhynchos       equus_caballus           neofusicoccum_parvum   oryza_sativa_nerica    sorghum_bicolor
arabidopsis_lyrata       escherichia_coli         nicotiana_tabacum      pdb                    squeezemeta
arabidopsis_thaliana     felis_catus              nr                     pfam                   sus_scrofa
astyanax_mexicanus       frogs                    nt                     phoenix_dactylifera    trypanosoma_brucei
bos_taurus               galaxy_test-data         oryctolagus_cuniculus  phyloflash             uniprot
candida_glabrata         greengenes               oryza_barthii          plasmidfinder          uniprot_swissprot
canis_lupus_familiaris   gtdbtk                   oryza_brachyantha      populus_trichocarpa    uniref50
capra_hircus             hhsuite                  oryza_glaberrima       prunus_persica         uniref90
checkm                   homo_sapiens             oryza_glumipatula      rattus_norvegicus      vitis_vinifera
clostridium_perfringens  kraken                   oryza_longistaminata   refseq                 xenopus_laevis
cocos_nucifera           krona                    oryza_meridionalis     rosa_chinensis         zymoseptoria_tritici
danio_rerio              lachancea_kluyveri       oryza_nivara           rsat_organism
ls /shared/bank/uniprot_swissprot/        
current  uniprot_swissprot_2018-10-10  uniprot_swissprot_2020_03

ls /shared/bank/uniprot_swissprot/current/
blast  diamond  fasta  flat  listingv1.blast  mapping  mmseqs

Public banks installed on the IFB cluster

Banks can be available in different versions and in different index format: fasta, gff, blast, bwa, bowtie, bowtie2, start, diamond, picard,

The banks are organized first per species or bank names, then per index:

homo_sapiens/  
├── GRCh38  
│   ├── bwa  
│   │   ├── Homo_sapiens.GRCh38.dna.primary_assembly.fa.amb  
│   │   ├── Homo_sapiens.GRCh38.dna.primary_assembly.fa.bwt  
│   │   └── ...  
│   ├── fasta  
│   │   ├── Homo_sapiens.GRCh38.dna.toplevel.fa  
│   │   ├── Homo_sapiens.GRCh38.dna.toplevel.fa.fai  
│   │   └── ...  
│   ├── gff  
│   │   ├── Homo_sapiens.GRCh38.94.gff3  
│   │   └── Homo_sapiens.GRCh38.94.gff3.readme  
│   └── star  
│       ├── Genome  
│       ├── SAindex  
│       └── ...  
└── hg19  

FAIR and DMP

  • FAIR: general principles
  • FAIR: why do it?
  • FAIR: an obligation!
  • DMP: Data Management Plan
  • DMP: its content

FAIR: general principles

FAIR: guidelines proposed by the scientific community 1.

Objectives: promote best-practice tools to help produce and publish reproducible analyses throughout the data lifecycle.

FAIR: why do it?

Over 70% of researchers have tried to reproduce another scientist’s experiments without success, and over half have failed to reproduce their own experiments!

FAIR: an obligation!

  • The FAIR principles are a key element of a more general concept: the data management plan (DMP).
  • In France, there is an obligation (e.g. for ANR and European projects) to make public the data and software of scientific projects funded with at least 50% public money.

DMP: Data Management Plan

  • The Data Management Plan (DMP) or Plan de gestion de données (PGD) is a synthetic, evolving document that helps to organize and anticipate every stage in the data life cycle. For each data set, it explains how project data will be managed, from creation or collection to sharing and archiving.
  • The data management plan is based on the data life cycle, which describes the various stages of data processing during a research project.

Figure from: https://www.universite-paris-saclay.fr/recherche/science-ouverte/les-donnees-de-la-recherche/introduction-aux-plans-de-gestion-de-donnees

DMP: its content

  • The Data Management Plan (DMP) is a synoptic and editable text document that aims to organize and anticipate all stages of the data life cycle.
  • How to?
    • Use templates provided by ANR or institutes;
    • Use online tools such as DMP opidor 1 or Data Stewardship Wizard 2.

Data life cycle

  • A long, not very quiet river
  • Data is stored in a structured and secure manner
  • Data is stored with metadata
  • What is metadata?
  • Data licensing
  • Making data accessible to the community

A long, not very quiet river

A long, not very quiet river

Very useful tools exist to collect, manage and share data and metadata over the time.

Data is stored in a structured and secure manner

A possible organization of data projects on a storage
  • Use a dedicated directory to store all data for all projects
  • Assign admin permissions to data managers only
    • Limit number of data managers
  • Assign read-only permissions to users
  • Setup a separate data archive system
    • “separate” = a different storage in a different place

Data is stored in a structured and secure manner

‘data’, ‘report’ and ‘operation’ are EDAM terms
  • Setup a same organization for all sub folders
    • “users” DO NOT control naming
    • only “data managers” DO that
  • Use controlled vocabulary for folder naming
    • i.e. EDAM bioinformatics 1 ontology
  • Ensure to always store in such a safe place raw data AND metadata

Note: ‘operation’ folder many contain particular final results, e.g. a sequence assembly.

Data is stored with metadata

What is this?

No metadata

Better with a label isnt’it!

‘Label’ = metadata

What is metadata?

Metadata 1 describes the data (main characteristics), such as author, creation date, content, format, associated project, associated sample, etc.

Given types of collected data, its description should be achieved using metadata standards, e.g. controlled vocaularies, ontologies, etc. Do not invent your own terms!

Metadata example

Data licensing

Data has to be associated with a license that describes how data can be used by anyone.

As a French public research institute, we should use the ETALAB 1 public license from the Direction Interministérielle du Numérique (DINUM).

  • There are several types of CC licenses 2: CC0, CC-BY, CC-BY-NC, etc.

However, in practice: we should use the CC-BY license to be in an international legal framework. Not using CC-BY license prevents ENA submission.

Making data accessible to the community

When it is time to publish your work, you will have to make your data accessible to everyone.

There are three major ways of doing that:

  • using the services of a certified public research institute
    • e.g. Ifremer’s Oceanographic Data Catalog (Sextant)
  • using the services of country-based initiatives
  • using one of the “Big Three”: NCBI, EBI, DDBJ
    • e.g. European Bioinformatics Institute (EBI)

Making data accessible to the community

Have in mind that:

  • French research institutes may or may not require data to be in-house hosted
  • scientific publishers may or may not require your data to be available from “The Big Three” data providers

As a consequence, it is very important to think about public data release at earlier stages of the project beacuse it is a very time consuming step.

DMP should be used to answer this question and provide appropriate solution(s).

Making data accessible to the community

Using the services of a certified public research institute

Ifremer Sextant Catalog of Oceanographic Data - CoreTrustSeal Certified

Making data accessible to the community

Using the services of country-based initiatives

Recherche Data Gouv 1 has been designed to support research teams in their work of structuring data to make it easy to find, accessible, interoperable and reusable - in other words, in line with the “FAIR” principles. These services are operated by INRAe.

Catalog entries for ‘Biodiversity and Ecology’ projects published on 2022

Making data accessible to the community

Using one of the “Big Three”: NCBI, EBI, DDBJ

Publish sequencing data to international databases for international referencing.

  • The International Nucleotide Sequence Database Collaboration (INSDC) collects and disseminates international databases containing DNA and RNA sequences
  • All data is available for free and unrestricted access, for any purpose, with no restrictions on analysis, redistribution, or re-publication of the data (CC-BY license)

Collecting and sharing data and metadata

Collecting and sharing data and metadata

There are a large numbers of services to publish and share data to the scientific community.

We cannot talk about everyone, so during this course we will take the example of submitting sequencing data.

For any other research domains, you may certainly benefits from (meta)data standards defined for these domains.

Following content will be:

  • Data submission at ENA
  • ENA metadata model description
  • Programmatic submission: how to?

Data submission at ENA

Practical use of EBI-ENA data submission service

Let’s consider we want to publish our sequencing data within the European Nucleotide Archive (ENA) databank managed by the European Bioinformatics Institute (EBI), Hinxton, UK.

You have to:

  • Understand the metadata model to figure out what mandatory information you have to provide
  • Collect all that information and setup document(s) ready for submission
  • Submit metadata along with data using EBI tools

ENA metadata model description

Sequencing data will be described using 5 objects

https://ena-docs.readthedocs.io/en/latest/submit/study.html

ENA metadata model description

About these 5 objects

ENA metadata model description

Metadata are described using ENA checklists

These are controlled vocabularies organized as a collection of scientific themes.

  • There is a minimum amount of information required during ENA sample registration and all samples must conform to a defined checklist of expected metadata values.

  • These sample checklists have been developed to meet the needs of different research communities. Different communities have different requirements on the minimum metadata expected to describe biological samples.

About ontologies

Ontologies are set of terms (along with unique IDs and description), each of them covering a particular scientific field and validated by scientific communities.

Ontologies are interoperable, logically well-defined, machine readable controlled vocabularies.

A few examples:

  • The Gene Ontology (GO) 1 aims to maintain and develop controlled vocabulary of gene and gene product attributes
  • EDAM 2 is an ontology of well established, familiar concepts that are prevalent within bioinformatics, including types of data and data identifiers, data formats, operations and topics.
  • ENVO 3 is an ontology which represents knowledge about environments, environmental processes, ecosystems, habitats, and related entities
  • and so on: more ontologies 4

ENA checklists

As of November 2023, there are about 40 ENA sample checklists:

Do not invent your terms to describe your metadata. Instead, use above one!

Collecting metadata for ENA submission

In practice, how to collect all that information?

A possible solution can be to set up a spreadsheet document containing the many pieces of information (i.e. metadata) useful to describe your sequencing data.

Pros:

  • a unique file to store all metadata
  • a self-documented file that can help users to fill in fields
  • capable of modeling ENA metadata using several tabs
  • capable of implementing controlled vocabularies
  • sharable and editable by several users through cloud-based storage system
  • write-protectable against any further modification by use of password

As soon as such a document is fully filled in, it must be locked against any further modification and stored next to raw data.

Collecting metadata for ENA submission

Here is an example of a spreadsheet-based document designed to collect metadata following ENA requirements 1:

ENA submission

Time to publish data and metadata?

  • Time-consuming
  • Difficult to handle
  • Several tools available

ENA submission tools

ENA allows submissions via three routes, each of which is appropriate for a different set of submission types. You may be required to use more than one in the process of submitting your data:

  • Interactive Submissions are completed by filling out web forms directly in your browser and downloading template spreadsheets that can be completed off-line and uploaded to ENA. This is often the most accessible submission route.
  • Command Line Submissions use our bespoke Webin-CLI program. This validates your submissions entirely before you complete them, allowing you maximum control of the process.
  • Programmatic Submissions are completed by preparing your submissions as XML documents and either sending them to ENA using a program such as cURL or using the Webin Portal.

Pros of programmatic submission: standard code, automatic submission (limit errors of manual handling), possibility to run a dry submission (test).

ENA submission done

Example of a catalog entry programmatically submitted to ENA:

Metadata related to ENA Study object

Are you FAIR?

Tool to check FAIRness of your materials: FAIR-Checker

Example for the Nextflow-based pipeline SAMBA published on Elixir Workflowhub:

FAIRness evaluation

References

References (1/2)

  • Introduction to Biological Databases In Bioinformatics: Types and Applications Biological databases
  • Additional links related to DMP:
    • https://www.universite-paris-saclay.fr/recherche/science-ouverte/les-donnees-de-la-recherche/introduction-aux-plans-de-gestion-de-donnees
  • Additional links related to Open Science:
    • https://www.enseignementsup-recherche.gouv.fr/fr/le-plan-national-pour-la-science-ouverte-2021-2024-vers-une-generalisation-de-la-science-ouverte-en-48525
    • https://www.inrae.fr/actualites/plan-donnees-pour-la-science
  • EDAM ontology:
    • https://github.com/edamontology/edamontology
    • https://edamontology.github.io/edam-browser/#operation_0525
  • Metadata links:
    • https://fairsharing.org/
    • https://ardc.edu.au/resource/metadata/

References (2/2)