FAIR principles applied to genome assembly and annotation

Module 3: data management

Cyril NOËL

Ifremer (Plouzané)

Stéphanie ROBIN

INRAE (Rennes)

2026-01-06

Module content

Biological databases
FAIR and DMP
Data life cycle
Collecting and sharing data and metadata

Biological databases

Generalities : Definition, types, classification, key properties, applications, main uses, file formats
Focus on the EMBL-EBI (European Molecular Biology Laboratory - European Bioinformatics Institute)
Focus on the NCBI (National Center for Biotechnology Information), NIH (National Institutes of Health, USA)
Other databases

Generalities

Definition of biological databases

Collection of structured biological data that can be accessed, searched and analyzed
They play a crucial role in bioinformatics and computational biology : scientists can access to a wide variety of biological relevant data
Organized in a collection of biological information : DNA, proteins sequences, molecular structure, gene expression data …
Designed to facilitate efficient data storage, retrieval and analysis, allowing researchers to explore and interpret biological data on a large scale

=> FAIR (Findable Accessible Interoperable Reusable) context!

image from https://digitalworldbiology.com/sites/default/files/DWB%20Product/teaser-images/exploring_databases.png

Types of biological databases : generalists or specialists

Nucleotide sequence : DNA and RNA data
Examples : GenBank (NCBI) , EMBL-EBI, DDBJ (DNA Data Bank of Japan)
Protein sequence : Amino Acid information
Examples : Uniprot (EMBL-EBI), RCSB-PDB (Protein Data Bank)
Gene Expression : gene expression data
Examples : GEO (Gene Expression Omnibus, NCBI), ArrayExpress (EMBL-EBI)
Protein structure : 3D protein structures
Examples : RCSB-PDB (Protein Data Bank)
Genetic variation : genetic variations and mutations
Examples : ClinVar (NCBI)

Types of biological databases : generalists or specialists

Specialists : databases that specialize in a particular organism or a particular type of data.

Genomic : genomes and associated information
Examples : Ensembl, UCSC Genome Browser, flybase
Literature : scientific literature and associated data
Examples : PubMed, PubMed Central
Taxonomy :
Examples : NCBI taxonomy, Silva
Metabolic pathways and networks : biological pathways and molecular interaction network
Examples : KEGG (Kyoto Encyclopedia of Genes and Genomics), Reactome, STRING, MetaCyc

Classification of biological databases : Primary and secondary

Primary biological databases : data that it derives experimentally

Data : original and curated data
Raw experimental data
Purposes : data deposition and curation
Sources : researchers and laboratories
Original formats (fasta, fastq …)
Frequent updates and additions

Examples: GenBank (NCBI), EMBL-EBI, DDBJ (DNA Data Bank of Japan)

Classification of biological databases : Primary and secondary

Secondary biological databases : data obtained by analysis or treatment of data present in primary database

Data : derived data
Processed and annotated data
Purposes : data integration and analysis
Sources : primary databases
Standardized formats (GFF, BED …)
Updates based on primary databases

Examples : KEGG (Kyoto Encyclopedia of Genes and Genomics), Pfam (Protein Families), SCOP (Structural Classification Of Proteins)

Key properties of a well-designed biological databases

Comprehensive
Accuracy of data (quality control)
Accessibility (user friendly interface)
Interoperability (compatible with other databases or bioinformatics tools)
Standardization (standards and formats for data representation, such as sequence file formats, ontologies, controlled vocabulary)
Documentation (clear and metadata describing the data sources, data collection methods, and data processing procedures)
Data security and privacy (sensitive nature)
Updates and maintenance (new data, fix errors)
Community engagement (encouraging user feed-back, collaboration and contribution)
Long-term sustainability (supported by an organization or consortium)

=> FAIR PRINCIPLES!

Applications of biological databases : examples

Genome annotation and analysis
Protein structure prediction and modeling
Genetic variations and disease-causing mutations
Comparative genomics and evolutionary studies (Phylogeny, taxonomy)
Gene expression analysis and functional genomics
Drug discovery and target identification
Systems biology and pathway analysis

=> Databases are essential tools for managing and utilizing data effectively in various fields and applications

Main uses of biological databases

Search data
Download data
Data analysis using integrated tools
Submit data

Main uses of biological databases

Search data

Example : EBI

Main uses of biological databases

Download data

Example : NCBI

Main uses of biological databases

Data analysis using integrated tools (BLAST, genome browser, variation viewer, taxonomy browser…)

Example : NCBI Genome Data Viewer

Main uses of biological databases

Submit data

Example : EMBL-EBI submission

The different types of file format (sequences)

FASTA

A text-based format for representing either nucleotide sequences or amino acid (protein) sequences

>MU105551.1 Sinapis alba strain S2 GC0560-79 unplaced genomic scaffold Sina_scaffold1, whole genome shotgun sequence
TGCATCAAAGATATTACCCTTGTGAGCTGGTGTAAGCTCACCGAGACGTACCGAGATCACGTTGATCTTGGGAACTCGTA
TATCTTGGTGTTCTTAAATTTCTGCACATCATCAAACACAAATCTAACTCAGCACAAATAAACACAACATCTGCAATCTG
AATCACATCAGAATTGAATACAAGAACATAAAACcaacacaaaaaacaaaagttagagactttttggtttcttcaagttt
ttttCAAAGAGAGAAGGGCTAAATAAAACGTACGTGGCCATCAGGGTTTCCTGATTGCTCCCAGTAGAGTGTGTGAGTAT

FASTQ

A text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding PHRED quality scores

@A00514:914:HTC5YDRXY:1:1101:4851:1031 1:N:0:GTATTCCACC+TTGTCTACAT
TNGTCGTTGAGCTTGGACTTGACCAGCGGGAAGTCGGACGACGACAGACGGCGGCCGTCTTTGGTGCATGTCCTCAGCACACGTTCCTCCCACGTCTGCGG
+
F#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00514:914:HTC5YDRXY:1:1101:4517:1047 1:N:0:GTATTCCACC+TTGTCTACAT
AGTCCGTTGATGCAGTAAATACTTGCCAACAAGCATCCCGGTAGTCCAATGAAATAGGTCACTTTCTTCCACAATGACAGACCTTGGAGGACCTCGGGTGA
+
,F,FFFFFFFFFFFFFFFFFFFF::F:FFFFFFFFFFFFF,F,FFFFFFFFFFF:FFFFFFF:FFFFFFFFFFFFFFFFFFFFF:F,FF:FFFFFF,FFFF

The different types of file format (sequences)

EMBL format

A text-based format for storing sequences and their associated meta-information, feature coordinates, and annotations

One sequence entry starts with an identifier line (“ID”), followed by further annotation lines. The start of the sequence is marked by a line starting with “SQ” and the end of the sequence is marked by two slashes (“//”).

Example from wiki.biouml.org

ID   ADHBADA2   standard; DNA; VRT; 1145 BP.
XX
AC   J00923; J00924;
XX
DT   13-JUN-1985 (Rel. 06, Created)
DT   22-NOV-1994 (Rel. 41, Last updated, Version 2)
XX
DE   Duck alpha-A-globin gene and 5' flank.
XX
KW   alpha-globin; globin.
XX
OS   Cairina moschata (duck)
OC   Eukaryota; Animalia; Metazoa; Chordata; Vertebrata; Aves;
OC   Neornithes; Neognathae; Anseriformes; Anatidae.
XX
RN   [1]
RP   603-696
RX   MEDLINE; 83028533.
RA   Niessing J., Erbil C., Neubauer V.;
RT   "The isolation and partial characterization of linked alpha-A- and
RT   alpha-D-globin genes from a duck DNA recombinant library";
RL   Gene 18:187-191(1982).
XX
RN   [2]
RP   1-1145
RX   MEDLINE; 83158759.
RA   Erbil C., Niessing J.;
RT   "The complete nucleotide sequence of the duck alpha-A-globin
RT   gene";
RL   Gene 20:211-217(1982).
XX
DR   EPD; 33033; Cm  a'A-globin.
DR   SWISS-PROT; P01987; HBA_CAIMO.
XX
CC   The alpha-A-globin gene is linked to the alpha-D-globin gene. [1]
CC   compared their alpha-A-globin gene sequence with chicken alpha-A-
CC   and alpha-S-globin gene sequences, as well as with other avian and
CC   mammalian alpha-A-globin gene sequences. NCBI gi: 212911
XX
FH   Key             Location/Qualifiers
FH
FT   source          1..1145
FT                   /organism="Cairina moschata"
FT   prim_transcript 331..1145
FT                   /note="alpha-A-globin mRNA"
FT   CDS             join(367..461,612..816,921..1049)
FT                   /note="alpha-A globin; NCBI gi: 212914"
FT                   /codon_start=1
FT   exon            367..461
FT                   /note="alpha-A globin"
FT                   /number=1
FT   intron          462..611
FT                   /note="alpha-A-globin intron A"
FT   exon            612..816
FT                   /number=2
FT   intron          817..920
FT                   /note="alpha-A-globin intron B"
FT   exon            921..>1049
FT                   /note="alpha-A globin"
FT                   /number=3
XX
SQ   Sequence 1145 BP; 193 A; 435 C; 291 G; 226 T; 0 other;
     ctcatgctgg ggttgcctcc ccccctcaaa ccctaacctt aatcccatct cgtgctgggg        60
     tcagaccccc ctaaccctaa cccagttcat gccgggatca gcccccccaa accctaaccc       120
     taaacccatc tcgtgccggg gtcagacccc ccccaaccct aaccccgacc ccagttcatg       180
     ccggggtcgc ccccccccgg tggtgccggt gccgcaggcg gggcagggcg gcggccccgc       240
     ctggccgagg tccagccgcg acggggcggg cggggcgggg cggcgcccgg gccggcacgg       300
     ggatataagg ccggcggcac cagtgggggc acccgtgctg ggggctgcca acgcggagct       360
     gcaaccatgg tgctgtctgc ggctgacaag accaacgtca agggtgtctt ctccaaaatc       420
     ggtggccatg ctgaggagta tggcgccgag accctggaga ggtaggtgtc tgtccccgtc       480
     ctttgtccgt ccctgatcct ctcctctcta accccatgct ctcccccacc ataactgtcc       540
     gtgtcctacc ccaccccatc catcccccct gtccgttgat cccgctggcc ctgactcgct       600
     ctgctccaca ggatgttcat cgcctacccc cagaccaaga cctacttccc ccactttgac       660
     ctgcagcacg gctctgctca gatcaaggcc catggcaaga aggtggcggc tgccctagtt       720
     gaagctgtca accacatcga tgacattgcg ggtgctctct ccaagctcag tgacctccac       780
     gcccaaaagc tccgtgtgga ccctgtcaac ttcaaagtga gtctggtgac tccccccagc       840
     tcctcttcag cacccatcct gggccatccg gccacccctt tacctccccc actcgctcac       900
     cgtctccttt tgcctttcag ttcctgggcc actgcttcct ggtggtggtt gccatccacc       960
     accccgctgc cctgacccca gaggtccacg cttccctgga caagttcatg tgcgccgtgg      1020
     gtgctgtgct gactgccaag taccgttaga cggcaccgtg gctagagctg gacccaccct      1080
     gttgccagcc ttccaactgc aagcagccaa atgatctgaa ataaaatctg ttgcatttgt      1140
     gctcc                                                                  1145
//

The different types of file format (sequences)

GenBank format (NCBI)

A text-based format for storing sequences and their associated meta-information, feature coordinates, and annotations

LOCUS       JN835270                1574 bp    mRNA    linear   INV 07-JUL-2015
DEFINITION  Spodoptera litura putative olfactory receptor 3 (OR3) mRNA,
            complete cds.
ACCESSION   JN835270
VERSION     JN835270.2
KEYWORDS    .
SOURCE      Spodoptera litura
  ORGANISM  Spodoptera litura
            Eukaryota; Metazoa; Ecdysozoa; Arthropoda; Hexapoda; Insecta;
            Pterygota; Neoptera; Endopterygota; Lepidoptera; Glossata;
            Ditrysia; Noctuoidea; Noctuidae; Amphipyrinae; Spodoptera.
REFERENCE   1  (bases 1 to 1574)
  AUTHORS   Lin,X., Zhang,Q., Wu,Z. and Du,Y.
  TITLE     Identification and Differential Expression of a Candidate Sex
            Pheromone Receptor in Natural Populations of Spodoptera litura
  JOURNAL   PLoS ONE 10 (6), E0131407 (2015)
   PUBMED   26126192
  REMARK    Publication Status: Online-Only
REFERENCE   2  (bases 1 to 1574)
  AUTHORS   Wu,Z.N., Du,Y.J. and Zhuge,Q.C.
  TITLE     Two olfactory receptors from Spodptera litura
  JOURNAL   Unpublished
REFERENCE   3  (bases 1 to 1574)
  AUTHORS   Wu,Z.N., Du,Y.J. and Zhuge,Q.C.
  TITLE     Direct Submission
  JOURNAL   Submitted (10-OCT-2011) Institute of Health and Environment,
            Wenzhou Medical College, University-Town, Wenzhou, Zhejiang 325035,
            China
REFERENCE   4  (bases 1 to 1574)
  AUTHORS   Wu,Z.N., Du,Y.J. and Zhuge,Q.C.
  TITLE     Direct Submission
  JOURNAL   Submitted (26-MAR-2012) Institute of Health and Environment,
            Wenzhou Medical College, University-Town, Wenzhou, Zhejiang 325035,
            China
  REMARK    Sequence update by submitter
COMMENT     On Mar 26, 2012 this sequence version replaced JN835270.1.
FEATURES             Location/Qualifiers
     source          1..1574
                     /organism="Spodoptera litura"
                     /mol_type="mRNA"
                     /db_xref="taxon:69820"
     gene            1..1574
                     /gene="OR3"
                     /note="SlitOR3"
     CDS             67..1365
                     /gene="OR3"
                     /codon_start=1
                     /product="putative olfactory receptor 3"
                     /protein_id="AEY84943.2"
                     /translation="MGLKKFLFEDESVQGITAPTDYLYVKLLRMTLRIIATWPQTELG
                     EKEPVYLNVFLKYFYLVATIGCQIGSYCYLRTYSDELTMMEAGHNYIMIMMTFIDISR
                     ITSLTFSTEYRKLCKEFFTKMHLFYFKDISEHAMATHKRVHLISHLFTLWLLFQMLFG
                     VPLFNLVPMYYNYAAGRYKPGGTQNSTFEHSMYYQYPFDTLTEIKGYIVANIINWILS
                     SLCVIWFCMCDLILLLMVFNIWGHLRMLTYTLNNFPRPSVETTITIDGGLTVTSAIYS
                     DEENVQIFRKLKECVDYHRRIVEFNSKVSDVFGPMLIIYYLYHQTSGCLLLLECSQMT
                     APVLMRYLPLTIVCTRQLIQLSVIFELIGTESEKLPSAVYSVPWECMSISNKKFVIFF
                     LMNVKEPIAVKPPGIANVGVTTMASILRTSLSYFTFLRSI"
ORIGIN      
        1 atggacaccc ttcagaagat atgcagatag aactttagat gtaaattact gattttcttc
       61 gcaaagatgg gtttaaaaaa gtttcttttt gaagatgaat ctgtgcaggg catcactgca
      121 ccaactgatt acctctacgt taaactcttg cgcatgactt tgcgtatcat cgccacctgg
      181 cctcagacag aactcggaga gaaggaacct gtgtacttga atgtttttct aaagtatttt
      241 tatttggtag ccaccattgg ctgtcaaatc ggttcatatt gttacctcag aacgtatagt
      301 gatgaactta ccatgatgga ggctggtcat aattatatta tgattatgat gacattcatc
      361 gatatatcaa gaattacatc gctcacattt tctacggaat acagaaaact ttgcaaagaa
      421 ttcttcacga aaatgcatct cttctatttc aaagatatat ctgaacatgc gatggcgact
      481 cacaaacggg tgcacttgat atctcatttg tttactttgt ggttattatt tcaaatgctg
      541 tttggggtgc ctctttttaa cctcgtacca atgtactata attatgctgc tggaagatat
      601 aaacctggtg gcacacaaaa ctcaactttt gaacattcta tgtattacca gtatccattt
      661 gatacgctca ctgaaataaa gggatatatc gttgctaata ttattaattg gatattgtct
      721 tctctctgcg taatctggtt ttgcatgtgt gacctcattc tattactgat ggtgttcaat
      781 atttggggac atttgagaat gcttacatac accctgaata attttcctag gccaagtgta
      841 gagactacta tcacaataga tggaggatta acagtaacaa gtgctatata ttctgatgaa
      901 gagaacgtac aaatatttag aaagttaaaa gaatgtgttg attatcatcg acggattgtt
      961 gaatttaata gcaaagtttc tgatgtattt ggtccaatgt tgatcatcta ttatttatac
     1021 catcaaacaa gcggctgttt attgttactg gagtgttcac aaatgactgc accagtccta
     1081 atgcgttatt tacccctgac gatagtgtgc actcggcaac tgattcagtt atcagtaatt
     1141 tttgaactaa ttggaactga gagcgaaaag ttaccaagcg ccgtgtacag tgtaccatgg
     1201 gaatgtatga gcataagtaa taaaaaattt gtgatattct tcttgatgaa tgttaaagag
     1261 ccaatagctg tgaaacctcc tggcattgcc aatgttggtg tcacgacaat ggcttcgatt
     1321 ttgagaacat cgttgtccta tttcaccttc ttacgcagca tttgaattta tgactttcta
     1381 aacaactatc tctcaaatca tctctcacga gttttgtcaa gttttctcac aaattatcac
     1441 actatattta tgctaacagt taaatgtaaa aatgttaata ataatgaata tgtttgctat
     1501 gaaatagttt atcaccaatt gttttatatt gtttcaataa attctgacgt cacaaaaaaa
     1561 aaaaaaaaaa aaaa
//

The different types of file format (annotation)

GFF (General Feature Format)

A text-based for describing genes and other features of DNA, RNA and protein sequences (annotation)

##gff-version 3
HiC_scaffold_1  AUGUSTUS    gene    7086    9916    0.14    +   .   ID=g14228
HiC_scaffold_1  AUGUSTUS    mRNA    7086    9916    0.14    +   .   ID=g14228.t1;Parent=g14228
HiC_scaffold_1  AUGUSTUS    five_prime_UTR  7086    7230    0.21    +   .   ID=g14228.t1.5UTR1;Parent=g14228.t1
HiC_scaffold_1  AUGUSTUS    exon    7086    7405    .   +   .   ID=g14228.t1.exon1;Parent=g14228.t1
HiC_scaffold_1  AUGUSTUS    CDS 7231    7405    0.84    +   0   ID=g14228.t1.CDS1;Parent=g14228.t1
HiC_scaffold_1  AUGUSTUS    CDS 8821    8930    0.99    +   2   ID=g14228.t1.CDS2;Parent=g14228.t1
HiC_scaffold_1  AUGUSTUS    exon    8821    8930    .   +   .   ID=g14228.t1.exon2;Parent=g14228.t1
HiC_scaffold_1  AUGUSTUS    CDS 8999    9190    1   +   0   ID=g14228.t1.CDS3;Parent=g14228.t1
HiC_scaffold_1  AUGUSTUS    exon    8999    9190    .   +   .   ID=g14228.t1.exon3;Parent=g14228.t1
HiC_scaffold_1  AUGUSTUS    CDS 9761    9883    1   +   0   ID=g14228.t1.CDS4;Parent=g14228.t1
HiC_scaffold_1  AUGUSTUS    exon    9761    9916    .   +   .   ID=g14228.t1.exon4;Parent=g14228.t1
HiC_scaffold_1  AUGUSTUS    three_prime_UTR 9884    9916    0.4 +   .   ID=g14228.t1.3UTR1;Parent=g14228.t1

seqname - name of the chromosome or scaffold; chromosome names can be given with or without the ‘chr’ prefix. Important note: the seqname must be one used within Ensembl, i.e. a standard chromosome name or an Ensembl identifier such as a scaffold ID, without any additional content such as species or assembly. See the example GFF output below.
source - name of the program that generated this feature, or the data source (database or project name)
feature - feature type name, e.g. Gene, Variation, Similarity

The different types of file format (annotation)

GFF (General Feature Format)

start - Start position* of the feature, with sequence numbering starting at 1.
end - End position* of the feature, with sequence numbering starting at 1.
score - A floating point value.
strand - defined as + (forward) or - (reverse).
frame - One of ‘0’, ‘1’ or ‘2’. ‘0’ indicates that the first base of the feature is the first base of a codon, ‘1’ that the second base is the first base of a codon, and so on..
attribute - A semicolon-separated list of tag-value pairs, providing additional information about each feature.

The different types of file format (annotation)

GTF (General Transfer Format)

Identical to GFF, version 2

HiC_scaffold_1  AUGUSTUS    exon    7086    7230    .   +   .   ID 1 ; gene_biotype protein_coding ; gene_id "g14228.t1"  ; transcript_id "g14228.t1" ;
HiC_scaffold_1  AUGUSTUS    exon    7231    7405    .   +   0   ID 1 ; gene_biotype protein_coding ; gene_id "g14228.t1"  ; transcript_id "g14228.t1" ;
HiC_scaffold_1  AUGUSTUS    exon    8821    8930    .   +   2   ID 1 ; gene_biotype protein_coding ; gene_id "g14228.t1"  ; transcript_id "g14228.t1" ;
HiC_scaffold_1  AUGUSTUS    exon    8999    9190    .   +   0   ID 1 ; gene_biotype protein_coding ; gene_id "g14228.t1"  ; transcript_id "g14228.t1" ;
HiC_scaffold_1  AUGUSTUS    exon    9761    9883    .   +   0   ID 1 ; gene_biotype protein_coding ; gene_id "g14228.t1"  ; transcript_id "g14228.t1" ;
HiC_scaffold_1  AUGUSTUS    exon    9884    9916    .   +   .   ID 1 ; gene_biotype protein_coding ; gene_id "g14228.t1"  ; transcript_id "g14228.t1" ;

seqname - name of the chromosome or scaffold; chromosome names can be given with or without the ‘chr’ prefix. Important note: the seqname must be one used within Ensembl, i.e. a standard chromosome name or an Ensembl identifier such as a scaffold ID, without any additional content such as species or assembly. See the example GFF output below.
source - name of the program that generated this feature, or the data source (database or project name)
feature - feature type name, e.g. Gene, Variation, Similarity

The different types of file format (annotation)

GTF (General Transfer Format)

start - Start position* of the feature, with sequence numbering starting at 1.
end - End position* of the feature, with sequence numbering starting at 1.
score - A floating point value.
strand - defined as + (forward) or - (reverse).
frame - One of ‘0’, ‘1’ or ‘2’. ‘0’ indicates that the first base of the feature is the first base of a codon, ‘1’ that the second base is the first base of a codon, and so on..
attribute - A semicolon-separated list of tag-value pairs, providing additional information about each feature.

The different types of file format (annotation)

BED format

A text file used to store genomic regions as coordinates and associated annotations

chr7  127471196  127472363  Pos1  0  +  127471196  127472363  255,0,0
chr7  127472363  127473530  Pos2  0  +  127472363  127473530  255,0,0
chr7  127473530  127474697  Pos3  0  +  127473530  127474697  255,0,0
chr7  127474697  127475864  Pos4  0  +  127474697  127475864  255,0,0
chr7  127475864  127477031  Neg1  0  -  127475864  127477031  0,0,255
chr7  127477031  127478198  Neg2  0  -  127477031  127478198  0,0,255
chr7  127478198  127479365  Neg3  0  -  127478198  127479365  0,0,255
chr7  127479365  127480532  Pos5  0  +  127479365  127480532  255,0,0
chr7  127480532  127481699  Neg4  0  -  127480532  127481699  0,0,255

chrom - name of the chromosome or scaffold. Any valid seq_region_name can be used, and chromosome names can be given with or without the ‘chr’ prefix.
chromStart - Start position of the feature in standard chromosomal coordinates (i.e. first base is 0).
chromEnd - End position of the feature in standard chromosomal coordinates
name - Label to be displayed under the feature, if turned on in “Configure this page”.

The different types of file format (annotation)

BED format

score - A score between 0 and 1000. See track lines, below, for ways to configure the display style of scored data.
strand - defined as + (forward) or - (reverse).
thickStart - coordinate at which to start drawing the feature as a solid rectangle
thickEnd - coordinate at which to stop drawing the feature as a solid rectangle
itemRgb - an RGB colour value (e.g. 0,0,255). Only used if there is a track line with the value of itemRgb set to “on” (case-insensitive).
blockCount - the number of sub-elements (e.g. exons) within the feature
blockSizes - the size of these sub-elements
blockStarts - the start coordinate of each sub-element

The different types of file format (annotation)

VCF (Variant Call Format)

A standard text file format for storing gene sequence variations

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3

CHROM The name of the sequence (typically a chromosome) on which the variation is being called. This sequence is usually known as ‘the reference sequence’, i.e. the sequence against which the given sample varies.
POS The 1-based position of the variation on the given sequence.
ID The identifier of the variation, e.g. a dbSNP rs identifier, or if unknown a “.”. Multiple identifiers should be separated by semi-colons without white-space.
REF The reference base (or bases in the case of an indel) at the given position on the given reference sequence.
ALT The list of alternative alleles at this position.
QUAL A quality score associated with the inference of the given alleles.

The different types of file format (annotation)

VCF (Variant Call Format)

FILTER A flag indicating which of a given set of filters the variation has failed or PASS if all the filters were passed successfully.
INFO An extensible list of key-value pairs (fields) describing the variation. See below for some common fields. Multiple fields are separated by semicolons with optional values in the format: =[,data].
FORMAT An (optional) extensible list of fields for describing the samples. See below for some common fields.
+SAMPLEs For each (optional) sample described in the file, values are given for the fields listed in FORMAT

The different types of file format (alignment)

SAM (Sequence Alignment Map)

A text-based format originally for storing biological sequences aligned to a reference sequence

D00562:237:CBFBNANXX:5:2307:15276:42153 1187    HiC_scaffold_1  375 3   12M292N112M =   803 745 CACAGAGGCTGTCACAATCTACGTTCCCCAGTAAGTCGGCGGCGGAGCTGTCTCGACTGGAGCCGTGTACGTCAGTAGCGGGCATGGAGCTATGGTCGCAGTGCTACGAGCGCTGGCTCGTGCC    CBCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCGGGGGGGGGGGGFEGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGAGGGGGGGGG    
MD:Z:124    PG:Z:MarkDuplicates RG:Z:T1_S15 NH:i:2  HI:i:2  NM:i:0  AS:i:250    XS:A:+
D00562:237:CBFBNANXX:5:1306:14559:55311 1187    HiC_scaffold_1  375 3   12M292N112M =   806 745 CACAGAGGCTGTCACAATCTACGTTCCCCAGTAAGTCGGCCGCGGAGCTGTCTCGACTGCAGCCGTGTACGGCAGTAGCGGGCATGGAGCTATGGTCGCAGTGCTACGAGCGCTGGCGCGTGCC    33AA0;1;/?E;11E1CGGCG:FB111FGG/<>11>1EEG/<E///</1::?EG1>EGG0:E>0=/9>FG0//<900DEGG/DGGGGGG<//EG/;EG.>6.////:C..96CG.C....69..    
MD:Z:40G18G11T45T6  PG:Z:MarkDuplicates RG:Z:T1_S15 NH:i:2  HI:i:2  NM:i:4  AS:i:235    XS:A:+
D00562:237:CBFBNANXX:5:1111:19941:76960 163 HiC_scaffold_1  375 3   12M292N112M =   803 745 CACAGAGGCTGTCACAATCTACGTTCCCCAGTAAGTCGGCGGCGGAGCTGTCTCGACTGGAGCCGTGTACGTCAGTAGCGGGCATGGAGCTATGGTCGCAGTGCTACGAGCGCTGGCTCGTGCC    CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEGGGGGGGGGGGGGGGGGGGGGDGGGGGGGGGGG    
MD:Z:124    PG:Z:MarkDuplicates RG:Z:T1_S15 NH:i:2  HI:i:2  NM:i:0  AS:i:250    XS:A:+

QNAME Query template/pair NAME
FLAG bitwise FLAG
RNAME Reference sequence NAME
POS 1-based leftmost POSition/coordinate of clipped sequence
MAPQ MAPping Quality (Phred-scaled)
CIGAR extended CIGAR string

The different types of file format (alignment)

SAM (Sequence Alignment Map)

MRNM Mate Reference sequence NaMe (`=’ if same as RNAME)
MPOS 1-based Mate POSition
TLEN inferred Template LENgth (insert size)
SEQ query SEQuence on the same strand as the reference
QUAL query QUALity (ASCII-33 gives the Phred base quality)
12+. OPT variable OPTional fields in the format TAG:VTYPE:VALUE

BAM (Binary Alignment Map)

A compressed binary version of a SAM file that is used to represent aligned sequences up to 128 Mb

=> All these formats are standard, you must use it without modifying it!

Focus on the EMBL-EBI

EMBL-EBI (European Molecular Biology Laboratory - European Bioinformatics Institute)

EBI website contains 150 databases

EMBL-EBI

Example of DNA & RNA database : ENA (European Nucleotide Archive)

ENA (European Nucleotide Archive) : nucleotide sequencing information, covering raw sequencing data, sequence assembly information and functional annotation

EMBL-EBI

Example of DNA & RNA database : ENA (European Nucleotide Archive)

ENA (European Nucleotide Archive) : nucleotide sequencing information, covering raw sequencing data, sequence assembly information and functional annotation

EMBL-EBI

Example of DNA & RNA database : Ensembl

Ensembl : genome browser for animal genomes that supports research in comparative genomics, evolution, sequence variation and transcriptional regulation.
Ensembl annotates genes, computes multiple alignments, predicts regulatory function and collects disease data.
Ensembl tools include BLAST, BLAT, BioMart and the Variant Effect Predictor (VEP) for all supported species.
Release 115 (09/2025) : 348 species !

EMBL-EBI

Example of DNA & RNA database : Rfam

Rfam : collection of RNA families, each represented by multiple sequence alignments, consensus secondary structures and covariance models

EMBL-EBI

Example of DNA & RNA database : Rfam

Rfam

EMBL-EBI

Example of protein database : Uniprot

Uniprot : high-quality, comprehensive and freely accessible resource of protein sequence and functional information.

EMBL-EBI

Example of protein database : Uniprot

Uniprot

EMBL-EBI

Example of protein database : AlphaFold

AlphaFold : database for protein structure predictions for numerous species

EMBL-EBI

Example of protein database : AlphaFold

AlphaFold

EMBL-EBI

Example of protein database : InterProscan

InterProscan : functional analysis of proteins by classifying them into families and predicting domains and important sites

EMBL-EBI

Example of protein database : InterProscan

InterProscan

EMBL-EBI

Example of protein database : Pfam

Pfam : A database of conserved protein families and domains. Pfam is a member database of InterPro. The Pfam database is a large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs)

EMBL-EBI

Example of protein database : HMMER

HMMER : Fast sensitive protein homology searches using profile hidden Markov models (HMMs) for querying against both sequence and HMM target databases.

EMBL-EBI

Example of protein database : HMMER

HMMER

Focus on the NCBI

NCBI National Center for Biotechnology Information (NIH national institutes of health, USA)

Biomedical and genomic information

NCBI website contains 58 Databases

NCBI

Example of DNA & RNA database : Nucleotide Database

Nucleotide Database : A collection of nucleotide sequences from several sources, including :

GenBank : an annotated collection of all publicly available DNA sequences
RefSeq : a comprehensive, integrated, non-redundant, well-annotated set of reference sequences including genomic, transcript, and protein.
The Third Party Annotation (TPA) database : a database designed to capture experimental or inferential results that support submitter-provided annotation for, or assembly of, sequence data that the submitter did not directly determine but derived from GenBank primary data.
Protein Database : a collection of sequences from several sources, including translations from annotated coding regions in GenBank, RefSeq and TPA, as well as records from SwissProt, PIR, PRF, and PDB.

Searching the Nucleotide Database will yield available results from each of its component databases.

NCBI

Example of DNA & RNA database : Nucleotide Database

Nucleotide Database

NCBI

Example of DNA & RNA database : Nucleotide Database

Nucleotide Database

NCBI

Example of DNA & RNA database : GenBank

GenBank : an annotated collection of all publicly available DNA sequences, is part of the International Nucleotide Sequence Database Collaboration , which comprises :
- the DNA DataBank of Japan (DDBJ),
- the European Molecular Biology Laboratory (EMBL),
- GenBank at NCBI.

GenBank consists of several divisions, most of which can be accessed through the Nucleotide database.

NCBI

Example of DNA & RNA database : SRA (Sequence Read Archive)

SRA (Sequence Read Archive) : stores sequencing data from the next generation of sequencing platforms including Roche 454 GS System®, Illumina Genome Analyzer®, Life Technologies AB SOLiD System®, Helicos Biosciences Heliscope®, Complete Genomics®, and Pacific Biosciences SMRT®.

NCBI

Example of DNA & RNA database : SRA (Sequence Read Archive)

SRA (Sequence Read Archive)

NCBI

Example of DNA & RNA database : Genome

Genome : Contains sequence and map data from the whole genomes of over 1000 organisms.

NCBI

Example of DNA & RNA database : Genome

Genome

NCBI

Example of DNA & RNA AND protein database : RefSeq

RefSeq : A comprehensive, integrated, non-redundant, well-annotated set of reference sequences including genomic, transcript, and protein.

NCBI

Example of DNA & RNA AND protein database : RefSeq

RefSeq

NCBI

Example of protein database : Protein Database

Protein Database : includes protein sequence records from a variety of sources, including GenPept, RefSeq, Swiss-Prot, PIR, PRF, and PDB.

NCBI

Example of protein database : Protein Database

Protein Database

NCBI

Example of publication database : PubMed

PubMed : collection of biomedical books that can be searched directly or from linked data in other NCBI databases

NCBI

Example of publication database : PubMed

PubMed

NCBI

Tool for sequence analysis : Blast

NCBI

Tool for downloading genome data : NCBI Datasets

NCBI

Tool for downloading genome data : command line NCBI Genome

NCBI

Tool for downloading SRA data : SRA Toolkit

Other databases

Databases of specific genomes

Ensembl : vertebrate genomes
University of California Santa Cruz (UCSC) : human, mouse, drosophila, zebrafish and other species
FlyBase : Drosophila genes and genomes
Joint Genome Institute (JGI) : Plants, algae, fungi and metagenomes
WormBase : Nematode Information Resource
The Arabidopsis Information Resource (TAIR)
Mouse Genome Informatics (MGI)
BIPAA (BioInformatics Platform for Agroecosystem Arthropods) : Arthropod genomes

Non-sequence-centric databases

Online Mendelian Inheritance in Man (OMIM): An Online Catalog of Human Genes and Genetic Disorders
RCSB Protein Data Bank (RCSB PDB)
MetaCyc : MetaCyc is a curated database of experimentally elucidated metabolic pathways from all domains of life.
Kyoto Encyclopedia of Genes and Genomes (KEGG)

A catalog of worldwide biological databases

Database Commons is a curated catalog of worldwide biological databases, with the aim to provide a full landscape of biological databases throughout the world and enable easy retrieval and access to a specific collection of databases of interest

Public banks installed on the IFB cluster

Banks can be available in different versions and in different index format: fasta, gff, blast, bwa, bowtie, bowtie2, start, diamond, picard,

ls /shared/bank/
COG                      danio_rerio              krona                  oryza_nivara            saccharomyces_cerevisiae
MMSEQS_DB                data.galaxyproject.org   lachancea_kluyveri     oryza_punctata          saccharum_spontaneum
PFAM                     daucus_carota            malus_domestica        oryza_rufipogon         salmonella_typhimurium
README.txt               dbcan                    mgnify                 oryza_sativa_aus        schizosaccharomyces_pombe
_scripts                 dram                     mmdb                   oryza_sativa_basmati    software_specific
accession2taxid          drosophila_melanogaster  motus                  oryza_sativa_indica     solanum_lycopersicum
alphafold2               eggnog-mapper            mus_musculus           oryza_sativa_japonica   sorghum_bicolor
amrfinder                elaeis_guineensis        musa_acuminata         oryza_sativa_nerica     squeezemeta
anas_platyrhynchos       emapperdb                ncbi_taxonomy          pdb                     sus_scrofa
arabidopsis_lyrata       equus_caballus           neofusicoccum_parvum   pfam                    trypanosoma_brucei
arabidopsis_thaliana     escherichia_coli         nicotiana_tabacum      phalaenopsis_aphrodite  uniprot
astyanax_mexicanus       felis_catus              nr                     phalaenopsis_equestris  uniprot_swissprot
bakta                    frogs                    nt                     phoenix_dactylifera     uniref50
bos_taurus               galaxy_test-data         ophrys_sphegodes       phyloflash              uniref90
bowtie-index             genbank                  oryctolagus_cuniculus  plasmidfinder           vanilla_planifolia
candida_glabrata         greengenes               oryza_barthii          populus_trichocarpa     vitis_vinifera
canis_lupus_familiaris   gtdbtk                   oryza_brachyantha      prunus_persica          xenopus_laevis
capra_hircus             hhsuite                  oryza_glaberrima       rattus_norvegicus       zea_mays
checkm                   homo_sapiens             oryza_glumipatula      refseq                  zymoseptoria_tritici
clostridium_perfringens  kraken                   oryza_longistaminata   rosa_chinensis
cocos_nucifera           krakenuniq               oryza_meridionalis     rsat_organism

ls /shared/bank/uniprot_swissprot/        
current  uniprot_swissprot_2018-10-10  uniprot_swissprot_2020_03

ls /shared/bank/uniprot_swissprot/current/
blast  diamond  fasta  flat  listingv1.blast  mapping  mmseqs

Public banks installed on the IFB cluster

Banks can be available in different versions and in different index format: fasta, gff, blast, bwa, bowtie, bowtie2, start, diamond, picard,

The banks are organized first per species or bank names, then per index:

homo_sapiens/  
├── GRCh38  
│   ├── bwa  
│   │   ├── Homo_sapiens.GRCh38.dna.primary_assembly.fa -> ../fasta/Homo_sapiens.GRCh38.dna.primary_assembly.fa
        ├── Homo_sapiens.GRCh38.dna.primary_assembly.fa.amb
        ├── Homo_sapiens.GRCh38.dna.primary_assembly.fa.ann
        ├── Homo_sapiens.GRCh38.dna.primary_assembly.fa.bwt
        ├── Homo_sapiens.GRCh38.dna.primary_assembly.fa.pac
        └── Homo_sapiens.GRCh38.dna.primary_assembly.fa.sa
│   │   └── ...  
│   ├── fasta  
│   │   ├── Homo_sapiens.GRCh38.dna.primary_assembly.fa
        ├── Homo_sapiens.GRCh38.dna.primary_assembly.fa.fai
        ├── Homo_sapiens.GRCh38.dna.primary_assembly.fa.readme
        ├── Homo_sapiens.GRCh38.dna.toplevel.fa
        ├── Homo_sapiens.GRCh38.dna.toplevel.fa.fai
        └── Homo_sapiens.GRCh38.dna.toplevel.fa.readme
│   │   └── ...  
│   ├── gff3  
│   │   ├── Homo_sapiens.GRCh38.94.gff3  
│   │   └── Homo_sapiens.GRCh38.94.gff3.readme  
│   └── star  
│       ├── Genome  
│       ├── Homo_sapiens.GRCh38.dna.primary_assembly.fa -> ../fasta/Homo_sapiens.GRCh38.dna.primary_assembly.fa  
│       └── ...  
└── hg19

FAIR and DMP

FAIR: general principles
FAIR: why do it?
FAIR: an obligation!
DMP: Data Management Plan
DMP: its content

FAIR: general principles

FAIR: guidelines proposed by the scientific community ¹.

Objectives: promote best-practice tools to help produce and publish reproducible analyses throughout the data lifecycle.

FAIR: why do it?

Over 70% of researchers have tried to reproduce another scientist’s experiments without success, and over half have failed to reproduce their own experiments!

FAIR: an obligation!

The FAIR principles are a key element of a more general concept: the data management plan (DMP).
In France, there is an obligation (e.g. for ANR and European projects) to make public the data and software of scientific projects funded with at least 50% public money.

DMP: Data Management Plan

The Data Management Plan (DMP) or Plan de gestion de données (PGD) is a synthetic, evolving document that helps to organize and anticipate every stage in the data life cycle. For each data set, it explains how project data will be managed, from creation or collection to sharing and archiving.
The data management plan is based on the data life cycle, which describes the various stages of data processing during a research project.

Figure from: https://www.universite-paris-saclay.fr/recherche/science-ouverte/les-donnees-de-la-recherche/introduction-aux-plans-de-gestion-de-donnees

DMP: its content

The Data Management Plan (DMP) is a synoptic and editable text document that aims to organize and anticipate all stages of the data life cycle.
How to?
- Use templates provided by ANR or institutes;
- Use online tools such as DMP opidor ¹ or Data Stewardship Wizard ².

Data life cycle

A long, not very quiet river
Data is stored in a structured and secure manner
Data is stored with metadata
What is metadata?
Data licensing
Making data accessible to the community

A long, not very quiet river

Very useful tools exist to collect, manage and share data and metadata over the time.

Data is stored in a structured and secure manner

A possible organization of data projects on a storage

Use a dedicated directory to store all data for all projects
Assign admin permissions to data managers only
- Limit number of data managers
Assign read-only permissions to users
Setup a separate data archive system
- “separate” = a different storage in a different place

Data is stored in a structured and secure manner

‘data’, ‘report’ and ‘operation’ are EDAM terms

Setup a same organization for all sub folders
- “users” DO NOT control naming
- only “data managers” DO that
Use controlled vocabulary for folder naming
- i.e. EDAM bioinformatics ¹ ontology
Ensure to always store in such a safe place raw data AND metadata

Note: ‘operation’ folder many contain particular final results, e.g. a sequence assembly.

Data is stored with metadata

What is metadata?

Metadata ¹ describes the data (main characteristics), such as author, creation date, content, format, associated project, associated sample, etc.

Given types of collected data, its description should be achieved using metadata standards, e.g. controlled vocaularies, ontologies, etc. Do not invent your own terms!

Data licensing

Data has to be associated with a license that describes how data can be used by anyone.

As a French public research institute, we should use the ETALAB ¹ public license from the Direction Interministérielle du Numérique (DINUM).

There are several types of CC licenses ²: CC0, CC-BY, CC-BY-NC, etc.

However, in practice: we should use the CC-BY license to be in an international legal framework. Not using CC-BY license prevents ENA submission.

Making data accessible to the community

When it is time to publish your work, you will have to make your data accessible to everyone.

There are three major ways of doing that:

using the services of a certified public research institute
- e.g. Ifremer’s Oceanographic Data Catalog (Sextant)
using the services of country-based initiatives
- e.g. French National Data Catalog (recherche.data.gouv.fr)
using one of the “Big Three”: NCBI, EBI, DDBJ
- e.g. European Bioinformatics Institute (EBI)

Making data accessible to the community

Have in mind that:

French research institutes may or may not require data to be in-house hosted
scientific publishers may or may not require your data to be available from “The Big Three” data providers

As a consequence, it is very important to think about public data release at earlier stages of the project beacuse it is a very time consuming step.

DMP should be used to answer this question and provide appropriate solution(s).

Making data accessible to the community

Using the services of a certified public research institute

Ifremer Sextant Catalog of Oceanographic Data - CoreTrustSeal Certified

Making data accessible to the community

Using the services of country-based initiatives

Recherche Data Gouv ¹ has been designed to support research teams in their work of structuring data to make it easy to find, accessible, interoperable and reusable - in other words, in line with the “FAIR” principles. These services are operated by INRAe.

Catalog entries for ‘Biodiversity and Ecology’ projects published on 2022

Making data accessible to the community

Using one of the “Big Three”: NCBI, EBI, DDBJ

Publish sequencing data to international databases for international referencing.

The International Nucleotide Sequence Database Collaboration (INSDC) collects and disseminates international databases containing DNA and RNA sequences
All data is available for free and unrestricted access, for any purpose, with no restrictions on analysis, redistribution, or re-publication of the data (CC-BY license)

Data submission at ENA

Practical use of EBI-ENA data submission service

Let’s consider we want to publish our sequencing data within the European Nucleotide Archive (ENA) databank managed by the European Bioinformatics Institute (EBI), Hinxton, UK.

You have to:

Understand the metadata model to figure out what mandatory information you have to provide
Collect all that information and setup document(s) ready for submission
Submit metadata along with data using EBI tools

ENA metadata model description

Sequencing data will be described using 5 objects

https://ena-docs.readthedocs.io/en/latest/submit/study.html

ENA metadata model description

About these 5 objects

ENA metadata model description

Metadata are described using ENA checklists

These are controlled vocabularies organized as a collection of scientific themes.

There is a minimum amount of information required during ENA sample registration and all samples must conform to a defined checklist of expected metadata values.

These sample checklists have been developed to meet the needs of different research communities. Different communities have different requirements on the minimum metadata expected to describe biological samples.

About ontologies

Ontologies are set of terms (along with unique IDs and description), each of them covering a particular scientific field and validated by scientific communities.

Ontologies are interoperable, logically well-defined, machine readable controlled vocabularies.

A few examples:

The Gene Ontology (GO) ¹ aims to maintain and develop controlled vocabulary of gene and gene product attributes
EDAM ² is an ontology of well established, familiar concepts that are prevalent within bioinformatics, including types of data and data identifiers, data formats, operations and topics.
ENVO ³ is an ontology which represents knowledge about environments, environmental processes, ecosystems, habitats, and related entities
and so on: more ontologies ⁴

ENA checklists

As of November 2023, there are about 40 ENA sample checklists:

Do not invent your terms to describe your metadata. Instead, use above one!

Collecting metadata for ENA submission

In practice, how to collect all that information?

A possible solution can be to set up a spreadsheet document containing the many pieces of information (i.e. metadata) useful to describe your sequencing data.

Pros:

a unique file to store all metadata
a self-documented file that can help users to fill in fields
capable of modeling ENA metadata using several tabs
capable of implementing controlled vocabularies
sharable and editable by several users through cloud-based storage system
write-protectable against any further modification by use of password

As soon as such a document is fully filled in, it must be locked against any further modification and stored next to raw data.

Collecting metadata for ENA submission

Here is an example of a spreadsheet-based document designed to collect metadata following ENA requirements ¹:

ENA submission

Time to publish data and metadata?

Time-consuming
Difficult to handle
Several tools available

ENA submission tools

ENA allows submissions via three routes, each of which is appropriate for a different set of submission types. You may be required to use more than one in the process of submitting your data:

Interactive Submissions are completed by filling out web forms directly in your browser and downloading template spreadsheets that can be completed off-line and uploaded to ENA. This is often the most accessible submission route.
Command Line Submissions use our bespoke Webin-CLI program. This validates your submissions entirely before you complete them, allowing you maximum control of the process.

Programmatic Submissions are completed by preparing your submissions as XML documents and either sending them to ENA using a program such as cURL or using the Webin Portal.

Pros of programmatic submission: standard code, automatic submission (limit errors of manual handling), possibility to run a dry submission (test).

ENA submission done

Example of a catalog entry programmatically submitted to ENA:

Metadata related to ENA Study object

Are you FAIR?

Tool to check FAIRness of your materials: FAIR-Checker

Example for the Nextflow-based pipeline SAMBA published on Elixir Workflowhub:

FAIRness evaluation

References

References (1/2)

Introduction to Biological Databases In Bioinformatics: Types and Applications Biological databases
Additional links related to DMP:
- https://www.universite-paris-saclay.fr/recherche/science-ouverte/les-donnees-de-la-recherche/introduction-aux-plans-de-gestion-de-donnees
Additional links related to Open Science:
- https://www.enseignementsup-recherche.gouv.fr/fr/le-plan-national-pour-la-science-ouverte-2021-2024-vers-une-generalisation-de-la-science-ouverte-en-48525
- https://www.inrae.fr/actualites/plan-donnees-pour-la-science
EDAM ontology:
- https://github.com/edamontology/edamontology
- https://edamontology.github.io/edam-browser/#operation_0525
Metadata links:
- https://fairsharing.org/
- https://ardc.edu.au/resource/metadata/

References (2/2)

Submission tools:
- athENA
- MadBot