Preparation guide

This guide contains instructions on how to prepare your data for being uploaded to BigPicture.

Check folder structure
Verify XML file structure

Check folder structure

Before you submit your data, please make sure that the folder structure of your dataset complies with the structure below.

The root folder of the submission should actually be the dataset folder which includes several subfolders. See the example of structure folder below:

    DATASET_{IDENTIFIER}
    |--- METADATA
    |    |--- dataset.xml (contains: Dataset)
    |    |--- policy.xml (contains: Policy)
    |    |--- image.xml (contains: Images)
    |    |--- annotation.xml (contains: Annotations) 
    |    |--- observation.xml (contains: Observations)
    |    |--- observer.xml (contains: Observers)
    |    |--- sample.xml (contains: Biological Beings, Cases (if present), Specimens, Blocks and Slides)
    |    |--- staining.xml (contains: Stainings)
    |---IMAGES
    |    |--- IMAGE_{IDENTIFIER}*
    |    |    |--- *.dcm files of an Image
    |    |--- IMAGE_{IDENTIFIER}*
    |    |    |--- *.dcm files of an Image
    |--- ANNOTATIONS+
    |    |--- *.geojson
    |--- LANDING_PAGE***
    |    |--- landing_page.xml (contains: Landing Page)
    |    |--- THUMBNAILS
    |    |    |--- *.jpg
    |--- PRIVATE**** - not shared with users
    |    |--- rems.xml - not shared with users
    |    |--- organisation.xml - not shared with users
    |    |--- datacite.xml (contains: DataCite, optional) - not shared with users

    *    The root of the folder must be the written as "DATASET_{IDENTIFIER}" with
         IDENTIFIER being either the accession ID of the Dataset generated by the
         repository (when data is downloaded), or the ALIAS defined by the
         submitter at dataset creation and submission.
    **   Folders containing WSIs files (I.e. *.dcm) must be named
         "IMAGE_{IDENTIFIER}" with IDENTIFIER being either the accession ID of a
         given Image the files relate to generated by the repository (when data is
         downloaded), or the ALIAS defined by the submitter at dataset creation or
         submission.
    ***  IMPORTANT: Anything in this folder should be expected to be visible to
         the entire world.
    +    If the dataset does not contain Annotations the respective .xml files
         or directory can be omitted.
    **** This folder contains metadata that will not be shared with users that
         have gotten access to a dataset

All the files should be encrypted with crypt4gh and the extensions must be c4gh, e.g: image.xml.c4gh, image1.dcm.c4gh etc
The metadata should be stored in two different subfolders: METADATA and PRIVATE.
The only files that may exist in the METADATA folder are the following: dataset.xml, image.xml, observation.xml, observer.xml (optional), policy.xml, sample.xml, annotation.xml (optional) and staining.xml.
The only files that may exist in the PRIVATE folder are the following: dac.xml and submission.xml.
The file image.xml should include the full path of each dicom image and includes also the checksums of both encrypted and unencrypted files, e.g:

    <FILES>
        <FILE filename="IMAGES/IMAGE_{IDENTIFIER}/*.dcm" checksum_method="SHA256" checksum="<encrypted_checksum>" unencrypted_checksum="<unencrypted_checksum>" filetype="dcm"/>
    </FILES>

Verify XML file structure

You can check if your metadata files will pass our validation by using the tool xmllint and the Bigpicture metadata schemas. This applies for the xml files in the folders METADATA, LANDING_PAGE and PRIVATE.

Download all relevant XSDs.

The files can be downloaded directly from the GitHub repo (v1) or (v2),
or all together using this script, which will place the files in the folder xsd-files/

version=v2  # replace with v1 for metadata version 1
mkdir -p xsd-files

curl -H "Accept: application/vnd.github.v3+json" \
   https://api.github.com/repos/imi-bigpicture/bigpicture-metaflex/contents/src?ref=$version.0.0 | jq -r '.[] | .download_url' |
while IFS= read -r url; do
    xsd_name=$(basename "$url")
    curl -o "xsd-files/${xsd_name%\?*}" -J -L "$url" >/dev/null 2>&1
done

For each xml file, run xmllint --noout --schema xsd-files/BP.<filename>.xsd <filename>.xml. For example, to check dataset.xml, run:
```
xmllint --noout --schema xsd-files/BP.dataset.xsd METADATA/dataset.xml
```