Preparation guide
This guide contains instructions on how to prepare your data for being uploaded to BigPicture.
Check folder structure
Before you submit your data, please make sure that the folder structure of your dataset complies with the structure below.
- The root folder of the submission should actually be the dataset folder which includes several subfolders. See the example of structure folder below:
DATASET_{IDENTIFIER}
|--- METADATA
| |--- dataset.xml (contains: Dataset)
| |--- policy.xml (contains: Policy)
| |--- image.xml (contains: Images)
| |--- annotation.xml (contains: Annotations)
| |--- observation.xml (contains: Observations)
| |--- observer.xml (contains: Observers)
| |--- sample.xml (contains: Biological Beings, Cases (if present), Specimens, Blocks and Slides)
| |--- staining.xml (contains: Stainings)
|---IMAGES
| |--- IMAGE_{IDENTIFIER}*
| | |--- *.dcm files of an Image
| |--- IMAGE_{IDENTIFIER}*
| | |--- *.dcm files of an Image
|--- ANNOTATIONS+
| |--- *.geojson
|--- LANDING_PAGE***
| |--- landing_page.xml (contains: Landing Page)
| |--- THUMBNAILS
| | |--- *.jpg
|--- PRIVATE**** - not shared with users
| |--- rems.xml - not shared with users
| |--- organisation.xml - not shared with users
| |--- datacite.xml (contains: DataCite, optional) - not shared with users
* The root of the folder must be the written as "DATASET_{IDENTIFIER}" with
IDENTIFIER being either the accession ID of the Dataset generated by the
repository (when data is downloaded), or the ALIAS defined by the
submitter at dataset creation and submission.
** Folders containing WSIs files (I.e. *.dcm) must be named
"IMAGE_{IDENTIFIER}" with IDENTIFIER being either the accession ID of a
given Image the files relate to generated by the repository (when data is
downloaded), or the ALIAS defined by the submitter at dataset creation or
submission.
*** IMPORTANT: Anything in this folder should be expected to be visible to
the entire world.
+ If the dataset does not contain Annotations the respective .xml files
or directory can be omitted.
**** This folder contains metadata that will not be shared with users that
have gotten access to a dataset
- All the files should be encrypted with crypt4gh and the extensions must be
c4gh
, e.g:image.xml.c4gh
,image1.dcm.c4gh
etc - The metadata should be stored in two different subfolders:
METADATA
andPRIVATE
. - The only files that may exist in the
METADATA
folder are the following:dataset.xml
,image.xml
,observation.xml
,observer.xml
(optional),policy.xml
,sample.xml
,annotation.xml
(optional) andstaining.xml
. - The only files that may exist in the
PRIVATE
folder are the following:dac.xml
andsubmission.xml
. - The file
image.xml
should include the full path of each dicom image and includes also the checksums of both encrypted and unencrypted files, e.g:
<FILES>
<FILE filename="IMAGES/IMAGE_{IDENTIFIER}/*.dcm" checksum_method="SHA256" checksum="<encrypted_checksum>" unencrypted_checksum="<unencrypted_checksum>" filetype="dcm"/>
</FILES>
Verify XML file structure
You can check if your metadata files will pass our validation by using the tool xmllint and the Bigpicture metadata schemas. This applies for the xml files in the folders METADATA
, LANDING_PAGE
and PRIVATE
.
Download all relevant XSDs.
- The files can be downloaded directly from the GitHub repo (v1) or (v2),
- or all together using this script, which will place the files in the folder
xsd-files/
version=v2 # replace with v1 for metadata version 1 mkdir -p xsd-files curl -H "Accept: application/vnd.github.v3+json" \ ?ref=$version.0.0 | jq -r '.[] | .download_url' | https://api.github.com/repos/imi-bigpicture/bigpicture-metaflex/contents/srcwhile IFS= read -r url; do xsd_name=$(basename "$url") curl -o "xsd-files/${xsd_name%\?*}" -J -L "$url" >/dev/null 2>&1 done
For each xml file, run
xmllint --noout --schema xsd-files/BP.<filename>.xsd <filename>.xml
. For example, to checkdataset.xml
, run:xmllint --noout --schema xsd-files/BP.dataset.xsd METADATA/dataset.xml