Importing sequencing data

Download

The import_seq script encapsulates all operations required to manage the high-throughput sequencing data. It will:

  1. Download the data from a remote location such as a sequencing center,
  2. Format the data to comply with a common format. This includes by default renaming files and compressing the FASTQ files with a more efficient algorithm than Gzip.
  3. Input the sequencing run(s) into the database.

The data (FASTQ files) are downloaded into the folder /data/seq/raw/LABXDB_EXP in the following example. By default, import_seq will download data in folder with current date.

Imported data were published in Vejnar et al. For this tutorial, only a few reads were kept per runs to reduce the file sizes.

import_seq --bulk 'LABXDB_EXP' \
           --url_remote 'https://labxdb.vejnar.org/doc/databases/seq/import_seq/example/' \
           --ref_prefix 'TMP_' \
           --path_seq_raw '/data/seq/raw' \
           --squashfs_download \
           --processor '4' \
           --make_download \
           --make_format \
           --make_staging

The sequencing runs will receive temporary IDs prefixed with TMP_. To import multiple projects, different prefixes such TMP_User1 or TMP_User2 can be used.

Tip

If you didn’t install Squashfs, you can:

  • Remove --squashfs_download option and no archive.sqfs will be created and download directory will remain or,
  • Replace --squashfs_download with --delete_download option to delete the non-FASTQ files.

After import

After completion, sequencing data can be found in /data/seq/raw:

/data/seq/raw
└── [ 172]  LABXDB_EXP
    ├── [4.0K]  archive.sqfs
    ├── [ 254]  resa-2h-1_R1.fastq.zst
    ├── [ 271]  resa-6h-1_R1.fastq.zst
    ├── [ 254]  resa-6h-2a_R1.fastq.zst
    └── [ 348]  resa-6h-2b_R1.fastq.zst
  • FASTQ files were compressed with Zstandard.
  • Remaining files (all but FASTQ files) were archived with Squashfs.

In the Run view of LabxDB seq, the 4 runs were imported:

Runs

You can then proceed to Annotating sequencing data.

Last modification: May 26, 2020