Melbourne Bioinformatics Storage

This page provides information on storage options at Melbourne Bioinformatics (formerly VLSCI) and how to make the best use of them. The first sections cover the storage systems, and the later sections provide more details and examples.

Good data management is a key part of good research. Melbourne Bioinformatics offers a variety of storage options for managing your data. The storage systems are common to all of the Melbourne Bioinformatics clusters, and are accessable when you login to any of the clusters. The storage options are;

  • fast, no-quota, scratch space (/scratch),
  • general, modest quota, project space (/vlsci), and
  • slow, large quota, longer term, HSM space (/hsm).

Options

The difference between the different types of storage is based on a trade-off between speed, size and reliability.

Project

Each project has its own project storage space (under /vlsci) that all project members can use. The project space has a quota on the total storage available to all members of the project. If the quota is exceeded then no new data can be saved until space is freed by removing old files and directories. If there is not enough storage left for running jobs to save their output, then the job may crash, and subsequent jobs will also fail until space is cleaned up. Project space is backed up daily. Projects are assigned space in /vlsci by default, and many projects do not have any other storage type.

Scratch

Projects with large temporary storage requirements, for files that need to persist between successive jobs, can request a dedicated temporary storage area on the /scratch file system. Project leaders can Request Scratch Space for use by all project members through the Melbourne Bioinformatics user portal.

All project members need to be aware that the policy for /scratch allocations is:

  • Scratch space is intended for the temporary storage of files spanning jobs that may rely on the output of other jobs.
  • Scratch space will have no disk quota limits applied (subject to availability).
  • Files are not backed up.
  • Files that have not been accessed or modified for 60 days or more will be automatically deleted (without notification).
  • If space is needed, files will be deleted (oldest and largest files first) to ensure usable free space is available.

It is all project members responsibility to ensure that necessary data is moved to a safer location.

Hierarchical Storage Management (HSM)

For projects that have large input, reference, or output files that may be needed again over the life of the project, use of the Hierarchical Storage Management system (HSM) may suit. Project managers can request HSM for the project via Melbourne Bioinformatics help.

The /hsm file system uses a combination of disk and tape storage. HSM keeps recently accessed files on disk, but may migrate large (>10MB) inactive files to tape. Should inactive files be subsequently accessed, they will automatically be migrated back to disk after a short delay of a few minutes. There is limited disk space on HSM, so actively used files, and small files (<10MB) essentially block space needed for recalling large files from tape. For this reason, HSM should not be used for active jobs. The HSM file system is backed up daily.

Summary of Melbourne Bioinformatics Storage Systems

| Location | Size | Backed up | Speed | Clean up | Best Use | |----------|------|-------|-----------|-------|------|----------| | /vlsci | limited | daily | normal | manual | day to day file management | | /scratch | no-quota | never | fast | automatic | input and intermediary files for running and queued jobs | | /hsm | large tape
limited disk | daily | slow | manual | store large files for later use |

Workflow

Every project member has a home directory, and there is a common shared directory that can be used by all members of the project. Typically a home directory is relatively small, and only contains personal configuration files. The shared storage area is best for research input and output data. Depending on the project’s data needs, this storage space will be included in the project space, or can be on HSM.

It is recommended that data needed for a job or set of jobs should be staged by copying (or moving) the data to a location specifically for running jobs. Every running job has a unique scratch space accessed by the environment variable $TMPDIR. This location is removed when the job ends.

If your project has scratch space, data should be staged in and out of the scratch space before and after your job or jobs have run. It is possible to script this process, but it is up to your project to determine the best process for your workflow. If data is being copied from HSM to scratch, please allow time for any recall of files that might have migrated to tape.

Managing Storage

The following sections give examples of tools that are available to help you manage your storage.

Project Storage

By default a project only gets storage in the project storage space, /vlsci. Scratch and HSM storage can be requested by project managers.

To see what storage types your project has, you can look at each of /vlsci, /scratch, /hsm. For example, if your project is VR1234, use the command line:

$ ls -d -1 /*/VR1234
/hsm/VR1234
/scratch/VR1234
/vlsci/VR1234

The output shows each filesystem that your project has space on. If a filesystem is not listed, your project does not have space on that system.

Every project has the same basic file structure. When you first logon to one of Melbourne Bioinformatics’s systems, a private home directory will be created for you in your default project. (If you are in more than one project, you can change your default project from the Melbourne Bioinformatics Account Management website.) For example, the following command line shows, for project PROJECTID: - the members' home directories - a shared directory - a usage.report file.

The / at the end of the names indicates a directory.

$ cd /vlsci/PROJECTID
$ ls -1 -F
member1/
member2/
member3/
shared/
usage.report

If you see shared@, then the shared directory is a link to the project's shared directory on the HSM filesystem.

Usage

The way to determine how much storage you and your project are using will depend on the storage system being used. Melbourne Bioinformatics storage systems are extremely large compared to your average household computer. Just listing every file could take hours. For this reason we have set up reporting tools that run each night. It is best to make use of these reports, rather than manually find the files yourself (it may take a very long time).

Project Usage

To get a quick view of how much quota your project has, and has used, in the project space (/vlsci), you can run the following command line;

df -h /vlsci/PROJECTID

For example,

$ df -h /vlsci/PROJECTID
Filesystem      Size  Used Avail Use% Mounted on
/dev/projects   750G   47G  704G   7% /vlsci

where PROJECTID is your project id. In this exmple, the quota is 750GB and 47GB has been used. (You can see the filesystem is called /dev/projects and mounted as /vlsci.)

You can also view a more detailed report on the usage by individual user within a project using the command:

$ cat /vlsci/PROJECTID/usage.report
User            |Storage Used GB|              File Counts              | Quota
                |   Disk|  Total| Ordinary Directory  Symlinks       All|Used %
member1             2489    2448   1793587     59820     21974   1875381     24
member2             1038     965   3787543   2360848      3219   6151610     10
member3              316     782     18879      1494       141     20514      3
member4                0       0         0         1         0         1      0
member5             1522    1507    867844    155317     16339   1039500     14

TOTALS              5882    6217   6560686   2588492     48455   9197633     57

This shows, for each owner of the data, the physical disk space used and the total system space used (these can differ depending on how the system stores the data). Also shown are the number of files, directories and links to files. The percentage is of the project's total project space quota. This report is only updated once every 24 hours.

Note that having many small directories may also impact on your usage, since there is an overhead in storing all the information about all the directories.

HSM Usage

For the HSM filesystem, /hsm it is best to use the command line,

$ cat /hsm/PROJECTID/usage.report
User            |Storage Used GB|              File Counts              | Quota
                |   Disk|  Total| Resident    PreMig  Migrated       All|Used %
member1             7297    8928   5880567    108422      8065   6094140     87
member2                0       0      1342         0         0      2647      0

TOTALS              7297    8928   5881910    108422      8065   6096790     87

where PROJECTID is your project id. This shows, for each owner of the data, the space used by data on disk and the total system space used (include space used on tape). Also shown are the number of files that are resident on disk, pre-migrated on disk and tape, and migrated to tape only. The percentage is of the project's total HSM quota. This report is only updated once every 24 hours.

Note that having many small directories may also impact on your usage, since there is an overhead in storing all the information about all the directories. This has greater impact on the HSM system where it is possible to run out of meta-data space before storage space.

On HSM, files less than 10MB will never be migrated. A side effect is that if you have many small files, this impacts on recalling files from tape. This can make recalling files slower if other files need to be moved to tape to make room for your file. In general, it is good practice to bundle files and directories in to a single (compressed) file. See below for examples of moving data.

Freeing Storage

Determining files that need to be moved off Melbourne Bioinformatics systems, or moved between Melbourne Bioinformatics filesystems can be onerous and time consuming. It is important to review your data practices and needs for good experimental workflow, and to avoid the acumulation of large amounts of data that need moving or deleting, long after they were needed. On the /vlsci file system, the space consumed by the deleted files will immediately become available for use. On the /hsm file system it may take up to 24 hours before the space consumed by the deleted files is again available for your use if the deleted files were only resident on tape.

To find files that should be moved or removed, at Melbourne Bioinformatics you can use the following command line;

mystaledata FILESYSTEM

Where FILESYSTEM is one of vlsci, scratch, and hsm. For example;

$ mystaledata scratch

    Stale data usage report for the user: member1


    Report for scratch filesystem...
    Username             Size(GB)       %Usage
    member1               27.35MB          0.00


    List of "TOP" files under "scratch" that are older than 60 days that can be targeted for cleanup...!

    SIZE       PROJECT    FILENAME
    5.16GB     PROJECTID  /scratch/PROJECTID/member1/test1/biggest.file
    5.16GB     PROJECTID  /scratch/PROJECTID/member1/test2/big1.file
    5.16GB     PROJECTID  /scratch/PROJECTID/member1/test3/big2.file
    1.23GB     PROJECTID  /scratch/PROJECTID/member1/test2/big3.file
    1.23GB     PROJECTID  /scratch/PROJECTID/member1/test3/big4.file


    Total number of files holding stale data under scratch is: 513459

    For list of all files under scratch run:
    /usr/bin/zless /vlsci/data/.staledata/latest/scratch_UID.allfiles_less_than_60days_WEEKYEAR.gz

As noted at the bottom of the output, there is a file containing a list of all old files that can be viewed with the zless command.

Note that for HSM, the top files less than 10MB are shown. These are candidates for bundling into a larger file (>10MB) that can make better use of the larger tape storage.

Moving Data

It is important to ensure that you have copies of all your needed data. The two common methods for managing and moving data are the rsync and tar commands.

For example, to copy and keep files synchronised from Melbourne Bioinformatics to your local machine, you can use the following command line;

rsync -avz USERNAME@SYSTEM.melbournebioinformatics.org.au:REMOTE-LOCATION LOCAL-LOCATION

Note that you run this command from your local (unix or linux based) machine, not Melbourne Bioinformatics machines. This command will remotely synchronise for the user USERNAME from the machine SYSTEM.melbournebioinformatics.org.au and Melbourne Bioinformatics location REMOTE-LOCATION to the local location LOCAL-LOCATION on your local machine. For Windows machines you can use WinSCP.

If you need to keep the files on Melbourne Bioinformatics storage, it is best to bundle them to your project’s shared directory.

For example, to create a compressed file from the contents of a directory, you could use the following command line;

tar -cvzf myhugedirectoryArchive-name.tar.gz MyHugeDirectory

This will create (c) a compressed (z) bundle (tar) file (f) called myhugedirectoryArchive-name.tar.gz for everything in MyHugeDirectory, with verbose (v) output.

To test the difference between a tar file and the source directory, you could use;

tar -dvf myhugedirectoryArchive-name.tar.gz MyHugeDirectory

This will show the difference (d) between the file (f) myhugedirectoryArchive-name.tar.gz and the original MyHugeDirectory with verbose output. There should be no difference if you have just created the tar file.

You can then move the tar file to your shared storage space;

mv myhugedirectoryArchive-name.tar.gz /vlsci/PROJECTID/shared/

This will move (mv) the bundle myhugedirectoryArchive-name.tar.gz to your destination location, e.g. /vlsci/PROJECTID/shared/.

Once you have moved the tar file, don't forget to remove the old directory and all files and sub-directories;

rm -rf MyHugeDirectory

Caution this will recursively (r) force (f) the removal (rm) of all files and directories of MyHugeDirectory.

Recovery

Melbourne Bioinformatics backs up your project space and HSM space, and can often recover accidentally deleted files. However, there is a time limit on backups so it is important that you contact Help as soon as you become aware of needing help in this area. Files are backed up once a day, so files that were not backed up in time can not be recovered.