Melbourne Bioinformatics Storage
This page provides information on storage options at Melbourne Bioinformatics (formerly VLSCI) and how to make the best use of them. The first sections cover the storage systems, and the later sections provide more details and examples.
Good data management is a key part of good research. Melbourne Bioinformatics offers a variety of storage options for managing your data. The storage systems are common to all of the Melbourne Bioinformatics clusters, and are accessable when you login to any of the clusters. The storage options are;
- fast, no-quota, scratch space (
- general, modest quota, project space (
- slow, large quota, longer term, HSM space (
The difference between the different types of storage is based on a trade-off between speed, size and reliability.
Each project has its own project storage space (under
/vlsci) that all project members can use. The project space has a quota on the total storage available to all members of the project. If the quota is exceeded then no new data can be saved until space is freed by removing old files and directories. If there is not enough storage left for running jobs to save their output, then the job may crash, and subsequent jobs will also fail until space is cleaned up. Project space is backed up daily. Projects are assigned space in /vlsci by default, and many projects do not have any other storage type.
Projects with large temporary storage requirements, for files that need to persist between successive jobs, can request a dedicated temporary storage area on the
/scratch file system. Project leaders can Request Scratch Space for use by all project members through the Melbourne Bioinformatics user portal.
All project members need to be aware that the policy for
/scratch allocations is:
- Scratch space is intended for the temporary storage of files spanning jobs that may rely on the output of other jobs.
- Scratch space will have no disk quota limits applied (subject to availability).
- Files are not backed up.
- Files that have not been accessed or modified for 60 days or more will be automatically deleted (without notification).
- If space is needed, files will be deleted (oldest and largest files first) to ensure usable free space is available.
It is all project members responsibility to ensure that necessary data is moved to a safer location.
Hierarchical Storage Management (HSM)
For projects that have large input, reference, or output files that may be needed again over the life of the project, use of the Hierarchical Storage Management system (HSM) may suit. Project managers can request HSM for the project via Melbourne Bioinformatics help.
/hsm file system uses a combination of disk and tape storage. HSM keeps recently accessed files on disk, but may migrate large (>10MB) inactive files to tape. Should inactive files be subsequently accessed, they will automatically be migrated back to disk after a short delay of a few minutes. There is limited disk space on HSM, so actively used files, and small files (<10MB) essentially block space needed for recalling large files from tape. For this reason, HSM should not be used for active jobs. The HSM file system is backed up daily.
Summary of Melbourne Bioinformatics Storage Systems
|Location||Size||Backed up||Speed||Clean up||Best Use|
||limited||daily||normal||manual||day to day file management|
||no-quota||never||fast||automatic||input and intermediary files for running and queued jobs|
|daily||slow||manual||store large files for later use|
Every project member has a home directory, and there is a common shared directory that can be used by all members of the project. Typically a home directory is relatively small, and only contains personal configuration files. The shared storage area is best for research input and output data. Depending on the project’s data needs, this storage space will be included in the project space, or can be on HSM.
It is recommended that data needed for a job or set of jobs should be staged by copying (or moving) the data to a location specifically for running jobs. Every running job has a unique scratch space accessed by the environment variable
$TMPDIR. This location is removed when the job ends.
If your project has scratch space, data should be staged in and out of the scratch space before and after your job or jobs have run. It is possible to script this process, but it is up to your project to determine the best process for your workflow. If data is being copied from HSM to scratch, please allow time for any recall of files that might have migrated to tape.
The following sections give examples of tools that are available to help you manage your storage.
By default a project only gets storage in the project storage space,
/vlsci. Scratch and HSM storage can be requested by project managers.
To see what storage types your project has, you can look at each of
/hsm. For example, if your project is
VR1234, use the command line:
$ ls -d -1 /*/VR1234 /hsm/VR1234 /scratch/VR1234 /vlsci/VR1234
The output shows each filesystem that your project has space on. If a filesystem is not listed, your project does not have space on that system.
Every project has the same basic file structure. When you first logon to one of Melbourne Bioinformatics’s systems, a private home directory will be created for you in your default project. (If you are in more than one project, you can change your default project from the Melbourne Bioinformatics Account Management website.)
For example, the following command line shows, for project
- the members' home directories
/ at the end of the names indicates a directory.
$ cd /vlsci/PROJECTID $ ls -1 -F member1/ member2/ member3/ shared/ usage.report
If you see
shared@, then the shared directory is a link to the project's shared directory on the HSM filesystem.
The way to determine how much storage you and your project are using will depend on the storage system being used. Melbourne Bioinformatics storage systems are extremely large compared to your average household computer. Just listing every file could take hours. For this reason we have set up reporting tools that run each night. It is best to make use of these reports, rather than manually find the files yourself (it may take a very long time).
To get a quick view of how much quota your project has, and has used, in the project space (
/vlsci), you can run the following command line;
df -h /vlsci/PROJECTID
$ df -h /vlsci/PROJECTID Filesystem Size Used Avail Use% Mounted on /dev/projects 750G 47G 704G 7% /vlsci
PROJECTID is your project id. In this exmple, the quota is 750GB and 47GB has been used. (You can see the filesystem is called /dev/projects and mounted as
You can also view a more detailed report on the usage by individual user within a project using the command:
$ cat /vlsci/PROJECTID/usage.report User |Storage Used GB| File Counts | Quota | Disk| Total| Ordinary Directory Symlinks All|Used % member1 2489 2448 1793587 59820 21974 1875381 24 member2 1038 965 3787543 2360848 3219 6151610 10 member3 316 782 18879 1494 141 20514 3 member4 0 0 0 1 0 1 0 member5 1522 1507 867844 155317 16339 1039500 14 TOTALS 5882 6217 6560686 2588492 48455 9197633 57
This shows, for each owner of the data, the physical disk space used and the total system space used (these can differ depending on how the system stores the data). Also shown are the number of files, directories and links to files. The percentage is of the project's total project space quota. This report is only updated once every 24 hours.
Note that having many small directories may also impact on your usage, since there is an overhead in storing all the information about all the directories.
For the HSM filesystem,
/hsm it is best to use the command line,
$ cat /hsm/PROJECTID/usage.report User |Storage Used GB| File Counts | Quota | Disk| Total| Resident PreMig Migrated All|Used % member1 7297 8928 5880567 108422 8065 6094140 87 member2 0 0 1342 0 0 2647 0 TOTALS 7297 8928 5881910 108422 8065 6096790 87
PROJECTID is your project id.
This shows, for each owner of the data, the space used by data on disk and the total system space used (include space used on tape). Also shown are the number of files that are resident on disk, pre-migrated on disk and tape, and migrated to tape only. The percentage is of the project's total HSM quota. This report is only updated once every 24 hours.
Note that having many small directories may also impact on your usage, since there is an overhead in storing all the information about all the directories. This has greater impact on the HSM system where it is possible to run out of meta-data space before storage space.
On HSM, files less than 10MB will never be migrated. A side effect is that if you have many small files, this impacts on recalling files from tape. This can make recalling files slower if other files need to be moved to tape to make room for your file. In general, it is good practice to bundle files and directories in to a single (compressed) file. See below for examples of moving data.
How much data do I have in HSM?
To help manage files and directories that are stored in the HSM filesystem we provide the 'duh' utility; which we recommend using in preference to 'du', 'ls' and other standard utilities.
Because HSM migrates inactive data onto tapes from time to time leaving just a stub (~0 bytes) on disk, the standard utilities often fail to report the correct size of these migrated files. If files are recalled in order to complete the task, this may take a long time.
Please note: the data queried by 'duh' is updated every few hours. Your most recent changes may not show up until the data is updated.
$ duh -h [...] This command expects "-p|--project" as mandatory argument. Other optional arguments are mentioned below: "-p|--project" - Project name for Eg: -p UOM9999 or --project UOM9999 "-s|--size" - size in b|k|m|g|t (lower or upper case is ok) for Eg: -s 1g and -s 1G will work. "--path" - default path is "/hsm/<Project>/" "--full" - recursive "du -s" "-h|--help" Usage: $ duh -p|--project <Project> [-s|--size <size> --path <path> --age <older_than_#days> -h|--help] Eg: $ duh -p|--project UOM9999 -s|--size 1g --path /hsm/UOM9999/shared --full Eg: $ duh -p|--project UOM9999 -s|--size 1g --age 180 --path /hsm/UOM9999/shared/My_Research
Determining files that need to be moved off Melbourne Bioinformatics systems, or moved between Melbourne Bioinformatics filesystems can be onerous and time consuming. It is important to review your data practices and needs for good experimental workflow, and to avoid the acumulation of large amounts of data that need moving or deleting, long after they were needed.
/vlsci file system, the space consumed by the deleted files will immediately become available for use. On the
/hsm file system it may take up to 24 hours before the space consumed by the deleted files is again available for your use if the deleted files were only resident on tape.
To find files that should be moved or removed, at Melbourne Bioinformatics you can use the following command line;
FILESYSTEM is one of
hsm. For example;
$ mystaledata scratch Stale data usage report for the user: member1 Report for scratch filesystem... Username Size(GB) %Usage member1 27.35MB 0.00 List of "TOP" files under "scratch" that are older than 60 days that can be targeted for cleanup...! SIZE PROJECT FILENAME 5.16GB PROJECTID /scratch/PROJECTID/member1/test1/biggest.file 5.16GB PROJECTID /scratch/PROJECTID/member1/test2/big1.file 5.16GB PROJECTID /scratch/PROJECTID/member1/test3/big2.file 1.23GB PROJECTID /scratch/PROJECTID/member1/test2/big3.file 1.23GB PROJECTID /scratch/PROJECTID/member1/test3/big4.file Total number of files holding stale data under scratch is: 513459 For list of all files under scratch run: /usr/bin/zless /vlsci/data/.staledata/latest/scratch_UID.allfiles_less_than_60days_WEEKYEAR.gz
As noted at the bottom of the output, there is a file containing a list of all old files that can be viewed with the
Note that for HSM, the top files less than 10MB are shown. These are candidates for bundling into a larger file (>10MB) that can make better use of the larger tape storage.
It is important to ensure that you have copies of all your needed data. The two common methods for managing and moving data are the
For example, to copy and keep files synchronised from Melbourne Bioinformatics to your local machine, you can use the following command line;
rsync -avz USERNAME@SYSTEM.melbournebioinformatics.org.au:REMOTE-LOCATION LOCAL-LOCATION
Note that you run this command from your local (unix or linux based) machine, not Melbourne Bioinformatics machines. This command will remotely synchronise for the user
USERNAME from the machine
SYSTEM.melbournebioinformatics.org.au and Melbourne Bioinformatics location
REMOTE-LOCATION to the local location
LOCAL-LOCATION on your local machine. For Windows machines you can use WinSCP.
If you need to keep the files on Melbourne Bioinformatics storage, it is best to bundle them to your project’s shared directory.
For example, to create a compressed file from the contents of a directory, you could use the following command line;
tar -cvzf myhugedirectoryArchive-name.tar.gz MyHugeDirectory
This will create (c) a compressed (z) bundle (tar) file (f) called myhugedirectoryArchive-name.tar.gz for everything in MyHugeDirectory, with verbose (v) output.
To test the difference between a
tar file and the source directory, you could use;
tar -dvf myhugedirectoryArchive-name.tar.gz MyHugeDirectory
This will show the difference (d) between the file (f) myhugedirectoryArchive-name.tar.gz and the original MyHugeDirectory with verbose output. There should be no difference if you have just created the tar file.
You can then move the tar file to your shared storage space;
mv myhugedirectoryArchive-name.tar.gz /vlsci/PROJECTID/shared/
This will move (mv) the bundle myhugedirectoryArchive-name.tar.gz to your destination location, e.g. /vlsci/PROJECTID/shared/.
Once you have moved the tar file, don't forget to remove the old directory and all files and sub-directories;
rm -rf MyHugeDirectory
Caution this will recursively (r) force (f) the removal (rm) of all files and directories of MyHugeDirectory.
Melbourne Bioinformatics backs up your project space and HSM space, and can often recover accidentally deleted files. However, there is a time limit on backups so it is important that you contact Help as soon as you become aware of needing help in this area. Files are backed up once a day, so files that were not backed up in time can not be recovered.