These guidelines address the management of large datasets and computationally intensive activities conducted by the UCLA Statistics Department and its Centers. The goals of these guidelines are to:
There is only limited on-line disk storage space in the system for users to store their datasets. Additionally, statistical computations access large datasets between the server's directory and its memory frequently. In order to ensure that the limited resources are shared fairly and effectively the following policy has been adopted by the Departmental of Statistics:
Currently there several gigabytes of dedicated disk space available for datasets and code on the computation servers (currently compute.stat.ucla.edu). All computation servers have a local /data directory. Use this directory to store all files used in your work: your programs (or source code), the files that your programs write, and the files that your programs read. It is critically important to follow this rule in order to avoid degrading the network and to keep your computation running to completion as fast as possible.
All members of the department have write privileges inside the /data directory. A good practice is to create your own directory in this directory in which to keep your files. This avoids cluttering the /data directory and protects your files.
The computation servers are available only using SSH. The information needed is—Server: compute.stat.ucla.edu, Directory: /data, Username: your department username, Password: your department password. SSH clients for Windows are Putty and WinSCP which are listed here. SSH is built-in to Mac OS X.
In order to keep space available, files and directories that are not modified or accessed for over six months are eligible for archiving to tape and then removed from the '/data' directories.
The /data directories are intended for center and departmental research work. Approval for usage is required from one of the following:
It is possible to work entirely with compressed datasets. The available UNIX compression utilities are:
The preferred utilities are gzip and gunzip. They are the most versatile and efficient. We suggest keeping your datasets compressed and decompressing them in your programs. Here are some methods to do this in various programming environments.
* To output a compressed SAS set; options compress=yes; * To read compressed input; filename d1 pipe "/usr/bin/gunzip -c '/data/cdc/IntermdiateDataset.gz'";
# To read in a compressed file open(GZIPIN,"/usr/bin/gunzip -c /data/cdc/IntermediateDataset.gz|"); # To output a compressed file open(GZIPOUT,"|/usr/bin/gzip -f > /data/cdc/AnotherDataset");
Date Created: 2001-08-16 22:24:33 Date Last Modified: 2007-02-05 08:19:09