Dataset Management and Computation Guidelines


Table of Contents


Overview

These guidelines address the management of large datasets and computationally intensive activities conducted by the UCLA Statistics Department and its Centers. The goals of these guidelines are to:

Working with Data Sets

There is only limited on-line disk storage space in the system for users to store their datasets. Additionally, statistical computations access large datasets between the server's directory and its memory frequently. In order to ensure that the limited resources are shared fairly and effectively the following policy has been adopted by the Departmental of Statistics:

Currently there several gigabytes of dedicated disk space available for datasets and code on the computation servers (currently compute.stat.ucla.edu). All computation servers have a local /data directory. Use this directory to store all files used in your work: your programs (or source code), the files that your programs write, and the files that your programs read. It is critically important to follow this rule in order to avoid degrading the network and to keep your computation running to completion as fast as possible.

All members of the department have write privileges inside the /data directory. A good practice is to create your own directory in this directory in which to keep your files. This avoids cluttering the /data directory and protects your files.

The computation servers are available only using SSH. The information needed is—Server: compute.stat.ucla.edu, Directory: /data, Username: your department username, Password: your department password. SSH clients for Windows are Putty and WinSCP which are listed here. SSH is built-in to Mac OS X.

Maintenance Policy

In order to keep space available, files and directories that are not modified or accessed for over six months are eligible for archiving to tape and then removed from the '/data' directories.

Who Qualifies to Use the /data Directories

The /data directories are intended for center and departmental research work. Approval for usage is required from one of the following:

  1. A center PI
  2. A faculty member
  3. The department chair
  4. The network administrator


Working with Compressed Datasets

It is possible to work entirely with compressed datasets. The available UNIX compression utilities are:

The preferred utilities are gzip and gunzip. They are the most versatile and efficient. We suggest keeping your datasets compressed and decompressing them in your programs. Here are some methods to do this in various programming environments.

SAS

* To output a compressed SAS set;

options compress=yes;

* To read compressed input;

filename d1 pipe "/usr/bin/gunzip -c '/data/cdc/IntermdiateDataset.gz'";

Perl

# To read in a compressed file
open(GZIPIN,"/usr/bin/gunzip -c /data/cdc/IntermediateDataset.gz|");

# To output a compressed file
open(GZIPOUT,"|/usr/bin/gzip -f > /data/cdc/AnotherDataset");

Useful UNIX Data Management Commands

du -s *
Display only the grand total for the specified files, in this example all files (*).
df
Display free disk space
tar zcvf .tgz
File archive and compression
gzip
Compress files
gunzip
Uncompress files

Working with Computations

 


If you have any questions regarding this topic e-mail to support@stat.ucla.edu

Date Created: 2001-08-16 22:24:33 Date Last Modified: 2007-02-05 08:19:09


UCLA Department of Statistics
Last updated: 13-Jul-2009
Access count is: 41408, since 02-Oct-2003
Maintained by: Web Staff [webstaff@stat.ucla.edu]