Data Handling

This section will briefly cover compressing/decompressing files/directories, transferring files, and logging in.

Compressing and decompressing

Compressing files are done with utilities like gzip, bzip2, or zip.

Compressing a file with gzip

gzip FILE

This results in FILE.gz

Decompressing a file with gzip

gunzip FILE.gz

You now again have FILE

Archiving

Archiving is generally done with tar.

A tarball is a commonly used name to refer to an archive file in the tar (Tape Archive) format.

A tarball can be compressed with something like gzip or bzip2.

tar [-options] <name of the tar archive> [files or directories which to add into archive]

Basic options:

    -c, --create  create a new archive;
    -a, --auto-compress  additionally compress the archive with a compressor which will be automatically determined by the file name extension of the archive. If the archive's name ends with *.tar.gz then use gzip, if *.tar.xz then use xz, *.tar.zst for Zstandard etc.;
    -r, --append — append files to the end of an archive;
    -x, --extract, --get — extract files from an archive;
    -f, --file — specify the archive's name;
    -t, --list  show a list of files and folders in the archive;
    -v, --verbose  show a list of processed files.

Hint

Code-along! You can download the tarball temp.tar.gz to play with (right-click and save): temp.tar.gz.

**Here follows some examples: **

Generate a tarball

tar -cvf DIRECTORY.tar DIRECTORY

Extracting the files from a tarball

tar -xvf DIRECTORY.tar

Generate a tarball and compress it with gzip

tar -zcvf DIRECTORY.tar.gz DIRECTORY

Uncompressing and extracting files from a tarball

tar -zxvf DIRECTORY.tar.gz

More information can be found in HPC2N’s documentation’s Archiving and compressing section.

File transfer and syncing

There are several possible ways to transfer files and data to and from Linux systems: scp, sftp, rsync…

Warning

FTP is generally not permitted due to security problems!

SCP

SCP (Secure CoPy) is a simple way of transferring files between two machines that use the SSH (Secure SHell) protocol.

From local system to a remote system

$ scp sourcefilename user@hostname:somedir/destfilename

From a remote system to a local system

$ scp user@hostname:somedir/sourcefilename destfilename

SFTP

SFTP (SSH File Transfer Protocol or sometimes called Secure File Transfer Protocol) is a network protocol that provides file transfer over a reliable data stream.

From a local system to a remote system

This example was made with the remote system “Kebnekaise” belonging to HPC2N.

enterprise-d [~]$ sftp user@kebnekaise.hpc2n.umu.se
Connecting to kebnekaise.hpc2n.umu.se...
user@kebnekaise.hpc2n.umu.se's password:
sftp> put file.c C/file.c
Uploading file.c to /home/u/user/C/file.c
file.c                          100%    1    0.0KB/s   00:00
sftp> put -P irf.png pic/
Uploading irf.png to /home/u/user/pic/irf.png
irf.png                         100% 2100    2.1KB/s   00:00
sftp>

From a remote system to a local system

sftp> get file2.c C/file2.c
Fetching /home/u/user/file2.c to C/file2.c
/home/u/user/file.txt  100%  1  0.1KB/s 00:00    
sftp> get -P file3.c C/
Fetching /home/u/user/file3.c to C/file3.c
/home/u/user/file.txt  100%  1  0.4KB/s 00:00    
sftp> exit
enterprise-d [~]$ 

rsync

rsync is a utility for efficiently transferring and synchronizing files between a computer and a storage drive and across networked computers by comparing the modification times and sizes of files.

Recursively sync files from one remote directory to a local directory. Also preserve symbolic links and time stamps, and allows resume of partially transferred files on restart

rsync -rlpt username@remote_host:sourcedir/ /path/to/localdir

Recursively sync a local directory to a remote destination directory, preserving owners, permission, modification times, and symbolic links

rsync -a /path/to/localdir/ username@remote_host:destination_directory

Much more information and examples can be found in the HPC2N documentation’s File transfer section.

Connecting with ssh

The ssh command is used for connecting to a remote computer.

Some useful examples:

Connecting to a compute cluster called Kebnekaise

ssh username@kebnekaise.hpc2n.umu.se

Connecting to Kebnekaise and enabling graphical display

ssh -Y username@kebnekaise.hpc2n.umu.se

Note that you need to have an X11 server like Xming or Cygwin on Windows, XQuartz on macOS (included on Linux) to open a graphical display.

Tip

If you are using a graphical display, then we are strongly recommending ThinLinc.

More advanced topics

This section will look at finding patterns (grep, awk, wild cards, regular expressions) and scripting.

Finding patterns

Here you will find descptions on how to search for files with specific patterns.

Hint

Try out some of these examples. You can use the contents of the tarball patterns.tar.gz to play with. Right-click and save to download, or right-click and copy the url, then do wget THE-URL-YOU-COPIED in a terminal window to download it there. Then do tar -zxvf patterns.tar.gz to unpack.

grep

This command searches for patterns in text files. FILE is the name of whatever file you want to look at. The file fil.txt is a good option here if you want to test.

Find the pattern ‘word’ in FILE

grep 'word' FILE

Find the pattern ‘word’ recursively under the directory path/to/dir

grep -rine 'word' path/to/dir

Try finding the pattern string in file.txt

Download file.txt here (not the same as fil.txt above).

awk

This command finds patterns in a file and can perform arithmetic/string operations.

Search for the pattern ‘snow’ in the file FILE and print out the first column

awk '/snow/ {print$1}' FILE

Wild cards

Wild cards are useful ‘stand-ins’ for one or more character or number, that you can use for instance when finding patterns or when removing/listing all files of a certain type.

Wild cards are also called globbing patterns.

  • ? represents a single character
  • * represents a string of characters (0 or more)
  • [ ] represents a range
  • { } the terms are separated by commas and each term must be a wildcard or exact name
  • [!] matches any character that is NOT listed between the [ and ]. This is a logical NOT.
  • ** specifies an “escape” character, when using a subsequent special character.

Warning

You may need quotation marks as well around some wildcards.

Some examples of use of wildcards

myfile?.txt

This matches myfile0.txt, myfile1.txt,… for all letters between a-z and numbers between 0-9. Try with ls myfile?.txt.

r*d

This matches red, rad, ronald, … anything starting with r and ending with d, including rd.

r[a,i,o]ck

This matches rack, rick, rock.

a[d-j]a

This matches ada, afa, aja, … and any three letter word that starts with an a and ends with an a and has any character d to j in between. Try with ls a[d-j]a.

[0-9]

This matches a range of numbers from 0 to 9.

cp {*.dat,*.c,*.pdf} ~

This specifies to copy any files ending in .dat, .c, and .pdf to the user’s homedirectory. No spaces are allowed between the commas, etc. You could test it by creating a matched file in patterns directory with touch file.c and running the above command to see it only copies that one from the patterns directory.

rm thisfile[!8]*

This will remove all files named thisfile*, except those that has an 8 at that position in it’s name. Try running it in the patterns directory! Do ls before and after to see the change. Remember, you can always recreate the directory patterns by untar’ing it again.

Regular Expressions

Regular Expressions are a type of globbing patterns that are used when you are working with text.

Regular Expressions can be used with programs like grep, find and many others.

Note

If your regular expressions does not do as you expect, you may need to use single quotation marks around the sentence and you may also have to use backslashes on every single special character.

Some common examples of regular expressions:

  • . matches any single character. Same as ? in standard wildcard expressions.
  • \ is used as an “escape” character for a subsequent special character.
  • .* is used to match any string, equivalent to * in standard wildcards.
  • * the proceeding item is matched zero or more times. ie. n* will match n, nn, nnnn, nnnnnnn but not na or any other character.
  • ^ means “the beginning of the line”. So “^a” means find a line starting with an “a”.
  • $ means “the end of the line”. So “a$” means find a line ending with an “a”.
  • [ ] specifies a range. Same as for normal wildcards. This is an ‘or’ relationship (you only need one to match).
  • | This wildcard makes a logical OR relationship between wildcards. You can thus search something or something else. You may need to add a ‘' before this command to avoid the shell thinking you want a pipe.
  • [^] This is the equivalent of [!] in standard wildcards, i.e. it is a logical “not” and will match anything not listed within the square brackets.

Example

$ cat myfile | grep '^s.*n$'

This command searches the file myfile for lines starting with an “s” and ending with an “n”, and prints them to the standard output.

Scripting

Scripting is used to perform complex or repetitive tasks without user intervention. All Linux commands can be used in a script including wild cards.

The most common reason for making a script is probably to avoid writing the same command again and again.

Note

If it is just a one-line command you want to do again and again, then ‘alias’ is more suited for this.

Hint

Type along!

Simple example of a script ‘analysis.sh’

You can download analysis.sh here and program.sh here. You can get file.dat here. Hint: right-click, copy url, and then use wget to get it where you want it directly.

#!/bin/bash
grep 'ABCD' file.dat > file_filtered.dat
./program.sh < file_filtered.dat > output.dat

This script can be executed with ./analysis.sh (remember to check that the permission for executing the script analysis.sh as user is set - you should also make sure program.sh has permissions set to execute as user).

To change the permissions to execute a script (here named analysis.sh), for just the user, you could do:

$ chmod u+x analysis.sh

The above script can then be executed with

$ ./analysis.se

For more examples of (more useful) scripts, see for instance this list of 25 Easy Bash Script Examples.