Linux Shell: Bash

Outline:

  • 1.Why learn Linux?
  • 2.Bash as shell program to interact with Linux OS
    • 2a.Bash: navigate through your file system
    • 2b.Bash: make new files and directories
    • 2c.Bash: manipulate text files
    • 2d.Bash: write shell scripts
    • 2e.Bash: work with multiple files
  • 3.Resources

Before we begin:

  • Before class (instructor’s setup):
    • set font size to 36 for terminal
    • zoom in to use notepad!!!
    • export PS1='$ '
    • history -c to clear history export PROMPT_COMMAND="history 1> ~/Dropbox/Public/SCW_April2016_Prep/ShellHistory.txt"
    • If you need to create alias: alias p="pwd" will make p as a substitute for pwd (you might need it in case some aliases already set, i had alias ls=ls -F, so I had to change it back to demonstrate ls -F flag)
    • connect to remote Linux and mac machines to be able to show commands across platforms
  • Before class (students’ setup):

1. Why learn Linux?

  • Most widely used OS in research and technology
  • Open-source: free to get, free to use, free to modify.
  • Availability: Linux VM, Linux servers/clusters (UTA, TACC)
  • CLI (command line interface) and GUI (graphical user interface) are available

2. Bash as shell program to interact with Linux OS

  • We will focus on CLI that you can get access to through terminal window (open gitbash or terminal)
  • A shell is a command line interpreter for the system.
  • Bash is one of shell programs that can be used to interact with Linux OS. It is a program that runs other programs and it is also a programming language itself.

  • Let’s get familiar with the terminal first: prompt, commands, output

  • Here is the list of the commands we will use in the first half of the lesson. Not too many, but when used effectively, do simplify and speed up your work enormously!

# commands used in the first half of the lesson
$ echo       # print
$ whoami     # prints username
$ pwd        # print working directory path
$ cd         # change directory
$ mkdir      # make directory
$ touch      # create empty file
$ cat        # view file/concatinate files
$ mv         # move/rename file
$ cp         # copy file
$ rm         # delete file
  • Read-Evaluate-Print loop of CLI. When the user types a command and then presses the enter (or return) key, the computer reads it, executes it, and prints its output.
# '#' here and everywhere should be interpreted as comment

# echo command is executed: Try
$ echo "Welcome to our workshop!"

# only commands known to bash will be executed: Try
$ stop this session 

# what do you think this one does? Try
$ du -h

#To stop command execution, press `Ctrl+Enter`

# List of built-in bash commands:Try
$ help

# To get help for the command: Try
$ wc --help  
$ grep --help
$ man wc

# We will look at help pages in more details later
# And if you cannot find the command you want, google!

# TAB auto-completition and other shortcuts
# Up arrow to get previous command 

Here is a link to cheat_sheet of Linux commands. Download and take a look. This should give an idea what you can do from the terminal.

You can use compgen -c command to list all the commands that are available on your system. Remeber, bash is a program that can run other programs, like python or R scripts…

To learn more about this command:

Try this (might not work with Git Bash but will from Linux and mac machines): You will be able to understand this command by the end of this lesson.

echo "$USER user can run $(compgen -c | wc -l) commands on $HOSTNAME."

See more resources at the end of these lesson notes …

As you can see from the list of Linux commands, bash can be used to deal with administrative tasks and text file manipulation, but it is also a programming language with variables, data structures, flow control statements and user-defined functions.

We will start by introducing simple administration-like commands that will allow you to find your way around the existing directory structure and then create new files and directories. After that we will see how to manipulate files and write shell scripts.

2a. Bash: navigate through your file system

Know who you are and where you are! Let’s talk about file system (folders and files) and how to find your way around …

# who am I? Try
$ whoami

And if I were to ask you where are you? Minimize your terminal for a second. You are back in your familiar GUI environment. Where are you? What folders are immediately available to you? Desktop folders? Open the terminal. Can you navigate to desktop using CLI?

# where am I? Try
$ pwd  #print working directory

Image of file system: root, working, current, parent directories…



pwd tells me where I am, but what is inside that folder?

# list files
$ ls

# use flags to show different details: Try
$ ls -F

# use man pages or --help to see what flags do what
# spend some time here - example of general command documentation
$ ls -t

# Do you have Desktop directory? You can list files present on Desktop: Try
$ ls -F Desktop  #if you do not have `Desktop` directory, it will give you an error...

CHALLANGE 1

Remember, we made a `SCW_April2016` folder in the beginning of class and moved our `data_linux` folder there. Can you now list files in SCW_April2016 directory from your current directory?

Solution: ls -F Desktop/SCW_April2016

In the exercise above, you viewed files in SCW_April2016 directory while in home directory. But you can change the directory and move directly into SCW_April2016 directory.

$ cd Desktop
$ pwd
$ cd SCW_April2016
$ pwd
# you could have done it in one step: cd Desktop/SCW_April2016

$ ls -F

So, cd lets you change directories into daughter subdirectories. What if you need to go back?

$ cd Desktop     #does not work, cd gets you into directories within current directory only

# But try this:
$ cd ..          #works

# Why? List all files in `SCW_April2016` folder 
$ ls -F -a

.. is a special directory name meaning “the directory containing this one”, or the parent of the current directory. And . means “the current working directory”.

# but `cd` without any arguments takes you ... where?
$ cd
$ pwd

Relative vs Absolute Paths

Navigation with cd .. is an example of the relative path - you indicate where you want to go relative to the location you are currently in. When you give an absolute path to the folder or file, you will end up there no matter what your current location within file system is.

$ pwd  #specifies absolute path
$ cd $(pwd)/Desktop/SCW_April2016
$ pwd  #in my case `/c/Users/Anna/Desktop/SCW_April2016` is an absolute path to `SCW_April2016` folder

Other shortcuts:

$ cd ~ #takes you home
$ cd - #takes you BACK one directory, NOT UP!!!

CHALLANGE 2

Make a diagram of our directory structure (~/Desktop/SCW_April2016/data_linux/) and practice navigation commands.
a) Find out where you currently are
b) Go to `data_linux` folder, what is there? 
c) Go back where you came from (with cd command)
d) Go one directory up
e) Try relative and absolute paths
f) get comfortable navigating across file system

2b. Bash: make new files and directories

Now that you know how to navigate your file system and list files that are already there, let’s see how we can create new folders and files.

You can create,delete,move,copy,rename files and directories using linux commands.


#Navigate to `SCW_April2016` folder
#check you are in the correct place
$ pwd

#create new directories for every lesson in this workshop.
$ mkdir Linux
$ mkdir Python
$ mkdir SQL Git

#check
$ ls

#go to Linux directory
$ cd Linux

#create file
$ touch MyFirstFile.txt   # creates empty file
$ npp MyFirstFile.txt     # $ edit MyFirstFile.txt if you are on mac;  edit file
$ cat MyFirstFile.txt     # view file

# move/rename file
# general syntax: mv $source $destination
$ mv MyFirstFile.txt MyFirstScript.sh

# Now try to move (but do not rename) MyFirstScript.sh to Python folder
# you are currently in `Linux`
$ mv MyFirstScript.sh ../Python/
# check if MyFirstScript.sh is present in `Linux` and `Python` without leaving `Linux`

# Copy file from `Python` to `Linux`
# General cp syntax: cp $source $destination
cp ../Python/MyFirstScript.sh .

# Delete file !!! CANNOT BE UNDONE
$ rm MyFirstScript.sh  
$ ls
# clean up `Python`
$ rm ../Python/MyFirstScript.sh

#Go ahead and create a new directory `draft`.
#Then try to delete the directory. 
$ rm draft    # does not work
$ rm -r draft # works  - first you must remove all files from the directory

CHALLANGE 3

Next we will work with files from the `data_linux` folder that you moved to `SCW_April2016` folder in the beginning of this lesson. Please move `data_linux` folder to Linux folder. We will keep all the files for Linux lesson in the `Linux` folder from now on. Make sure you understand the directory structure of SCW_April2016 before we continue.

BREAK

2c. Bash: manipulate text files

Here is the list of the commands we will use in this second half of the lesson. Not too many, but when used effectively, do simplify and speed up your work enormously!

# commands used in this second half of the lesson
$ wc         #word count
$ head/tail  #display start/end of file
$ cut        #extract fields(columns) of interest from file
$ sort       #sort file
$ uniq       #select uniq lines only
$ grep       #select rows based on content

Let’s look into data_linux folder. Open OECD_Countries_Full.txt in Excel first. Let’s understand this dataset.

You want to be able to get a feel for datasets like that usung command line interface. Some questions you might want to ask:

  • How big is this file?
  • What countries does this dataset include?
  • What is the country with the highest infant mortality rate?
  • What country has lowest income inequality?
  • How do these countries rank with respect to taxes?
  • What year has the most data?

Let’s see how we can do some of this.

#how big is your file `wc` outputs lines, words, bytes
$ wc OECD_Countries_Full.txt  # try also with absolute path!

#look at the first 10 lines
$ head -n 10 OECD_Countries_Full.txt   # also try `tail`

##how many countries in this file? We will need more than one step here!
# 1. extract the first column from the dataset:
# `cut` -  to get the column; `-f1` - flag to indicate we want the first column
$ cut -f1 OECD_Countries_Full.txt

# Let's redirect the output to file `CountryList.txt`; Use `>` operator to write to file.
$ cut -f1 OECD_Countries_Full.txt > CountryList.txt

#How can you view CountryList.txt? Notice that country names repeat many times!
$ cat CountryList.txt

# 2.Sort and select uniq names only
$ sort CountryList.txt > CountryList_Sorted.txt
$ uniq CountryList_Sorted.txt > CountryList_uniq.txt

#the above 2 lines of code could be substituted with one:
$ sort -u CountryList.txt > CountryList_uniq.txt   # same as sort and then select uniq lines (sort|uniq)

# Is it correct? Header got in the mix - be careful!

# 3. And finally, how many countries? 
$ wc -l CountryList_uniq.txt > CountryCountInOECDdata.txt

Notice, we created 3 output files to find out how many countries are included into our dataset… Run ls to see for yoursef. Do we really need them? One way to avoid generating intermediate files you do not need is to string different commands together, known as ‘piping’

Pipes

Now let’s see how we can combine commands

# use `|` symbol to pass the output of one command as an input to the next command
$ cut -f1 OECD_Countries_Full.txt | sort -u | wc -l > CountryCountInOECDdata_2.txt

# a quick fix to avoid counting header as a country  # introducing `grep`
$ cut -f1 OECD_Countries_Full.txt |grep -v "Country" - |sort -u|wc -l > CountryCountInOECDdata_3.txt

Now it is your turn…

CHALLANGE 4

What country had the highest "Infant_mortality" in 2012? 
Use `OECD_Countries_Full.txt` as an input file and generate `CountryWithHighestMortality.txt` as your ONLY output file.

#Hint: you can accomplish this by using `grep`, `cut`, `sort` and `tail` but you might want to look up help pages for some of these commands... 

Solution: first step by step and then as a pipe

# 1. select all rows with `Infant_mortality` in it
grep Infant_mortality OECD_Countries_Full.txt > InfantMortality_all.txt

# 2. get data for 2012 only
grep 2012 InfantMortality_all.txt > InfantMortality_2012.txt

# The 2 steps above can be combined into one step using regular expressions
# $ grep -E "Infant_mortality.*2012" OECD_Countries_Full.txt # `grep -E` is the same as `egrep` 

# 3. select only 1st and 6th columns
cut -f1,6 InfantMortality_2012.txt > InfantMortality_2012_short.txt

# 4. sort by 2nd column from min to max
sort -n -k2 InfantMortality_2012_short.txt > InfantMortality_2012_short_Sorted.txt

# 5. select country with highest mortality rate
tail -n 1 InfantMortality_2012_short_Sorted.txt >CountryWithHighestMortality.txt

#Or as a pipe:
grep Infant_mortality OECD_Countries_Full.txt| grep 2012 | cut -f1,6 | sort -n -k2 | tail -n 1 > CountryWithHighestMortality.txt 

2d. Bash: write shell scripts

But what if you want to reuse these commands? You want to do something similar on another file? We need to save the commands to a file so we can easily reuse/modify them later. Such collection of commands in the order you want them to be executed is a simple shell script.

#copy and paste or just redirect the command that works (from the terminal) to `MyFirstScript.sh` file

$ echo "grep Infant_mortality OECD_Countries_Full.txt| grep 2012| cut -f1,6 | sort -n -k2 | tail -n 1  > CountryWithHighestMortality.txt" > MyFirstScript.sh

We have just created MyFirstScript.sh. Let’s view and modify it slightly: npp MyFirstScript.sh or edit MyFirstScript.sh (if you are on mac)

  • Add path to shell (bash in our case) that will execute your code; should be your first line: #!/bin/bash
  • Add description of what script does
  • Add usage statement that helps a user to run this script: usage: script.sh
#more formal usage statement
if [[ $# -ne 1 ]]; then 
   echo "usage: script.sh arg1"
     exit;
fi

Let’s run it. We are in data_linux directory. But first look at it carefully, do you think it will run?

#./ indicates that you are running script from working (`data_linux`) directory
./MyFirstScript.sh

Well, it runs and generates the expected output. But is it a good script? Could you reuse it with a different file as input file? How can we make it more flexible?

To make it flexible we need to inroduce a variable for a part of the code that we want to change frequently. For example, if we want to run this code with a different file, we want a variable instead of hard-wired filename; a variable can take any user-defined value.

Variables in bash

Varible name: myName; value assigned to it: Anya

#try this
$ myName=Anya  #variable assignment
$ echo Anya  
$ echo myName
$ echo $myName  # need `$` to get the value of the variable

Let’s modify MyFirstScript.sh to include a variable and save it as MyFirstScript_2.sh Also change the name of the output file to CountryWithHighestMortality_2.txt so that we can be sure that the output corresponds to this new modified script.

Here is MyFirstScript_2.sh

#!/bin/bash

# record a country with highest Infant_mortality among countries in OECD_Countries_Full.txt
#usage: script.sh

input=OECD_Countries_Full.txt

grep Infant_mortality $input| grep 2012  | cut -f1,6 | sort -n -k2 | tail -n 1  > CountryWithHighestMortality_2.txt

Run it: ./MyFirstScript_2.sh

Is it better? A little bit… why? What would be even better? We want to provide filename at the command line and not have to change the script itself.

Here is MyFirstScript_3.sh

#!/bin/bash

# record a country with highest Infant_mortality among countries in OECD_Countries_Full.txt
#usage: script.sh $inputFile   #notice how we need to run this now

input=$1  #special variable that stores the the first argument from the command line

grep Infant_mortality $input| grep 2012  | cut -f1,6 | sort -n -k2 | tail -n 1  > CountryWithHighestMortality_3.txt

Run it: ./MyFirstScript_3.sh OECD_Countries_Full.txt

Is it better? Why? Can we make it even better? How?

CHALLANGE 5

Work in groups to write a script (name it `MyFirstScript_Good.sh`) that would allow user to compare any indices between the countries(not just Infant_mortality), use data for any year (not just 2012) and write the output to a user-defined output file.

Solution:

Here is MyFirstScript_Good.sh

#!/bin/bash

#record a country with highest Infant_mortality among countries in OECD_Countries_Full.txt
#usage: script.sh $inputFile $index $year $outFile   #notice how we need to run this now

input=$1              #special variable that stores the the first argument from the command line
measure=$2            # $2, $3, $3 store values from 2-4 command line arguments
year=$3
out=$4

grep $measure $input| grep $year| cut -f1,6 | sort -n -k2 | tail -n 1  > $out

Run it: ./MyFirstScript_Good.sh OECD_Countries_Full.txt Income_inequality_interdecile_ratio_P90_P10,_level,_late_2000s 2013 CountryWithHighestIncomeInequality.txt

This is much better!

2e. Bash: work with multiple files

Go to data_linux/ByCountry. This directory contains data from OECD_Countries_Full.txt but was split by country using this command: $ grep -v Country OECD_Countries_Full.txt |awk '{print >$1"_data.txt"}'

We can use a wild card to get information about multiple files at the same time

$ wc -l *

#Try:
$ wc -l A*        # * matches multiple characters
$ wc -l I*e*      
$ wc -l I?e*      # ? matches single character

# These wild cards * and ? are examples of file globbing.
 

You can also run commands on multiple files using loops. Here is the essence of the loop:

for file in *; do echo $file; done 

Let’s use this loop idea to write a script that will extract data on particular measure from every country file found in ByCountry folder and record it to a single output file.

This is CollectDataFromByCountry.sh. Save it to ByCountry folder for now.

#!/bin/bash
# this script collects data of user-specified index measured in a given year from all country-specific files
# usage: script.sh $inputFile $measure $year

input=$1
index=$2
year=$3

grep $index $input| grep $year | cut -f1,6 >> $index"_"$year.txt 

Run it from ByCountry folder:

for file in *; do CollectDataFromByCountry.sh $file Total_fertility_rates 2012; done 

For you final challange, go to ByMeasure folder. In this case the files contain data from OECD_Countries_Full.txt but was split by measure/index using this command: $ grep -v Country OECD_Countries_Full.txt |awk '{print >$3"_data.txt"}'

CHALLANGE 6

Go to `data_linux/ByMeasure` folder. In this case the files contain data from `OECD_Countries_Full.txt` but was split by measure/index. Write a script (`GetCountryInfo.sh`) that will collect information from every measure-specific file for user-defined country.

Solution

This is GetCountryInfo.sh. Save it to ByMeasure folder for now.

#!/bin/bash
# this script collects data for user-specified country from all index-specific files
# usage: script.sh $inputFile $country

input=$1
country=$2

grep $country $input >> $country.txt 

Run it from ByMeasure folder:

for file in *; do ./GetCountryInfo.sh $file Sweden; done