export PS1='$ '
history -c
to clear history export PROMPT_COMMAND="history 1> ~/Dropbox/Public/SCW_April2016_Prep/ShellHistory.txt"
alias p="pwd"
will make p
as a substitute for pwd
(you might need it in case some aliases already set, i had alias ls=ls -F
, so I had to change it back to demonstrate ls -F flag)data_linux
folder - move it to the SCW_April2016 folderBash is one of shell programs that can be used to interact with Linux OS. It is a program that runs other programs and it is also a programming language itself.
Let’s get familiar with the terminal first: prompt, commands, output
Here is the list of the commands we will use in the first half of the lesson. Not too many, but when used effectively, do simplify and speed up your work enormously!
# commands used in the first half of the lesson
$ echo # print
$ whoami # prints username
$ pwd # print working directory path
$ cd # change directory
$ mkdir # make directory
$ touch # create empty file
$ cat # view file/concatinate files
$ mv # move/rename file
$ cp # copy file
$ rm # delete file
# '#' here and everywhere should be interpreted as comment
# echo command is executed: Try
$ echo "Welcome to our workshop!"
# only commands known to bash will be executed: Try
$ stop this session
# what do you think this one does? Try
$ du -h
#To stop command execution, press `Ctrl+Enter`
# List of built-in bash commands:Try
$ help
# To get help for the command: Try
$ wc --help
$ grep --help
$ man wc
# We will look at help pages in more details later
# And if you cannot find the command you want, google!
# TAB auto-completition and other shortcuts
# Up arrow to get previous command
Here is a link to cheat_sheet of Linux commands. Download and take a look. This should give an idea what you can do from the terminal.
You can use compgen -c
command to list all the commands that are available on your system. Remeber, bash is a program that can run other programs, like python or R scripts…
To learn more about this command:
Try this (might not work with Git Bash but will from Linux and mac machines): You will be able to understand this command by the end of this lesson.
echo "$USER user can run $(compgen -c | wc -l) commands on $HOSTNAME."
See more resources at the end of these lesson notes …
As you can see from the list of Linux commands, bash can be used to deal with administrative tasks and text file manipulation, but it is also a programming language with variables, data structures, flow control statements and user-defined functions.
We will start by introducing simple administration-like commands that will allow you to find your way around the existing directory structure and then create new files and directories. After that we will see how to manipulate files and write shell scripts.
Now that you know how to navigate your file system and list files that are already there, let’s see how we can create new folders and files.
You can create,delete,move,copy,rename files and directories using linux commands.
#Navigate to `SCW_April2016` folder
#check you are in the correct place
$ pwd
#create new directories for every lesson in this workshop.
$ mkdir Linux
$ mkdir Python
$ mkdir SQL Git
#check
$ ls
#go to Linux directory
$ cd Linux
#create file
$ touch MyFirstFile.txt # creates empty file
$ npp MyFirstFile.txt # $ edit MyFirstFile.txt if you are on mac; edit file
$ cat MyFirstFile.txt # view file
# move/rename file
# general syntax: mv $source $destination
$ mv MyFirstFile.txt MyFirstScript.sh
# Now try to move (but do not rename) MyFirstScript.sh to Python folder
# you are currently in `Linux`
$ mv MyFirstScript.sh ../Python/
# check if MyFirstScript.sh is present in `Linux` and `Python` without leaving `Linux`
# Copy file from `Python` to `Linux`
# General cp syntax: cp $source $destination
cp ../Python/MyFirstScript.sh .
# Delete file !!! CANNOT BE UNDONE
$ rm MyFirstScript.sh
$ ls
# clean up `Python`
$ rm ../Python/MyFirstScript.sh
#Go ahead and create a new directory `draft`.
#Then try to delete the directory.
$ rm draft # does not work
$ rm -r draft # works - first you must remove all files from the directory
Next we will work with files from the `data_linux` folder that you moved to `SCW_April2016` folder in the beginning of this lesson. Please move `data_linux` folder to Linux folder. We will keep all the files for Linux lesson in the `Linux` folder from now on. Make sure you understand the directory structure of SCW_April2016 before we continue.
export PS1='$ '
history -c
to clear history export PROMPT_COMMAND="history 1> ~/Dropbox/Public/SCW_April2016/ShellHistory.txt"
OECD_Countries_Full.txt
open in Excel. Set font size to 24Here is the list of the commands we will use in this second half of the lesson. Not too many, but when used effectively, do simplify and speed up your work enormously!
# commands used in this second half of the lesson
$ wc #word count
$ head/tail #display start/end of file
$ cut #extract fields(columns) of interest from file
$ sort #sort file
$ uniq #select uniq lines only
$ grep #select rows based on content
Let’s look into data_linux
folder. Open OECD_Countries_Full.txt
in Excel first. Let’s understand this dataset.
You want to be able to get a feel for datasets like that usung command line interface. Some questions you might want to ask:
Let’s see how we can do some of this.
#how big is your file `wc` outputs lines, words, bytes
$ wc OECD_Countries_Full.txt # try also with absolute path!
#look at the first 10 lines
$ head -n 10 OECD_Countries_Full.txt # also try `tail`
##how many countries in this file? We will need more than one step here!
# 1. extract the first column from the dataset:
# `cut` - to get the column; `-f1` - flag to indicate we want the first column
$ cut -f1 OECD_Countries_Full.txt
# Let's redirect the output to file `CountryList.txt`; Use `>` operator to write to file.
$ cut -f1 OECD_Countries_Full.txt > CountryList.txt
#How can you view CountryList.txt? Notice that country names repeat many times!
$ cat CountryList.txt
# 2.Sort and select uniq names only
$ sort CountryList.txt > CountryList_Sorted.txt
$ uniq CountryList_Sorted.txt > CountryList_uniq.txt
#the above 2 lines of code could be substituted with one:
$ sort -u CountryList.txt > CountryList_uniq.txt # same as sort and then select uniq lines (sort|uniq)
# Is it correct? Header got in the mix - be careful!
# 3. And finally, how many countries?
$ wc -l CountryList_uniq.txt > CountryCountInOECDdata.txt
Notice, we created 3 output files to find out how many countries are included into our dataset… Run ls
to see for yoursef. Do we really need them? One way to avoid generating intermediate files you do not need is to string different commands together, known as ‘piping’
Now let’s see how we can combine commands
# use `|` symbol to pass the output of one command as an input to the next command
$ cut -f1 OECD_Countries_Full.txt | sort -u | wc -l > CountryCountInOECDdata_2.txt
# a quick fix to avoid counting header as a country # introducing `grep`
$ cut -f1 OECD_Countries_Full.txt |grep -v "Country" - |sort -u|wc -l > CountryCountInOECDdata_3.txt
Now it is your turn…
What country had the highest "Infant_mortality" in 2012?
Use `OECD_Countries_Full.txt` as an input file and generate `CountryWithHighestMortality.txt` as your ONLY output file.
#Hint: you can accomplish this by using `grep`, `cut`, `sort` and `tail` but you might want to look up help pages for some of these commands...
# 1. select all rows with `Infant_mortality` in it
grep Infant_mortality OECD_Countries_Full.txt > InfantMortality_all.txt
# 2. get data for 2012 only
grep 2012 InfantMortality_all.txt > InfantMortality_2012.txt
# The 2 steps above can be combined into one step using regular expressions
# $ grep -E "Infant_mortality.*2012" OECD_Countries_Full.txt # `grep -E` is the same as `egrep`
# 3. select only 1st and 6th columns
cut -f1,6 InfantMortality_2012.txt > InfantMortality_2012_short.txt
# 4. sort by 2nd column from min to max
sort -n -k2 InfantMortality_2012_short.txt > InfantMortality_2012_short_Sorted.txt
# 5. select country with highest mortality rate
tail -n 1 InfantMortality_2012_short_Sorted.txt >CountryWithHighestMortality.txt
#Or as a pipe:
grep Infant_mortality OECD_Countries_Full.txt| grep 2012 | cut -f1,6 | sort -n -k2 | tail -n 1 > CountryWithHighestMortality.txt
But what if you want to reuse these commands? You want to do something similar on another file? We need to save the commands to a file so we can easily reuse/modify them later. Such collection of commands in the order you want them to be executed is a simple shell script.
#copy and paste or just redirect the command that works (from the terminal) to `MyFirstScript.sh` file
$ echo "grep Infant_mortality OECD_Countries_Full.txt| grep 2012| cut -f1,6 | sort -n -k2 | tail -n 1 > CountryWithHighestMortality.txt" > MyFirstScript.sh
We have just created MyFirstScript.sh
. Let’s view and modify it slightly: npp MyFirstScript.sh
or edit MyFirstScript.sh
(if you are on mac)
#!/bin/bash
usage: script.sh
#more formal usage statement
if [[ $# -ne 1 ]]; then
echo "usage: script.sh arg1"
exit;
fi
Let’s run it. We are in data_linux
directory. But first look at it carefully, do you think it will run?
#./ indicates that you are running script from working (`data_linux`) directory
./MyFirstScript.sh
Well, it runs and generates the expected output. But is it a good script? Could you reuse it with a different file as input file? How can we make it more flexible?
To make it flexible we need to inroduce a variable for a part of the code that we want to change frequently. For example, if we want to run this code with a different file, we want a variable instead of hard-wired filename; a variable can take any user-defined value.
Varible name: myName; value assigned to it: Anya
#try this
$ myName=Anya #variable assignment
$ echo Anya
$ echo myName
$ echo $myName # need `$` to get the value of the variable
Let’s modify MyFirstScript.sh to include a variable and save it as MyFirstScript_2.sh Also change the name of the output file to CountryWithHighestMortality_2.txt so that we can be sure that the output corresponds to this new modified script.
Here is MyFirstScript_2.sh
#!/bin/bash
# record a country with highest Infant_mortality among countries in OECD_Countries_Full.txt
#usage: script.sh
input=OECD_Countries_Full.txt
grep Infant_mortality $input| grep 2012 | cut -f1,6 | sort -n -k2 | tail -n 1 > CountryWithHighestMortality_2.txt
Run it: ./MyFirstScript_2.sh
Is it better? A little bit… why? What would be even better? We want to provide filename at the command line and not have to change the script itself.
Here is MyFirstScript_3.sh
#!/bin/bash
# record a country with highest Infant_mortality among countries in OECD_Countries_Full.txt
#usage: script.sh $inputFile #notice how we need to run this now
input=$1 #special variable that stores the the first argument from the command line
grep Infant_mortality $input| grep 2012 | cut -f1,6 | sort -n -k2 | tail -n 1 > CountryWithHighestMortality_3.txt
Run it: ./MyFirstScript_3.sh OECD_Countries_Full.txt
Is it better? Why? Can we make it even better? How?
Work in groups to write a script (name it `MyFirstScript_Good.sh`) that would allow user to compare any indices between the countries(not just Infant_mortality), use data for any year (not just 2012) and write the output to a user-defined output file.
Here is MyFirstScript_Good.sh
#!/bin/bash
#record a country with highest Infant_mortality among countries in OECD_Countries_Full.txt
#usage: script.sh $inputFile $index $year $outFile #notice how we need to run this now
input=$1 #special variable that stores the the first argument from the command line
measure=$2 # $2, $3, $3 store values from 2-4 command line arguments
year=$3
out=$4
grep $measure $input| grep $year| cut -f1,6 | sort -n -k2 | tail -n 1 > $out
Run it: ./MyFirstScript_Good.sh OECD_Countries_Full.txt Income_inequality_interdecile_ratio_P90_P10,_level,_late_2000s 2013 CountryWithHighestIncomeInequality.txt
This is much better!
Go to data_linux/ByCountry
. This directory contains data from OECD_Countries_Full.txt
but was split by country using this command: $ grep -v Country OECD_Countries_Full.txt |awk '{print >$1"_data.txt"}'
We can use a wild card to get information about multiple files at the same time
$ wc -l *
#Try:
$ wc -l A* # * matches multiple characters
$ wc -l I*e*
$ wc -l I?e* # ? matches single character
# These wild cards * and ? are examples of file globbing.
You can also run commands on multiple files using loops. Here is the essence of the loop:
for file in *; do echo $file; done
Let’s use this loop idea to write a script that will extract data on particular measure from every country file found in ByCountry
folder and record it to a single output file.
This is CollectDataFromByCountry.sh. Save it to ByCountry
folder for now.
#!/bin/bash
# this script collects data of user-specified index measured in a given year from all country-specific files
# usage: script.sh $inputFile $measure $year
input=$1
index=$2
year=$3
grep $index $input| grep $year | cut -f1,6 >> $index"_"$year.txt
Run it from ByCountry
folder:
for file in *; do CollectDataFromByCountry.sh $file Total_fertility_rates 2012; done
For you final challange, go to ByMeasure
folder. In this case the files contain data from OECD_Countries_Full.txt
but was split by measure/index using this command: $ grep -v Country OECD_Countries_Full.txt |awk '{print >$3"_data.txt"}'
Go to `data_linux/ByMeasure` folder. In this case the files contain data from `OECD_Countries_Full.txt` but was split by measure/index. Write a script (`GetCountryInfo.sh`) that will collect information from every measure-specific file for user-defined country.
This is GetCountryInfo.sh. Save it to ByMeasure
folder for now.
#!/bin/bash
# this script collects data for user-specified country from all index-specific files
# usage: script.sh $inputFile $country
input=$1
country=$2
grep $country $input >> $country.txt
Run it from ByMeasure
folder:
for file in *; do ./GetCountryInfo.sh $file Sweden; done