Welcome to Software Carpentry Etherpad!


This pad is synchronized as you type, so that everyone viewing this page sees the same text. This allows you to collaborate seamlessly on documents.
################################################################################################
################################################################################################
################################################################################################

Welcome to Software Carpentry Workshop!

Please have workshop website open:

https://annawilliford.github.io/2018-02-24-UTA/

Follow the link to etherpad provided under Schedule section ( to this page) - everything we type here is visible to all participants of the workshop

Please make sure that the following steps are  completed: If you have problems, put up a red sticky note, one of our helpers will come help you.

1. If you are off-campus guest: Register your vehicle:
To register your vehicle, please go to: https://uta.nupark.com/events/Events/Register/2a220cb4-fdbc-463c-b735-cfbfe49c0fa5 

2. Make sure you have connection to internet:tut
    For our off-campus guests:
    
    NetID / Username: evt-biol
    Password - Spring2018

    
    You may access the UTA wireless network by connecting
to "UTA Auto Login using the above credentials.
If this does not work, please try connecting to “UTA Web Login” 
with the network key - UTAsecret. Users will then have to open
a web browser to authenticate through the welcome page with the
event account NetID username and password and click Submit.   

3. Make sure Git and Anaconda are installed
Follow instructions under Setup section of the workshop website

4. Fill out the survey: https://www.surveymonkey.com/r/swc_pre_workshop_v1?workshop_id=2018-02-24-UTA

###########################################################################################################
###########################################################################################################
###########################################################################################################


### This space if for taking any notes you want to for this workshop. It is best to work together and do this as a class.


##Issues/Solutions:
    
    
If you are receiving errors when entering 'npp' to git-bash, please try the following:
    
1. Open git-bash
2. Type in: notepad .bash_profile [ENTER]
3. Paste: if [ -f ~/.bashrc ]; then . ~/.bashrc; fi
4. Save file and re-open git-bash


#############################################
Day 1, Morning: Python Basics
#############################################


My lessons: https://github.com/rameshbalan/SWC_spring2018/blob/master/Python_Basics.ipynb
If you want to follow along with the commands as they are issued, keep the above link open in your web-browser. You will have to occassionally refresh to view changes.

The data we will be working with for this workshop
Dataset:
        https://raw.githubusercontent.com/AnnaWilliford/2017-11-11-UTA/gh-pages/workshop/SWC_fall2017/Data.zip

1. Download the data and unzip it into a directory on your Desktop called 'SWC_spring2018'
2. Also within SWC_spring2018, create a directory called 'Python_basics'
3. Copy the 'gapminder.txt' file from 'Data' into 'Python_basics'

Your directory should look something like this (where indenting indicates hierarchical file structure):
Desktop
	SWC_spring2018
		Data
			gapminder.txt
			ByCountry/
		Python_basics
			gapminder.txt

import math
 # we are importing a module, a module is a collection of functions. Here we are importing the functions 
#that live in the "math" module, these are functions that allow us to use the mathematical operations such as
# factorials, square roots, integrals, derivatives and such

help(math) # the "help()" funciton allows us to see all of the functions that are in the "math" module.

# The way that we "call" the function for use,is by typing "math", a period, and then the deisred function
# for example math.sqrt()
# we type what we want to take the square root of inside of the parenthesis
math.sqrt(64) this returns 8


Challenge 2.1 What will be the value of each variable after each statement in the following code?
mass = 47.5
height = 24.5
age = 122
mass = mass * 2.3
age = age - 20
height = height + 20    

height= 44.5

Mr. Robot: 
		* mass 109.24999999999999, age 102, height 44.5

mass 109.24999999999999 height 44.5 age 102mass 109.24999999999999

mass = 109.24999999999999
age = 102
''' 
this is a multiline comment
'''

"""
this is also a multiline comment
"""

When naming a new variable, do not use spaces and do not put numbers at the beginning of the name.
The command "who" gives all of the variables declared so far, and their current values
Use del() to delete a particular variable
Use "reset" to delete all variables.

functions: print(), help(), type(), str(), int(), float(), pandas.DataFrame(), pandas.read_table()
commands: reset, who

data types: float (e.g. 23.0), integer (e.g. 23), string (e.g. "Hello!")
another data type:
list = [value1, value2, value3]

python index begins from 0, not 1  <<--- THIS IS SUPER IMPORTANT! This varies between languages. R, for example, is 1 indexed.
# list is an ordered sequence of objects.
# indexing starts from zero
# list are mutable

# tuples are immutable, meaning the elements of a tuple cannot be changed without creating a new tuple
# tuples are the same as lists except tuples are immutable

# Dictionary is an unordered sequence of objects
# Dicionary contains key:value pair.

Also, do not name your variables with words already used in python, eg naming a dictionary "dict" is  BAD IDEA, same with naming a tuple "tuple" or a list "list"

.iloc() = integer based position indexing    # is it *.iloc() or *.iloc[] ? ## nvm, it is *.iloc[]

does that ^ have to be called using "pandas.iloc()" or do you use "mydataframe.iloc()"?
OK, it says on the projector that you use "mydataframe.iloc()" where mydataframe is the dataframe object that you created.

### For tab separated file
my_file = pd.read_table("gapminder.txt")
Similar to tab-separated file, you can use comma separated file as well as excel file.
### For comma separated file
my_file = pd.read_csv("gapminder.csv")
### For excel file
my_file = pd.read_excel("gapminder.xlsx")
Subsetting Dataframes
Now lets look at the top few lines and last few lines in the df.
### Read in the first 10 lines of data frame
file_dataframe = my_file.head(10)
### Read in the last 10 lines of data frame
file_dataframe = my_file.tail(10)
# For a specific row, column - Here in this case, the subset is a seriesprint(dataframe.iloc[row,column])
# For a set of rows and for a set of columns - Here its a dataframeprint(dataframe.iloc[[<row selection>],[<column selection>]])
Now, let us subset our gapminder dataset using iloc to understand its functionality better.
If you want to the first row and third column, we would subset using iloc as follows.
print(my_file.iloc[0,2])  #first row, third column

There are three main options to achieve the selection and indexing activities in Pandas. They are:
	1. Selecting data by indexing (.iloc)
	1. Selecting data by label(.loc)
	1. Selecting data by a conditional statment/logical statements


Challenge 3.1 Play with gapminder dataset:
    
TASK: Answer the following questions about myData object
1. Can you extract 3rd and 5th column of the dataset?
2. Can you extract the list of countries in this dataset?
### Hint: use unique(). ###
3. Can you get a part of this dataset that includes information about Sweden?
4. Can you extract all countries for which life expectancy is below 70?
5. Can you make a new column that contains population in units of millions of people?
#LinuxFTW
#This is LifeExpPlot.py  
# Importing packages 
import pandas as pd 
import matplotlib.pyplot as plt  
#Reading file my_file = pd.read_table("/Users/balan/Desktop/SWC_spring2018/Data/gapminder.txt")  
#Subsetting  Canada = my_file.loc[my_file['country']=="Canada", :] 
# Plotting 
Canada.plot.line(x="year",y="lifeExp", label = "Life Expectancy", figsize=(8,6)) plt.suptitle("Life Expectancy in Canada over the years", fontsize = 20)
plt.xlabel("Year", fontsize = 16) 
plt.ylabel("Life Expectancy", fontsize = 16) 
plt.savefig("PlotLifeExp.png")

Challenge 3.2 Write your own python script Write a script to calculate mean gdpPerCap for African and European countries. Try to make a barplot to display your results.
You might need to read help for ‘mean’ and ‘plot’ functions .mean() plt.bar() 

unique() is part of pandas module


#############################################
Day 1, Afternoon: Introduction to Bash Shell
#############################################
Adnan:
Remember, we made a `SWC_spring2018` folder in the beginning of workshop and moved our `Data` folder there. Can you now list files in SWC_spring2018 directory from your current directory?

Next we will work with files from the `Data` folder that you moved to `SWC_spring2018` folder in the beginning of this workshop. Please move `Data` folder to unix_shell folder. We will keep all the files for Linux lesson in the `unix_shell` folder from now on. Make sure you understand the directory structure of SWC_spring2018 before we continue.

Link to automatically-updating history of commands issued by instructor. Refresh web-page regularly to view changes.
Dropbox Link: https://www.dropbox.com/s/11damf5gwzaq1x2/linux.txt?dl=0


https://www.dropbox.com/s/axb4n7b2zwix1ob/SWC_spring2018_linux2.txt?dl=0

LINK TO BASH HISTORY:
    https://pastebin.com/fxu5acgy

Remove color from ls:
    alias ls='ls --color=none'
    
Change $ user@dir color:
       export PS1='\[\033[1;37m\]\u@\h:\[\033[0m\]\[\033[1;37m\]\w\[\033[0m\] \[\033[1;37m\]'

Paste the above command in ~/.bashrc

Farah:

https://www.dropbox.com/s/axb4n7b2zwi018_linux2.txt?dl=0

CHALLANGE 4
What country had the highest “LifeExp” in 2002?
Use gapminder.txt as an input file and generate Country_HighestLifeExp.txt as your ONLY output file.
Hint: you can accomplish this by using grep, cut, sort and tail but you might want to look up help pages for some of these commands.

Clue: If you need to cut more than one columns, you can use cut -f1,2,3,4,5,6, etc. Use number of columns seperated by comma after -f1

Solution to challenge 4: 
    
    Solution: first step by step and then as a pipe

# 1. select all columns of interest - "Country", "Year", and "lifeExp.
cut -f1,3,4 gapminder.txt > LifeExp_All.txt
#2. get data for 2012 onlygrep 2002 LifeExp_All.txt > LifeExp_2002.txt
#3. sort by 3rd column from min to max sort -n -k3 LifeExp_2002.txt > LifeExp_2002_Sorted.txt
#4. select country with highest mortality rate
tail -n 1 LifeExp_2002_Sorted.txt > CountryHighestLifeExp.txt

#Or as a pipe:cut -f1,3,4 gapminder.txt | grep 2002 | sort -n -k3 | tail -n 1 > CountryHighestLifeExp.txt
# is there any way to do this command ^ such that the column headers are kept in place?
cut -f1,3,4 gapminder.txt | grep -v "lifeExp",2002 | sort -rk3 | head -n 2 > CountryHighestLifeExp.txt


###Loops# for loop syntax:
    for variable in list; do action $variable; done
    #command-line `for` loops can be run on a single line 
    $ for variable in list; do action $variable; done
    #A simple for loop command
    for file in *.txt; do echo "mv $file 02-24-18_$file"; done
    
    # What does it show you?
    
    Let’s try to extract the third column of every .txt file in ByCountry folder and record the output to corresponding output files that begin with ‘Col3_’. For example, the output for `Canada_data.txt` should be stored in `Col3_Canada_data.txt`. 
    Let’s move to the ByCountry folder.
$ for file in *.txt; do cut -f3 $file > Col3_$file; done

Bash scripts:
If you want to reuse commands, to do something similar on another file, you can save the commands to a file and run that file so that you can easily reuse/modify them later. Such collection of commands in the order you want them to be executed is a simple shell script.

Let's paste this command in  text editor  and save it as MyFirstScript.sh

cut -f1,3,4 Data/gapminder.txt | grep 2002 | sort -n -k3 | tail -n 1 > CountryHighestLifeExp.txt

1. You have just created MyFirstScript.sh. Let’s view and modify it slightly: npp MyFirstScript.sh (for windows) or edit MyFirstScript.sh (for mac)

Add path to bash shell that will execute your code; so the following should be your first line of the script: 
#!/bin/bash
	* Add description of what script does.
#this script records a country with highest life expectancy or population or gdp in any year among countries in gapminder.txt to a user-defined output
	* Add usage statement: 
#usage: scriptname.sh

./ indicates that you are running script from working directory. So this script should be in the directory where Data directory is. ./MyFirstScript.sh

So this is how MyFirstScript.sh it should look:
#!/bin/bash

#this script finds What country had the highest "LifeExp" in 2002 in gapminder.txt
#usage: ./MyFirstScript.sh

cut -f1,3,4 Data/gapminder.txt | grep 2002 | sort -n -k3 | tail -n 1 > CountryHighestLifeExp.txt

MyFirstScript_2.sh:
#!/bin/bash

#this script records a country with highest life expectancy in 2002 among countries in gapminder.txt
#usage: script.sh
#usage: ./MyFirstScript_2.sh (using this command you run it from the terminal)

input=Data/gapminder.txt

cut -f1,3,4 $input | grep 2002 | sort -n -k3 | tail -n 1 > CountryHighestLifeExp_2.txt    
 
 When you are defining the input file, it will only work on that file.
 
 What if you want to get the highest life expectancy from some other file?
 You can use variables. 
 Variables in bash:
Varible name: myName; the value assigned to it: the Name
#try this
$ myName=James  #variable assignment
$ echo James  
$ echo myName
$ echo $myName  # command needs a `$` to return the value of the variable
Now let's use a variable for the input file name.
 
 MyFirstScript_3.sh:

#!/bin/bash

#this script records a country with highest life expectancy in 2002 among countries in gapminder.txt

#usage: ./MyFirstScript_3.sh or sh MyfirstScript_3.sh input file name

input=$1  #special variable that stores the the first argument from the command line

cut -f1,3,4 $input | grep 2002 | sort -n -k3 | tail -n 1 > CountryHighestLifeExp_3.txt

How can we make it more flexible?
To make it flexible we need to inroduce a variable for a part of the code that we want to change frequently. For example, if we want to run this code with a different file, we want a variable instead of hard-wired filename; a variable can take any user-defined value.

So far we have been asking the script to find out the country with the highest life expectancy in 2002 only from gapminder.txt. What if you want to know more? What if you want to know any highest or lowest value from any column for any country for any year from any file? We can use variables to make it do what we want.

MyFirstScript_Good.sh
#!/bin/bash
 #usage: sh MyFirstScript_Good.sh $inputFile $column $year $outputfile  (provide the filename column number year and the outputfile name in the terminal after typing sh MyFirstScript_Good.sh)
 
inputFile=$1                #special variable that stores the the first argument from the command line 
column=$2        # $2, $3, $4 store values from 2-4 command line arguments 
year=$3 
outputfile=$4  
cut -f1,3,$column $inputFile | grep $year | sort -n -k3 | tail -n 1 > $outputfile

You can sort in reverse order using sort -r

#############################################
Day 2, Morning Session 1: Python Programming
#############################################


Please have workshop website open:

https://annawilliford.github.io/2018-02-24-UTA/

Follow the link to etherpad provided under Schedule section ( to this page) - everything we type here is visible to all participants of the workshop

Good Morning, 

Before we begin, let us try and install plotnine.

Type the following command in Terminal / git bash:

conda install -c conda-forge plotnine


Green - 


#This is LifeExpPlot.py  # Importing packages 

import pandas as pd 
import matplotlib.pyplot as plt  
# Reading file
my_file = pd.read_table("gapminder.txt")  
#Subsetting  
Canada = my_file.loc[my_file['country']=="Canada", :] 
# Plotting 
Canada.plot.line(x="year",y="lifeExp", label = "Life Expectancy", figsize=(8,6)) 
plt.suptitle("Life Expectancy in Canada over the years", fontsize = 20) 
plt.xlabel("Year", fontsize = 16) 
plt.ylabel("Life Expectancy", fontsize = 16) 
plt.savefig("PlotLifeExp.png")


====================

To get python and jupyter notebook to work from gitbash:
From whereever you are, type:
cd
npp .bashrc

Add this line at the end of the file:
export PATH=$PATH:"C:\Users\<windows user name>\Anaconda3"
export PATH=$PATH:"C:\Users\<windows user name>\Anaconda3\Scripts"
Save and close.

Note: if this does not work, check the path to Anaconda3 folder
and replace "C:\Users\<windows user name>\Anaconda3" with a correct path.
This provides path to python.exe that MUST BE inside Anaconda3 folder (installed there by default during Anaconda installation)
"C:\Users\<windows user name>\Anaconda3\Scripts" should provide path to jupyter-notebook.exe

Start a new gitbash window for changes to take effect

====================


#defining functions in Jupyter

#defining the function
def degree_faren():   
     # f = (c*9/5)+32    
     f=(22*9/5)+32    
     print(f)  

#calling the function
degree_faren()

#defining an argument for a function (allows for more flexibility in your code)
def degree_faren(c):    
    f = (c*9/5)+32   
     #f=(22*9/5)+32    
     print(f)

#call function with user defined argument
degree_faren(5505)

#using return instead of print()
def degree_faren(c):   
     f = (c*9/5)+32   
      #f=(22*9/5)+32    
      return f

#assign output of function to variable
faren = degree_faren(30)

#multiply that variable by 10
faren = faren*10

#view output two ways
print(faren)
faren


CHALLENGE :

Write a function to  add two number and return the answer
sum(x, y)
def add_two_numbers(x,y):    
    return x + y
answer=add_two_numbers(1,2)
print(answer)

def plot_country(country):    
    country.plot.line(x="year",y="lifeExp", label = "Life Expectancy", figsize=(8,6))     
    plt.suptitle("Life Expectancy in Canada over the years", fontsize = 20)     
	plt.xlabel("Year", fontsize = 16)     
	plt.ylabel("Life Expectancy", fontsize = 16)     
	plt.savefig("PlotLifeExp.png")    
	plt.show()
	
def get_country(country_name):    
    country = my_file.loc[my_file['country']==country_name, :]     
    return country

country_table = get_country('India')
plot_country(country_table)

#############################################
Day 2, Morning Session 2: Data Visualization with Python plotnine
#############################################

-------
import pandas as pd 
import plotnine 
from plotnine import *

(ggplot(my_file) + aes(x='gdpPercap', y='lifeExp') + geom_point())
http://ggplot.yhathq.com/how-it-works.html

-----

countries = my_file[my_file.country.str.startswith('A') |                    my_file.country.str.startswith('Z')]


(ggplot(countries) + aes(x='year', y='lifeExp', color='continent') +   geom_line() + facet_wrap('country') )

(ggplot(countries) + aes(x='gdpPercap', y='lifeExp', size='pop',                          color='continent') +   geom_point() )


#############################################
Day 2, Afternoon Session 1: Writing Reports w/ Jupyter Notebook
Anna Williford
#############################################

http://assemble.io/docs/Cheatsheet-Markdown.html
http://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Working%20With%20Markdown%20Cells.html

https://www.dropbox.com/s/81pc6zdr6qftvkp/GDP_report_Africa_Americas.ipynb?dl=0
###################################################################################


HTML code for a hyperlink
<a href='$URL'>words you click on </a>  


#############################################
Day 2, Afternoon Session 2: Version control with Git/Github
#############################################

Peace's Git Novice URL: https://pow123.github.io/git-novice/


CHALLENGE #1 (See 3A at above link)
Along with tracking information for your Thesis (the project we have already created), say one would also like to track > information about each chapter. Despite a collaborator’s concerns, you create a Ch1 project inside your Thesis project with the following sequence of commands:

$ cd             # return to home directory
$ cd Thesis    # go into Thesis directory, which is already a Git repository
$ ls -a          # ensure the .git sub-directory is still present in the Thesis directory
$ mkdir Ch1    # make a sub-directory Thesis/Ch1
$ cd Ch1       # go into Ch1 sub-directory
$ git init       # make the Ch1 sub-directory a Git repository
$ ls -a          # ensure the .git sub-directory is present indicating we have created a new Git repository

Is the git init command, run inside the Ch1 sub-directory, required for tracking files stored in the Ch1 sub-directory?

GitHub Education Pack - LOTS of FREE or reduced price developer's tools: https://education.github.com/pack   provides UNLIMITED GitHub private repositories while you're a student

All workshop lessons are available for you after the workshop. If you created a GitHub account, you can copy(fork) this 
repository to your account by following this link and clicking on `fork` button at the top right corner
Repository to fork: https://github.com/AnnaWilliford/SWC_Spring2018_lessons