-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathindex.qmd
More file actions
1375 lines (927 loc) · 41.4 KB
/
index.qmd
File metadata and controls
1375 lines (927 loc) · 41.4 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: ""
---
# Introduction
The purpose of this course is to teach you the basics of the Python language and give you the confidence to tackle larger projects using the language. *Importantly, we want to get you thinking like a programmer*. This doesn't mean that by the end of the course you will know Python fully, but you will know *enough* so you can go online and look for the help you need to complete most tasks.
## Practice makes perfect
Programming is like any skill, the more you practice the better you get. ***It's really important that you keep using what you have learned after the course is completed*** otherwise there is a good chance you will forget everything and you'll be back to square one.
## Why use Python?
Python is a high-level **general** programming language (unlike R which has a focus on mathematics and statistics) so Python can be used for a wide variety of applications. When it comes to machine learning, Python seems to have the edge and is preferred by the data scientist community for it;s AI/ML capabilitioes. In bioinformatics we see Python making real in-roads as more packages become available. One of the most popular ones being [Scanpy](https://scanpy.readthedocs.io/en/stable/) and the [scverse](https://scverse.org) which handle single-cell analysis Python is free and available for all operating systems.
## Which other languages do bioinformaticians use?
[R](https://cran.r-project.org) of course which has a long rich history in boinformatics, and most will have heard of [Bioconductor](https://www.bioconductor.org) which is a collection of packages to anaylse biological and genomics data. For very computationally intensive tasks (e.g sequence alignment), languages such as C/C++/Rust are more commonly used, but these are far more difficult to learn.
## A small warning
If you have knowledge of R, this course may be a bit frustrating as some things in Python take more effort to do. R is really geared up as a language for maths and stats, so many useful functions are build into the base language. Python is a **general** programming language, so functions that are native to R will need to be located in Python packages.
## How will this course work?
We're going to take a different approach to this course. You will be taught the basics of the Python language while doing small exercises along the way. However, we will finish by you undertaking a project which will push you quite hard. The aim is that by tacking a more difficult problem will consolidate what you have learnt, and learn more by having to look up solutions to the problems you will likely face.
## Getting Python
Python is slightly different in that we make "environments" on our computers and work within these environment which makes them self-contained. You need to install Miniconda, so go [here](https://www.anaconda.com/docs/getting-started/miniconda/main) and follow the instructions to install Miniconda on your system.
When you have installed Miniconda, we need to make an environment and install Python and some other packages that we need to get going. On your computer, open a terminal and do:
```
conda create -n workshop.env python=3.10
```
And when this is done, activate the enviroment by doing:
```
conda activate workshop.env
```
You are now inside the `workshop.env` environment. Anything you install/do here, will stay here.
Lets install some packages that we'll need:
```
conda install -c conda-forge jupyterlab notebook nbclient ipykernel pandas matplotlib numpy
```
We also need to register this environment as a Jupyter kernel we will use to run commands:
```
python -m ipykernel install --user --name quarto-env --display-name "Python (quarto-env)"
```
## Choosing an editor
Ask a hundred different programmer which editor is best, and you will get a hundred different answers. These are the two which people tend to recommend the most when asked:
1. [VScode](https://code.visualstudio.com). This is a general purpose editor which is really popular. It will handle pretty much all languages, and plugins will make it easier to write and run code. This is the one I use, and the one we#ll use fo this course.
2. [PyCharm](https://www.jetbrains.com/pycharm/). This is a Python specific editor which has a rediculous number of features. You can play with this in your own time!
# The basics
Lets start gently and go through a few simple things to warm up.
## Assigning a variable.
Into your script copy/type the following line:
```{python}
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
```
This will make a **vector** of values from 1 to 10, and put them into a variable called `x`.
Execute the code by hitting the "Run" button at the top-right of the script window. You will see this line appear in the R console below.
To view the contents of the object you have just created, just type `x` in the **console** and hit return:
```{python}
x
```
The contents of x are now printed out.
**Now is a good time to learn about commenting and documenting code**. This is free text you put into your scripts that tell the reader whats going on. It also helps remind your future self of what you did or your thought process. Comments are put in using `#`, so for example:
```{python}
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] # This is a comment
```
Anything after a `#` will be ignored by Python. You can run the code again to check.
Rather than typing in the value 1 to 10, there is a much simpler way to create the same vector using `:`
```{python}
x = list(range(1,11))
print(x)
```
Much better! It's also bidirectional, so to go backwards it's:
```{python}
y = list(range(5, -6, -1))
y
```
Issue the command `help(range)`. In python you can get a manual for any function using the `help()` command. Look at the help page and generate a vector of numbers from 1 to 100 in steps of 10:
```{python}
a = list(range(0, 101,10))
a
```
Lets try something a bit more difficult. Generate a vector called `b` ranging from 3 to 987 where the length of the vector is 53 entries long. Done? Check the length of the vector you have just made.
In Python, there is a greater reliance on packages for working with numbers, and the package which we will use the most is `numpy`. To load a package we do:
```{python}
import numpy
```
The way we use functions from a package is by doing `numpy.X` where `.X` is replaced by the name of the function in the `numpy` package you need. You can already see that if you need to use `numpy` functions a lot, then having to type `numpy` all the time is going to be a drag. So what we can do here is shorted it by doing:
```{python}
import numpy as np
```
So now, if you want to use a `numpy` function we just use `np.X`
```{python}
b = np.arange(3, 938, 53) # 938 is exclusive, so it matches R's inclusive 937
length = len(b)
print(length)
```
We can also make a new vector `d` using a vector `c`:
```{python}
c = np.arange(1, 51) # 1 to 50 inclusive
d = 1 / c
print(d)
```
And do maths on them, for example calculate the mean of `d`:
```{python}
mean_d = np.mean(d)
print(mean_d)
```
```{python}
std_d = np.std(d)
print(std_d)
```
## Conditionals
This is important stuff. If we want to ask whether something is equal to something else, we need to use the `==` operator, NOT `=`. Try this:
```{python}
x = np.arange(1,11)
x == 5
```
We can also do some other simple but important things:
```{python}
print(np.where(x < 5))
print(np.where(x <= 5))
print(np.where(x >= 5))
print(np.where(x > 5))
print(np.where(x != 5))
```
Lets make another vector `y`:
```{python}
y = np.arange(7,16)
common = np.intersect1d(x, y)
print(common)
```
What do you think this does?
```{python}
result = x[~np.isin(x, y)]
print(result)
```
## Basic plotting
In order to plot, we need another module called `matplotlib`, specifically, a submodule called `pyplot` which can be thought of as base plotting in R (if you use R). Lets load it:
```{python}
import matplotlib.pyplot as plt
```
To plot we do:
```{python}
plt.plot(d)
```
Do some googling and see how you can add a title, label the x/y axis, make the line points instead, and colour them red.
Here is how I would do it:
```{python}
plt.plot(c,d, marker='o',color="red")
plt.title("c")
plt.xlabel("Index")
plt.ylabel("d")
plt.show()
```
We can make this plot a little fancier. Those who work in R will be familiar with `ggplot2` package, and there is a direct port to that in Python called `plotnine`. We use this in combaination with `pandas` which is the module that creates and handles data.frames (more on this later):
```{python}
from plotnine import ggplot, aes, geom_line, geom_point
import pandas as pd
df = pd.DataFrame({'col1': c, 'col2': d})
# Plot
p = (
ggplot(df, aes(x='col1', y='col2')) +
geom_line(color='red') +
geom_point(color='green')
)
p.draw()
```
If you want to do it the "pure" Python way (and we probably should!) we can to the following using `seaborn` plus `matplotlib`:
```{python}
import seaborn as sns
df = pd.DataFrame({'col1': c, 'col2': d})
# Plot
plt.figure(figsize=(8, 5))
sns.lineplot(data=df, x='col1', y='col2', color='red') # red line
sns.scatterplot(data=df, x='col1', y='col2', color='green') # green points
plt.title("1/c vs c")
plt.xlabel("col1")
plt.ylabel("col2")
plt.show()
```
Lets have a look at the manual for pandas. Go [here](https://pandas.pydata.org/docs/index.html). This modules has extensive documentation, and at the top you can clokc on "API" which will list out all the methods which can be used on a `pandas` DataFrame.
# Matricies
Matrices are the most common data format bioinformaticians work with. Microarray/RNAseq/single-cell data are all kept in matricies where gene are in the rows and samples down in the columns. Lets make a simple matrix of zeros using `numpy`:
```{python}
m = np.zeros((10, 5)) # 10 rows, 5 columns
print(m)
```
This will create a matrix filled with zeros. To transpose (flip) the matrix we use t() (this will be important later!)
```{python}
tposed_m = m.T
print(tposed_m)
```
We usually need to name the rows and columns (genes/samples), so for that we need to switch to using a pandas dataframe because np matricies do not take labels:
```{python}
df = pd.DataFrame(
m,
index=["A","B","C","D","E","F","G","H","I","J"],
columns=["cat", "dog", "pig", "cow", "chicken"]
)
print(df)
```
## Subsetting
Lets make a matrix (and a vector) containing integer values so we can take a look at how subsetting work in R:
```{python}
#m = np.arange(1, 51).reshape((5, 10), order='F').T # 10x5 after transpose
v = np.arange(1, 11)
m = np.arange(1, 51).reshape((10, 5))
df = pd.DataFrame(
m,
index=["A","B","C","D","E","F","G","H","I","J"],
columns=["cat", "dog", "pig", "cow", "chicken"]
)
print(df)
```
We can access individual elements using square brackets `[]`. So to get the 6th, 1st and 5th elements of `v` we need:
```{python}
print(v[[6,0,4]])
```
Why? Because **Python counts from 0.** If you have been using R until now, you will know it is one based. To get the first element of a vector in R you would d `v[1]`, but this isn't so in Python. In fact, pretty much all languages are 0 based. Wit this in mind, lets look ar the first row of the matrix `m`.
```{python}
print(df.iloc[0,:])
```
and the 3rd column:
```{python}
print(df.iloc[:,2])
```
```{python}
print(df.iloc[:,[1,4]]) # the 2nd and 5th column
```
```{python}
print(df.iloc[:,[1,4]])
```
```{python}
print(df.iloc[2:7, 3]) # the 3rd to 7th row of the 4th column. Remember 7 is exclusive
```
```{python}
print(df.loc["B",:]) # gets the row labelled B
```
```{python}
print(df.loc["B","cow"])
```
```{python}
print(df.loc[["F", "J"],["chicken", "cat", "pig"]])
```
We often need to collect vectors and assemble them into a matrix. This can be done using the rbind (row) and cbind (column) functions:
```{python}
v1 = np.arange(1, 11) # [1, 2, ..., 10]
v2 = np.arange(101, 111) # [101, 102, ..., 110]
rbound_mat = np.vstack([v1, v2])
print(rbound_mat)
```
# Dictionaries
So far we have talked about vectors and matrices separately, but we often we want to collect these things and put them into one object under a single variable as a collection (for example expression data and gene annotation). To do this we use something called a **dictionary**.
```{python}
alpha = ["A", "B", "C", "D", "E", "F", "G", "H"]
mat = np.random.randn(8, 5) # 8 rows, 5 columns of random normal values
listex1 = {
"char": alpha,
"nums": mat
}
```
You can see that each item is given a name `char` `nums` before it is put into the list. Each element can now be accessed via `[]`:
```{python}
listex1["char"]
```
So to access the 3rd element of the vector:
```{python}
listex1["char"][2]
```
And the first row of the dataframe:
```{python}
listex1["nums"][0,:]
```
```{python}
from sklearn.datasets import load_iris
iris_data = load_iris(as_frame=True)
iris = iris_data.frame # pandas DataFrame
# Sample 10 random rows
iris.head(5)
#type(iris)
```
```{python}
plt.scatter(iris["sepal length (cm)"], iris["sepal width (cm)"])
plt.xlabel("Sepal Length")
plt.ylabel("Sepal Width")
plt.title("Sepal Length vs Sepal Width")
plt.show()
```
```{python}
iris["species"] = iris_data.target_names[iris_data.target]
# Create the plot
sns.scatterplot(data=iris, x="sepal length (cm)", y="sepal width (cm)", hue="species")
# Display the plot
plt.title("Sepal Length vs Sepal Width")
plt.show()
```
We can also break up a data frame up on attributes and store the contents in a list (which we have already discussed). For example by species:
```{python}
iris_sp = {species: data for species, data in iris.groupby("species")}
# Show the keys (species names) of the dictionary
list(iris_sp.keys())
```
```{python}
iris_sp['setosa'].iloc[0:3, :3]
```
```{python}
iris_sp['versicolor'].iloc[0:3, :3]
```
Let do some basic filtering and manipulation of this data. First, filter just for those which are setosa:
```{python}
setosa_data = iris[iris["species"] == "setosa"]
setosa_data.iloc[:5]
```
Select 3 specific columsn of the data
```{python}
iris_subset = iris[["sepal length (cm)", "sepal width (cm)", "species"]]
iris_subset.iloc[:5]
```
Sort on a field:
```{python}
iris_sorted = iris.sort_values("sepal length (cm)")
iris_sorted.head(10)
```
# Reading and writing files
You have to get the data into R first before you can analyse it (this helps a lot). R has many useful functions to do this, so now we can take our first look at some expression data. Download this [file](https://github.com/sccbioinformatics/Python_programming_1/blob/main/Mouse_HSPC_reduced.txt) and save it to your current working directory.
***Exercise***: Open the file in Excel or something to see how it looks. As we want a **named** matrix we know that we want a `pandas` dataframe. So call `help(pd.read_csv)` in the terminal and try to work out how to import this file.
This is how I would do it:
```{python}
#| code-fold: true
hspc_data = pd.read_csv("Mouse_HSPC_reduced.txt", sep="\t", header=0, index_col=0)
```
Lets look are the first few lines:
```{python}
hspc_data.head(5)
```
Lets have a look at the dimensions of the data. In the `pandas` module we have the `shape` function to get this:
```{python}
hspc_data.shape
```
We can access the individal components of this using:
```{python}
hspc_data.shape[0]
```
```{python}
hspc_data.shape[1]
```
Lets take a look at the column names and see what they look like:
```{python}
list(hspc_data.columns)[1:10]
```
It's quite common that we need to break a table up. In this case, we want to separate the different populations into dataframes called `lthsc`,`mep`, and `gmp`.
To do this we need to get to the string atributes of the column headers, and to do this we need to use `.str` and the functions contained within. an example of this:
```{python}
mep_data = hspc_data.loc[:, hspc_data.columns.str.contains("MEP")]
```
Let break this down inside out. `hspc_data.columns.str.contains("MEP")` accessing the string functions using `.str`, and then uses the `.contains` function to ask which of the `.columns` contains "MEP". We then pass the results of this to `hspc_data.loc` to pull out the corresponsing columns.
Lets run the inner part by itself:
```{python}
hspc_data.columns.str.contains("MEP")
```
We can also get the same results by using `.startswith` instead:
```{python}
hspc_data.columns.str.startswith("MEP")
```
***Exercise***: Isolate the data for the LTHSC and GMP data into objects called `lthsc_data` and `gmp_data`.
```{python}
#| code-fold: true
lthsc_data = hspc_data.loc[:, hspc_data.columns.str.contains("LTHSC")]
gmp_data = hspc_data.loc[:, hspc_data.columns.str.contains("GMP")]
```
Let have a look at another important task, writing out a file.
***Exercise***: Call `pd.DataFrame.to_csv` and have a look at the help page. Try and figure out how to write the `lthsc_data` to a file called `LTHSC_data.txt`.
```{python}
#| code-fold: true
lthsc_data.to_csv("LTHSC_data.txt", sep="\t", index=True, header=True)
```
***Exercise***: Write out the other two files!
```{python}
#| code-fold: true
mep_data.to_csv("MEP_data.txt", sep="\t", index=True, header=True)
gmp_data.to_csv("GMP_data.txt", sep="\t", index=True, header=True)
```
# Flow control basics
This is where things get more interesting and we start to feel like "proper" programmers. Now that we have these datasets loaded in Python, we can use them to learn about flow control and some basic mathematical functions. We are going to do a few things the “long way” so you get the idea of how flow control works, and then we’ll look at some shortcuts.
Flow control is how multi-step processes (such as algorithms) are carried out. In the example below we print out the numbers 1 to 10:
```{python}
for i in range(10):
print(i)
```
To translate this code, it simply says for every integer from 0 to 9, print this value to the screen.
***Exercises***:
1. Using the example above, print the first 10 lines of 'lthcs_data' in a for loop.
```{python}
#| code-fold: true
#| eval: false
for i in range(10):
print(hspc_data.iloc[i,:])
```
2. Print every 2nd line of `mep_data` from lines 0 to 49.
```{python}
#| code-fold: true
#| eval: false
for i in range(0,50,2):
print(hspc_data.iloc[i,:])
```
An important point regarding for loops is that any processes/calculations occurring within the loop will stay in the loop. If data generated within a loop has to be retained, we need to create a container to “fill up” while the loop is being carried out. for example:
```{python}
vec = []
for i in range(10):
vec.append(i * 10)
print(vec)
```
The empty container `vec` is initialised outside the loop, and then populated by concatenating to it using `append` after every iteration of the loop.
***Exercise***: Initialise an empty container, and for `gmp_data`, calculate the mean of each row (gene), and store the results in the container you made.
```{python}
#| eval: false
gmp_row_means = []
for i in range(gmp_data.shape[0]):
gmp_row_means.append(gmp_data.iloc[i,:].mean())
print(len(gmp_row_means))
```
Another method of flow control is `while`. for example:
```{python}
w = 0
while (w <= 5):
print(w)
w = w + 1
```
We will cover `if` and `else` a bit later.
# Functions
Functions are chunks of code that execute several lines of code at once to perform a task. Once you have a few lines of useful code that you want to apply repeatedly, a function is a nice way to wrap them up so it can be used quickly when needed. Lets turn the code you wrote in the previous exercise into a function where we also calculate the variance for a gene.
```{python}
def calc_mean_and_var(mat):
means = []
variances = []
for i in range(mat.shape[0]):
row = mat.iloc[i, :]
means.append(np.mean(row))
variances.append(np.var(row))
return {"mns": means, "vars": variances} # This line gives us back what we need.
```
Here, you can see a loop is started, and the output from each loop is appended to containers `means` and `variances`. These are both put into a list which is returned at the end. By putting this code into a function we can now calculate the means and deviations of any matrix in one line of code. For example, the `gmp` data:
```{python}
#| eval: false
gmp_mn_var = calc_mean_and_var(gmp_data)
print(gmp_mn_var["mns"])
print(gmp_mn_var["vars"])
```
We can also use functions to demonstrate `if` and `else` statements reagarding flow control. Lets have a look at this using a silly example:
```{python}
def animal_maths(value1, value2, animal="pig"):
if animal == "pig":
print(value1 / value2)
elif animal == "cow":
print(value1 * value2)
```
... and run it:
```{python}
animal_maths(10, 5, "pig")
```
```{python}
animal_maths(10, 5, "cow")
```
What happens if we do this?
```{python}
animal_maths(10, 5, "dog")
```
What we need to do here is a little "exemption handling". You can't rely on the user doing the correct thing, so we need to account for any odd things that happen. In this case, the user has put an animal into the function which isn't accounted for by the function. Let's modify the function to let the user know they screwed up:
```{python}
def animal_maths(value1, value2, animal="pig"):
if animal == "pig":
print(value1 / value2)
elif animal == "cow":
print(value1 * value2)
else:(print("I don't recognise this animal!"))
```
```{python}
animal_maths(10, 5, "dog")
```
We're using Python, so we should do this the more "Pythonic" way, and there is a really cool way of doing this using a dictionary to lookup what needs to be done:
```{python}
def animal_maths(value1, value2, animal="pig"):
operations = {
"pig": lambda x, y: x / y,
"cow": lambda x, y: x * y
}
if animal in operations:
print(operations[animal](value1, value2))
else:
print("Unknown animal")
animal_maths(10, 5, "pig")
```
What we do here is provide the options in a dictionary where each one is an unnamed function (lambda function), so all we need is to provide one `if` and one `else`, as opposed to `if` > `elif` > `else`.
```{python}
animal_maths(10, 5, "cow")
```
# Using built-in functions
Libraries such as `pandas` have a host of functions that can be applied to `pandas` DataFrames. Let have a look at this code which calculated row means and variances of a matrix:
```{python}
def calc_mean_and_var(mat):
means = []
variances = []
for i in range(mat.shape[0]): # <<< This is slow!
row = mat.iloc[i, :]
means.append(np.mean(row))
variances.append(np.var(row))
return {"mns": means, "vars": variances}
```
This function loops through each line which is computationally *really* slow as each line had to be interpreted in turn. Lets try this instead:
```{python}
gmp_row_means = gmp_data.mean(axis=1)
```
This is so much faster.
***Exercise***: Write a function called `calc_mean_and_var_fast` which uses this method to calculate row means and variances for a matrix.
```{python}
def calc_mean_and_var_fast(mat):
return {
"mns": mat.mean(axis=1),
"vars": mat.var(axis=1)
}
```
We can use the `time` library to measure how long it takes to run a functions. Slow one first:
```{python}
#| eval: true
import time
start_time = time.time()
result = calc_mean_and_var(gmp_data) # Call your function
end_time = time.time()
print(f"Time taken: {end_time - start_time:.6f} seconds")
```
Now the fast version:
```{python}
import time
start_time = time.time()
result = calc_mean_and_var_fast(gmp_data) # Call your function
end_time = time.time()
print(f"Time taken: {end_time - start_time:.6f} seconds")
```
# Apply
`apply` is a commonly used function in Python to speed up matrix calculation when you have a custom function that you would like to use. For eample, lets take a nonsense operation:
```{python}
def example_func(v):
val = (np.mean(v) * np.std(v)) / np.sum(v)
return val
ex_apply = gmp_data.apply(example_func, axis=1)
```
We'll likely use this a bit later!
# Getting highly varable genes.
Lets start doing some work with this data, and first we'll define a function that isloates the *N* highest varable genes in the `hspc_data`.
```{python}
def get_top_variable_genes(mat, top_n=500):
variances = mat.var(axis=1, ddof=1)
top_genes = variances.nlargest(top_n).index
return mat.loc[top_genes]
```
Lets run it.
```{python}
hspc_var = get_top_variable_genes(hspc_data)
hspc_var.head(5)
```
# Standardising data
Lets stick with our expression data and cluster it. Doing so, we'll learn more of the language, and some of the fundamental maths that runs underneath. Lets take a look at the range of the data, i.e. getting the lowest and highest value in the matrix of variable genes we just made.
```{python}
hspc_var.values.min(), hspc_var.values.max()
```
For some operations (such as making heatmaps) the data needs to be z-score normalised (scaled) first. When we scale data, each row of gene is standarised so that it's mean = 0 and sd = 1. Specifically for a gene `g` of the i-th row:
$$Z_i=\frac{g_i-\hat{g}}{\sigma g_i}$$
Which means for every gene ($g_i$) we subtract the mean expression of the gene ($\hat{g)}$) to each expression value ($g_i$) and divide by the standard deviation of the expression of the gene ($\sigma g_i$).
***Exercise***: Write a function called `zscore` which will take a single vector of values `v` and scale them. When you have done this, apply this to the hspc.var matrix to scale all rows and call it `hspc_zs`.
```{python}
def zscore(v):
zv = (v - np.mean(v))/(np.std(v,ddof=1))
return zv
hspc_zs = hspc_var.apply(zscore,1)
print(hspc_zs.shape)
hspc_zs.head(5)
```
Lets make a boxplot of the data before and after scaling:
Before:
```{python}
hspc_var.boxplot(rot=90) # rot=90 rotates x-axis labels vertically (like las=2)
plt.show()
```
After:
```{python}
hspc_zs.boxplot(rot=90) # rot=90 rotates x-axis labels
plt.show()
```
<!-- ***Exercise***: Do some Googling and see if there is already a `zscore` function somewhere in Python that will calculate z-scores for you. Use it to z-score your data and save the output in `hspc_zs_v2`. There are maybe a couple of ways that you can do it. -->
# Clustering
All the steps before have been leading to this, clustering the data and making heatmaps - a staple method in bioinformatics.
Clustering is one of the most common visualisation techniques for genes expression data. Here we will learn how to do some basic histograms/heatmaps and plotting. To do this we need to bring in the `scipy` [library](https://docs.scipy.org/doc/scipy/index.html) which is a library for engineering, maths, and stats. In this case we are going to pull in the `cluster` package and within, the `hierarchy` module.
```{python}
import scipy.cluster.hierarchy as sch
linkage_matrix = sch.linkage(hspc_zs, method='ward',metric='euclidean')
# Create the dendrogram
plt.figure(figsize=(10, 6))
sch.dendrogram(linkage_matrix, labels=hspc_var.index.tolist(), leaf_rotation=90)
plt.title("Dendrogram of Rows (hspc_var)")
plt.xlabel("Sample")
plt.ylabel("Distance")
plt.tight_layout()
plt.show()
```
```{python}
import scipy.spatial.distance as ssd
dist_matrix = ssd.pdist(hspc_zs, metric='euclidean')
# Step 2: Hierarchical clustering (like R's hclust)
linkage_matrix = sch.linkage(dist_matrix, method='ward') # or "ward", "average", etc.
# Step 3: Plot dendrogram (like R's plot)
plt.figure(figsize=(10, 6))
sch.dendrogram(linkage_matrix, labels=hspc_zs.index.tolist(), leaf_rotation=90)
plt.title("Hierarchical Clustering of Genes")
plt.xlabel("Genes")
plt.ylabel("Distance")
plt.tight_layout()
plt.show()
#lo = sch.dendrogram(linkage_matrix)
```
```{python}
import scipy.spatial.distance as ssd
dist_matrix = ssd.pdist(hspc_var, metric='correlation')
# Step 2: Hierarchical clustering (like R's hclust)
linkage_matrix = sch.linkage(dist_matrix, method='ward') # or "ward", "average", etc.
# Step 3: Plot dendrogram (like R's plot)
plt.figure(figsize=(10, 6))
sch.dendrogram(linkage_matrix, labels=hspc_zs.index.tolist(), leaf_rotation=90)
plt.title("Hierarchical Clustering of Genes")
plt.xlabel("Genes")
plt.ylabel("Distance")
plt.tight_layout()
plt.show()
#lo = sch.dendrogram(linkage_matrix)
```
Let say we want to cluster the samples(columns) instead? What should we do?
```{python}
#| code-fold: true
import scipy.cluster.hierarchy as sch
sample_linkage_matrix = sch.linkage(hspc_zs.T, method='ward',metric='euclidean')
# Create the dendrogram
plt.figure(figsize=(10, 6))
sch.dendrogram(sample_linkage_matrix, labels=hspc_var.T.index.tolist(), leaf_rotation=90)
plt.title("Dendrogram of Samples (hspc_var)")
plt.xlabel("Sample")
plt.ylabel("Distance")
plt.tight_layout()
plt.show()
```
## Heatmaps
Heatmaps have been the go-to method for visualising gene expression data. We can make one from first principles using the work we have done above. Lets take a look at the contents of what is returned by `sch.dendrogram`.
Lets run these lines again, but this time saving the output from each call to cluster the data:
```{python}
gene_linkage_matrix = sch.linkage(hspc_zs, method='ward',metric='euclidean')
genes_hc = sch.dendrogram(gene_linkage_matrix, labels=hspc_var.index.tolist(),no_plot=True)#, leaf_rotation=90)
sample_linkage_matrix = sch.linkage(hspc_zs.T, method='ward',metric='euclidean')
samples_hc = sch.dendrogram(sample_linkage_matrix, labels=hspc_var.T.index.tolist(),no_plot=True)#, leaf_rotation=90)
```
***Exercise**: What kind of object is `genes_hc`? Check it, and then get a list of what `genes_hc` contains.
```{python}
#| code-fold: true
type(genes_hc)
genes_hc.keys()
```
At th
```{python}
plt.imshow(hspc_zs, aspect='auto', cmap='viridis')
plt.show()
```
We can how ordered the rows according to `ivl`
```{python}
hspc_ordered = hspc_zs.loc[genes_hc["ivl"]]
```
Replot the heatmap:
```{python}
plt.imshow(hspc_ordered, aspect='auto', cmap='viridis')
plt.show()
```
Looking better right? One more step and we will have clustered columns:
```{python}
hspc_ordered = hspc_zs.loc[genes_hc["ivl"],samples_hc["ivl"]]
```
And plot again:
```{python}
plt.imshow(hspc_ordered, aspect='auto', cmap='viridis')
plt.show()
```
To make a heatap using `seaborn`:
```{python}
import seaborn as sns
sns.heatmap(hspc_ordered, cmap='magma') # Optional: cbar=True, xticklabels=False, yticklabels=False
plt.title('Heatmap of Data')