I will focus on the clinical data files (data_clinical_patient.txt, data_clinical_sample.txt, and data_mutations.txt), primarily the clinical patient file. These are tab delimited plain text files that can be accessed after the tar is decompressed.
There is a mix of categorical (factors), continuous and numerical features (some to be inferred by R code).
This data represents the targeted sequencing (MSK-IMPACT) of 696 melanoma tumour / normal pairs.
There is a corresponding published medical journal entitled “Therapeutic Implications of Detecting MAPK-Activating Alterations in Cutaneous and Unknown Primary Melanomas”.
Reference:
Shoushtari AN, Chatila WK, Arora A, Sanchez-Vega F, Kantheti HS, Rojas Zamalloa JA, et al. . Therapeutic implications of detecting MAPK-activating alterations in cutaneous and unknown primary melanomas. Clin Cancer Res. (2021) 27:2226–35. doi: 10.1158/1078-0432.CCR-20-4189, PMID: - DOI - PMC - PubMed: https://pubmed.ncbi.nlm.nih.gov/33509808/
Skin related & unclassified primary melanomas often conceal changes that can activate the MAPK pathway. The disease pathway describes the mechanisms by which it evolves & either persists or is resolved. The MAPK pathway activation is a prominent step in melanoma pathogenesis.
The targeted capture of multi-gene sequencing can detect oncogenic RTK-RAS-MAPK pathway alterations (can cause development of a tumour or tumours) in almost all cutaneous and unknown primary melanomas.
Warning: package 'ggplot2' was built under R version 4.3.2
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
Attaching package: 'rio'
The following object is masked from 'package:plotly':
export
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ stringr 1.5.1
✔ forcats 1.0.0 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.0
✔ purrr 1.0.2
[conflicted] Will prefer dplyr::filter over any other package.
[conflicted] Will prefer dplyr::lag over any other package.
Code
source('final_proj_utils.R')# Import and load source code from local R script
[conflicted] Removing existing preference.
[conflicted] Will prefer dplyr::filter over any other package.
[conflicted] Removing existing preference.
[conflicted] Will prefer dplyr::lag over any other package.
Prepare working directory and data for import
Set working directory to path_wd
The working directory should contain .qmd, *.R and tar/gzip files before executing final-project.qmd file.
Merge data sets (clinical patient, clinical sample & data gene panel matrix as they have the same number of observations at 696 each)
Code
# Merge by patient ID df_clin <-merge(x = data_files$data_clinical_patient, y = data_files$data_clinical_sample, by ="PATIENT_ID", all =TRUE)# Merge / Join by Sample IDdf_clin <-merge(x = df_clin, y = data_files$data_gene_panel_matrix, by ="SAMPLE_ID", all =TRUE)
Data manipulation (add columns, add proportion percentages, omit NAs etc.)
Code
# Add column to represent proportion as a percentage for the male:female samplesdf_clin <- df_clin %>%group_by(SEX) %>%mutate(gender_prop =100*round(sum(n() /length(df_clin$SEX)), 2)) %>% ungroup# Add column to represent proportion as a percentage for the primary cancer site # samplesdf_clin <- df_clin %>%group_by(DMT_PRIMARY_SITE) %>%mutate(site_prop =100*round(sum(n() /length(df_clin$DMT_PRIMARY_SITE)), 2)) %>% ungroup# Add column to represent proportion as a percentage the primary cancer site # classified (generic site) samplesdf_clin <- df_clin %>%group_by(DMT_PRIMARY_SITE_CLASSIFIED) %>%mutate(site_classified_prop =100*round(sum(n() /length(df_clin$DMT_PRIMARY_SITE_CLASSIFIED)), 2)) %>% ungroup# extract OS years from OS months columndf_clin$OS_YEARS <-round(df_clin$OS_MONTHS /12)# estimate patient current age by adding age at diagnosis + OS yearsdf_clin$EST_CURR_AGE <- df_clin$OS_YEARS + df_clin$AGE_AT_INITIAL_DIAGNOSIS# Extract 2 columns from OS_STATUS where OS_STATUS_INT is 1 or 0 # and OS_STATUS_GRP is Living or Deceaseddf_clin <- df_clin %>%separate_wider_delim(OS_STATUS, ":", names =c("OS_STATUS_INT", "OS_STATUS_GRP"))# Add column to represent age group - this column is an ordered factor based on estimated current age of patientdf_clin$age_group =cut(df_clin$EST_CURR_AGE, c(0, 5, 15, 25, 35, 45, 55, 65, 75, 85, 95, Inf), c("0-5", "5-15", "15-25", "25-35", "35-45", "45-55", "55-65", "65-75", "75-85", "85-95", ">95"), include.lowest=TRUE, ordered = T)# There NA values but they do not hamper any calculation that I have replicated so commenting out this line# na.omit(df_clin)
Journal Statistics (Part 1): Tables / Numeric Summaries of Data
Create summary table with count by group aggregation that adds count variable n and count proportion as a percentage (variable Freq).
This is done by calling a genric function from final_proj_utils.R
The grouping field here is SIMPLIFIED_METASTATIC_SITE
The most common sites for cancers to metastasize include the lungs, liver, bones and brain.
From this sample set, Regional Local Recurrences (LR) / In-Transit / Lymph Node (LN) is by far the most common at 39% followed by Soft Tissue / Distant LN and Lung at 14% and 13% respectively.
NA here corresponds to Primary site in Table 1
Code
count_agg(df_clin, "SIMPLIFIED_METASTATIC_SITE")
SIMPLIFIED_METASTATIC_SITE
n
Freq
LR / In-Transit / Regional LN
271
39
NA
104
15
Soft Tissue / Distant LN
96
14
Lung
88
13
Brain
52
7
Liver
36
5
Other Visceral
33
5
Bone
16
2
Create summary table with count by group aggregation that adds count variable n and count proportion as a percentage (variable Freq).
This is done by calling a genric function from final_proj_utils.R
The grouping field here is DMT_PRIMARY_SITE_CLASSIFIED
The author applies a filter to this data where CANCER_TYPE_DETAILED == "Cutaneous Melanoma" before creating summary statistics. This makes sense because we want to rule out the patients with unknown primary cancer site/area as this will skew the result.
There are 140 patients (20%) with unknown DMT primary cancer site which we want to rule out before reporting the proportion of patients per generic primary cancer site.
The count_agg function has an optional parameter for digits (number of decimals) which has default value of 0 but here we explicitly set digits = 2 for a more accurate Freq (group size proportion as a percentage based on count variable n) to match the counts reported.
There is a relatively equaly split here where each site has a share between 18% and 32%.
Trunk and Head are the top 2 results from this summary.
Further Exploratory Analysis (NOT based on Table 1)
Create summary table with count by group aggregation that adds count variable n and count proportion as a percentage (variable Freq).
This is done by calling a genric function from final_proj_utils.R
The grouping field here is DMT_PRIMARY_SITE
The count_agg has an optional parameter for digits (number of decimals) which has default value of 0 but here we explicitly set digits = 2 for a more accurate Freq (group size proportion as a percentage based on count variable n)
This summary table goes into more detail on where the site is, giving further detail on the lower and upper extremities in particular.
Unfortunately, Unknown Primary observations account for 20% of the data set.
However there are more than 20 records for neck, face, scalp, arm/shoulder, leg/hip and trunk. Trunk again is the top result at 26% which may be expected to a certain extent as it is a large component of the human body.
Code
count_agg(df_clin, "DMT_PRIMARY_SITE", digits =2)
DMT_PRIMARY_SITE
n
Freq
Skin, trunk
180
25.86
Unknown Primary
140
20.11
Skin, leg/hip
104
14.94
Skin, arm/shoulder
99
14.22
Skin, scalp
60
8.62
Skin, face
52
7.47
Skin, neck
28
4.02
Skin, external ear
15
2.16
Skin, eyelid
6
0.86
Skin, foot
6
0.86
Skin, hand
3
0.43
Skin, Site Unspecified
1
0.14
Skin, lip
1
0.14
NA
1
0.14
Create summary table with count by group aggregation that adds count variable n and count proportion as a percentage (variable Freq).
This is done by calling a genric function from final_proj_utils.R
This is the first summary of mutations data
The grouping field here is Hugo_Symbol
The count_agg has an optional parameter for n_results (number of results returned after sorting results set by n in descending order) which by default will return first 20 results (includes each group). There are many Hugo_Symbol values and I am only interested in the top 15 occurences so we set n_results = 15
The count_agg has an optional parameter for digits (number of decimals) which has default value of 0 but here we explicitly set digits = 2 for a more accurate Freq (group size proportion as a percentage based on count variable n)
One of the most common changes in melanoma cells is a mutation in the BRAF oncogene, which is found in approx. half of diagnosed melanomas.
Other genes that can be affected in melanoma include NRAS, CDKN2A & NF1 (Usually only 1 of these genes is affected).
We can see BRAF (# 5), NRAS (# 15), CDKN2A (# 4) and NF1 (# 7)
The top 3 in this data set consist of: TERT, PTPRT and GRIN2A
It appears the BRAF mutation is associated with aggressive tumor features & can increase the risk of persistent and recurrent disease
I think this is a promising subset of data to examine further for discerning data features or even prinicpal component analysis in the mutations data frame which has a column of values per patient.
We can see there are 19,479 observations or records
We can see there is now 122 variables or columns
The top 5 symbols in order of count were: TERT, PTPRT, GRIN2A, CDKN2A, BRAF
Code
dim(data_files$data_mutations)
[1] 19479 122
Graphical Summaries / Data Visualisation
I would like to create some data visualisations to further explore the summary information from Table 1.
I am interested in the split for age_group (Ordinal factor) and SEX (Male or Female - categporical).
The majority of samples are Male and lie between age groups 45-55 -> 75-85
Code
ggplotly(ggplot(df_clin, aes(x=age_group, fill = SEX)) +geom_bar())
I am interested in the split for OS_STATUS (Living or Deceased) and age_group (Ordinal factor)
The largest group of deceased patient samples are in the age bracket of 65-75.
Code
ggplotly(ggplot(df_clin, aes(x=age_group, fill = OS_STATUS_GRP)) +geom_bar())
I am interested in the split for OS_STATUS_GRP (Living or Deceased) and DMT_PRIMARY_SITE_CLASSIFIED (categorical feature)
This shows a lot of samples for living patients that were diagnosed with skin cancer in the head area, especially 65-85.
The area with a high proportion of deceased patients were diagnosed with skin cancer in trunk or torso area across all age groups followed by the Head area. In the 15-25 age group, the deceased appear to all have Head diagnosis as primary site of melanoma.
Code
ggplotly(ggplot(df_clin |>filter(CANCER_TYPE_DETAILED =="Cutaneous Melanoma"), aes(x=OS_STATUS_GRP, fill = DMT_PRIMARY_SITE_CLASSIFIED)) +geom_bar() +facet_wrap(~age_group))
I am interested in the split for OS_STATUS_GRP (Living or Deceased) and EST_CURR_AGE (continuous variable that represents the estimated current age of patient - age at diagnosis + OS_MONTHS)
There are a lot more patients living with skin cancer diagnosis than there are deceased in this data set.
However, there is a relatively high count of deceased with estimated current age between 40 and 80 years of age (count 174).
Code
ggplotly(ggplot(df_clin, aes(x=EST_CURR_AGE, fill = OS_STATUS_GRP)) +geom_histogram(binwidth =20))
I am interested in the split for OS_STATUS_GRP (Living or Deceased) and OS_YEARS (continuous variable that represents the number of years after initial diagnosis)
This chart is replicated for each primary cancer site (DMT_PRIMARY_SITE) as a sub plot or graph.
All patients diagnosed with skin cancer in hand/lip and site unspecified as primary site of melanoma are deceased.
Patients diagnosed with skin cancer where foot was primary site has a 50:50 split for Living:Deceased (3:3).
There are patients that had diagnosis in the leg/hip and trunk and primary site living long after the initial diagnosis from this data set (up to and just over 30 years).
There is an outlier for one patient that is living after initial diagnosis of neck skin cancer 43 years after initial diagnosis and counting with estimated current age of 68yrs old.
Code
ggplotly(ggplot(df_clin, aes(x=OS_YEARS, y = EST_CURR_AGE, color = OS_STATUS_GRP )) +geom_point() +facet_wrap(~DMT_PRIMARY_SITE))
I would like to see if there are any outliers based on primary cancer site and gender so I will use a box plot this time.
I am interested in the split between SEX (Male or Female) and AGE_AT_INITIAL_DIAGNOSIS (continuous variable that represents the patients age at initial diagnosis)
This chart is replicated for each generic primary cancer site (DMT_PRIMARY_SITE_CLASSIFIED) as a sub plot or graph.
Head and Upper Extremity graphs show outliers for males that were diagnosed earlier than normal for Head/Upper Extremity skin cancer.
This is odd because males on average are diagnosed later with skin cancer except for Lower Extremity. This may be a sign that the data is not great. In addition, There are lot more male samples than female. Conversely, it is possible that the samples for female patients may be lacking.
Code
ggplotly(ggplot(df_clin, aes(x=SEX, y = AGE_AT_INITIAL_DIAGNOSIS, color = SEX)) +geom_boxplot() +facet_wrap(~DMT_PRIMARY_SITE_CLASSIFIED))
Part 2: R Packages
I have selected the testthat and covr packages for this section.
testthat: Unit Testing for R
The testthat package is for writing unit tests against R code (packages / functions).
‘testthat’ is a testing framework that is easy to learn and use, and integrates with RStudio.
covr: Test Coverage for Packages
Track and report code coverage for your package and there is an option to upload the results to a coverage service. The covr package is used to generate reports that highlight the level of unit test coverage by percenatge and number of lines, logic branches (if else etc.) covered. It helps to identify missed lines of code or code that does not have any unit test coverage. It provides information on the code and its structure.
Clean code has unit tests and unit tests also encourages code improvements and refactoring with piece of mind that the code is still functioning to a good level especially if unit test coverage is at a high percentage such as 85%+.
Please see test_final_proj_utils.R (uni tests / test suite) and final_proj_utils.R (source code) files attached.
See test_final_proj_utils.R for a test suite containing 5 tests that roll out to 12 test cases. The structure of the tests is the de-facto “Given, When, Then” formula for test driven development (TDD). - Given a set of test variables/constants/parameters for a given test scenario - When we call a method or unit of code with these configurations / arguments - Then we assert or check the result is what we expect. This is the most important step as we need to test for something tangible. Running a test and not verifying the result would be pointless.
I have demonstrated some of the methods available from the testthat package for asserting the result of the unit test is as expected with functions: expect_null, expect_false, expect_equal, and expect_failure. There are other notable functions where you can expect s3 or s4 objects amongst others. This is very easy to use and from previous experience in unit testing with other languages, testthat package in R matches up quite well with JUnit / PyTest etc.
I have used the covr::file_coverage method to run the unit tests with coverage to allow for code coverage report generation but you also run the tests as a stand alone directly from the test script in RStudio as long as testthat library is loaded.See screenshots for further information on this.
Code
# Run tests with coveragecovr <-file_coverage("final_proj_utils.R", "test_final_proj_utils.R")
[conflicted] Removing existing preference.
[conflicted] Will prefer dplyr::filter over any other package.
[conflicted] Removing existing preference.
[conflicted] Will prefer dplyr::lag over any other package.
[conflicted] Removing existing preference.
[conflicted] Will prefer dplyr::filter over any other package.
[conflicted] Removing existing preference.
[conflicted] Will prefer dplyr::lag over any other package.
[conflicted] Removing existing preference.
[conflicted] Will prefer dplyr::filter over any other package.
[conflicted] Removing existing preference.
[conflicted] Will prefer dplyr::lag over any other package.
Test passed 🎉
Test passed 😸
Test passed 🥳
[1] "data_clinical_patient.txt - importing clinical data"
[1] "data_clinical_sample.txt - importing clinical data"
[1] "data_cna_hg19.seg is not needed for import..."
[1] "data_cna.txt - importing other data"
[1] "data_gene_panel_matrix.txt - importing other data"
[1] "data_mutations.txt - importing other data"
[1] "data_sv.txt - importing other data"
Test passed 🎊
[1] "cases_all.txt - importing cases data"
[1] "cases_cna.txt - importing cases data"
[1] "cases_cnaseq.txt - importing cases data"
[1] "cases_sequenced.txt - importing cases data"
[1] "cases_sv.txt - importing cases data"
Test passed 🎊
Code
# Generate HTML coverage reportreport(covr)
The report function runs the coverage for the source and test code files passed as arguments. This can also be done for a whole package via package_coverage. There are methods to convert the report to cobertura code coverage or sonarqube reports in XML format which often plug-in to build servers like Jenkins for code coverage and static code scanning analysis. More information on the functions available at https://cran.r-project.org/web/packages/covr/covr.pdf.
The report generated is in HTML format and can be viewed directly in RStudio or via internet browser such as chrome.
The test script with the testthat framework installed and loaded allows ‘Run Tests’ option on top RHS of the code editor. The bottom left pain shows the results of running test suite (pass/fail/issues).
. .
The bottom right ‘viewer’ displays report generated by ‘report’ function.
.
.
Part 3: Functions / Programming
S3 Class / Objects / Methods
Create AggregationArgs class from object instance aggregation_drv_class
The code creates a list called aggregation_drv_class with 4 elements: df, col, n_results and digits.
The values of these elements are df_clin, “DRIVER_CLASS”, 3, and 1 respectively.
Then, the class() function is used to assign a class to the aggregation_drv_class list.
The class is set to “AggregationArgs”.
Code
aggregation_drv_class <-list(df = df_clin, col ="DRIVER_CLASS", n_results=3, digits =1)class(aggregation_drv_class) <-"AggregationArgs"
Define countAgg function / method.
The function takes one argument, “object”.
The function then uses the “UseMethod” function to dispatch the method that matches the class of the “object” argument.
This is an example of OOP in R, where different methods can be defined for different classes of objects.
This code defines a function called countAgg that takes an object as its argument.
Within the function, it uses the count_agg function that I previously defined in final_proj_itls.R to print a summary table with count and Frequency info for the data frame and group_by column passed as arguments, it also takes 2 parameters (n_results and digits) that are normally optional (more strict version).
This is an example of how useful OOP programming is for re-usable code whether using S3/S4. S4 is newer and cleaner so I think I prefer S4 (less code after setup).