Recently Published
Retrieving the mRNA from cDNA of MS patient ID_REF and forming amino acid sequences
Not able to use Bioconductor to get the barcode ID_REF gene name for use in exploring known genes of the top 41 genes found to play a role in predicting with 100% accuracy if a sample has MS or is healthy. But reviewed the process of transcription and translation of a protein from mRNA when transcribed from its triplet codon sets by transfer RNA at the ribosome. Gaps of 1-2 RNAs were unpaired to a codon so gsub wasn't the best choice, but these barcodes are fragments and maybe those are deletions, insertions, or translations in genes that are risk associated genes for MS, or better yet are found in those with MS.
Potential Top 41 genes in Multiple Sclerosis Risk Loci of Allele Variants of cDNA
This is extracting the top genes from the previous data project in data frame build of the samples of multiple sclerosis cDNA data from NCBI's GSE293036, details in last project. The machine learning model isn't built yet but should be relatively straight forward from other projects. This is part 2 after exploratory data analysis. This goes directly to finding the cDNA fragments of 20 base pairs long that had the most fold change or increase or decrease compared to healthy control samples. The top 50 genes in enhancer fold change and bottom 50 genes in silencer fold change selected from sample repeats of the 1st patient, then 2nd patient, and then a comparative commercial patient using mean values of samples vs control mean values. This found 41 top genes in common to all samples. Next will be the machine learning after data transformation to see how well these gene variants can predict the class of the sample as healthy or MS.
Multiple Sclerosis 10-50 million nucleotides as rows Data Extraction GSE293036 3 parts
This is a very large data base from GSM samples for study on NCBI GSE293036 on 20 base pair long fragments of nucleotide sequences that are common allele variants in finding risk loci of multiple sclerosis patients. This had to be done in separate batches for the controls, samples, and commercial samples to compare. The samples each had varying fragrment nucleotide strands, this filters out only the common strand fragments to every sample in control, 2nd patient, 1st patient, and commercial patient that used EBV to keep the commercial line alive and replicate fast similar to the hela cells for making innoculations due to fast cell division rate of viruses like HPV and EBV. But the EBV viral strands and HPV don't interfere in the host gene expression. Although I am finding the relation between MS and EBV that is said to be associated as well as with mononucleosis, Burkett Lymphoma, Hodgkin Lymphoma, and head and neck cancer.
Data Extraction of Multiple Sclerosis complementary DNA 20 base pair barcodes to get top genes Part1
Extracting very large data in 10-50 million observations or rows that is time consuming just to pull from internet but then read into Rstudio and transform before running machine learning on the top genes. In this case these observations are copy variants in the allele information of the complementary DNA with thymine made from reverse transcription of messenger RNA or mRNA to get what this study in the document used to find multiple sclerosis risk loci variants that enhanced or silenced (upregulated or down regulated) gene activity. This should be interesting and is part of the work to see if their are some common associations with EBV infection at various states. No libraries used just building the data set of common strands of nucleic DNA in 20 base pair fragments. The study used 2 MS patients, 1 control, and 1 commercial line of MS to compare but used repeat RefSeq analysis in 3 repeats on the control and 5 each on the 2 MS patients and commercial line comparison.
part3 forecasting on nonduplicated data for actionable insights in R with prophet to forecast
Part 3 on the data and real results to compare to the bloated results after removing 7 duplicates. Results similar just the numeric changes.
Cleaning data to forecast after running summary stats and analysis to build more client income for mobile massage biz from side gig to biz
This uses anonymized mobile massage data with combined data from income and consent forms with optional surveys attached to the consent forms of each client. The idea is to make the data provide information for best massage services to offer, idea region, age group, pressure, and other information to predict the next year income using the library prophet for R as well as dplyr and ggplot2 for graphical plots. Date variables are no joke if you enter them wrong. Many hours spent getting correct AI generated code to turn a month/day/year of 4 digit year into a 2 digit year. But that was cut out of this document so you can avoid the upset. Useful information to help guide this mobile massage provider into more income by targeting preferred idea client to get those who return more often and pay more per household.
Part 1 in EBV infection using gene data to get top genes study found defecit IL27RA this proves it
First part exploratory data analysis, after gathering data from NCBI gene expression study that analysis worked to prove the study's allele defective copy variant of IL27RA prevents T-cell immune response in EBV infection but is more to extract top genes from the most and least expressed genes in fold change of this 2 patient and 2 control RNA gene expression data of lymphoblastic cell lines in peripheral blood mononuclear cells to compare to top reactive genes in other data sets related to EBV associated lymphoproliferative pathologies of MS, Burketts lymphoma, Hodgkins lymphoma, and possibly see how it compares to top genes reactive in Lyme disease and myofascial pain studies similar to fibromyalgia pain.
Keras Deep Neural Networks on small Lyme disease data with class balancing of revisited Lyme disease 86X80 dataframe
This project follows along with many modifications to the 4 year old tutorial of (details in document) about Deep Neural Networks and a short demonstration on his data not used here. We revisit the data set made from the PCA in Random Forest project earlier (see that project for link to data) and see how Keras can manipulate and solve the class imbalances of the data to make predictions on our 4 class target.
There was some packaging and dependency issues and changes dealt with in a document not published but do know you should have Rtools installed, and latest keras and tensorflow to install those packages that are built for python modules but transitioned with an R package called reticulate that has some dependency issues I found but didn't publish but had to use nested for loops for the 4 classes.
In the end, DNN does better on very large data and not 86X80 more like 860,000X80 as it was built for facial recognition and fingerprint matching, etc. The results are better than PCA using the error or noise to predict classes, but not better than random forest in the caret or randomForest package of R on this type of data.
DocumentPCA analysis part2 with all components to predict 4 classes in randomForest on 19k wide data of 86 samples
An extension to last published Rpub document on analyzing a 19k+ gene expression dataset for Lyme disease with PCA but using all components. Please see all the previous Lyme disease Rpub documents of mine to get the data by running code from original data set.
PCA analysis in R of the Lyme disease data of gene expression with a 28k feature space
This is an extension to the randomForest analysis of a 28k feature dataset of gene expression data to find top genes. This function, prcomp, in R is part of the base library of functions that can do principal component analysis or PCA to find the components that explain the error or noise or variance in the data that prevents a line in multidimensional space being fit to all the 28k features. It cannot predict the exact gene or top genes as based on the error space and so can predict the class if tuned well if wanting only to have a predicting model to identify class of infectios stage or healthy samples only. Won't be beneficial to gene therapy in personalized gene therapeutics other than identifiying the samples working with.
Fraud detection in Simulated financial data of 10k rows with 500 fraud cases and 9,500 legit cases from Kaggle
This data science project explores financial data and uses ingenuity within the randomForest package to see how well this algorithm can handle large data compared to very small and imbalanced data in last publication using Lyme disease. Details and link to Kaggle data in the publication.
Using unbalanced Lyme Disease data to test randomForest package and 2 class solver from 4 class
This is an exploratory analysis of the package randomForest as I used caret package in earlier rmarkdown publications to model data. The lyme disease data was from online source in the 1st rpub document in this profile, but we used it again to go from 61% accuracy with 4 classes to 85% accuracy with 2 classes. Tuning was used for best parameters on a 80 feature wide or big dataset of genes to predict the class of acute infection versus chronic infection.
Fibromyalgia 2nd part Median fold changes with cross validation added to means and bootstrap outcomes
This is the 2nd part of the machine learning of fibromyalgia data comparisons using the same 1st part work on means of samples' fold change values of healthy vs myofascial pain, but now compared with the medians of those fold change values and different parameter tuning for 50, 100, and 1000 bootstrapped samples from only 13 to predict class correctly in the means then in the medians, and try same number of folds as best bootstrap aggregating value for cross validation on means and medians separately. A few models returned 100% accuracy on only 3 test samples.
Fibromyalgia RNA-Seq Gene Expression Analysis 12 samples Bootstrap Random Forest Model
In this project, we begin the analysis of the gene expression data on trigger point myofascial pain similar to fibromyalgia in clinical signs and symptoms for chronic pain. The genome data was used in a study of 5 healthy and 7 myofascial pain patients that helped the researchers understand how a drug that starts with 'dex' helps with chronic pain. It included fragments per kilo million and counts that were both normalized from the gene high throughput fastp data collected. This study is a stepping stone to connecting major illnesses associated with Epstein-Barr virus (EBV) such as multiple sclerosis, mononucleosis, Hodgkin's disease, and fibromyalgia. We can eventually understand how changes in the body of many people make their body undergo DNA transcription to make more of some genes and less of others when dealing with 58,000 genes in this study. Details in the document.
Lyme Disease Top Features in Predicting State of illness
Using R packages to manipulate data from NCBI gene studies with tidyr, dplyr, caret, and kernlab there are 6 models used with 10 folds of cross validation and Accuracy to measure algorithms of KNN, rpart, random forest, linear discriminate analysis, support vector machines for radial, and support vector machines for linear model fitting. Then summary results shown. Error in plots displaying properly in knitr and Latex, so they were block commented out. Looks like top genes are involved in upregulation of lipid regulators, DNA repair, and bile production to digest more fats and cholesterols. But downregulated mitotic activity in cell replication. This is from acute infection to chronic infection up to six months. Only 86 samples, and not balanced data for chronic infection. Tuning can be improved and selecting better model parameters to get better accuracy. For four classes best model was rpart but see notes in doc why.