Recently Published
Part1 and Part2 together in NKTCL GSE318371 using Seurat to get top genes before Machine Learning
Part 1 and Part 2 in reading in a large RAW file with Seurat and creating the massively large layered array object of matrices, lists, strings, etc. needed to run analysis within Seurat with. More to come on the unsupervised machine learning algorithms of PCA, tSNE, uMap, and clustering with K-Nearest Neighbors to get top genes to predict Natural Killer T-cell Lymphoma aggressive pathology from our database to build our machine to predict EBV, EBV associated pathologies such as this one and Mononucleosis and Multiple Sclerosis, fibromyalgia, and Lyme disease.
Unsupervised Learning Data Prep Beginning Errors Seurat Library on NKTCL GSE318371
This Rpub document goes through beginning errors of using Seurat to handle unsupervised RAW gene expression data with many layers to the Seurat objects created. It runs through figuratively extracting just the data frames of counts and fragments with barcodes of cells in the array to make a large table, but that table doesn't have the attached hidden layers of gene name and other important information that Seurat can handle to run PCA, K-Nearest Neighbors clustering, UMap, and tSNE algorithms in getting top clusters. This is part 1. There is a part 2 before this, this is edited and cleaned up. Removing previous version.
Mononucleosis Analysis with Machine Learning for Gene Targets part4 extension final additions added
This is the last part of the mononucleosis analysis, data science, machine learning, and analysis of essential amino acids in the infected patient with mononucleosis in first 7 months compared to healthy. The data has been added to the data base of target genes to search and build our model in predicting pathologies related to EBV infection at some point in time and Lyme disease once we add in a few more pathologies of Burkett and Hodgkin Lymphoma as well as head, neck, and throat sarcoma.
extension part3 to infectious mononucleosis GSEstudy in document
In this portion, part 3, we extend the original document and add in analysis and machine learning to all samples instead of only those with all participants or only those time points and participants who completed research. The 2 class model predicted 100% correctly on whether or not the sample was mono or not, but failed on the 5 class model of time point of infection as initial diagnosis, 1 month infection, 2 months infection, 7 months infection or healthy. But seems to be great at distinguishing healthy from mono. We have our top 16 genes and need to look up the microRNA gene IDs to add to our machine model database to predict the sample by pathology of Lyme disease, EBV stimulated with IL27, Multiple Sclerosis, and Fibromyalgia. As well as look up other associated EBV pathologies of head and neck sarcoma or throat cancer, Burkett's Lymphoma, and Hodgkin's Lymphoma.
part 2 to infectious mononucleosis with search of miRNA top gene targets and machine learning to verify top target gene
This is an extension to the earlier infectious mononucleosis project that looks at all patients that participated in the study up to 1 month and tests a 2 class model as well as 3 class model in predicting mono or healthy or initial diagnosis, 1 month of mono, or healthy. The 2 class model scored perfect with 100% accuracy while the 3 class model only scored perfect in predicting healthy classes but not state of infection progress.
Random Forest in microRNA gene expression data on Infectious Mononucleosis
This study in the document had all the information on methods used to derive the NCBI gene expression data, we explore it and analyze the fold change values and get the top genes, those common to all time stamps of the mononucleosis gene expression results from Affymetrix Gene Chip Array. The micoRNA names are not recognized as genes in genecards.org and Ensembl isn't loading into my browser currently to check. The internet says they are microRNA only tried one search but they are probably the genes identified within the machine or not. Could be a separate database. MicroRNA are involved in transcription of dropping introns and forming messenger RNA from premessenger RNA and in assisting translation of the mRNA at the ribosome. They should still have genes in the DNA because they are proteins and DNA makes the proteins our bodies need and use. RandomForest predicted 100% accuracy the class in a 2 class model, but a 5 class model not at all accurate.
Top genes database recap and store to add to and make changes as continuing mission to associated pathologies of EBV and Lyme Disease
In generating a machine model tool to predict a class of pathology, we have analyzed EBV when stimulated with IL27, Multiple Sclerosis (MS) when using a commercial line with participating patients, Fibromyalgia that has unknown timepoint of when gene expression data obtained after causing pain or alleviating it after causing pain and why in human studies when it says rat studies in description of samples, and Lyme disease in acute or new infection, 1 month of antibiotics, 6 months of antibiotics, and healthy controls. The media types of PBMCs with LCLs, RBCs, or B-cells, or skeletal muscle and processing type as high throughput RNA-Seq or array processing. This could be very useful and lead to discoveries of associated gene expression changes in populations. But better data in same media type would be best of course.
Extracting the mRNA amino acids to find most abundant amino acids with biostrings on MS data GSE293036
This is an extension to the last two weeks of machine learning, data extraction, exploratory data analysis, and inference on multiple sclerosis data from 20 base pair long cDNA strings to find the genes via BLAST not visually shown as part of the documents but described. In this we see if its possible to get the top 41 genes that changed the most in multiple sclerosis from those silence and those enhanced, by consuming more or less amino acids seen in the silenced gene fragments by amino acid abundance and enhanced. Explanation of essential amino acids and non-essential amino acids, neurotransmitter glutamate exploration of presence in amino acid sequences of genes and more.
Listing the genes manually found by rank on BLAST to our 20 bp long cDNA strings of MS
Tried to use bioconductor and example code, not sure what went wrong but retrieving information from NCBI failed with earlier code and demonstration by AI and an outdated dplyr that isn't available yet for R couldn't get it up and running. So manual input by rank of string in the strands, and many of these multiple sclerosis gene fragments come from chromosome 2 non-coding region 2.12 but many don't as well. Corrected for the inverse relationship of the commercial line vs control as the fold change was input incorrectly but simple math corrected in and kept the field for an inverse comparison, all fold change values are in same direction of magnitude and these are the genes for our 41 top expressed genes found earlier to predict with 100% accuracy on samples alone and not foldchange values that a sample was healthy or had multiple sclerosis. We will use these genes as targets when building machine to predict a pathology of those analyzed thus far for Lyme disease, MS, mono, and fibromyalgia but also search for those EBV associated lymphomas and neck and throat sarcomas.
Retrieving the mRNA from cDNA of MS patient ID_REF and forming amino acid sequences
Not able to use Bioconductor to get the barcode ID_REF gene name for use in exploring known genes of the top 41 genes found to play a role in predicting with 100% accuracy if a sample has MS or is healthy. But reviewed the process of transcription and translation of a protein from mRNA when transcribed from its triplet codon sets by transfer RNA at the ribosome. Gaps of 1-2 RNAs were unpaired to a codon so gsub wasn't the best choice, but these barcodes are fragments and maybe those are deletions, insertions, or translations in genes that are risk associated genes for MS, or better yet are found in those with MS.
Potential Top 41 genes in Multiple Sclerosis Risk Loci of Allele Variants of cDNA
This is extracting the top genes from the previous data project in data frame build of the samples of multiple sclerosis cDNA data from NCBI's GSE293036, details in last project. The machine learning model isn't built yet but should be relatively straight forward from other projects. This is part 2 after exploratory data analysis. This goes directly to finding the cDNA fragments of 20 base pairs long that had the most fold change or increase or decrease compared to healthy control samples. The top 50 genes in enhancer fold change and bottom 50 genes in silencer fold change selected from sample repeats of the 1st patient, then 2nd patient, and then a comparative commercial patient using mean values of samples vs control mean values. This found 41 top genes in common to all samples. Next will be the machine learning after data transformation to see how well these gene variants can predict the class of the sample as healthy or MS.
Multiple Sclerosis 10-50 million nucleotides as rows Data Extraction GSE293036 3 parts
This is a very large data base from GSM samples for study on NCBI GSE293036 on 20 base pair long fragments of nucleotide sequences that are common allele variants in finding risk loci of multiple sclerosis patients. This had to be done in separate batches for the controls, samples, and commercial samples to compare. The samples each had varying fragrment nucleotide strands, this filters out only the common strand fragments to every sample in control, 2nd patient, 1st patient, and commercial patient that used EBV to keep the commercial line alive and replicate fast similar to the hela cells for making innoculations due to fast cell division rate of viruses like HPV and EBV. But the EBV viral strands and HPV don't interfere in the host gene expression. Although I am finding the relation between MS and EBV that is said to be associated as well as with mononucleosis, Burkett Lymphoma, Hodgkin Lymphoma, and head and neck cancer.
Data Extraction of Multiple Sclerosis complementary DNA 20 base pair barcodes to get top genes Part1
Extracting very large data in 10-50 million observations or rows that is time consuming just to pull from internet but then read into Rstudio and transform before running machine learning on the top genes. In this case these observations are copy variants in the allele information of the complementary DNA with thymine made from reverse transcription of messenger RNA or mRNA to get what this study in the document used to find multiple sclerosis risk loci variants that enhanced or silenced (upregulated or down regulated) gene activity. This should be interesting and is part of the work to see if their are some common associations with EBV infection at various states. No libraries used just building the data set of common strands of nucleic DNA in 20 base pair fragments. The study used 2 MS patients, 1 control, and 1 commercial line of MS to compare but used repeat RefSeq analysis in 3 repeats on the control and 5 each on the 2 MS patients and commercial line comparison.
part3 forecasting on nonduplicated data for actionable insights in R with prophet to forecast
Part 3 on the data and real results to compare to the bloated results after removing 7 duplicates. Results similar just the numeric changes.
Cleaning data to forecast after running summary stats and analysis to build more client income for mobile massage biz from side gig to biz
This uses anonymized mobile massage data with combined data from income and consent forms with optional surveys attached to the consent forms of each client. The idea is to make the data provide information for best massage services to offer, idea region, age group, pressure, and other information to predict the next year income using the library prophet for R as well as dplyr and ggplot2 for graphical plots. Date variables are no joke if you enter them wrong. Many hours spent getting correct AI generated code to turn a month/day/year of 4 digit year into a 2 digit year. But that was cut out of this document so you can avoid the upset. Useful information to help guide this mobile massage provider into more income by targeting preferred idea client to get those who return more often and pay more per household.
Part 1 in EBV infection using gene data to get top genes study found defecit IL27RA this proves it
First part exploratory data analysis, after gathering data from NCBI gene expression study that analysis worked to prove the study's allele defective copy variant of IL27RA prevents T-cell immune response in EBV infection but is more to extract top genes from the most and least expressed genes in fold change of this 2 patient and 2 control RNA gene expression data of lymphoblastic cell lines in peripheral blood mononuclear cells to compare to top reactive genes in other data sets related to EBV associated lymphoproliferative pathologies of MS, Burketts lymphoma, Hodgkins lymphoma, and possibly see how it compares to top genes reactive in Lyme disease and myofascial pain studies similar to fibromyalgia pain.
Keras Deep Neural Networks on small Lyme disease data with class balancing of revisited Lyme disease 86X80 dataframe
This project follows along with many modifications to the 4 year old tutorial of (details in document) about Deep Neural Networks and a short demonstration on his data not used here. We revisit the data set made from the PCA in Random Forest project earlier (see that project for link to data) and see how Keras can manipulate and solve the class imbalances of the data to make predictions on our 4 class target.
There was some packaging and dependency issues and changes dealt with in a document not published but do know you should have Rtools installed, and latest keras and tensorflow to install those packages that are built for python modules but transitioned with an R package called reticulate that has some dependency issues I found but didn't publish but had to use nested for loops for the 4 classes.
In the end, DNN does better on very large data and not 86X80 more like 860,000X80 as it was built for facial recognition and fingerprint matching, etc. The results are better than PCA using the error or noise to predict classes, but not better than random forest in the caret or randomForest package of R on this type of data.
DocumentPCA analysis part2 with all components to predict 4 classes in randomForest on 19k wide data of 86 samples
An extension to last published Rpub document on analyzing a 19k+ gene expression dataset for Lyme disease with PCA but using all components. Please see all the previous Lyme disease Rpub documents of mine to get the data by running code from original data set.
PCA analysis in R of the Lyme disease data of gene expression with a 28k feature space
This is an extension to the randomForest analysis of a 28k feature dataset of gene expression data to find top genes. This function, prcomp, in R is part of the base library of functions that can do principal component analysis or PCA to find the components that explain the error or noise or variance in the data that prevents a line in multidimensional space being fit to all the 28k features. It cannot predict the exact gene or top genes as based on the error space and so can predict the class if tuned well if wanting only to have a predicting model to identify class of infectios stage or healthy samples only. Won't be beneficial to gene therapy in personalized gene therapeutics other than identifiying the samples working with.
Fraud detection in Simulated financial data of 10k rows with 500 fraud cases and 9,500 legit cases from Kaggle
This data science project explores financial data and uses ingenuity within the randomForest package to see how well this algorithm can handle large data compared to very small and imbalanced data in last publication using Lyme disease. Details and link to Kaggle data in the publication.
Using unbalanced Lyme Disease data to test randomForest package and 2 class solver from 4 class
This is an exploratory analysis of the package randomForest as I used caret package in earlier rmarkdown publications to model data. The lyme disease data was from online source in the 1st rpub document in this profile, but we used it again to go from 61% accuracy with 4 classes to 85% accuracy with 2 classes. Tuning was used for best parameters on a 80 feature wide or big dataset of genes to predict the class of acute infection versus chronic infection.
Fibromyalgia 2nd part Median fold changes with cross validation added to means and bootstrap outcomes
This is the 2nd part of the machine learning of fibromyalgia data comparisons using the same 1st part work on means of samples' fold change values of healthy vs myofascial pain, but now compared with the medians of those fold change values and different parameter tuning for 50, 100, and 1000 bootstrapped samples from only 13 to predict class correctly in the means then in the medians, and try same number of folds as best bootstrap aggregating value for cross validation on means and medians separately. A few models returned 100% accuracy on only 3 test samples.
Fibromyalgia RNA-Seq Gene Expression Analysis 12 samples Bootstrap Random Forest Model
In this project, we begin the analysis of the gene expression data on trigger point myofascial pain similar to fibromyalgia in clinical signs and symptoms for chronic pain. The genome data was used in a study of 5 healthy and 7 myofascial pain patients that helped the researchers understand how a drug that starts with 'dex' helps with chronic pain. It included fragments per kilo million and counts that were both normalized from the gene high throughput fastp data collected. This study is a stepping stone to connecting major illnesses associated with Epstein-Barr virus (EBV) such as multiple sclerosis, mononucleosis, Hodgkin's disease, and fibromyalgia. We can eventually understand how changes in the body of many people make their body undergo DNA transcription to make more of some genes and less of others when dealing with 58,000 genes in this study. Details in the document.
Lyme Disease Top Features in Predicting State of illness
Using R packages to manipulate data from NCBI gene studies with tidyr, dplyr, caret, and kernlab there are 6 models used with 10 folds of cross validation and Accuracy to measure algorithms of KNN, rpart, random forest, linear discriminate analysis, support vector machines for radial, and support vector machines for linear model fitting. Then summary results shown. Error in plots displaying properly in knitr and Latex, so they were block commented out. Looks like top genes are involved in upregulation of lipid regulators, DNA repair, and bile production to digest more fats and cholesterols. But downregulated mitotic activity in cell replication. This is from acute infection to chronic infection up to six months. Only 86 samples, and not balanced data for chronic infection. Tuning can be improved and selecting better model parameters to get better accuracy. For four classes best model was rpart but see notes in doc why.