RPubs

by RStudio

JanisCorona

Janis Harris

Recently Published

Relational Genes from Dashboard on nonEBV pathologies compared to mono and EBV

This document compares the genes we found as top genes in the dashboard we made using mono and EBV top genes as they were compared to genes in fibromyalgia, chronic fatigue syndrome, Lyme disease, uterine leiomyoma as well as mono and EBV active form. We found some interesting results that show CFS is almost exclusively classified as healthy or CFS but no misclassifications with Lyme disease, Fibromyalgia, uterine leiomyoma, mono, or active EBV. As well as other findings from a random forest classifier. Links in document.

about 7 hours ago

Updated Normalized Negative Values in Lyme Disease Data Found 7 top genes

Found 7 top genes normalizing the negative values of Lyme disease data to be Absolute value of min-x over the range of max-min to keep large values higher and low values lower. We got the top genes from the duplicates among the classes of top genes with 17/18 predicted correctly in a 4 class random forest classifier model.

24 days ago

Changed the Lyme disease foldchange genes more to come just adding them to the pathologies database

The Lyme disease genes were tested for goodness of fit in predicting the class of healthy, acute, infected 1 and 6 months each otherwise a 4 class random forest model. That did good on some but terrible on others for the sample predictions. We added these new fold change genes of top genes to the database as these 52 unique genes were better than the 8 genes in common that were removed as duplicates and also better than the other 33 genes we had originally found and removed the old 33 genes.

28 days ago

Revisit the CHL and DLBCL lymphomas for top genes using healthy samples as controls from AIM and CAEBV study

This project uses a control of healthy samples of total RNA PBMCs in a different gene study GSE85599 to get top genes in the hodgkins and diffuse large b-cell lymphomas. The 4 class random forest classifier upon training a set of top genes by fold change value predicted with much better accuracy than previously using this set of healthy samples as a baseline. 100% accuracy on duplicated genes in common to all lymphomas except the pDLBCL class with only 50% but the unique genes of those 3 lymphoma foldchange values top genes scored worse with 70-80% accuracy on CHL and mDLBCL with 100% accuracy on healthy and 25% accuracy on pDLBCL. More to add.

about 1 month ago

Chronic Fatigue Syndrome Analysis Project to put in Tableau dashboard for gene expression comparisons to EBV and others

In this project, we analyzed GSE293840 on chronic fatigue syndrome of 93 cases of CFS and 75 healthy controls. We explored and manipulated the data with filtering and ordering to remove infinites, NaNs, and keep only fold change values above 0 to find top genes in CFS. We will then add this data to our Tableau dashboard of non-EBV associated pathologies comparing it to Fibromyalgia, Multiple Sclerosis, Lyme Disease, and others.

about 1 month ago

Minor changes made to Fibromyalgia project for data import into Tableau

Just a few modifications to the dataset to add to the dashboards on Tableau.

about 1 month ago

Project on Acute Infectious Mono and chronic active EBV GSE85599

In this project we extract CEL files on acute infectious mononucleosis (AIM) and chronic active Epstein-Barr virus (CAEBV), get top genes by fold change, use regex to extract the gene symbol from the platform using the bioconductor package, and then we tested these top 10 stimulated genes by fold change of AIM/healthy and CAEBV/healthy in random forest classification and found great results 100% accuracy in CAEBV genes on training and mixed results on AiM in training with only 2 classes 100% but the CAEBV class scored 67%, but both in this 3 class model of healthy, CAEBV, or AIM scored 75% accuracy on the hold out validation data using an 80/20 training/testing data split like most often. We will add these to the pathology database and Tableau to compare later. Our other mononucleosis project was with micro RNA that we could not use to compare to other EBV associated pathologies on Tableau. This will be nice to use.

about 1 month ago

added autism data to pathology database nothing special just documenting it

A change to the autism data, just added the top genes with gene IDs to the pathology database with gene summary pasted to the study summary before adding to the Tableau dashboard of non-EBV associated pathologies to compare across pathologies.

about 1 month ago

Part 2 to Autism project revisited

None of the top genes of Autism project were in the few or 540 genes we could get symbols to the gene bank accession IDs.

about 1 month ago

Revisiting Autism data to merge the gene bank accession IDs with symbols to add to tableau

We revisited our Autism data of Gene Bank Accession IDs of 41k genes and merged it in bioconductor to get the gene symbols. We only end up with just under 600 gene symbols but its better than 40. We will add this to tableau to compare to EBV genes associated with and not associated with EBV.

about 1 month ago

Revisited the CHL & DLBCL EBV gene expression project for different top gene extraction

Revisited the project of elderly EBV lymphomas of CHL or DLBCL done a few weeks ago, and added in a different variation of spread to see since no baseline comparison for fold change top genes if the spread from max-median for stimulated genes to median-minimum in each class of lymphoma could be used as better predictors in a 3 class model using random forest classifier. The results were just as poor with the best class predicted at 75% accuracy for mDLBCL in the training model and about 72% the best score for the CHL class using a subset of the top genes. The duplicates of top genes from extracting the Gene ID from the SPOT.1 feature threw off the model classifier in using all features to predict the model and only subsets could be used of the top genes or batches to predict the class as CHL, pDLBCL, or mDLBCL. The pDLBCL was at best 50% accurate in a subset of genes. Overall best accuracy was 50% prediction using a subset of these top genes.

about 2 months ago

Analysis of Intrahepatic Cholangiocarcinoma ICC gene expression study GSE316921 for EBV association and Machine Learning by class

In this short project, we do a quick exploratory data analysis without plotting just reading the series text information and then use fold change data of the already supplied data set on intrahepatic cholangiocarcinoma or ICC that used two groups to compare hypoxia in ICC wild type and in ICC with a knockout inhibition of the SLC2A1 gene with short hairpin version or shSLC2A1. We then separated the groups after getting fold change values for top genes stimulated and inhbited without infinites, NaNs or 0s, and used a quick random forest classifier to tell if the genes would be good in predicting the class of the group. And 100% accuracy in both groups for training and testing. Small set of samples, we then looked at the EBV genes of recent studies in our projects to compare LMP1, LMP2, PDL1 or CD274, and PDL2 or PDCD1LG2. Only the PDL1 and PDL2 genes were included and both inhibited in both groups.

about 2 months ago

Updated colorectal cancer GSE302491 Tableau edits

Final version of the colorectal cancer and EBV association if any, few genes found, we extracted, filtered, got the fold change values of knockout of AURKA gene in colon cancer commecial line SW480 and found PDL1 and PDL2 showed and MCC gene of EBV genes but not the others, then found all these genes predicted the class with 100% accuracy, and then added these genes to the pathology database, and added a link to a youtube playlist on this project and put together a partial dashboard of EBV associated pathologies with this data to add to it.

about 2 months ago

Part1 Analysis and Machine Learning on colorectal gene expression data GSE302491

Starting the data extraction and exploratory data analysis to colorectal cancer gene study GSE302491, we got the series and text data together in a tabular data set to start analyzing the gene expression data of fpkm, tpm, and read counts already normalized. There are 16,122 genes we got from this by Gene Symbol. We will be analyzing the data, finding top genes to colorectal cancer and seeing if any of the EBV genes from our previous projects show up in this one.

2 months ago

Part 6 GSE305165 Three EBV and Large B-cell Lymphomas CHL & pDLBCL & mDLBCL

This is an extension on top of part 5b to the analysis and machine learning of the Epstein-Barr Virus or EBV elderly patients various genders of classical hodgkins lymphoma or CHL, polymorphic diffuse large b-cell lymhoma or pDLBCL, or monomorphic diffuse large b-cell lymphoma or mDLBCL large B-cell lymphomas comparing it to similar gene expression data in GSE318371 of natural killer t-cell lymphomas (NKTCL). We break down the genes in the NKTCL top genes by fold change of NKTCL with EBV over controls to that of the variation within samples in groups by lymphoma type of CHL, pDLBCL, or mDLBCL and male or female or 50-72 years of age or 73-94 years of age based on data. Some observations made. To see if the T-cell anergy in aging affects the lymphomas in a side by side comparison of the NKTCL study's PBMC genes versus the LCL type genes in the CHL and DLBCLs.

2 months ago

Part 5 GSE305165 EBV large B-cell Lymphomas machine learning test set genes

We did some machine learning with standard random forest classifier on the study genes and then on genes specific to relationships we saw in the data analysis of groups and subgroups of genes. Not all subsets were done, but some were terrible and some just not good at all. The best performer was the genes from the study. The avg/median of groups and subgroups didn't seem to work very well. The best score was 70% with genes of study and 60% with some genes that seemed to be more associated with males. It seems that the 4 class model on the proposed 4 classes scored 30% with the genes in the study and 40% in the male gene predictor in the male inclusive samples. The transition state seems likely but at times it misidentified an mDLBCL class as it for IFNG-L when it should be CHL or pDLBCL. It's apparent that the large B-cell lymphomas are EBV associated and so is CHL but they remain difficult to classify even with a classifier tool.

2 months ago

Analyzing genes in Autism Savant Language & Mild GSE15402 pub2009 MEV files

This is another project working on in analyzing the Savant Autism patients and use of MEV files I have never worked with but seem similar to the CEL type files but can read in with normal read in functions of read.delim or read.table after skipping rows of meta and reviewing the format in notepad first. No need for bioconductor in this set, but this one shows the array data of TIGR and the actual spot readings, intensities, channels, and saturation levels of channels with quality score, I assume the quality control score is the value that is input in gene reads. This is RNA data of lymphoblastic cell lines or LCLs. I didn't read the published article first but will. The links to gene expression data and published article are in the document.

3 months ago

Part4 GSE305165 EBV and 3 Lympomas CHL & pDLBCL & mDLBCL analysis and machine learning

In this part 4 of the project analyzing the research study on GSE305165, we finish the top Genes of the study using the genes found from the study, some background information presented as well from the study, compare findings and analyze the findings by age and gender within each of the 3 types of EBV large B-cell lymphomas. We then do a random forest classifier on the model using all the research genes available and the featured study genes to see how well the target genes could predict only within each type of the 3 lymphomas and then only within the 4 proposed classes of lymphomas. The results are that the transition state did score points in the 4 class model where the CHL and pDLBCL failed to predict with a standard random forest classifier. All genes included some lactate dehydrogenase genes said to be used in labeling the latent type 3 of EBV along with other factors of mDLBCL. More to analyze in Part5 with the subset groups of age and gender within each class of lymphoma to see how well those genes can predict in a standard random forest classifier a 3 class and then a 4 class model.

3 months ago

Part3 GSE305165 analysis EBV 3 lymphomas CHL pDLBCL mDLBCL

In this part we finish getting the top genes that the probe IDs matched a gene from 15 groups of 20 top genes in top 10 over expressed and top 10 under expressed using group avg/ group median, plus the available genes from the study's published article. We will use this to build the machine to predict the 3 classes of lymphoma within the 3 major group genes, study genes, and all genes by their subcategory within larger class, and by the new 4 classifications this study is proposing. The machine learning will be next part in Part 4.

3 months ago

Part2 to 3 types Large B cell Lymphoma GSE305165 Analysis Data Extraction

Finding ways to develop insights with variables that don't have fold change values as there is no baseline as all are different states of Lymphoma so trying a change feature by avg/median per group and subgroup to get the top skewed changes in over and under expression. More to do on this with the subgroups and also the machine learning. The study's genes that were mentioned and available in the data of Affy IDs matched to listed IDs in a grouped listed column made. Table being made and added to for top genes and a side by side comparison of study genes from tables made and ended with links to data in document

3 months ago

GSE305165 EBV in Hodgkins and Diffuse Large B-Cell Lymphoma Analysis Part 1

Figuring out how to read in CEL files that came with this interesting study with published article on findings for GSE305165 on Epstein-Barr Virus (EBV) in lymphomas of Hodgkins and Diffuse Large B-Cell Lymphomas of 2 types polymorphic and monomorphic that found that these types of lymphomas should be categorized as one but in different states of disease as aging lowers immunity from immune senescent to immune escape in the elderly populations with median age of 74 years of age and minimum age of 50 years with max age of 94 years. Some patients dropped out as there are 47 provided samples but the study reported 57 samples. This is the first time I encountered CEL files and not the first time approaching something new, so this is part 1. Links to study research article resource and gene expression omnibus in NCBI database in document.

3 months ago

Part 2 to Seurat on Gastric Carcinoma Data GSE308231 randomForest top genes Added to Pathologies database

In this part 2 project, we add part 2 to part 1 separated by equal signs and 3 stars after the QC, filtering, normalizing, getting high variability genes, clustering with KNN and UMAP and TSNE, then get fold change values in part 2 for top 20 plus top 10 in Seurat's algorithm and add those FC values as well, test the significance in predicting the class type of GC or PM for Gastric Carcinoma or Peritoneal Metastasis, and scored 100% accuracy on both sets of genes, but 100% accuracy on the training and testing hold out validation set for the top 20 fold change values after omitting 0.000000 values after removing NAs and Infinites. Then added them to our pathologies database. Links in document also to the Tableua dashboard on FCs for each pathology we analyzed so far just by FCs related to EBV, and not but close, Fibromyalgia, Lyme disease, EBV infection, mononucleosis (only one in miRNA and no genes the same in other sets), multiple sclerosis, Hodgkin's Lymphoma, Natural Killer T Cell Lymphoma, Gastric Carcinoma, and HIV infected Hodgkin's with EBV, and uterine fibroids. We will see after gathering more data how well a model can be tuned with these top genes of fold change values to predict pathologies or show their similarities across pathologies by gene affects from disease.

3 months ago

Gastric Carcinoma Gene expression data Seurat using GSE308231

Here is a project that analyzes gene expression data of GSE308231 that compared stage 1 gastric cancer or stomach cancer to stage 4 stomach cancer in the peritoneal tissue of stomach region. There are 6 samples with 3 samples of each class. The high variability genes were selected for top 10 and top 2,000 extracted into a table with links in document as well as links to the study and the study's published article. No machine learning in this part done, Will do that with next one. The processing time wasn't as long with this project as was only 6 samples, unlike the many samples of the NKTC Lymphoma project.

3 months ago

Part 2 to Uterine Fibroids and EBV connection

We extend Part 1 with this Part 2 where we add these genes that are duplicates and shared between classes of this data in GSE244187 and add them to our pathologies database.

3 months ago

Finding a connection to EBV with Uterine Fibroids Analysis of GSE244187

This is an analysis of the NCBI study GSE244187, links in document, that covers a recent uterine fibroid study in high throughput sequencing of RNA of uterine fibroid in the myometrium, adjacent and at risk tissue of myometrium for fibroid development, and normal myometrium tissue not next to the fibroids of the uterus. There are 3 samples each of normal White female, normal Black female, at risk White female, at risk Black female, fibroid White female, and fibroid Black female. No machine learning done, just seeing if any of the fold change values of previous studies on EBV associated diseases, the last few studies' genes of a handful show changes. There is a difference and similarity. There could be an EBV connection. We will add these fibroid genes to our dataset and see if our large pathology database can predict fibromyalgia, Lyme disease, Hodgkins, Burkitts Lymphoma, Gastric carcinoma, nasopharyngeal carcinoma, HIV, normal, intraductal cholangiocarcinoma, and now uterine fibroid, when we get to it. still more genes of top genes and their respective pathology study selected to analyze before adding them to the database.

3 months ago

Part 2 to the GSE289903 study on Hodgkins and Hodgkins with EBV and HIV using machine learning to distinguish genes with random forest

In this study we categorize the genes, add in genes from other studies recently part of our projects, get the subtypes and find out if the genes that are specific to the study top genes, the top fold change genes for EBV and HIV in Hodgkins lymphoma, or the genes relative to the study in finding specificity to tumor mutation burden or tmb, immune checkpoint inhibitor, HIV specific, and other combinations to predict the class samples in a 3 class model for diagnosis type, then in a 5 class model of subtype cellularity. The genes specifically selected from this set scored 100% accuracy in subtype prediction of Nodular Sclerosis or Mixed Cellularity and found in 2 different gene feature sets as predictors that the NA sample was the Nodular Sclerosis as a prediction. The part 1 is added to the end of this part 2. For class type, the best performing features were not the top fold change genes but all 35 genes, or top 5 genes of the study in predicting the Hodgkins or Hodgkins with EBV but not in Hodgkins with EBV and HIV like the top 5 ranked tumor mutational burden found them in more mutations within the HIV samples.

3 months ago

Pulling Gene information from GSE289903 on Hodgkins alone or with EBV or EBV and HIV

In this study, we pull out useful information and genes from a study of a published article from the available gene expression data we use to further analyze and compare results and see if the target genes in the study and our top genes from fold change only on tumor biopsies RNA sequencing data of 19 samples all having Hodgkins lymphoma but some having EBV or EBV and HIV. Some associations were made and genes found that are up or down regulated in only EBV or only EBV with HIV. The baseline comparison is the cHL only samples. We will explore the subtypes in another part of this project while searching for more top genes to add to our database to build our machine model to predict pathologies of EBV, EBV associated diseases, Lyme disease, fibromyalgia, Mononucleosis, Multiple Sclerosis, Nasopharyngeal carcinoma, Hodgkin's disease, HIV, Natural Killer T-cell Lymphoma, and others yet to be explored.

3 months ago

Part 2 to the analysis of top genes in NPC by EBV with KDM5B in GSE299775

In this part 2 analysis of the previous analysis of top genes in EBV+NPC from GSE299775, we don't do any more data analysis or machine learning but use our part 1 results to add those 30 genes to our pathology database that we will use to get as many genes and media we research and find from machine learning of various cell types to predict the class of EBV, EBV associated pathology, Multiple Sclerosis, Fibromyalgia, Nasopharyngeal carcinoma, natural killer t-cell lymphoma with EBV, Lyme disease, and mononucleosis. We have yet to add the gastric and biliary duct carcinoma related to EBV or the Burkitt Lymphoma, Hodgkin's Lymphoma, and Large B-cell Lymphoma all associated with EBV.

3 months ago

More EBV genes 30 of them GSE299775 on Primary EBV profiles and KDM5B and NPC proliferation with chromatin access by latent EBV

In this project, after adding all details from the PubMed article at link in document using the gene expression data of a portion of the research article on primary nasopharyngeal carcinoma (NPC) and the gene KDM5B and its signature EBVIR-enhancer-KDM5B that allows interacting regions of the chromatin to become hijacked and the latent type II Epstien-Barr Virus (EBV) to take over and make NPC rapidly progress to tumor creation, NPC cell proliferation, and rapid decline more likely. In it they discuss that there are 2 created substances of JQ1 and GSK-467 that can knockout the KDM5B gene and suppress tumor proliferation by EBV. This analysis looks at 10 of the genes in the study mentioned and confirms their upregulation with upregulation of KDM5B using fold change values in MET vs nonMET samples they provided of 28 of the 177 samples. But in using a random forest model to predict 3 classes of MET, nonMET, or Normal, these 10 genes predicted with 28% accuracy overall, while top 20 genes of highest foldchange predicted 58% accuracy, and all 30 genes predicted 71% accuracy. Certain classes with all 30 genes were predicted with 100% accuracy in the Normal and the nonMET classes.

3 months ago

Part 3 final addition and extension to analysis and predictive analytics on top genes EBVaNPC on GSE271486

In this extension to the project analyzing top genes via differential expression and fold change values using GSE271486 gene expression data from an NCBI study and reviewing the top genes in the PubMed article free to download based on this data or the outcome article of the data as that study was also an extension to their own study on EBV. We confirmed the 13 genes from the study were 100% accurate in predicting class in a 2 class sample of EBVaNPC or NPC based on study details in this document with links to article and gene expression data. We then add those 13 genes plus the 20 genes we found from top 10 up regulated and top 10 down regulated genes using fold change values and filtering for those sample means not having a 0 value for the gene. We are now moving on to other EBV pathologies from Burkitt and Hodgkin lymphomas that are said to be associated with EBV infection. Also, some other leads to pathologies with EBV associations. We know the association to EBV with multiple sclerosis and Hodgkin's disease is not strong, but we might show it is after we get our database of pathologies and see what the machine predicts in class for these pathologies.

4 months ago

Part 2 to Nasopharyngeal with EBV mutated H101R of LMP1 GSE271486 analyzing 13 genes from study foldchange comparisons to content

This is part 2 to the GSE271486 analysis of foldchange value top genes but only looking at the 13 top genes of influence and interest from the study after reviewing the free pubmed article that was recently published all details in the document. Some misspelled abbreviations for nasopharyngeal carcinoma as NCP was corrected with search and replace within Rstudio with NPC so some errors might be present in documentation, but if you see NCP or NPC it is the same thing for mentioning nasopharyngeal carcinoma. Thanks!

4 months ago

Nasopharyngeal Carcinoma and EBV infection led to identifying genes that target NCP from EBV infection

This is a quick project on finding the top genes in analyzing a recent study on the effects of Epstein-Barre Virus (EBV) targeted immune response genes affected in a commercial line of nasopharyngeal carcinoma (NCP) samples. The genes found were top stimulated or inhibited genes with fold change values eliminating the mean samples with 0. The random forest model for classification scored 100% on the training set and testing set of only 2 samples each in a 2 class predictor model for training set and 100% in predicting the correct class from only 1 sample of each class. Gene study used is GSE271486, a recent study with links to gene expression data as well as the published PubMed report produced by same researchers.

4 months ago

Adding our top genes of NKTC Lymphoma to our pathologies database

The last project of finding top genes of GSE318371 a natural killer t-cell lymphoma study with EBV, using top variability genes found in Seurat in preprocessing to prepare to use PCA an unsupervised learning model and in supervised learning modeling with highest fold change values for a total of 19 genes, details in project.

4 months ago

Part 6 GSE318371 NKTC Lymphoma EBV study using Supervised learning on top genes from PCA of Seurat and FoldChange comparisons get 100% accuracy either set of genes

In this extension to earlier and latest or previous analysis on the gene study of Natural Killer T-Cell Lymphoma with EBV in GSE318371, we take the genes from parts1-5 found with Seurat when using high variability and PCA to compare all genes with their respective sample set of healthy or pathology of NKTCL mean values for fold changes with top enhanced or silenced genes after excluding NaN values and taking the complete cases of those samples with genes. Both sets with a tuning parameter set the same in randomForest package of mtry=2 and ntrees=10000 scored 100% accuracy on prediction but the genes from fold change complete cases and in the top 2,000 high variability genes of Seurat scored better in training than the top ranked genes of Seurat.

4 months ago

Part 5b Seurat in R to analyze GSE318371 NKTC Lymphoma & EBV data with Unsupervised ML

An extension to the earlier parts on this project of GSE318371 on Natural Killer T-cell Lymphoma with Epstein-Barr Viral infection in 12 healthy and 20 pathological cases in single cell RNA sequencing data with 30,960 genes and 407,006 cells in the array. We added to our other work by making the data of 30,960 genes with number of counts in all cells, number of features or genes that showed up in all cells, and percent mitochondrial DNA present in each gene, with actual gene name, link in document to a few of those data frames. Then ran T-SNE and saw a completely different clustering affect. That concludes the machine learning visualizations of unsupervised learning on this data with PCA, K-nearest neighbors, UMAP, and T-SNE. We will add more to analyze these samples from supervised perspective with the data we have gathered with some processing like our other work to see how well the top genes predict the healthy or pathology case in a 2 class model and then later get the genes associated with EBV and Lymphomas and blood cancers from the KEGG gene expression database of systemic, metabolic, and pathological gene associations to compare to the results of gene changes in our samples from the data we extracted with attached gene names to the samples comparing differential expression as fold change values from mean of healthy to mean of pathology. This numeric data has all been log normalized, scaled by 10,000, and then scaled by subtracting mean of gene across all cells not samples, and dividing by the standard deviation of that gene across all cells of the array (407,006 cells).

4 months ago

Part 5 of Analysis to the workflow with Seurat library for GSE318371 NKTC Lymphoma and EBV samples

This part is an extension to the workflow analysis with unsupervised machine learning to find target genes in the samples from GSE318371 that have EBV or Epstein-Barr Viral infection and Natural Killer T-cell Lymphoma or NKTC aggressive Lymphoma. We do the PCA, the K-Nearest Neighbor, and UMap. Later we will do TSNE since available in Seurat, and also add this data with gene name as a data frame to link to the clusters of known EBV targeted genes to see how they change within the pathology of NKTCL vs healthy, add these genes minus EBV target genes already known to the database of our known pathology gene targets in predicting pathology of EBV, EBV associated diseases like Mononucleosis, Multiple Sclerosis, Hodgkin Disease, Burkett's Lymphoma, and nasopharyngeal carcinoma, as well as not EBV associated pathologies of fibromyalgia and Lyme disease. More to come.

4 months ago

Part 4 to Seurat Differential Gene Expression Analysis of GSE318371 got it to Scale

Some issues were dealt with and finally resolved in RStudio but not within Knitr but some subtle changes within the versions of Seurat when normalizing the data and scaling it needed to be made. The clustering portion took too long to process. But working with it some more. Not done with this project, but have the top 10 genes and a database of the expression values and ranking among all single cell RNA sequencing cells. This is great as an addition to our model building towards predicting a pathology of EBV, EBV associated pathology such as mono or MS or Hodgkins or Burkett Lymphoma or nasopharyngeal carcinoma, Lyme disease, and Fibromyalgia. I want to get the other unsupervised algorithms to work as there is a neat little way to find the clusters in added columns might need a whole other computer for it, and then use those genes in supervised learning models with samples given. It would also be great to find a way to get Knitr to print out the results in RStudio. This pathology is related to EBV and the samples have EBV, there is an extension to this learning with single cell RNA sequencing that uses known genes as a conglomerate or group of genes for EBV that could also be pulled from this data and added to own or same database of top 10 genes to compare findings or just use to compare findings acrose our other gene data base as we have each of those.

4 months ago

Part1 and Part2 together in NKTCL GSE318371 using Seurat to get top genes before Machine Learning

Part 1 and Part 2 in reading in a large RAW file with Seurat and creating the massively large layered array object of matrices, lists, strings, etc. needed to run analysis within Seurat with. More to come on the unsupervised machine learning algorithms of PCA, tSNE, uMap, and clustering with K-Nearest Neighbors to get top genes to predict Natural Killer T-cell Lymphoma aggressive pathology from our database to build our machine to predict EBV, EBV associated pathologies such as this one and Mononucleosis and Multiple Sclerosis, fibromyalgia, and Lyme disease.

4 months ago

Unsupervised Learning Data Prep Beginning Errors Seurat Library on NKTCL GSE318371

This Rpub document goes through beginning errors of using Seurat to handle unsupervised RAW gene expression data with many layers to the Seurat objects created. It runs through figuratively extracting just the data frames of counts and fragments with barcodes of cells in the array to make a large table, but that table doesn't have the attached hidden layers of gene name and other important information that Seurat can handle to run PCA, K-Nearest Neighbors clustering, UMap, and tSNE algorithms in getting top clusters. This is part 1. There is a part 2 before this, this is edited and cleaned up. Removing previous version.

4 months ago

Mononucleosis Analysis with Machine Learning for Gene Targets part4 extension final additions added

This is the last part of the mononucleosis analysis, data science, machine learning, and analysis of essential amino acids in the infected patient with mononucleosis in first 7 months compared to healthy. The data has been added to the data base of target genes to search and build our model in predicting pathologies related to EBV infection at some point in time and Lyme disease once we add in a few more pathologies of Burkett and Hodgkin Lymphoma as well as head, neck, and throat sarcoma.

5 months ago

extension part3 to infectious mononucleosis GSEstudy in document

In this portion, part 3, we extend the original document and add in analysis and machine learning to all samples instead of only those with all participants or only those time points and participants who completed research. The 2 class model predicted 100% correctly on whether or not the sample was mono or not, but failed on the 5 class model of time point of infection as initial diagnosis, 1 month infection, 2 months infection, 7 months infection or healthy. But seems to be great at distinguishing healthy from mono. We have our top 16 genes and need to look up the microRNA gene IDs to add to our machine model database to predict the sample by pathology of Lyme disease, EBV stimulated with IL27, Multiple Sclerosis, and Fibromyalgia. As well as look up other associated EBV pathologies of head and neck sarcoma or throat cancer, Burkett's Lymphoma, and Hodgkin's Lymphoma.

5 months ago

part 2 to infectious mononucleosis with search of miRNA top gene targets and machine learning to verify top target gene

This is an extension to the earlier infectious mononucleosis project that looks at all patients that participated in the study up to 1 month and tests a 2 class model as well as 3 class model in predicting mono or healthy or initial diagnosis, 1 month of mono, or healthy. The 2 class model scored perfect with 100% accuracy while the 3 class model only scored perfect in predicting healthy classes but not state of infection progress.

5 months ago

Random Forest in microRNA gene expression data on Infectious Mononucleosis

This study in the document had all the information on methods used to derive the NCBI gene expression data, we explore it and analyze the fold change values and get the top genes, those common to all time stamps of the mononucleosis gene expression results from Affymetrix Gene Chip Array. The micoRNA names are not recognized as genes in genecards.org and Ensembl isn't loading into my browser currently to check. The internet says they are microRNA only tried one search but they are probably the genes identified within the machine or not. Could be a separate database. MicroRNA are involved in transcription of dropping introns and forming messenger RNA from premessenger RNA and in assisting translation of the mRNA at the ribosome. They should still have genes in the DNA because they are proteins and DNA makes the proteins our bodies need and use. RandomForest predicted 100% accuracy the class in a 2 class model, but a 5 class model not at all accurate.

5 months ago

Top genes database recap and store to add to and make changes as continuing mission to associated pathologies of EBV and Lyme Disease

In generating a machine model tool to predict a class of pathology, we have analyzed EBV when stimulated with IL27, Multiple Sclerosis (MS) when using a commercial line with participating patients, Fibromyalgia that has unknown timepoint of when gene expression data obtained after causing pain or alleviating it after causing pain and why in human studies when it says rat studies in description of samples, and Lyme disease in acute or new infection, 1 month of antibiotics, 6 months of antibiotics, and healthy controls. The media types of PBMCs with LCLs, RBCs, or B-cells, or skeletal muscle and processing type as high throughput RNA-Seq or array processing. This could be very useful and lead to discoveries of associated gene expression changes in populations. But better data in same media type would be best of course.

5 months ago

Extracting the mRNA amino acids to find most abundant amino acids with biostrings on MS data GSE293036

This is an extension to the last two weeks of machine learning, data extraction, exploratory data analysis, and inference on multiple sclerosis data from 20 base pair long cDNA strings to find the genes via BLAST not visually shown as part of the documents but described. In this we see if its possible to get the top 41 genes that changed the most in multiple sclerosis from those silence and those enhanced, by consuming more or less amino acids seen in the silenced gene fragments by amino acid abundance and enhanced. Explanation of essential amino acids and non-essential amino acids, neurotransmitter glutamate exploration of presence in amino acid sequences of genes and more.

5 months ago

Listing the genes manually found by rank on BLAST to our 20 bp long cDNA strings of MS

Tried to use bioconductor and example code, not sure what went wrong but retrieving information from NCBI failed with earlier code and demonstration by AI and an outdated dplyr that isn't available yet for R couldn't get it up and running. So manual input by rank of string in the strands, and many of these multiple sclerosis gene fragments come from chromosome 2 non-coding region 2.12 but many don't as well. Corrected for the inverse relationship of the commercial line vs control as the fold change was input incorrectly but simple math corrected in and kept the field for an inverse comparison, all fold change values are in same direction of magnitude and these are the genes for our 41 top expressed genes found earlier to predict with 100% accuracy on samples alone and not foldchange values that a sample was healthy or had multiple sclerosis. We will use these genes as targets when building machine to predict a pathology of those analyzed thus far for Lyme disease, MS, mono, and fibromyalgia but also search for those EBV associated lymphomas and neck and throat sarcomas.

5 months ago

Retrieving the mRNA from cDNA of MS patient ID_REF and forming amino acid sequences

Not able to use Bioconductor to get the barcode ID_REF gene name for use in exploring known genes of the top 41 genes found to play a role in predicting with 100% accuracy if a sample has MS or is healthy. But reviewed the process of transcription and translation of a protein from mRNA when transcribed from its triplet codon sets by transfer RNA at the ribosome. Gaps of 1-2 RNAs were unpaired to a codon so gsub wasn't the best choice, but these barcodes are fragments and maybe those are deletions, insertions, or translations in genes that are risk associated genes for MS, or better yet are found in those with MS.

5 months ago

Potential Top 41 genes in Multiple Sclerosis Risk Loci of Allele Variants of cDNA

This is extracting the top genes from the previous data project in data frame build of the samples of multiple sclerosis cDNA data from NCBI's GSE293036, details in last project. The machine learning model isn't built yet but should be relatively straight forward from other projects. This is part 2 after exploratory data analysis. This goes directly to finding the cDNA fragments of 20 base pairs long that had the most fold change or increase or decrease compared to healthy control samples. The top 50 genes in enhancer fold change and bottom 50 genes in silencer fold change selected from sample repeats of the 1st patient, then 2nd patient, and then a comparative commercial patient using mean values of samples vs control mean values. This found 41 top genes in common to all samples. Next will be the machine learning after data transformation to see how well these gene variants can predict the class of the sample as healthy or MS.

6 months ago

Multiple Sclerosis 10-50 million nucleotides as rows Data Extraction GSE293036 3 parts

This is a very large data base from GSM samples for study on NCBI GSE293036 on 20 base pair long fragments of nucleotide sequences that are common allele variants in finding risk loci of multiple sclerosis patients. This had to be done in separate batches for the controls, samples, and commercial samples to compare. The samples each had varying fragrment nucleotide strands, this filters out only the common strand fragments to every sample in control, 2nd patient, 1st patient, and commercial patient that used EBV to keep the commercial line alive and replicate fast similar to the hela cells for making innoculations due to fast cell division rate of viruses like HPV and EBV. But the EBV viral strands and HPV don't interfere in the host gene expression. Although I am finding the relation between MS and EBV that is said to be associated as well as with mononucleosis, Burkett Lymphoma, Hodgkin Lymphoma, and head and neck cancer.

6 months ago

Data Extraction of Multiple Sclerosis complementary DNA 20 base pair barcodes to get top genes Part1

Extracting very large data in 10-50 million observations or rows that is time consuming just to pull from internet but then read into Rstudio and transform before running machine learning on the top genes. In this case these observations are copy variants in the allele information of the complementary DNA with thymine made from reverse transcription of messenger RNA or mRNA to get what this study in the document used to find multiple sclerosis risk loci variants that enhanced or silenced (upregulated or down regulated) gene activity. This should be interesting and is part of the work to see if their are some common associations with EBV infection at various states. No libraries used just building the data set of common strands of nucleic DNA in 20 base pair fragments. The study used 2 MS patients, 1 control, and 1 commercial line of MS to compare but used repeat RefSeq analysis in 3 repeats on the control and 5 each on the 2 MS patients and commercial line comparison.

6 months ago

part3 forecasting on nonduplicated data for actionable insights in R with prophet to forecast

Part 3 on the data and real results to compare to the bloated results after removing 7 duplicates. Results similar just the numeric changes.

6 months ago

Cleaning data to forecast after running summary stats and analysis to build more client income for mobile massage biz from side gig to biz

This uses anonymized mobile massage data with combined data from income and consent forms with optional surveys attached to the consent forms of each client. The idea is to make the data provide information for best massage services to offer, idea region, age group, pressure, and other information to predict the next year income using the library prophet for R as well as dplyr and ggplot2 for graphical plots. Date variables are no joke if you enter them wrong. Many hours spent getting correct AI generated code to turn a month/day/year of 4 digit year into a 2 digit year. But that was cut out of this document so you can avoid the upset. Useful information to help guide this mobile massage provider into more income by targeting preferred idea client to get those who return more often and pay more per household.

6 months ago

Part 1 in EBV infection using gene data to get top genes study found defecit IL27RA this proves it

First part exploratory data analysis, after gathering data from NCBI gene expression study that analysis worked to prove the study's allele defective copy variant of IL27RA prevents T-cell immune response in EBV infection but is more to extract top genes from the most and least expressed genes in fold change of this 2 patient and 2 control RNA gene expression data of lymphoblastic cell lines in peripheral blood mononuclear cells to compare to top reactive genes in other data sets related to EBV associated lymphoproliferative pathologies of MS, Burketts lymphoma, Hodgkins lymphoma, and possibly see how it compares to top genes reactive in Lyme disease and myofascial pain studies similar to fibromyalgia pain.

6 months ago

Keras Deep Neural Networks on small Lyme disease data with class balancing of revisited Lyme disease 86X80 dataframe

This project follows along with many modifications to the 4 year old tutorial of (details in document) about Deep Neural Networks and a short demonstration on his data not used here. We revisit the data set made from the PCA in Random Forest project earlier (see that project for link to data) and see how Keras can manipulate and solve the class imbalances of the data to make predictions on our 4 class target. There was some packaging and dependency issues and changes dealt with in a document not published but do know you should have Rtools installed, and latest keras and tensorflow to install those packages that are built for python modules but transitioned with an R package called reticulate that has some dependency issues I found but didn't publish but had to use nested for loops for the 4 classes. In the end, DNN does better on very large data and not 86X80 more like 860,000X80 as it was built for facial recognition and fingerprint matching, etc. The results are better than PCA using the error or noise to predict classes, but not better than random forest in the caret or randomForest package of R on this type of data.

6 months ago

DocumentPCA analysis part2 with all components to predict 4 classes in randomForest on 19k wide data of 86 samples

An extension to last published Rpub document on analyzing a 19k+ gene expression dataset for Lyme disease with PCA but using all components. Please see all the previous Lyme disease Rpub documents of mine to get the data by running code from original data set.

6 months ago

PCA analysis in R of the Lyme disease data of gene expression with a 28k feature space

This is an extension to the randomForest analysis of a 28k feature dataset of gene expression data to find top genes. This function, prcomp, in R is part of the base library of functions that can do principal component analysis or PCA to find the components that explain the error or noise or variance in the data that prevents a line in multidimensional space being fit to all the 28k features. It cannot predict the exact gene or top genes as based on the error space and so can predict the class if tuned well if wanting only to have a predicting model to identify class of infectios stage or healthy samples only. Won't be beneficial to gene therapy in personalized gene therapeutics other than identifiying the samples working with.

6 months ago

Fraud detection in Simulated financial data of 10k rows with 500 fraud cases and 9,500 legit cases from Kaggle

This data science project explores financial data and uses ingenuity within the randomForest package to see how well this algorithm can handle large data compared to very small and imbalanced data in last publication using Lyme disease. Details and link to Kaggle data in the publication.

6 months ago

Using unbalanced Lyme Disease data to test randomForest package and 2 class solver from 4 class

This is an exploratory analysis of the package randomForest as I used caret package in earlier rmarkdown publications to model data. The lyme disease data was from online source in the 1st rpub document in this profile, but we used it again to go from 61% accuracy with 4 classes to 85% accuracy with 2 classes. Tuning was used for best parameters on a 80 feature wide or big dataset of genes to predict the class of acute infection versus chronic infection.

6 months ago

Fibromyalgia 2nd part Median fold changes with cross validation added to means and bootstrap outcomes

This is the 2nd part of the machine learning of fibromyalgia data comparisons using the same 1st part work on means of samples' fold change values of healthy vs myofascial pain, but now compared with the medians of those fold change values and different parameter tuning for 50, 100, and 1000 bootstrapped samples from only 13 to predict class correctly in the means then in the medians, and try same number of folds as best bootstrap aggregating value for cross validation on means and medians separately. A few models returned 100% accuracy on only 3 test samples.

6 months ago

Fibromyalgia RNA-Seq Gene Expression Analysis 12 samples Bootstrap Random Forest Model

In this project, we begin the analysis of the gene expression data on trigger point myofascial pain similar to fibromyalgia in clinical signs and symptoms for chronic pain. The genome data was used in a study of 5 healthy and 7 myofascial pain patients that helped the researchers understand how a drug that starts with 'dex' helps with chronic pain. It included fragments per kilo million and counts that were both normalized from the gene high throughput fastp data collected. This study is a stepping stone to connecting major illnesses associated with Epstein-Barr virus (EBV) such as multiple sclerosis, mononucleosis, Hodgkin's disease, and fibromyalgia. We can eventually understand how changes in the body of many people make their body undergo DNA transcription to make more of some genes and less of others when dealing with 58,000 genes in this study. Details in the document.

6 months ago

Lyme Disease Top Features in Predicting State of illness

Using R packages to manipulate data from NCBI gene studies with tidyr, dplyr, caret, and kernlab there are 6 models used with 10 folds of cross validation and Accuracy to measure algorithms of KNN, rpart, random forest, linear discriminate analysis, support vector machines for radial, and support vector machines for linear model fitting. Then summary results shown. Error in plots displaying properly in knitr and Latex, so they were block commented out. Looks like top genes are involved in upregulation of lipid regulators, DNA repair, and bile production to digest more fats and cholesterols. But downregulated mitotic activity in cell replication. This is from acute infection to chronic infection up to six months. Only 86 samples, and not balanced data for chronic infection. Tuning can be improved and selecting better model parameters to get better accuracy. For four classes best model was rpart but see notes in doc why.

6 months ago

Sign In

JanisCorona

Janis Harris

Recently Published