Janis Harris

Recently Published

Web Scraped Reviews 2, Text Mining, NLP, ML, visNetworks, predict with Selected Keys
Version 2 with some modifications to the original 12 stopwords as keywords, without and with 12 additional keywords picked by meeting certain thresholds. R packages used: dplyr, tidyr, tm, wordcloud, ggplot2, visNetworks, igraph, caret, randomForest, MASS, gbm, etc.
Web Scraped Reviews with Manual NLP with regex and keyword extraction to predict Rating by Review
This Rmarkdown file demonstrates how to clean data with regular expressions (regex) for feature extraction from the reviews and use of natural language processing (nlp) without use of the algorithms applied to the document term matrix (dtm) to predict the rating. It manually breaks down the process of building and re-running a script to predict the most likely rating based on 12 keywords and ratios of term to total term frequency.
Gene Targets by Threshold CNV and Fold Change Values Link Analysis with VisNetwork
Six diseases that were extracted from GEO databases and had sequence information, were analyzed by targeting those genes that had high copy number variations and were either in the lowest 5th or highest 95th percentile. Those genes were then analyzed for multiple gene associations and plotted in a link analysis plot with the genes the nodes and the diseases the centroid nodes associated with those genes. There are interactive data tables of findings, complete documented steps to reproduce, and a great interactive visNetwork link analysis plot at the end.
Link Analysis Setting up a Link Network VisNetwork Htmlwidgets
The htmlwidgets R linked in learning course on visNetworks. This used the data table provided and code, some modifications. This shows how the data tables can be modified to presumably use on other datasets. Like for use on gene expression values for genes in common with other diseases or fold changes, or stocks that have majority businesses in certain regions and how they do when some of those businesses default or are insolvent. Many uses for this handy interactive tool. The plot at the end is interactive, with zoom, labels visualized, highlighting of nodes, ets. You are encouraged to try it out, get familiar with it, and use it on analyzing other data you have.
Regex extraction from notes of seconds of time for actions in UFC Mazvidal
UFC fighter script in Rmarkdown that shows use of regular expression (regex) to extract the sequential actions of fighters in the UFC. This one uses Mazvidal from frame by frame notes on actions and reactions of Mazvidal and Till in Europe. Designed to be used in Machine Learning (ML) for predicting hits landed based on hits received, missed, landed in event.
Text Mining PubMed Articles on Earaches
This Rmarkdown file shows how the stemming and lemmatization of the document term matrix of text mined words from ten articles extracted from PubMed differ with some word clouds showing the most common words in the documents after they have been cleaned.
UFC fighter analytics, ML on hits landed from regex text extraction of descriptive actions
This takes a few of the samples of fights from aliases VufenSarah (A.Nunez) and Wolfey (Mazvidal) and predicts any sequence of up to three in one observation as one second of a particular fighting round as a hit landed, missed, received and what type of action. Does not include ground actions in predictions as features (actions/reactions). The algorithms used vary and predict around 85% accuracy on if that second will be a hit landed based on the chosen features that were selected to exclude multicollinear features.
Coronavirus liver tumor and blood capillary samples analyzed for CNVs and such
This is not the current COVID-19 strain, but the more recent coronavirus from 2017-2019. This Rmarkdown file combines the gene expression values obtained from a NCBI database to look at genes with more Copy Number Variations (CNVs) and fold change of strains over time from one day in one study and 4 days for the other study. These are beadchip gene expression values.
grabbing the stocks available from yahoo to Analyze and calculating counts decreasing and increasing
This program grabs the stocks pulled from the web and available by ticker symbol at yahoo finance, it then uses all the information from a set date up until todays date and adds the lag value lag days earlier to get the counts of increasing and decreasing days by lag and saves to file with date retrieved and current date and lag value.
Uterine Leiomyoma Beadchip Gene Expressions MySQL and DT package
Use of the DT and RMariaDb packages to display the table of copy number variants (CNVs) and fold change values as a ratio of the mean of Uterine Leiomyoma (UL) to the mean of nonUL. The top CNVs and fold change genes were displayed from a call to the MySQL database to use SQL queries.
Uterine Leiomyoma Beadchip CNVs and FCs with ML to Predict Gene Targets
A machine learning script on uterine leiomyoma (UL)beadchip data from NCBI GEO that shows copy number variants (CNV) of each gene and fold change (FC) gene expressions as subsets of most CNVs and FCs of genes to use as gene predictors of having UL or not.
Subset of Stock with ML for Days Increasing or Decreasing as Target
Time series of selected stocks with a subset single stock selected from a data set of 65 stocks with added counts for days the stock increased or decreased and how many times in the given time period calculated did the stock exactly increase or decrease that many days. As well as predicting a target of next day being an increasing or decreasing day from price a set lag days made when calculating the counts.