Recently Published
Clustering Universities: A K-Means Clustering Approach in Python
This project demonstrates the application of K-Means Clustering, an unsupervised learning algorithm, to classify universities into two groups: Private and Public. Although K-Means typically operates without labels, we leverage the true labels in the dataset for educational purposes, providing a unique opportunity to evaluate clustering performance using a classification report and confusion matrix. The analysis begins with exploratory data visualization and summary statistics to understand the dataset’s structure and relationships. We then implement the K-Means algorithm with two clusters and assess its results against the actual classifications. The findings highlight the limitations of K-Means, including its sensitivity to feature scaling and assumptions of cluster shape and size. While the clustering algorithm does not perform well in this specific context, the project underscores the importance of pre-processing, feature selection, and domain expertise in unsupervised learning. This exercise provides valuable insights into the practical application and challenges of clustering techniques in real-world datasets.
Predicting Iris Species Using Support Vector Machines (SVM): A Detailed Classification Approach
In this project, we analyze the well-known Iris dataset using machine learning techniques to classify different species of Iris flowers. This dataset is a classic in the field of data science and machine learning, often used as an introductory example for classification algorithms. The dataset was first introduced by Sir Ronald Fisher in 1936 and remains a benchmark for evaluating classification models. We employ Support Vector Machines (SVM) to classify the Iris species based on flower measurements. Additionally, we use Grid Search to fine-tune the model’s hyper-parameters for better performance. By comparing the performance of a baseline model with a tuned model, we demonstrate the effectiveness of hyper-parameter optimization.
Predicting Loan Repayment: Leveraging Decision Trees and Random Forest Models on LendingClub Data
This project explores the use of machine learning models to predict loan repayment behavior using historical data from LendingClub.com. Focusing on the years 2007 to 2010, the analysis aims to help investors assess borrower risk and make more informed lending decisions. The dataset includes borrower profiles, loan characteristics, and repayment outcomes, allowing us to identify key factors that influence loan repayment. We employ **Decision Trees** and **Random Forests** to classify whether a borrower is likely to repay their loan in full. To address the class imbalance inherent in financial data, we apply the **Synthetic Minority Over-sampling Technique (SMOTE)**, which improves model performance by balancing the dataset. Our findings reveal that the Random Forest model outperforms the Decision Tree model, achieving higher accuracy and recall rates. The results demonstrate the practical application of predictive analytics in enhancing credit risk assessment, with implications for better investment strategies in peer-to-peer lending.
PART 2: Enhancing Classification Accuracy Using K-Nearest Neighbors (KNN): A Data-Driven Approach Using Python
This project explores the application of the K-Nearest Neighbors (KNN) algorithm to classify data using a synthetic dataset. KNN, a widely used machine learning technique, assigns class labels based on the majority vote of the nearest neighbors. The analysis begins with exploratory data analysis (EDA) to understand the dataset’s characteristics, followed by feature scaling to ensure the accuracy of distance-based computations. We implemented the KNN classifier using Python and evaluated its performance through metrics like precision, recall, and F1-score. Initial results achieved an accuracy of 94%, but through hyperparameter tuning, we optimized the value of K to further improve the model’s performance. The project demonstrates the effectiveness of KNN for classification tasks while highlighting the impact of feature scaling and hyperparameter selection. Future work includes exploring more advanced algorithms and techniques for enhanced predictive accuracy.
Predicting Ad Clicks Using Logistic Regression in Python
This project explores the application of logistic regression to predict whether users will click on online advertisements based on demographic and behavioral data. Using a dataset containing information such as age, daily internet usage, income, and engagement metrics, we conducted extensive exploratory data analysis (EDA) to uncover key patterns and relationships. After cleaning and transforming the data, including feature engineering to extract temporal components, we built a logistic regression model to predict ad clicks. The model achieved a strong balance between precision and recall, indicating its effectiveness in identifying factors influencing user behavior. Key findings suggest that user age, daily internet usage, and time spent on site significantly impact the likelihood of clicking on ads. This analysis demonstrates the power of predictive modeling in digital marketing and highlights potential areas for future model enhancement using more advanced machine learning techniques.
Machine Learning from Disaster: Titanic Survival Analysis with Logistic Regression in Python
This project explores the use of logistic regression to predict passenger survival on the Titanic using a dataset of 891 passengers. The analysis begins with an exploratory data analysis (EDA) to identify key factors influencing survival rates, such as passenger class, gender, and age. Significant missing data in columns like Age and Cabin were handled through imputation and column removal, respectively. Categorical variables were transformed into numerical features to prepare the data for model training. A logistic regression model was developed to predict the likelihood of survival based on selected features. The model achieved an accuracy of 83%, with high precision and recall rates for predicting survival. The analysis revealed that female passengers, younger individuals, and those in first class had higher survival rates. While the model provided valuable insights, there is potential for further enhancement by incorporating additional features and exploring more sophisticated machine learning algorithms. The project demonstrates practical applications of data analysis, statistical modeling, and machine learning in deriving actionable insights from historical data.
Channel Success: Leveraging Machine Learning in Python to Predict the Impact of Mobile App vs. Website on Ecommerce Sales
This study explores the relationship between customer interaction channels and purchasing behavior for a New York City-based Ecommerce company specializing in clothing sales and personal styling services. Using linear regression analysis on customer data, we sought to determine whether the company’s mobile app or website more effectively drives customer purchases. Our model achieved a Root Mean Square Error (RMSE) of 1.8, reflecting a reliable predictive capacity, although with some room for further refinement. Findings suggest that customer engagement data can effectively guide strategic channel prioritization. These insights provide the company with a data-driven foundation to enhance customer experience and maximize revenue through targeted digital optimization efforts.
Stocks Uncharted: Exploring Trends Through Data Visualization in Python
This project delves into stock price data to explore historical trends, volatility, and seasonal patterns using Python’s data visualization and analysis tools. By conducting a comprehensive exploratory data analysis (EDA), we aim to uncover key insights into stock price movements over time, with a particular focus on periods of heightened market activity. Notably, our analysis reveals that the highest levels of volatility occurred during the 2008 global financial crisis, highlighting the profound impact of economic downturns on market behavior. This project emphasizes skill-building in data visualization and the application of Pandas, and it is intended purely as an educational exercise rather than a basis for financial advice or decision-making.
Going Behind the Call: Uncovering Patterns in 911 Emergencies Using Python and Pandas
In this analysis, we leveraged Python and Pandas to uncover key insights into 911 emergency call patterns. Our findings revealed that medical emergencies (EMS) are the leading cause of calls, with peak call times around 7 AM and 7 PM, likely linked to periods of high activity. Interestingly, January saw the highest volume of emergency calls, while December experienced a drop, perhaps due to holiday festivities. These insights underscore the value of data analysis in understanding and anticipating community needs, helping to better allocate resources and improve emergency response strategies.
Analysis of Sales and Profits in a Retail Store
This article has three sections: analysis of store sales, time series analysis, and the development of a dashboard in PowerBI. In the store sales analysis, I test for hypotheses and create a regression model. In the time series section, I forecast store sales from historical data. Finally, I create a dashboard for the retail store to inform managers of developments in the business.
The Law of Large Numbers and Central Limit Theorem with Simulation in Python
This article provides a brief exploration of two fundamental statistical theorems: the Law of Large Numbers (LLN) and the Central Limit Theorem (CLT). It explains the meaning, applications, and significance of both theorems in various fields, such as insurance, polling, and hypothesis testing. Practical simulations are demonstrated using R, including rolling a die to illustrate the LLN and sampling from a uniform distribution to showcase the CLT. Visualizations are generated using Matplotlib to highlight how sample means converge to expected values and approximate normal distributions. These concepts are essential for understanding statistical inference and the behavior of sample data.
The Law of Large Numbers and Central Limit Theorem with Simulation in R
This article provides a comprehensive exploration of two fundamental statistical theorems: the Law of Large Numbers (LLN) and the Central Limit Theorem (CLT). It explains the meaning, applications, and significance of both theorems in various fields, such as insurance, polling, and hypothesis testing. Practical simulations are demonstrated using R, including rolling a die to illustrate the LLN and sampling from a uniform distribution to showcase the CLT. Visualizations are generated using **ggplot2** to highlight how sample means converge to expected values and approximate normal distributions. These concepts are essential for understanding statistical inference and the behavior of sample data.
House Prices and Access to Parks in London
In this analysis, we examine the connection between the accessibility to local parks and housing prices in London. The results reveal a noteworthy association between house prices and access to local parks and metropolitan parks. However, district parks and open spaces show a negative relationship with house prices, although it is not statistically significant. Notably, Westminster consistently stands out as the priciest area. The regression models employed in the analysis show that accessibility to parks has limited role in influencing housing prices. Instead, the borough is of primary importance in pricing. Nonetheless, access to open spaces, local parks, and district parks has a significant relationship with house prices. Metropolitan parks and regional parks have no significant relationship with prices.
Enhancing Anti-Churn Strategies: Leveraging Advanced Machine Learning for Targeted Intervention in Telecommunications
This study aims to improve a telecommunications company’s anti-churn efforts using advanced machine learning. Six models, including Decision Tree, Random Forest, Bagged Decision Tree, Extreme Gradient Boosting, Ridge Model, and Lasso Model, were used to identify 2000 customers for urgent anti-churn actions. The logistic regression model successfully pinpointed 2740 clients, offering targeted insights to enhance the anti-churn campaign’s effectiveness. The findings highlight the importance of using sophisticated machine learning for precise customer churn prediction in the telecommunications sector.
Predicting Customer Churn with Precision: Unleashing the Power of Machine Learning and Logistic Regression in Telecommunication Analytics
This study focuses on optimizing the anti-churn campaign for a telecommunications company through advanced machine learning techniques. The analysis involves the implementation of logistic regression to pinpoint 2000 consumers for urgent contact in the anti-churn initiative. Through the application of logistic regression model, a total of 2262 clients were successfully identified, providing targeted insights to significantly enhance the efficacy of the anti-churn campaign. The findings underscore the value of employing sophisticated machine learning methodologies for precise customer churn prediction and strategic intervention in the telecommunications sector.
Analysis of Health Data in R
Analysis of health data- the health status of citizens
Department Store Sales Time-Series Analysis
I analyze time series data from a departmental store and attempt to make forecasts. I also discuss the various terms associated with time series- trend, seasonality, and noise. I then compare the performance of the models in forecasting. The results show that a polynomial model with seasonal effects does better than a pure linear model.
Analyzing Sales Data in Python and Pandas: Unveiling Regional Leaders, Top Sales Representatives, Best-selling Items, and Peak Sales Month
This analysis, conducted using Python and Pandas, delves into a dataset comprising 43 observations and 8 variables related to sales transactions. The primary objectives include identifying the region with the highest sales, pinpointing top-performing sales representatives, determining the best-selling item, and uncovering monthly variations in sales. The results highlight the exceptional dominance of the Central region, contributing over 60% of total sales. Sales representatives Kivell and Parent emerge as frontrunners, while binders stand out as the most popular item. Furthermore, the analysis unveils seasonal variations, with December and July recording peak sales, and March experiencing a notable dip. These findings offer actionable insights for strategic decision-making, emphasizing the significance of regional, individual, and temporal considerations in optimizing sales efforts and business performance. The use of Python and Pandas demonstrates the efficacy of data-driven approaches in extracting meaningful patterns and trends from complex datasets, paving the way for informed decision-making in dynamic business environments.
Financial Development and Educational Attainment: A Cross Country Comparison
In this analysis, I examine the relationship between financial development and education attainment. My premise is that people who attain higher education are better placed to enter the formal labor market and hence demand financial services such as bank accounts. Education also raises awareness even among people in the informal and semi-formal sectors to better manage and access finance from formal financial intermediaries. The findings of the analysis confirm these hypotheses.
Assessing the Influence of 2000s Trade with China on Manufacturing Employment Across US Census Zones
We examine the variation in changes in manufacturing jobs across census regions in the United States. We find that there is a significant regional variation, with satl region experiencing the least reduction and wncen the most. Education, share of routine jobs, and manufacturing (Share of employment in manufacturing at start of decade) are also have a statistically significant relationship with changes in manufacturing employment.
Hypothesis Testing in R
In examining a dataset featuring exam scores and corresponding hours of study, this analysis endeavors to unravel the intricate relationship between academic performance and study efforts.
Analyzing Deaths of Climbers in Mt. Everest
In this analysis, I explore data from Wikipedia on the recorded number of deaths among climbers of Mt. Everest.
What Is The State of Food Security and Nutrition in the US?
The United Nations Food and Agriculture Organization’s report, “The State of Food Security and Nutrition in the World 2022,” often makes people think that food insecurity is a problem only in other parts of the world, not in the United States.
Admission Delays for Emergency Patients
This project uses machine learning to predict potential admission delays for emergency patients.
Natural Language Processing (NLP): Analyzing Social Media Data Using the ‘Bag of Words’ Technique in R
The proliferation of social media platforms has revolutionized the way people communicate, share information, and express opinions in the digital age. These platforms have become an invaluable source of data for researchers, businesses, and policymakers seeking to gain insights into public sentiment, behavior, and trends. Analyzing social media data, however, presents unique challenges due to the unstructured and often noisy nature of text-based content.
Happiness and Television Consumption
In this analysis, I use data from SOEP to explore the link between TV consumption and happiness. I explore descriptive statistics and plot a heat map. The analysis shows that people that consume more TV are, on average, happier than those that watch less TV. However, the observed link does not imply causality due to possible confounding factors.
Earnings Differentials Between US Born and Migrant Workers
In this analysis, I examine the drivers of the differential in earnings between US born and migrant workers in the United States. The analysis shows that migrants have the lowest earnings in their first year of arrival. Earnings increase steeply in the first five years, followed by a gradual decline. The drivers of income differentials include gender, age, race, education, certification, hours worked, job location (rural vs urban), and occupation. The data has a severe case of missing data making analysis challenging. After including more control variables. it appears that all else remaining the same, foreign born workers are likely to earn more than US born workers. There is potential for omitted variable bias.
Analysis of Africa’s 250 Largest Companies Using Libreoffice Calc and Python
Recently. I published a data analysis project titled Navigating Africa’s Business Landscape Using Python in my Rpubs site. The analysis provoked quite a bit of interest that motivated me to go a little deeper into analyzing the corporate landscape in Africa. In this mopre comprehensive analysis project, I use updated data from the Africa Business to explore the 250 largest formal corporations in Africa 1. The data captures the revenue, net income, and market valuation for the of the 250 largest companies in Africa for the years 2022 and 2023(Kagle, 2023).
Navigating Africa's Business Landscape Using PythonDocument
In this analysis, we use data from Kagle to illustrate the use of Python and Pandas in data analysis. The data captures the top 2000 companies in the world and is available for free upon registration on the Kagle website. We filter the data to only include companies from Africa.
nveiling Corporate Titans: Navigating Global Business Landscapes through Python Data Analysis
In this analysis, we use data from Kagle to illustrate the use of Python and Pandas in data analysis. The data captures the top 2000 companies in the world and is available for free upon registration on the Kagle website.
Filtering and Sorting with Pokemon Data in Python
In this project, I analyze Pokemon data to illustrate data filtering using pandas.
Analyzing Store Food Sales Using Python Programming Language
In this analysis, I analyse sales data in different cities and regions in the United States. The objective of this analysis is to illustrate basic data analysis using Python programming language. Python is the leading competitor for R, the other leading data science analysis programming language.
Analyzing Common English Names Using Python
In this project, I analyze data about popular English language names. The data is available on this [site](https://github.com/dolph/dictionary/blob/master/popular.txt). The project was part of a course created by [FreeCodeCamp](https://www.freecodecamp.org/). The course is available on YouTube on this [link](https://www.youtube.com/watch?v=r-uOLxNrNk8&t=224s). The course is project based. The purpose is to illustrate the basics of data analysis using the Python language.
Scraping Multiple Pages of Text Using R and rvest
In many cases, data is not always available in a ready to load and analyse format. Actually, data is often not available. In this case, we may have to collect data ourselves. One of the ways to do this is through getting data from websites through a process refered to as web-scraping. In this section, we examine the scraping of data from the web using R.
Mining Text and Exploring Sentiments in Leo Tolstoy\\'s 'How Much Land Does a Man Need?'
In conducting sentiment analysis on Leo Tolstoy's short story titled "How much land does a man need?" [@tolstoy1905much], the primary objective is to illustrate automated text mining in R. The scondary objective is to examine the underlying sentiments conveyed within the text by applying a quantitative approach. By analyzing the story through this lens, we aim to gain a deeper understanding of the characters, themes, and overall message conveyed by Tolstoy.
QnA Analysis of the Drivers of Loans defaults
Analysis of loans defaults data
Scraping & Analyzing World Population Data Using Python
In this analysis, I scrap and analyze population and country data from three sites.
Using Machine Learning to Predict Flight Delays : Decision Trees and Random Forests
Flight delays are a significant concern in the airline industry. Apart from the inconvenience caused to travelers, delays also affect the reputation of airlines, negatively impacting market share. In this analysis, I utilize data for flights between New York and Washington DC. The central questions in the analysis are;
- Which factors have a significant relationship to flight delays?
- Can machine learning be useful in predicting flight delays?
Which Beauty Product Combinations do Customers Often Buy Together?
In this mini-project, I explore association rules using data from a beauty product shop. Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness.In any given transaction with a variety of items, association rules are meant to discover the rules that determine how or why certain items are connected (Kotsiantis and Kanellopoulos 2006).
Estimating Systematic Risk Using the Capital Asset Pricing Model, CAPM
In this analysis, I use stock prices for Microsoft and General Motors (GM) to estimate the systematic risk of the stocks using the Capital Asset Pricing Model (CAPM). CAPM, developed by William Sharpe, Jack Treynor, John Lintner and Jan Mossin (Perold, 2004) quantifies the systematic risk and the expected return on an asset, particularly stocks.
Becoming a Geographer: The Art of Creating Maps in R
In this mini-project, I demonstrate how to make presentable maps using R. Often, researchers may require to visualise their data using maps. For instance, finance researchers and professionals may desire to visualise the extent of financial inclusion in different countries.
Plasma Ferritin Concentration Study
Which factors affect plasma ferritin concentration (Ferr) among Australian athletes? IIn this article, we assess the effect of a collection of explanatory variables on the plasma ferritin concentration (Ferr) in 202 Australian athletes. The file Sports Data CW 2021.csv contains the data on the plasma ferritin concentration as well as a selection of demographic variables of 202 male and female athletes.
Visualizing World Happiness in 4 Charts
In this flex dashboard, I use 4 charts to visualize world happiness in. 2021.
Weight and Sleeping Patterns in the UK
With sedentary lifestyles, obesity has become a significant health issue globally. Overweight individuals have a higher risk of developing heart disease, stroke, cancer, kidney disease, among others (Shah et al. 2021). These health issues place additional strain on health facilities and state financial resources. Consequently, much research goes into tracking obesity, mapping out possible health complications associated with obesity, establishing the factors contributing to obesity. Critically, many resources go to the design of measures to minimise the prevalence of obesity (Fruh 2017; Malik, Willett, and Hu 2013; Lopez 2007).
In this project, we explore the link between age, sleep, and body mass index in a sample of individuals from the United Kingdom.
DETERMINANTS OF BODY FAT IN INDIVIDUALS
The task is to build a regression model that can be useful in explaining and predicting the body fat of individuals.
Sentiment Analysis of Kenya’s Star Newspaper on Friday July 15, 2022
In this analysis, I scrap data from the Star Newspaper for Friday July 15, 2022 and evaluate the sentiment and topics that dominate the news.
Functional Programming in R Using Purrr
In this article, I highlight the use of map functions from the purrr package in R. Additional information is available in the R help pages and the resources cited in the references section.
Acess to Finance: A Global Comparison
In this project, I use financial access data from the IMF to map the current state and the trends in access to finance globally.
Extracting Data from 200 Nested Excel Files Using R
In this project, I demonstrate how to extract data from multiple excel files nested in different folders and sub-folders. Ordinarily, this data would require a person to open each folder and sub-folder. Next, open each Ms Excel file inside the folder or sub-folder, and then copy-paste the content of each of the Excel files to a master Excel file.
Even for a very efficient Ms Excel user, cutting and pasting data from 200 Ms Excel files to create one data set is a tall order. Fortunately, the R programming language makes such tasks easy and fast.
Regression analysis of YouTube dataset
Regression analysis of YouTube dataset
Analysis of Gobal Homicide/ Murder Rates
In this project, I use data from the World Bank to develop a dashboard that captures the world's homicides/murder hotspots. Overall, Homicides are concentrated in Latin America and the Caribbean, followed by Africa. No country from Asia and Europe is in the top 20 in homicide/murder rates.
Beer debate: Part 1
In this article, I used data from BeerAdvocate.com and Wikipedia to examine drivers of average ratings of beer. In addition, I examined the countries with the highest beer consumption both in absolute terms and in per capita terms. The significant takeaways are as follows.
Who’s the Fastest of All? Analyzing the 100 Metres Men’s Sprint Data
I use data from World Athletics on the best times posted by male athletes in the 100 metres sprint from 1958 to the present.
Reducing the Number of High Fatality Accidents in the UK
In this project, I use data from the Department of Transport in the United Kingdom (UK) to derive insights to reduce fatalities from major accidents. Specifically, the project aims to identify factors associated with road accidents fatalities. The key findings are:
- Accident casualties peak during weekends, Starting from Friday and falling on Sunday.
- Accidents casualties vary by time of day.
- Major accidents and accident casualties mainly happen when the weather is fine.
- Major accidents and accident casualties mainly occur in road stretches with speed limits of 30 mph and 60 mph.
John Karuitha: Scraping Data from Websites Using R
I illustrate how to scrap data from websites using R
KARUITHA: Getting Started With tidymodels
In this exercise, I introduce tidymodels using health insurance data to fit a linear regression model.
Report for the Cars and Pressure Datasets
A short report based on the inbuilt R datasets; cars and pressure