Current Projects
Language: Python
Alzheimer's Disease (AD) is one of the most impactful and prevalent diseases in the world, impacting an estimated 55 million people worldwide. Two proteins have become notoriously associated with AD: the Tau protein and A-beta amyloid peptide (APP). Using drugs pulled from the DrugBank Database, this project aims to use Graph Neural Networks to nominate existing drugs as potential treatments for Alzeimer's Disease to reduce levels of Tau and APP in the brain.
- Training data was obtained from PubChem for classifying compounds as inhibiting levels of Tau or APP.
- Molecular properties and ADMET predictions were evaluated to prioritize compounds that were more likely to cross the blood brain barrier.
- Eight final drugs were and presented as the final nominations to prioritize studying.
Investigating Bias Between Popular News Sources with Unsupervised Topic Modeling and Sentiment Analysis.
Language: Python
Nearly 1.8 million news articles published in 2020 are available at the Harvard Dataverse. These articles were collected from more than 500 different news sources. As news sources often have clear political leanings I used various topic modeling and sentiment analysis algorithms to investigate whether noticeable differences existed between news sources when covering similar topics.
- The analysis included articles published by CNN, CBS, Fox News, the New York Times, USA Today, and the Washington Post. 3,000 articles were randomly selected from each source.
- LDA and LSI topic modeling algorithms were fit on the bag of words and term frequency–inverse document frequency representations of the text data.
- VADER and TextBlob sentiment analysis algorithms were fit to evaluate differing sentiments as well as measures of objectivity in news articles.
NBA Betting: Using Team Statistics to Classify the Spread. Can I make a Profit?
Language: Python
After scraping NBA team data from the Basketball Reference website, I attempted to create a
profitible classification model to bet on NBA games. The data used in the model were based on the season
statistics of the teams playing each game.
Betting sites determine the favorite and underdog team for each matchup and the predicted margin of victory. In simple terms,
bets are made on whether the favorite team will win and if margin of victory will be exceeded or not. My aim was to classify each
game played as to whether or not the predicted margin of victory would be exceeded, and to bet accordingly.
- The models implemented included Random Forest, Support Vector Classifier (SVC), and Logistic Regression.
- The models were tested using different thresholds dictating how conservative or aggressive the betting labels would be assigned to the data.
- Of the 20 variations tested between the models and thresholds, 10 of the models obtained a marginal profit. See my write-up in the project details link below!
Investigating the Impact of Environmental Variables on Violent Behavior in New York City.
Language: R
The New York Police Department stores data for all shooting incidents reported to the police between 2006 and 2019. This dataset includes demographic information about each incident, including the location, perpetrator/victim sex, age, and race, in addition to whether the shooting resulted in a death.
- Due to the increasing number of studies suggesting an impact of environmental conditions on negative human behaviors, I performed a regression analysis to investigate if acute levels of various pollutants, barometric pressure, as well as daily temperature displayed any relationship with the number of shootings incidents in Bronx County, New York.
Visualizing Changes in Community Health Data. A partnership with OCM Advisory.
Tool: Tableau
OCM Advisor provides consulting service to small and minority run businesses helping them grow and thrive. This project involved using Tableau to create an interactive application for the company, in which specific health indicators could be visualized over time and across counties, in an effort to show how minority populations may be disproportionately affected in New Jersey.
Database Design. Developing a Normalized Relational SQL Database.
Language: SQL and R
Given data in the form of XML documents, I created a relational database to store data regarding
peer-reviewed publications through PubMed. The initial steps involved using lucid-chart to visualize the schema and develop a plan
for the creation of the database. Following the intitial design I performed the following:
- Instantiated a normalized OLTP SQLite Database in R-Studio.
- Parsed and extracted data from XML documents using XML parse and other R programming packages to load data into the created database.
- Manipulated the database to create fact-tables for efficient querying of data.
- Mined data from the database using SQL queries to gather information on publication timelines, and visualize said patterns with the ggplot2 R package.