My Data Journey


Learning Data Science

Building Recommendation Systems for Boardgames

boardgames


Hit or Flop?

For today’s post I wanted to go over another of my recent projects, classifying songs as hits or flops based off of descriptive features. The data that I used for this project came from kaggle. Using features such as the duration, energy and danceability of the song, the goal was to create a classifier system to then predict how popular a song would be. The conditions for a song to be called a flop were: the song did not appear in the ‘hit’ list, the track’s artist did not appear in the hit list, the track must belong to a genre that could be considered non-mainstream and / or avant-garde, the track’s genre must not have a song in the ‘hit’ list, and the track must have ‘US’ as one of its markets. My dataset consisted of 15 predictors and a target variable. Because of the extremely useful methods involved, I almost exclusively used the sklearn classifier systems. As always when starting a project, the first step is to load in your data and clean it up.


Real Estate Investing in Philadelphia

For this blog post, I want to go over the process of one of my recent projects. The goal of the project was to identify the five best area codes for short term (1-3 years) real estate investment in Philadelphia. To accomplish this goal I used a dataset obtained through Zillow.com. Each row of the dataset represented a zip code, and there was a row for every zip code in the country. The columns consisted of some identifying features (city, state, etc.) and a column for each month from April 1996 until April 2018. Each month contained the median home value for that zip code during that month.


Choosing the Right Hypothesis Test

What is Hypothesis Testing?


Which Features Should I Use?

While I was working on a project recently, I realized how important it is to select the right features for a linear model. Interestingly, you get a much better model from dropping some information from your data. If you use too many features, the model can be hard to understand and be overly complex. The model can often become overfitted, showing inaccuracies when compared to a test dataset. Furthermore, training time can increases exponentially as more features are used. However, you need to make sure that you are dropping the right variables. If you don’t use your best features, your model will be inaccurate and will fail to show correct relationships. It is vitally important for the strength of a model to select the right subset of features.