Analyzing Used Car Data to Improve Sales and Predict Pricing

Team Members

General Summary

Getting the price right till today remains one of the greatest challenges that plague businesses across industries and functions.

Matters only get more complex when we enter the used goods market. An layers of the consumer's perceived value of the object gets added to the price equation and interacts with every predictor variable.

Through this analysis, we aim to understand the levers of the used car market, both from a selling as well as buying perspective.

Some of the factors we look the analyze are the condition, color, ododmeter reading, state of sale, manufacturer, cylinder configuration and type of drive. We visualize their variation with the car price and build insight into the workings of this industry.

Project Objective

The analysis and prediction of used car prices and its impact factors such as color, odometer reading, state of sale, manufacturer, cylinder configuration and type of drive. We visualize their variation with the car price and build insight into the workings of this industry.

Through this project, we hope to help consumers buy high-quality cars at a reasonable price and help Craigslist.com platform to properly allocate enterprise inventory resources.

Real world Impact

The global used car market was valued at USD 1,332.2 B in 2019 and is expected to grow at a compound annual growth rate (CAGR) of 5.5% (2020 to 2027). The used car shipment was recorded at 102.8 million units in 2019.

This market has shown growth potential over the past decade. Factors - consumers' growing interest in used cars vs. new cars, price competitiveness, growing dealership networks - that are contributing to this growth are also adding to the complexity in the industry.

In this climate, both consumers and sellers can gain significantly by making data driven decisions such as the right price to pay for a certain kind of vehicle for the consumer and the right inventory to stock to maximize profits for sellers.

Our analysis targets this problem and looks to analysis used car prices both from a consumer as well as a seller's lens.

Contents

Data preparation

Exploratory data analysis

Business analysis - based on correlation of variables and word cloud results.

About Dataset

Data Preparation

Select important variables

Missing values processing

Through data.head(), we can see a number of missing values in the dataset. We want to know the proportion of missing value of each variable, and we use heatmap to visualize the mising value of each variable.

According to the above figure, 'condition', 'cylinders',' VIN ', 'Drive', 'size', 'type' and 'paint_color' have a large number of missing values. As the unique ID of the car, 'VIN' will not be used in the subsequent analysis. As there are too many missing values in 'size', we also choose to delete this column.

After deleting two columns, there are still a large number of missing values for the remaining columns. However, considering that our data volume is large and the column is meaningful, we choose to retain it. Otherwise we will lose some key factors affecting the price. After deleting all missing values, we still have 115,435 rows data, which is sufficient for our subsequent analysis.

Removing outliers

Identify outliers

By plotting a boxplot, we can see that Buick, Ford, Gmc, Nissan and Toyota is having outliers in their data.

We choose an understandable priced for used car as a range from 10.000 dollars to 100.000 dollars.

Now we plot the dataset again to see the boxplot has become much more normal, with comprehensible price distribution.

Create custom variables

Adding manufacturer country column

Rename the categorical variables as custom number

The clean table

Finally, we have a clean table and we will use this clean table for further data analytics.

vehicle_ml.csv will be used in correlation processing and machine learning.

vehicle_cleaned.csv will be used in data visualization.

Exploratory data analysis

In the following two parts, we will analyze the data in different dimensions - Statistics & Business Problems.

We will first understand the statistical properties of the data so that we can decide which properties to analyze further.

Initial Analysis - Statistics

Statistical description

In the statistical analysis section, we expected numerical data, so we chose to use the vehicle_ml.csv file instead.

From the above table, we can get some interesting results:

Correlation Between Variables

Correlation analysis is very meaningful to us and can help us select appropriate variables for exploration in business problem.

In above figure, we can see that 'year', 'fuel', 'odometer', 'transmission', 'cylinders',' drive ', 'type' and 'Made' are variables that have influence on price fluctuation.

Wordcloud based on customer comment

Wordcloud is a very interesting graph where the data is drawn based on user comments. We will combine the correlation results with wordCloud results to determine the following analysis.

In above figure, the key words most concerned by users are: credit, seat, wheel, power, front, rear, milesmiles,drive, price, auto, fanancing, color, etc.

Based on these keywords, we conclude that the most concerned users are price, drive, cylinder, color and odometer.

Business Problem

Through this analysis, we look to provide two points of view. The first one is that of the companies that deal in used cars and the second is that of the consumer who looks to purchase a used car

We have selected variables of value for each category and look to understand their individual correlation with the price and popularity of a used car

What are we solving for?

Companies point of view

Used-car companies CEOs will be most interested in seeing what types of cars will sell with the highest price, or being the most popular sale. In this portion, we speculate that car price will vary according to manufacturers, drive and color.

Price Exploration

In this section, we analyze the variation in price and sales popularity based on some predictors with the overall goal of accessing which predictors are most impactful.

The variables we consider in this section are the color of the car, its manufacturer, time of sale and cylinders in the cars. We have selected these two since companies can then identify and invest in cars that are most likely to sell and bring them a great price

Average price by manufacturer and its fuel type

Most of the top 10 most expensive manufacturers by average tend to have either gas or electric engines.

Average price by manufacturer

This plot shows the top 10 manufacturers in terms of average price. From this, we can see which manufacturers have used cars that sell at the highest value.

Average price over the state

Next, we want to know whether the prices of different regions are different. This graph suprised to us which we didn't know that before. We can find that the average price of used cars in the west coast is generally lower than the average price of used cars in the east coast. This finding is of great significance to used-car companies, because they can know the average price of cars in the east coast and west coast, so as to formulate appropriate pricing strategies.

Next, we suspect that the factor affecting prices is inventory. By mapping the number of vehicles in the database, it can be seen that vehicle inventory is slightly higher in the west coast than in the east coast, so inventory is a factor influencing prices.

In addition, the income and prices of the west coast and the east coast are searched on Google. The prices and incomes of the east coast are generally higher than those of the west coast.

Fuel usage shift over time

While Gasoline remains the most favoured fuel, the usage of diesel has significantly reduced in usage and electric has increased

Average price over the years

Prices were initially higher in the beginning of the dataset. However, they decreased slightly and remained consistent until the last decade, where they began to increase.

We looked at the GDP of the 1990s and the GDP of today, and we found that this price increase is closely related to changes in GDP.

The variation of price of used cars

A majority of used cars have prices ranging from 10K to 40K and a price between 10k and 20k is most prevalent

The limit of prices has been set to remove outliers as 75 percentile of the data lies within the interval picked

The variation in price with miles run by the car

At lower odometer readings, the drive type and price do not seem to be correlated. However, Rwd is the only type of drive prevalent at high odometer readings.

Drive Exploration

Our analysis shows that cars with FWD have the lowest median price, whereas cars with 4-wheel drive have the highest. Some interesting results can be seen from box plot. For example, outliner values of FWD are generally small and scattered, while outliner values of RWD and 4WD are generally concentrated and high.

This means that used car companies have fewer luxury vehicles in their inventory on FWD and more luxury vehicles on RWD and 4WD. If customers want to buy luxury vehicles, they can recommend 4WD and RWD first. If customers are more price conscious, they can recommend FWD vehicles first.

To add to the analysis from the plot above, this graph shows that there is the most variation in price for cars with FWD. The second chart confirms our view by showing that FWD has a large inventory in the low price range, while RWD and 4WD have relatively high inventory prices.

Color Exploration with price

Our analysis showed that their doesn't seem to be a correlation between car price and its color - meaning that the color of the car does not suggest a higher or lower price than others.

However, we did find that the most popular color for cars sold are white, black, silver, and red.

Color correlation with manufacturer

While white is the most popular colour, Ford and Chevrolet sell the most whites in the market.

The graph below is interactive. Please feel free to filter colors or manufacturers

Cylinder Exploration

Our data shows us that the higher the number of cylinders the car the greater price the car sees. There is a positive correlation between price and number of cylinders.

In the first graph, the price was computed using the average by cylinder - with the 12-cylinder car having the highest average price. This is most likely due to the fact that cars with more power would be sold at a higher price.

The second graph shows the number of cylinders vs. price, with each dot representing an observation. The difference in results from the two graphs is most lilely caused by outliers in the data.

Customer point of view

Customers typically want to buy a car with less distance travelled on it and the lowest price possible. We calculate the percentage between price and distance travelled.

Radar chart with price over odometer

In the radar chart, we can see that Italian Made cars might be the best choice if you are looking for a car at the lowest price, with the least number of miles travelled.

Manufacturers with the lowest prices

Diving deeper into Italian made cars, we see that if we want to buy the Italian made car with the lowest average price, we should buy a fiat.

Price in terms of number of cylinders

In this radar chart, we see that the cars with 5 cylinders will be the best choice as it, on average, has the lowest price with the least miles travelled.

Machine Learning

Set up

Business summary

For this part we are focusing on the the sellerr's point of view, we would want to make suggestions to the seller to get the right inventory to stock to maximize profits for seller. To be exact: if we import a car with this color, this brand, this cylinder, having this drive, type, and car color, how much price should the company sell.

Price prediction models will include: Linear regression, Decision tree, Random forest and Random forest with cross validation.

Technical summary

The model would make more accurate prediction if we try to include all the variables or do some feature engineering first, but we are first building a prototype following the business sense of the seller. Chosen features will be "manufacturer", "cylinders", "drive", "type", "paint_color".

We preprocessed categorical data using one hot encoding and labelling. Even though we mainly used the labelling dataset, we still reserve one hot encoding dataset for future machine learning models.

All models are built using sklearn library

In regression prediction we chose RMSE and MAE for our comparison.

Preprocessing data for regression prediction

One hot encoding
Label encoding

Linear Regression

Explanation:

For the linear regression model with RMSE = 93, which means that on average the price prediction will be off by 93$.

R squared of 12% means that the varibility of price can be explained by the model. R-squared is relatively low, which means the association is not terribly strong. Predictions will better than just guessing the mean, but not entirely reliable

Decision Tree

Random Forest

Since RMSE of random forest is bigger than RMSE of decision tree, we will try tuning random forest model to see if it helps.

The RMSE for random forest improved for a little bit after grid search. If we have time we will come back to the model to tune.

Comparing between models

Looking at the RMSE and MAE score, we can see that Random Forest performed the best at this regression prediction problem.

We can infer from the RMSE that on average, for every guessed price it will be off around $10.000.

Making real prediction

Now let's suppose that the company is having a used "Ford" truck, with 8 cylinders, 4wd, and black color, we will now try to apply Random Forest model to guess the selling price.

After running through the model, we can guess the selling price of this car as $33808.

Conclusion

image-1.png

We also focused on pushing our deliverables on an app using Streamlit and it will go live on Heroku. This app will have a business summary page, a description page (for our EDA) and a prediction page (for our machine learning models).