Stop putting prices on things by guessing. You’re leaving money on the table. A great use-case for Machine Learning is building pricing engines. No coding or data science expertise required.
Whether you’re buying or selling a product — any product! — it’s helpful to use Machine Learning to get a sense of whether a given price is too high or too low. This is true for any product, in particular commodities like computers, cars, steel, corn, fertilizer, airline tickets, concert tickets — whatever.
All you need to get started is some data to train an algorithm and a no-code Machine Learning platform like Monument. (Monument offer’s free trials downloadable directly from the front-page.)
In this tutorial, I’m going to show you how I helped a friend decide which kind of MacBook Pro to buy. He wanted to pick up an extra computer from Craigslist, but with so many variables — screen size, model year, RAM, hard drive size, Apple Care coverage, etc. — it was hard to get a sense of which products were over- or under-priced.
We solved this with Monument.
Get The Data
First, I went on Craigslist and summarized 150 MacBook listings. I’ve zipped these pages up for reference.
Next, I needed to get the data into a structured spreadsheet format so that a Machine Learning algorithm can understand the information. If there were more listings to handle, I might go through the trouble of hiring an unlucky intern or writing a Python scraper to grab the relevant information. However, these weren’t too many to deal with, so I manually went through each listing and threw the relevant information into a spreadsheet.
Price is largely based on significant features. There are many I could have selected, but in the interests of time, I chose a few obvious ones to start.
- Id — a unique number assigned to each listing so I can reference the listing later
- Used — a TRUE/FALSE value indicating whether or not the computer was used
- AppleCare — a TRUE/FALSE value indicating whether or not the computer was still under an Apple Care policy
- Year — the model year
- YearsFromCurrentYear — the current year (2020 at the time of writing) minus the model year, in order to get the age in years of the computer
- Price — the price advertised on Craigslist
- Size — the screen size in inches
- Type — processor type (i7 or i9)
- Speed (GHz) — processor speed
- RAM (GB) — RAM capacity
- Drive (TB) — hard drive size
- GPU (GB) — GPU speed if there is a GPU present
In this project as in any Machine Learning project, there are endless elements that could be considered. It’s up to us, the user who understands the business context, to use judgement about what might be significant or insignificant.
In the end, I ended up with a CSV, which you can download here.
Build A Pricing Model Without Code
I then opened Monument — again, a free trial is available — and imported the CSV.
After heading to the model tab, I dragged a few data pills that I thought were especially significant: YearsFromCurrentYear, Size, and Price. I then changed the chart style to “grid” using the chart style widget in the top right of the charting area.
After plotting these data pills along the ROWS (Y) axis, Monument detected that Regression algorithms were appropriate and made two such algorithms available: LightGBM and LinReg (short for Linear Regression).
I applied both to Price and got this result:
The validation error rates both came in at around 13. Let’s see if we can improve that quickly.
By default, algorithms in Monument will use only the plotted data pills as independent variables for an algorithm. In this case, that means that when applying LightGBM and LinReg to Price, only YearsFromCurrentYear and Size were used as independent variables.
By opening the drop-down menu on the algorithm pills, I can go to the INDEPENDENTS menu.
Inside the INDEPENDENTS menu, I see that I can select more independent variables. Including relevant independent variables will likely improve my prediction.
After selecting, Used, Apple Care, Speed, Drive, and RAM, the LightGBM validation error rate drops from 13.65 to 12.95. A decent improvement for 5 seconds of work!
You’ll also notice that certain rows in the LightGBM prediction changed from actual values to
null values. This is because some of the input data values in the independent variables we added to LightGBM were
null. When this happens, an algorithm will give a
null result. In a later tutorial, I will explore how to generate “synthetic data” to get around this.
With all of these independent variables used, which ones are driving most of the prediction results? We can discover that by clicking the Validation Importance Table icon in the INFO box in the bottom left corner of the screen. Not all algorithms produce Validation Importance Tables, but when we do have access to them, they are useful.
The Validation Importance Table rank-orders to the independent variables by the importance of that feature. We can see that the size of the Drive and the amount of RAM are each responsible for about 25% of the prediction. This means that these characteristics of a MacBook drive the price the most — even more than whether the computer is used or not, which came in at 12%.
LinReg has a validation error rate of 13.70. Let’s apply the same independent variables.
When we add the same independent variables as we used with LightGBM, the validation error rate drops from 13.70 to 13.11. Again, a pretty good result for 5 seconds of work.
You’ll also notice that some values became
null, which occurred for the same reason as it did with the LightGBM.
As an aside, you’ll notice if that if you apply on the Dimensions data pills like AppleCare to the chart, the algorithms will change to Classification. This makes available LightGBM, LogReg (Logistic Regression), and SVM (Support Vector Machine). This will be explored in a later tutorial.
Going back to the regression algorithms we’ve already applied, if I wanted to spend more time improving the prediction, I could open their PARAMETERS menu and adjust the parameter settings.
However, for now, I’ll skip this step because my goal here is to get a quick result that is “close enough.”
Moving over to the OUTPUT tab, I can see my original data that I imported in the INPUT tab with the addition of the LightGBM and LinReg results.
I click the SAVE NOW button to produce a CSV of these results.
I posted the output CSV as a Google Sheet that you can view here.
In order to make a decision with the data, I added three columns in the Google Sheet:
- The listed price minus the LightGBM’s forecasted price
- The listed price minus the Linear Regression’s forecasted price
- The LightGBM’s forecasted price plus the Linear Regression’s forecasted price
The more negative the values for the first and second additional columns means the more that the algorithm predicted the price to be higher than it actually was — suggesting that the laptop was under-priced. For example, if the listed price was $3,000, but the LightGBM’s forecasted price was $3,500, then the first added column would have a value of -$500.
In the third column, I combined the two forecasts as a simple way to include both forecasts. I then sorted the spreadsheet by the last column from lowest to highest. This floats the computers that are the most under-priced to the top of the spreadsheet.
Now, to bring in some domain knowledge, we should note that, given that this information is from Craigslist, some of these listing might be fraudulent. If I were purchasing the computer, I would start my outreach from the most under-priced computer listing with the understanding that I should pay extra attention to signs of fraud.
If I wanted to improve my model, I could add additional input data points. There are many other variables I could have chosen, including the city in which the computer is listed, whether or not there are typos in the listing text, the number of months left in the Apple Care policy, whether there are pictures included, whether a phone number is listed, count of battery cycles, and so on.
I could also go back to the PARAMETERS menu and make adjustments there to improve predictive accuracy.
I could also use this same approach to build a pricing model for selling a computer. The approach would be the same: I would load existing data and train algorithms on it. Then I would apply these same algorithms on the data for the computer I was selling, which would give me predictions about what the market might expect my computer to be listed at. I could reasonably expect that if I listed it in that range, it would sell.
I hope you found this tutorial a useful introduction to using Machine Learning algorithms to build a pricing engine! You don’t need to know how to code in order to get started with Machine Learning.
Interested in learning more about Monument? Book a free introductory Zoom call here.