Improving Business Performance with Machine Learning | by Juan Jose Munoz | Jun, 2024 – Towards Data Science

Because we are using an unsupervised learning algorithm, there is not a widely available measure of accuracy. However, we can use domain knowledge to validate our groups.

Visually inspecting the groups, we can see some benchmarking groups have a mix of Economy and Luxury hotels, which doesn't make business sense as the demand for hotels is fundamentally different.

We can scroll to the data and note some of those differences, but can we come up with our own accuracy measure?

We want to create a function to measure the consistency of the recommended Benchmarking sets across each feature. One way of doing this is by calculating the variance in each feature for each set. For each cluster, we can compute an average of each feature variance, and we can then average each hotel cluster variance to get a total model score.

From our domain knowledge, we know that in order to set up a comparable benchmark set, we need to prioritize hotels in the same Brand, possibly the same market, and the same country, and if we use different markets or countries, then the market tier should be the same.

With that in mind, we want our measure to have a higher penalty for variance in those features. To do so, we will use a weighted average to calculate each benchmark set variance. We will also print the variance of the key features and secondary features separately.

To sum up, to create our accuracy measure, we need to:

To keep our code clean and track our experiments , lets also define a function to store the results of our experiments.

Now that we have a baseline, lets see if we can improve our model.

Up until now, we did not have to know what was going on under the hood when we ran this code:

To improve our model, we will need to understand the model parameters and how we can interact with them to get better benchmark sets.

Lets start by looking at the Scikit Learn documentation and source code:

There are quite a few things going on here.

The Nearestneighbor class inherits fromNeighborsBase, which is the case class for nearest neighbor estimators. This class handles the common functionalities required for nearest-neighbor searches, such as

The Nearestneighbor class also inherits fromKNeighborsMixin and RadiusNeighborsMixinclasses. These Mixin classes add specific neighbor-search functionalities to the Nearestneighbor

Based on our scenario, KNeighborsMixin provides the functionality we need.

We need to understand one key parameter before we can improve our model; this is the distance metric.

The documentation mentions that the NearestNeighbor algorithm uses the Minkowski distance by default and gives us a reference to the SciPy API.

In scipy.spatial.distance, we can see two mathematical representations of "Minkowski" distance:

uv p=( i u iv i p ) 1/p

This formula calculates the p-th root of the sum of powered differences across all elements.

The second mathematical representation of Minkowski distance is:

uv p=( i w i(u iv i p )) 1/p

This is very similar to the first one, but it introduces weights wi to the differences, emphasizing or de-emphasizing specific dimensions. This is useful where certain features are more relevant than others. By default, the setting is None, which gives all features the same weight of 1.0.

This is a great option for improving our model as it allows us to pass domain knowledge to our model and emphasize similarities that are most relevant to users.

If we look at the formulas, we see the parameter. p. This parameter affects the "path" the algorithm takes to calculate the distance. By default, p=2, which represents the Euclidian distance.

You can think of the Euclidian distance as calculating the distance by drawing a straight line between 2 points. This is usally the shortest distance, however, this is not always the most desirable way of calculating the distance, specially in higher dimention spaces. For more information on why this is the case, there is this great paper online: https://bib.dbvis.de/uploadedFiles/155.pdf

Another common value for p is 1. This represents the Manhattan distance. You think of it as the distance between two points measured along a grid-like path.

On the other hand, if we increase p towards infinity, we end up with the Chebyshev distance, defined as the maximum absolute difference between any corresponding elements of the vectors. It essentially measures the worst-case difference, making it useful in scenarios where you want to ensure that no single feature varies too much.

By reading and getting familiar with the documentation, we have uncovered a few possible options to improve our model.

By default n_neighbors is 5, however, for our benchmark set, we want to compare each hotel to the 3 most similar hotels. To do so, we need to set n_neighbors = 4 (Subject hotel + 3 peers)

Based on the documentation, we can pass weights to the distance calculation to emphasize the relationship across some features. Based on our domain knowledge, we have identified the features we want to emphasize, in this case, Brand, Market, Country, and Market Tier.

Passing domain knowledge to the model via weights increased the score significantly. Next, lets test the impact of the distance measure.

So far, we have been using the Euclidian distance. Lets see what happens if we use the Manhattan distance instead.

Decreasing p to 1 resulted in some good improvements. Lets see what happens as p approximates infinity.

To use the Chebyshev distance, we will change the metric parameter to Chebyshev. The default sklearn Chebyshev metric doesnt have a weight parameter. To get around this, we will define a custom weighted_chebyshev metric.

We managed to decrease the primary feature variance scores through experimentation.

Lets visualize the results.

Using Manhattan distance with weights seems to give the most accurate benchmark sets according to our needs.

The last step before implementing the benchmark sets would be to examine the sets with the highest Primary features scores and identify what steps to take with them.

These 18 cases will need to be reviewed to ensure the benchmark sets are relevant.

As you can see, with a few lines of code and some understanding of Nearest neighbor search, we managed to set internal benchmark sets. We can now distribute the sets and start measuring hotels' KPIs against their benchmark sets.

You dont always have to focus on the most cutting-edge machine learning methods to deliver value. Very often, simple machine learning can deliver great value.

What are some low-hanging fruits in your business that you could easily tackle with Machine learning?

World Bank. World Development Indicators. Retrieved June 11, 2024, from https://datacatalog.worldbank.org/search/dataset/0038117

Aggarwal, C. C., Hinneburg, A., & Keim, D. A. (n.d.). On the Surprising Behavior of Distance Metrics in High Dimensional Space. IBM T. J. Watson Research Center and Institute of Computer Science, University of Halle. Retrieved from https://bib.dbvis.de/uploadedFiles/155.pdf

SciPy v1.10.1 Manual. scipy.spatial.distance.minkowski. Retrieved June 11, 2024, from https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.minkowski.html

GeeksforGeeks. Haversine formula to find distance between two points on a sphere. Retrieved June 11, 2024, from https://www.geeksforgeeks.org/haversine-formula-to-find-distance-between-two-points-on-a-sphere/

scikit-learn. Neighbors Module. Retrieved June 11, 2024, from https://scikit-learn.org/stable/modules/classes.html#module-sklearn.neighbors

See the original post:

Improving Business Performance with Machine Learning | by Juan Jose Munoz | Jun, 2024 - Towards Data Science

Related Posts

Comments are closed.