Advanced modeling of housing locations in the city of Tehran using machine learning and data mining techniques … – Nature.com

To conduct research and determine the research strategy, theoretical-applied implications of grounded theory, choice theory, evaluation, random utility, and content analysis methods were adopted. Each of these perspectives and approaches directly impacted the description and analysis of this research. Data derived from five data science-related libraries in Python programming (Online, 2022a, 2022b, 2022c, 2022d, 2022e, 2022f, 2022g, 2022h, 2022i, 2022j, 2022k, 2022l, 2022m, 2022n, 2022o) format were utilized as a reference for data discovery and optimization. Although this method seems simple, using this method for this research has helped to measure, model, and present the findings better.

The following five models were also employed for the measurement and modeling:

Lasso regression: It is a type of linear regression that draws on shrinkage. Shrinkage describes where data values are shrunk towards a central point data (e.g., mean). This model best suits data that follow multiple alignments (Online, 2022a, 2022b, 2022c, 2022d, 2022e, 2022f, 2022g, 2022h, 2022i, 2022j, 2022k, 2022l, 2022m, 2022n, 2022o).

Kernel regression: The basis of this statistical model is a non-parametric method for estimating the conditional expectation of a random variable, and its mission is to identify a nonlinear relationship between the two variables x and y (Online, 2022a, 2022b, 2022c, 2022d, 2022e, 2022f, 2022g, 2022h, 2022i, 2022j, 2022k, 2022l, 2022m, 2022n, 2022o).

Elastic net: It is a regulated model that linearly integrates the L1 and L2penalties of the lasso and ridge methods (Online, 2022a, 2022b, 2022c, 2022d, 2022e, 2022f, 2022g, 2022h, 2022i, 2022j, 2022k, 2022l, 2022m, 2022n, 2022o).

Gradient boosting regressor: A machine learning method that draws on the results of weaker models (e.g., decision trees) to improve learning outcomes (Online, 2022a, 2022b, 2022c, 2022d, 2022e, 2022f, 2022g, 2022h, 2022i, 2022j, 2022k, 2022l, 2022m, 2022n, 2022o).

XGB Regressor: A more powerful version of Gradient boosting regressor (Online, 2022a, 2022b, 2022c, 2022d, 2022e, 2022f, 2022g, 2022h, 2022i, 2022j, 2022k, 2022l, 2022m, 2022n, 2022o).

This is an exploratory study that adopts a descriptive-analytical perspective. The research sampling is also theoretical. That is a purposive sampling method in which the researcher tries to perform data mining and explore the phenomenon by drawing on the knowledge and opinions of the subjects (Kopai, 2015). Purposive sampling was also used to collect data, mainly extracted from the official sources and statistics (Online, 2022a, 2022b, 2022c, 2022d, 2022e, 2022f, 2022g, 2022h, 2022i, 2022j, 2022k, 2022l, 2022m, 2022n, 2022o). Also, the research data was derived from a systematic review of documents and techniques over 2 years. Data analysis was conducted based on the grounded theory and coding to discover priority variables in housing locations. Also, to convert nominal data to numerical data (the column related to the neighborhood), the One Hot Encoding method and Python programming language as content and data mining were used. Converting nominal data to numerical one is a requirement for learning models. The rationale for using data mining is to expand the size of existing and future data. Although data mining, like other techniques, could only be conducted with human intervention, it enables analysis, who may need to be more expert in statistics or programming, to manage the knowledge extraction process effectively (Wickramasinghe, 2005). The study population consisted of 18,000 samples of villas and apartments selected. After extracting and deleting duplicate data, data distribution on the map of Tehran was determined and data analysis was carried out in 3 steps. First, after validation, 8,000 data from 22 districts and 317 neighborhoods of Tehran were selected and evaluated in terms of 9 variables of the warehouse, elevator, parking lot, surface area, neighborhood, rent, mortgage, year, and total secure deposit affecting the housing prices. Then, the extent of positive or negative correlation of the selected indicators was measured using the Dython Library in the Python programming language. Finally, the learning models were estimated in the existing data using the cross-validation method.

Finally, five regression-based models were implemented on the research data to achieve 85% accuracy to enhance research validity. Therefore, based on Table 1, the accuracy of these models was measured using cross-validation (Online, 2022a) in two stages, before deleting the outliers and the warehouse column, and after deleting the data and the warehouse column (Table 1).

Negative values in Table 1 suggest very low accuracy of models (Online, 2022e), and the closer the precision of a model is to 1 (assuming a maximum accuracy of 100%), the results would be better, and vice versa. A significant improvement in the accuracy of the models is because the skewness and kurtosis of the value distribution forms of each data column were optimized by deleting the outliers, which was essential for the modeling. The skewness and kurtosis optimization does not improve the accuracy of each model (Online, 2022). Still, the models adopted in this paper benefited from this optimization in the best possible manner. Since data with a surface area of more than 200m2 had an asymmetrical distribution, settlements with a maximum area of 200m2 were evaluated and measured. In this research, each data includes the house price, presented by each seller according to the determinants of residential housing prices. After selection, the research data was organized into a database, and several columns formed a matrix for valuation and encoding. Each column contains nine variables: Warehouse, elevator, parking lot, area, neighborhood, rent, mortgage, year, and total deposit, and the amount of data is shown in each row. Each of these variables plays a significant role in housing pricing and location. The neighborhood name column was converted into columns with numerical variables in the research process using the One Hot Encoding (Online-retrieved, 2022) method. For clarity, Table 2 displays the matrix of variables and the data values for selling or buying housing in some Tehran neighborhoods and urban areas (Table 2).

According to the data analysis, some values of the total value column were zero because they had been put on sale for a negotiated price. Therefore, the equivalent rent and deposit were zero. Thus, containing this value in this column was deleted because prices outside the natural range interrupt the learning process of models and yield false predictions.

For example, Figs. 1 and 2 show outliers for the columns relate to Area and Total values after the preprocessing data step (Figs. 1 and 2).

This figure shows the density plot of the area column data after the removal of outliers. The x-axis represents the area in square meters, while the y-axis represents the density. The plot indicates the frequency distribution of area sizes within the specified range, highlighting the peak and spread of the data.

This figure illustrates the density plot of the total value column data after removing outliers. The x-axis represents the house total value in Iranian Toman, while the y-axis represents the density. The plot highlights the frequency distribution of house values, showing the range, peak, and overall distribution pattern of the data.

The bulk of data has a relative value of zero compared to other data, indicating that the data is too large with low frequency. In the research data section, by limiting the range of values, attempts have been made to bring the distribution of importance of these columns closer to the normal distribution. Also, the probability function pertained to the area columns, and the Total value before removing outliers caused by data that are too large or have low frequency was plotted this way. Figures 3 and 4 reveal the results after omitting outliers (Figs. 3 and 4).

This figure presents a Q-Q (quantilequantile) plot comparing the probability distribution of the total value column data, post-outlier removal, against a normal distribution. The x-axis represents the theoretical quantiles, while the y-axis represents the sample quantiles. The blue points indicate the observed values, and the red line represents the reference line for a normal distribution. The plot demonstrates how well the data conforms to a normal distribution, with deviations indicating departures from normality.

This figure shows a Q-Q (quantilequantile) plot comparing the probability distribution of the area column data, after the removal of outliers, to a normal distribution. The x-axis represents the theoretical quantiles, while the y-axis represents the sample quantiles. The blue points indicate the observed values, and the red line represents the reference line for a normal distribution. The plot assesses how closely the area data follows a normal distribution, with deviations highlighting discrepancies from normality.

The statistical studies suggested that the column dedicated to the year of construction also contained abnormal data; hence, houses built earlier than 1995 were removed as outliers. In addition, the skewness and kurtosis of the distribution curve related to the area, year of construction, and Total value, before and after the omission of outliers, are presented in Table 3 (Table 3).

The skewness and kurtosis of the distribution curve of each column exert a direct effect on the learning of prediction models, diminishing or improving the accuracy of the models. The closer the skewness and kurtosis are to their optimal value, the more accurate the models prediction will be. The skewness and kurtosis of the other columns were not investigated due to the lack of continuous data. On the other hand, skewness in the range of 0.5 to 0.5 means that the data are relatively symmetric, and the kurtosis between 2 and 2 is acceptable (George, 2010). Therefore, skew values after removing remote data help the model learn more.

Follow this link:

Advanced modeling of housing locations in the city of Tehran using machine learning and data mining techniques ... - Nature.com

Related Posts

Comments are closed.