PROPERTY VALUE ASSESSMENT USING ARTIFICIAL NEURAL NETWORKS , HEDONIC REGRESSION AND NEAREST NEIGHBORS REGRESSION METHODS

In this paper, hedonic regression, nearest neighbors regression and artificial neural networks methods are applied to the real and up to date estate data set belongs to Adana province of Turkey. Traditionally, hedonic regression methods have been used to predict house prices. Because of the nature of the relationships between the factors affecting house prices are generally being nonlinear; some alternative methods have been needed. Nearest neighbors regression (k-nn) and artificial neural networks (ANN) present both flexible and nonlinear fittings. Classical hedonic approach and its nonlinear alternatives have been employed on a mixed types data set and compared based on some performance measures including root mean squared error, the coefficient of determination (R squared), the coefficient of determination, and mean absolute error. Cross validation method has been used to determine the appropriate model parameters for nearest neighbors and ANN. According to the results, ANN is found better when compared to other methods in terms of all measures. Besides, k-nn regression method provides reasonable results despite of lower performance than hedonic regression method. It has been seen that ANN is a powerful tool for predicting house prices.


INTRODUCTION
Traditionally, having a house has been one of the main goals of a human being and placed at the central to the entire life.One reason of this situation is that a house has been met the need of shelter which is the most fundemantal and vital need of us.In other respects, it offers a profitable and wisely investment opportunity.Both being a property and a investment asset, the real estate market has own unique and different characteristics.These characteristics are high cost of supply, heterogenity, durability, locational fixity, the possibility to raise loans against housing collateral, the existence of a well-developed secondary market (Iacoviello, 2000:10;Quigley, 1992 andMiles, 1992).
House market is a composite market inluding many components such as potential owners, building contractors, investors, appraisers, banks, assurers, consultants, market researchers, lenders, developers and so forth (Frew and Jud, 2003;Selim, 2009).As a consequence of this property, a precise and fair estimation of sales price of a house has a particular and great concern.Making mistake on determining the price cause some undesirable results such as increasing or decreasing property tax, excessive profit in favor of some groups or affecting potential houseowners negatively.
Accurately predicting the housing price is not easy and clear problem because of including numerous factors.Structural, locational and enviromental properties of a house affect the price.There is no definite study to choose the exact attributes or properties.A great deal of previous researches have been focused on determining the most important factor on a house price.Several estimation models have been proposed as tools to predict the market price of a house precisely.These models provide some useful insights for understanding individual effects of any attribute on the price.Estimation models were classified into two groups as traditional and advanced by Pagourtzi et al. (2003).In this mentioned study, it was emphasized that models created by assuming an underlying form between the attributes and the prices as traditional.Regression models can be given as example to this kind of models.On the other hand, advanced models are based on mimicking human being.Artificial neural networks, fuzzy logic and ARIMA models are part of advanced methods.
In housing market, hedonic regression methods have been often used.These methods are based on multiple regression except a conceptional differences.The term hedonic refers to "weighting of the relative importance of various components among others in constructing an index of usefulness and desirability" (Goodman, 1998).It originates from microeconomic theory (Lancaster, 1966).While assuming an underlying form between the attributes of a house and the price makes the inference easier, there no clear and obvious way on determining the optimal form.The existing literature on this topic is not so much because of insufficient guidance from economic theory about the proper function (Bin, 2004).Most of the studies focused on estimating the sale price of a house by using some flexible and nonlinear forms which some of them were based on Box-Cox transformation approaches (Box and Cox, 1964).This approach have been aroused interest of researchers and widely used to get better insights.When applying a tranformation on price sounds an attractive option, it is so straightforward and causes many problems on fitting.A notable one is the choice of the proper function.In addition, feature selection is another problem.Problems could be extented as having outliers, nonlinear relationship between attributes and price values, some kind of dependencies, multicollinearity and so on.As a reasonable result of this situation, different approaches have been tried on predicting the housing price.Artificial neural networks and nearest neighbors regression methods have been prefered by the reason of not assuming any underlying functional form between attributes and the price of a house.
Artificial neural networks is a computational and nonlinear statistical modelling method being inspired biological neurons of human beings.It can be used to investigate the relationships between the attributes and price of a house.A neural network basically learns by observing the data itself.Some useful patterns are being searched and used for updating the networks parameters.Some artificial neural networks like multilayer perceptrons (MLP) are being considered as a kind of multiple regression methods because of having structural similarities.Artificial neural networks are parts of non-parametric method by the reason of not assuming any underlying functional form.Nonparametric methods provide more flexible options.A different and well-known member of these methods is nearest neighbors regression method.This method is generally known as k-nearest neighbourhood regression method (knn regression).By choosing an optimal k number, it discovers k closest different units to a given one and uses them to estimate it (James et al., 2013).The value of k is so significant on solution and flexibility of the model.
In this paper, the main goal is to search the efficiencies of k-nn regression method via Gower measure which is a mixed type distance measure, hedonic regression method and artificial neural networks on the property appraising.Additionally, it has been investigated whether the selection of parameters in knn regression and artificial neural networks by using cross validation affect the performance or not.Therefore, causal factors of house prices are being investigated in Adana province of Turkey.Data set has been retrieved from a popular real estate website.Hedonic regression, artificial neural networks and k-nearest neighbors regression methods have been employed and compared with one another.The number of hidden layers and nodes for artificial neural networks and the appropriate k value for k-nn regression are determined via cross validation method.In k-nn regression, Gower distance is employed to calculate the distances and the estimations due to existence of mixed types data set.
This paper is organized as follows.Section 2 shortly reviews the literatures including the aforementioned methods.The details and definitions of hedonic regression, nearest neighbors regression and artificial neural networks are given in Section 3. Section 4 consists of some information about the data set and comparison the results of three methods.Finally conclusions are reported and discussed in Section 5.

RELATED LITERATURE REVIEW
In the literature, most of the studies have been carried out by comparison of several methods which may be useful for predicting house prices.The number and types of attributes are variable.Multiple linear regression, hedonic regression, fuzzy logic, artificial neural networks, memory based reasoning, semi parametric regression methods have been commonly used in these studies.As performance measures, mean squared error, root mean squared error, r squared, mean absolute error, Theil's U statistic are the preferred measures.
One of the first usage of artificial neural networks (ANN) was carried on the data sets of family residences in England by Borst (1991).Tay and Ho (1992) compared the performance of the back propagation neural network (BP) model and the multiple regression analysis (MRA) model in terms of estimating house sale prices.Similar to Tay and Ho's study, Worzala et al. (1995) applied ANN to real estate appraisal and compared the results with MRA.In this study, it was found that neural networks were not superior than classical approaches.Rossini (1996) reviewed the literature and used ANN to compare its results with MRE and the actual sale prices.The results in Rossini's study are variable and there is no single best method.It was aimed to clarify the usage of ANN for estimating sale prices.Cechin et al. (2000) applied multilayer perceptron neural network and ordinary regression analysis to determine an apartment's monetary worth appraisement at the Porto Alegre city located in Southern Brazil.Neural network was mainly preferred in this study due to the nonlinearity between attributes.Nguyen and Al Cripps (2001) compared ANN and MRA to predict housing prices for single families depending on various data models including the different samples size, the functional forms and the temporal prediction.Some explanations about the usage of ANN and MRA are summaries giving the performance results of comparisons.Limsombunchai and Samarasinghe (2004) compared the predictive power of the hedonic regression model and an artificial neural network model for house price prediction by using a web database in Christchurch, New Zeland.The results of this study suggested to use ANN by emphasize some comments about the black box nature of it and providing variables results in different conditions.Bin (2004) carried out a study including the usage of semi-parametric regression and the comparison with traditional parametric models in terms of prediction performance on housing prices.Semi-parametric regression outperformed the parametric counterparts in this study.Zurada et al. (2006) used fuzzy logic and memory-based reasoning in evaluating residential property values for a real data set and compared them with neural networks and multiple regression.In this study, principal component analysis and variable selection were employed to improve the quality of the results.The results showed that there was no single superior method for the data set.Khalafallah (2008) used the neural networks based models for predicting house market performance on testing and validation processes.The prediction error was found in the range between -2% and +2% in this study.Selim (2009) applied hedonic regression analysis and ANN to determine the factors of house prices in Turkey.ANN was found a powerful and better alternative than classical hedonic regression in terms of predicting performance.Mousa and Saadeh (2010) built an ANN model for the purpose of automatic appraisal of Jordanian estates to avoid the drawbacks of manual appraisal by using Genetic Algorithm for determining the best networks structure.Some statistical tests were carried out to validate the effectiveness of the proposed method.Kontrimas and Verikas (2011) used the ordinary least squares (OLS), support vector machine (SVM) regression, multilayer perceptron (MLP) and a committee of predictors and compared them.The proposed committee of models outperformed all other predictors.Sampathkumar et al. (2015) applied multiple regression and neural networks for predicting land prices in the state of Tamilnadu, India.The results in this study showed that both models fitted well but neural networks provided better accuracy.Abidoye and Chan (2017) applied ANN for modelling property values in Nigeria and found that ANN could be used as a tool to get reliable and accurate property valuation.

Hedonic Regression Method
As mentioned before, hedonic term is defined as ''the weighting of the relative importance of various components among others in constructing an index of usefulness and desirability" (Goodman, 1998).Hedonic price model indicates that each one of the characteristics of nonhomogenous goods provides different profit or some degree of utility.This model is generally used for determining the fair and accurate price of a good in its market.This approach is originated from the consumer theory developed by Lancaster (1966) and extended to the real estate market by Rosen (1974).The traditional use of hedonic estimation in housing studies has been for the purpose of making inferences about nonobservable values of different attributes like air quality, airport noise, commuter access (railway, subway or highway) and neighborhood amenities (Janssen et al., 2001;Selim, 2009).Besides it has been widely used for valuation of agricultural goods, real estate pricing and environmental studies (Limsombunchai and Samarasinghe, 2004).
Multiple regression model has same purpose and usage with hedonic approach.The concept of hedonic can be transfered into regression analysis.The properties of a house refer to the independent variables and the price of a house is the dependent variable.Actually, regression analysis is known as hedonic price model in the real estate or similar markets including valuation of any good (Selim, 2009).Let Y is the dependent variable, X's are the independent variables and β's are the individual coefficients for each variable and then the hedonic price model is defined as follows: This model is exactly same with multiple linear regression model.However, many different functional forms can be used in this model.The existing literature on this topic is not so much because of that there is insufficient guidance from economic theory about the proper function (Bin, 2004).That's why there is no obvious and effective way to determine the appropriate form but linear, logarithmic and squared forms have been too often used.The most preferred functional form is the semi logarithmic form because of that it makes possible to interpret every coefficient as being the proportion of a good's price (Halvorsen andPalmquist, 1980).This basically means that the natural logarithm of the house price is treated as new dependent variable and ordinary least squares approach is applied by using this new variable.It should be noted that some considerations like outliers, nonlinear relationship between attributes and price values, some kind of dependencies, multicollinearity etc have to be attentively examined in models including some transformations.

Nearest Neighbors Regression Method
Nearest neighbors regression a non-parametric and flexible method which does not assume an underlying functional form for model (James et al., 2013).The conceptional expression of this method is so simple and straightforward when compared to other competitors.It is well-known as k-nearest neighbors regression (k-nn regression) method.The knn approach basically uses the k-closest samples to predict a new unit.
The k-nn regression method can not be defined as a traditional model because of dependency individual samples in data set.In this method, a new sample is predicted by the mean of the k closest neighbors values (Kuhn and Johnson, 2013).Given a new samples, say  0 , and k value, the k-nn regression discovers the k closest samples to this given sample.When  0 is defined as the set of this k samples, the prediction of this new samples is calculated as follows (James et al., 2013): As a measures of this closeness, Euclidean distance is the most commonly used metric in the literature.This metric is defined as follows: Here   and   are any two samples in the data set.Minkowski distance measure is generalized version of Euclidean distance and defined as (Liu, 2007;Kuhn and Johnson, 2013): It can be clearly seen than when t = 2, this metric is equivalent to Euclidean distance.Besides, when t=1, it corresponds to City block distance.City block is generally used to find the distance between binary variables.On the other hand, some popular alternatives like Cosine, Hamming, Jaccard, Tanimoto, Simple Matching measures have been used for different purposes in different contexts and areas.Not only the choice of distance measures but also the scale of the variables is critical in terms of model prediction performance.When the data set includes mixed types of measures, Gower distance can be used to calculate the distances between samples.For this reason, to avoid this potential bias and to make possible each independent variables to contribute equally to the distance, scaling and centering the independent variables is suggested before applying the knn method (Kuhn and Johnson, 2013).
Another consideration is not to have any missing values in data set.If exists, computing the distance between units is impossible.As a tuning parameter, the number of neighbors, k, plays key role on the results.As this value increases, the fit will be less variable and this means that results have high bias but low variance.To get more flexible fit, a smaller K value should be choosen.Inversely to the large value, this creates a model having high variance but low bias.In terms of bias-variance tradeoff, k value should be determined carefully by using some resampling techniques such as cross-validation (James et al., 2013;Kuhn and Johnson, 2013).

Artificial Neural Networks
Artificial neural networks (ANN) is a computational and nonlinear statistical modelling method being inspired biological neurons of human beings (Bishop, 1995;Kuhn and Johnson, 2013).It has been widely used in many areas such as aerospace, automotive, banking, defense, electronics, entertaintment, financial, insurance, manufacturing, medical, oil and gas, robotics, speech, securities, telecommunications and transportation (Demuth et al., 2014).ANN can also provide accurate predictions in regression context.It can be seen as nonlinear regression methods (Selim, 2009).
A neural network structure consists of many sub-components such as weights, nodes, layers and activation functions.There are mainly three layers named input layer (includes independent variables values), hidden layer (includes a certain number of processing units, nodes) and the output layer (gives the estimated values of the dependent variable).Weights are generally randomly choosen from a distribution between a determined range values such as [-1,1].The remain component, activation function produces values by using total net as defined the sum of weighted inputs and the bias values..All of these components are interconnected with one another.It takes same independent variables as inputs and dependent variable as an output with classical regression models.ANN basically learns by observing the data set itself and updates weights to reduce the error between actual dependent values and estimated ones.The relationship between input and hidden layer can be expressed in linear form and defined as follows: where (. ) is the activation function,   is the weight and  0 is the bias value between each variable and the corresponding hidden node (t).This ℎ  () value is simply the output of hidden node t.After choosing the number of hidden nodes in hidden layer, the outcome value can be similarly defined as linear combination of this nodes as follows (Kuhn and Johnson, 2013): Here () corresponds to the estimated outcome values.In aANN model, the parameters are updated to minimize or reduce the sum of the squared residuals.This updating process is carried out by using different learning algorithms such as widely used backpropagation algorithm proposed by Rumelhart et al. (1986).It should be noted that there is no guarantee to reach the global optimum solution (Kuhn and Johnson, 2013).

Data Set, Source and Preprocessing
The main data has been retrieved from a well-known real estate website in January and February 2018.It contains 3114 units and 11 variables which are given with some descriptive statistics in Table 2.The data set is belongs to four central districts including Seyhan, Çukurova, Yüreğir and Sarıçam in Adana.The location of house, the age of building, credit availability, size (square meters), the number of rooms, the number of bathrooms, the floor of house, the number of floors of building, the distance to the city central and the heating system of house are used as variables.
Inclusion of outliers may dramatically affect the results.That's why outlier analysis process is carried out by using some criterias such as studentized residual, leverage values and Cook's distance.Based on this process, 12 units are omitted from the data set.The results ot outlier analysis is given in Appendix B.
As well as the effect of outliers, multicollinearity is the another important issue because of referring near-linear dependencies between independent variables.The presence of multicollinearity can produce unstable regression coefficients which have large variance/covariances and values in absolute manner (Montgomery et al., 2012).Variance inflation factor and condition index are used to determine whether there is multicollinearity or not.The cutoff values are taken as 10 for Vif and 1000 for CI.There is no multicollinearity between independent variables according to these criterias.The results are given in Appendix C.
The data set is splitted into train and test data.The ratios between them are 70% and 30%, respectively.The models have been fitted by using the training data and tested via testing data.As performance measures RMSE, R squared and MAE have been calculated and given comparatively.The solutions for hedonic regression analysis has been obtained by using IBM Spss and STATA 14.0.The R software has been used for the results of k-nn regression and artificial neural networks.
The descriptive statistics of the data set are given in Table 1 and Table 2.According to descriptive statistics, the majority of the houses are located in Çukurova district.The most of them are at 0-5 years.Combi boilers is the most preferred heating system.The banking credit is available for most houses.There is at a certain amount of dues in almost all houses.The cheapest house price is 65000 (TL) and the maximum price is equal to 1350000 (TL).The mean price is around 282000 (TL)

Hedonic Regression Results
In this section, hedonic regression results are given in Table 3.As mentioned in preprocessing step, there no multicollinearity between independent variables.Another assumption named heteroscedasticity is present when it is checked by Breusch-Pagan test.Robust standart errors for coefficients have been used as a solution to this violation.By doing so, t statistics and p values have been calculated based on these standart errors.The majority of coefficients are significant.In the last column of hedonic regression results, percent effects (i.e exp(coef)) for each variable are presented.The signs of coefficients and effects are consistent with the literature and expectations.According to these values, the house prices in Çukurova district are higher than Sarıçam (base category) by %44.8.When compared to the prices in Sarıçam, Yüreğir has lower values by %1.2.The results also point out that size, number of rooms, number of bathrooms, number of floors in building have significant and positive effects on the prices.
Age is also an important variable.House prices between 6-10 years are lower than the ones 0-5 years (the based category for age).When age gets older, this difference rises up to 31-35 years.After this age, prices significantly and distinctly increases.This situation ma depend on the places where these houses located.Enviromental and physical conditions may be developed because of being located for a long time.The price of houses having combi boiler as heating system are higher than stove (the base category) by approximately %11.Credit availability, dues, having air conditioning, being 26-30 years and being in Yüreğir district don't have significant effect on the prices.Observed and predicted house prices for hedonic regression are given in Figure 2.

Nearest Neighbors Regression Results
Choosing an optimal k value plays key role on knn regression results.As mentioned before this value should be determined carefully by using some resampling techniques such as cross-validation.In this section, RMSE, Rsquared and MAE performance measures for various k values by using 10 foldcross validation have been calculated and given in Appendix A. All the results have been carried out by using Gower distance because of having mixed types of measures in our data set.The results can be seen as visually in Figure 3

397
Both observing Figure 3 and Appendix A, it can be said that as the k value increases, RMSE and MAE values decreases but Rsquared values increases until a certain value.After this value, the change reverses.Herefrom the optimal k value is seen as 7 which is the value providing minimum RMSE, highest Rsquared and reasonable MAE.By using this value, model has been fitted on the whole traning data set.The results for training and testing data are given in Table 4.The results for predicted and observed house prices after fitting k-nn regression is given in Figure 4.

Artificial Neural Networks Results
In this study, the number of hidden layers and nodes in each layers are the parameters which must be tuned.Cross validation process has been carried out to determine them effectively.One or two hidden layers options have been considered.As the number of hidden layers nodes {5,10,15,20} values have been tried.Many activation functions have been proposed in the literature.Sigmoid activation function has been most commonly used one and also preferred in this study.RMSE, Rsquared and MAE performance measures have been calculated for every possible combination including one or two hidden layers and four possible number of hidden layer nodes depending on training and testing data sets.According to these results, the best combination has been found when we used sigmoid activation function, five and fifteen hiddens nodes for hidden layer 1 and hidden layer 2, respectively.The minimum testing RMSE value has been calculated for this combination.The results for each combination are given in Table 5.By using these options, the model has been fitted on whole training data set.Testing results have been obtained by using testing data set via this model.The results for predicted and observed house prices after fitting multilayer perceptron with two hidden layers and sigmoid activation function are given in Figure 5.

Comparison of Three Methods
An overall examination of the results is given in Table 7.According to these results, ANN is the best method to predict house prices.It has both the lowest RMSE/MAE values and the highest R squared value.This result suggests that ANN is better tool than Hedonic regression.However, hedonic regression outperforms knn regression in terms of all performance measures.Additionally, the comparison for prices based on a sample of cases is given in Table 8 and a visual representation of these prices is given in Figure 6.

CONCLUSIONS
In this study, the factors affecting prices of houses which are located in Adana has been investigated by using hedonic regression, nearest neighbors regression and artificial neural networks methods.According to the hedonic regression results, district, age, size, number of rooms, number of bathrooms, type of heating system, floor, number of floors in building, distance to the city center are found as significant variables on the house prices.Because of being flexible and nonlinear alternatives, k nearest neighbourhood and artificial neural networks approaches have been used and compared with hedonic regression.It is shown that the selection of parameters via cross validation in knn regression and ANN is effective on the performance of these methods.Consequently, ANN has been found as the best method

Figure 1 .
Figure 1.A sample structure of a multilayers neural network (towardsdatascience.com)

Figure 3 .
Figure 3. CV RMSE values for a range of k values

Figure 4 .
Figure 4. Observed and predicted house prices by k-nn regression

Figure 5 .
Figure 5. Observed and predicted house prices by multilayer perceptron

Figure 6 .
Figure 6.Actual and predicted prices by hedonic regression, k-nn regression and ANN

Table 3 .
Hedonic regression model estimates

Table 4 .
Nearest neighbors regression results

Table 5 .
Artificial neural networks results

The omitted samples and corresponding examination measures
Property Value Assessment Using Artificial Neural Networks, Hedonic Regression And Nearest Neighbors Regression Methods401in terms of testing performance on predicting house prices.It can be used as a powerful alternative to ordinary hedonic regression method.