The classification of “Program Sembako” recipients in Payobasung West Sumatra Based on the K-nearest neighbors classifier

. The "Sembako Program" is a program carried out by the Indonesian government to improve the welfare of low-income communities. The purposes of this study are: (a) to determine the classification of households that deserve to receive basic-food assistance in Koto Panjang Payobasung, West Sumatra, using the KNN classifier and (b) to determine the optimal number of nearest neighbors used in the classification process. The measure of proximity between objects used is the Gower dissimilarity coefficient. This research used primary data consisting of 175 households collected purposively in a survey conducted on all households in Payobasung. The optimal K value is determined by implementing a 5-fold cross-validation procedure. The result showed that the best classification process is when K = 3 nearest neighbors are used since it produces the highest accuracy coefficient and Mattews correlation coefficient (MCC). Therefore, for further work, in deciding the eligibility of a household to receive the Sembako Program in Payobasung, KNN can be used by considering its 3 nearest neighbors.


INTRODUCTION
One of the efforts made by the Indonesian government to improve the welfare of lowincome people is through the provision of a Non-cash Food Assistance Program, which is transformed into assistance for the basic food program known as the Sembako Program [1]. Payobasung is one of the villages located in Payakumbuh City, West Sumatra, with a population of 2,704 people. According to the information held by the local government of Payobasung, there is a considerable number of households below the National Poverty Line who are proposed to receive the benefit of this assistance program. However, only a small portion of households meet the government's criteria and are declared eligible to receive the Sembako Program.
The COVID-19 pandemic that has hit Indonesia since 2020 has had an impact on the economic sector, including in Payobasung. This increased the number of poor people who are predicted to be eligible to receive the Sembako Program. To meet the government's goal to fulfill the need of poor people for nutritious food, it is necessary to know the eligibility of a household to be categorized as a recipient of the Sembako Program in Payobasung. The first step is to determine the criteria that make a household eligible to be a recipient of the Sembako Program. Based on these criteria, households in Payobasung are classified into the group of households that are eligible to receive food assistance and the group of households that are not eligible. Statistically, it can be done through various classification methods.
Classification is the process of finding a model or rule that describes and distinguishes data classes or concepts. Classification is used to predict categorical class labels and classifies newly or unlabelled available data by assigning them to the class of the most similar characteristics. There are several classification techniques: ID3 algorithm, C4.5 algorithm, support vector machine, Naïve Bayes, Knearest neighbor, and many more [2], [3]. Generally, the accuracy of each algorithm depends on the specific dataset and problem being addressed [4], [5], [6]. According to [7] Knearest neighbor (KNN classifier) is easier to use than other methods. ID3 and C4.5 are classification methods that are based on a decision tree algorithm. The ID3 can only work *Corresponding Author: hazmirayozza@sci.unand.ac.id with categorical data, so it is not suitable for handling numerical data [8], [9]. On the other hand, the KNN classifier works well with all types of data, unlike Naïve Bayes, support vector machine, and logistic regression, which require certain assumptions about underlying data distribution [10]. KNN is a non-parametric method that does not assume data distribution. It makes KNN more flexible and adaptable to different types of data.
The KNN classifier method is a distance-based classification process in determining the proximity between data that will be the nearest neighbors. KNN classifier classifies objects based on learning data of which the closest distance to the object [11]. The KNN classifier algorithm uses information from variable values from the training data set. In the process of classifying new data, the proximity of the new data to all training data is calculated. The class for the new data is determined based on its K nearest neighbors' majority class [12].
The performance of the KNN classifier depends on the dissimilarity measure used to describe the distance between objects. For that reason, choosing the right similarity/dissimilarity is important. It depends on the data type. When all variables are numeric, Euclidean distance is appropriate [13]. When all variables are categorical, the commonly used similarity measure is the simple matching coefficient [14]. For mixed-type data (a mixture of numeric and categorical data), the dissimilarity measure that can be used is the Gower dissimilarity coefficient [15], [16].
In the KNN classifier, the K parameter defines the number of nearest neighbors. It also plays an important role in determining the class prediction of the new data and subsequently affects the accuracy of the whole prediction [12], [17]. The large dataset does not always require a large K value; conversely, small data does not always require a small K. In cases with very unbalanced classes, it is recommended to use a small K value. In some studies, it is recommended to determine the optimal K value by repeatedly running the KNN algorithm with different K values and choosing the one that provides the highest accuracy [17]. Other studies use different numbers of nearest neighbors to different test samples [18].
Several coefficients can be used to measure the accuracy of a classification method. The commonly used measure is the accuracy coefficient, calculated from a confusion matrix [19]. The confusion matrix is a type of contingency table to evaluate or visualize the behavior of models in supervised classification contexts [20]. A confusion matrix of size n x n associated with a classifier shows the predicted and actual classification, where n is the number of different classes [21]. The rows of the confusion matrix represent the actual class of the instances and the columns represent their predicted class [20]. The accuracy coefficient measures the proportion of observations that are correctly classified by a classification method. The higher the accuracy coefficient, the better the classifier results. The other coefficient recommended if positive and negative cases are of equal importance is the Matthews correlation coefficient (MCC). The higher the MCC, the better the classifier results [22], [23].
This study aims to: (a) determine the eligibility classification of households in Payobasung, West Sumatra as recipients of the Sembako Program using the KNN classifier, and (b) determine the optimal K value for the KNN classifier. A similar study has been conducted by [7], but this study used a different distance measure that is more appropriate to be used for mixture variables, i.e. Gower coefficient.

Population and Sample
The population of this study is 838 households in Payobasung at Payakumbuh City, West Sumatra.
The sample consists of 175 households selected purposively from the population.

Data Collection
This study used the primary data collected in a survey conducted on all households in the sample.
The survey is carried out by interviewing the head of the household.

Variables
The variables involved in this study are: 1. Household dependency (X1), that is the number of household dependencies. This is a numeric variable. 2. Household income (X2); household income is ordinal and classified into four categories, i.e: • Less than 1.000.000 rupiahs • 1.000.000 -3.000.000 rupiahs • 3.000.000 -5.000.000 rupiahs • More than 5.000.000 rupiahs 3. The occupation of the head of the household (X3); the household head occupation is nominal and classified into 5 categories, namely: • Labor,  [15], [16]. Denote as the-c variable and dijc is a dissimilarity measure between the i-th and j-th objects by the cth variable (c = 1, ..., m), • If is nominal, the dissimilarity between two objects is expressed as • If is ratio/numeric, the dissimilarity between two objects is where ( ) and ( ) are the maximum and minimum values of , respectively.
• If is an ordinal variable, then all the categories are transformed using the formula where "# : the rank number of the ith ordinal category (r = 1, ..., Rc) Rc: the maximal rank number of # . After this transformation, the dissimilarity is calculated using the formula for numeric variables (equation 2).
Finally, the Gower coefficient between two objects is calculated by where and wijc takes the value zero, if either the i-th or the j-th object by the cth variable is missing; otherwise, it takes the value one [16]. ii. Determine its K nearest neighbors. iii. Identify the class of each nearest neighbors iv. Grouped new observation based on the majority class of its nearest neighbors.

Stage 2:
Determining the Accuracy of the Classification Result The accuracy of the classification result was measured as follows: a. Construct the confusion matrix The confusion matrix for this research is shown in Table 1.
Stage 3: Selecting the Optimal K value The number of the nearest neighbor (K value) has an important role in the KNN classifier. To select the optimal value of K, a 5-fold cross validation method is implemented by following steps. a. Randomly split data into 5 sub-datasets. b. Iteratively, repeat stage 1 by using a subdataset as a validation dataset and the remaining data as a training dataset. c. Construct a confusion matrix for all data and calculate the accuracy coefficient and MCC d. Repeat stage a-c for K= 1,3, 5,…, 29. e. The optimal K value is the K value that provides the highest accuracy coefficient and the highest Mattews' correlation coefficient (MCC).
The experiment stages are shown in Figure 1.

RESULTS AND DISCUSSION
In this research, the KNN classifier is used to classify households in Payobasung West Sumatra into two classes according to their eligibility as the recipient of the Sembako Program provided by the Indonesian government. The sample consists of 175 households. The data is partitioned into the training and the testing data consisting of 140 data (80% data) and 35 data (20% data) respectively. Variables used are household dependency, household income, occupation of household head, and home ownership status. First, we will illustrate the classifying process of households in Payobasung using the KNN algorithm with K=3 nearest neighbors. The same algorithm is applied with different K values and based on the resulting accuracies we determined the optimal K value which is the K value that provides the highest accuracy.

The KNN classifier for K=3
The first stage to classify new data is to calculate the Gower distances between each new data and all training data. The following illustrates the calculation of the Gower coefficient between new data, denoted by Test-1, and five data in the training datasets, denoted by Train1-Train5. Table 2 shows data for each observation.
The Gower distance between two objects is determined by calculating the dissimilarity between the two objects, separately based on each variable.

• Household dependency
Household dependency is a ratio variable with maximum and minimum values are 1 and 7, respectively. The dissimilarity between Test1 and Train1 based on this variable is:    Table 3.
Based on this rule, the transform value of household income for Test1 and Train 1 are 0 and 0, respectively. The dissimilarity between Test1 and Train1 is: • Household head occupation This is a nominal variable. Since Test1 and Train1 have the same household occupation, the dissimilarity between these objects is (1234,678"94): = 0 • Household head occupation The occupation of the head of Test1 and Train1 are the same, then the dissimilarity between these objects is (1234,678"94); = 0 Finally, The Gower distance between Test1 and Train1 is: The value wijc =1 for all variables because the values of xic and xjc are non-missing. The Gower coefficient between Test1 and Train1-Train5 were shown in Table 4.
The Gower coefficient between Test1 and the rest of the training data (Train6-Train140) were calculated. It was found that the three closest neighbors for Test1 are Train1, Train3, and Train4. From Table 2, it has already known that Train1 and Train4 belong to the eligible class and Train3 belongs to a non-eligible class. Therefore, Test1 is assigned to the eligible class since the eligible class is the majority class of its nearest neighbors. Furthermore, the same procedure is followed to assign all data in the testing data to eligible or non-eligible classes.
The next stage is to determine the accuracy of classification. In the next step, the accuracy value will be used to determine the optimal Kvalue. For this purpose, a 5-fold crossvalidation procedure was employed and a confusion matrix was constructed according to the actual and predicted class of all data. The confusion matrix is presented in Table 5.   The classification of "Program Sembako" recipients in Payobasung West Sumatra … (Hazmira Yozza, Nindi Maula Azizah, Lyra Yulianti, Izzati Rahmi HG) ___________________________________________________________________________________________________

The Optimal K
The optimal K value is obtained by performing the above procedure for K=1,3,5,7,…. 29 and calculating the accuracy for each K. The optimal value of K is the value of K that gives the highest accuracy. The accuracy of a classification and The Mattews correlation coefficients (MCC) for KNN with all K values are shown in Table 6 and graphically presented in Figure 2.
The classification performed with K=1 nearest neighbor provides an accuracy coefficient of 96.57%, indicating that 96.57% of observations are correctly classified by the KNN classifier with K=1. By using K=3, more superior classifier results are produced with a higher accuracy coefficient, 97.14%. Furthermore, when K is between 5 and 19, the accuracy coefficients fluctuate, but the coefficients gained never surpass the accuracy coefficient achieved when K=3. The accuracy coefficients decline as the K values increase, until at K=29, only 90% of the data are classified correctly. Based on this pattern, it can be expected that the accuracy value will be lower for a higher K value. Thus, it can be concluded that the highest accuracy coefficient value will be obtained if K = 3 is used. The pattern as described above is seen more clearly on the MCC chart. Therefore, in classifying households in Payobasung, we can choose K=3 as the optimal K value. The selection of a small K value is also intended to reduce the noise in the classification process.
Many countries have implemented food assistance programs for their citizens who are food insecure. According to various studies, these programs are closely related to insecurity, poverty, and health. Along with other assistance programs, these programs significantly promote food security and good health. These programs effectively reduce poverty by providing benefits to households for grocery purchases and allowing them to spend more on their other basic needs, such as education, health care, and electricity [24]. According to another research, the fundamental purpose of food assistance programs is to provide lower-income families with access to a healthy meal, which improves the nutritional well-being of the individuals they serve. Furthermore, studies have linked the Supplemental Nutrition Assistance Program in the USA to better health outcomes and lower health-care expenses [25]. Food assistance programs are implemented in various forms in Indonesia, including the Raskin, cash transfer, and Sembako programs.
The main challenge in implementing a food assistance program is accurately identifying the   [26] used 5 criteria, i,e. employment, income, dependencies, and housing condition, to select the recipient of the non-cash food assistance program. Similar criteria are used in [7]. As in our research, the latter study attempts to discover a procedure to predict a household's eligibility to participate in the program by adopting the KNN classifier. Unlike our current research, the optimal number of nearest neighbors (K) in this study was derived from the values 15, 30, 45, 60, and 75, and the best accuracy values were found at K = 15 and K = 30. For K = 45, 60, 75, the value of K decreased significantly.
By this information, the pattern of accuracy in this study is expected to resemble the pattern obtained in our research. Unfortunately, researchers did not perform a KNN classifier for other K values in the range 1-30; therefore, the pattern of accuracy coefficients for K values in that interval was unrevealed. Considering the fact that the values K = 15 and K = 30 provide the same accuracy coefficient values, the accuracy coefficients within the interval may fluctuate slightly around that value, allowing the researcher to select a much smaller optimal K.
Finally, this research finding can be used for further work in deciding the eligibility of a household to receive the Sembako Program in Payobasung. The eligible recipient of the Sembako Program can be identified using KNN classifier with K = 3 based on the criteria: the number of household dependencies, household income, the occupation of household head and home-ownership status.

CONCLUSION
In this research, the KNN classifier is used to predict the eligibility of households in Payobasung to receive basic food assistance through the Sembako Program provided by the Indonesian government. Variables used as the basis of classifying are household dependency, household income, occupation of household head, and home ownership status. Since K=3 provided the highest accuracy, whether a household in Payobasung is eligible or not eligible to receive Sembako Program can be predicted from the majority class of its 3 nearest neighbors. The nearest neighbors are determined using the Gower distance which is calculated based on household dependency, household income, occupation of household head, and home-ownership status.