I hope you are doing good.

Please find the answers below:

1. How the data should look like (what cleansing is required) ?

In the dataset, we have columns - spend.category and spend.numeric
The average value of each category of Spend.Category seems to fall within the min and max of each category (e.g. the average spend of the category "$100 - $200" is 146.69889). Hence, Spend.Category is giving us redundant information in a grouped manner. Hence we need to drop the variable "Spend.Category"  of  the dataset in Data cleaning step.

2. How to proceed for sampling ?

The purpose of sampling/ slicing the data into two different sets is as follows: 
  • We will use one part of sliced dataset (trainset) as the dataset which will "train" the model. 
  • The next set (testset) will be used to make our predictions! It is considered a good practice to use two different datasets for training the model and testing the model. 
  • set.seed() is used to enable reproducibility of results when there is randomness involved in any function. In case set.seed is not used, every time a function which involves some randomness (such as sample()) is invoked/ run, it will produce results which will be different for each run/ execution.

Lets say I have data in variable retail_data
set.seed(777)
index_value = sample(1:nrow(retail_data),size = 0.7*nrow(retail_data)) 

### Retail_Train_Data will contain 70% of the data 
Retail_Train_Data = Retail_Data_Reduced[Index,] 
View(Retail_Train_Data)

### Retail_Train_Data will contain the rest 30% of the data 
Retail_Test_Data = Retail_Data_Reduced[-Index,] 
View(Retail_Test_Data)

3. How to proceed for Logistic Regression.
I would suggest you to refer to the below link that will help you in working on this project.
 
Hope this helps you.

Please feel free to revert if you need any further help we will be glad to assist you.
 
If you feel satisfied with the way i resolved your ticket then please give your valuable feedback by clicking on the feedback button given below with your own comments.
 
Your feedback helps us to make our service and your experience better with us.

Please note if you are not happy with the response on this ticket, please escalate it to escalations@edureka.in.
We assure you that we will get back to you within 24 hours 




Regards
Priyanka at Edureka
edureka! Solution Team
Website - www.edureka.co
Edureka claims 1st position at Deloitte's Technology Fast 50 India 2014

216894