/Dataset/Data Details? dataId=56
Introduction to Dataset (see Dataset Link for details):
There are ***4 tables in the dataset. Because it is an advertisement click prediction, the first three tables are used, which are described as follows:
Main Table: Original Sample
As the data set label of post-logistic regression model, Clk will be renamed as flag later.
Data cleaning:
According to the user id+ time_stamp, there will be many duplicate records. Delete the repetition time+user ID and set a unique identifier;
View the current data set size:
View null values:
View click percentage:
Advertising _ function
Data cleaning:
View duplicate values:
Generate date 1 associated with the main table:
View merged null values:
View click percentage:
User profile
View duplicate values:
And date 1 generation date 2:
Final data sheet information:
View click rate:
Delete the ID identifier field that does not need to be analyzed:
Click on the ratio of the final data table:
Category field: ratio of male to female/ratio of students
Time field:
The data set only has a timestamp field, from which we extract the number of weeks and the corresponding time period to see the time trend;
View click trends:
Re-distinguish the week for feature extraction in the future;
Similarly, hours are grouped for later feature processing.
Continuous field treatment:
Price:
View descriptive statistics of advertising prices:
Distribution advertising price:
Time field can be deleted:
To view the currently missing data, you need to fill in the missing data:
Look at the missing ratio, there are many missing pvalue_level, which are filled with a special number, and 9999.0 is used here.
Empty padding: if it is numeric, replace it with average value; If it is classified data, replace it with the most common category;
View the distribution and descriptive statistics of the remaining continuous data:
Delete the original column of the partition:
Standardize gender to 0/1;
Current data preview:
There are too many age _ level/WeChat _ group classifications, and there are too many features generated when feature coding is done later. Partition it:
Dataset Rename Backup:
Use get_dummies to encode the previously extracted features with one key (similarly, only three are posted).
After the classification column stu is encoded, we keep one feature:
Correlation coefficient method: calculate the correlation coefficient of each feature.
Check the correlation coefficient between each feature and the click situation (flag), and ascending =False indicates descending:
Intercepted before and after several, the correlation coefficient is not high, and the click-through rate of users' own advertisements is very low;
Advertising price, resource location, gender, commodity category and Friday can affect users' clicks;
According to the correlation coefficient between each feature and the logo, these features are selected as the input of the model:
Establishing a training data set and a test data set;
Establish logical regression and calculate the logical correct rate;