Current location - Quotes Website - Signature design - Click Prediction of Taobao Advertising Users (python)
Click Prediction of Taobao Advertising Users (python)
In order to improve the click-through rate of users' Taobao advertisements, this paper analyzes the click records of Taobao users' advertisements and the corresponding user information and advertisement information tables, cleans the merged data with python, extracts data features, carries out feature engineering analysis, analyzes the factors influencing users to click advertisements, and establishes a simple logistic regression model to predict whether users click advertisements.

/Dataset/Data Details? dataId=56

Introduction to Dataset (see Dataset Link for details):

There are ***4 tables in the dataset. Because it is an advertisement click prediction, the first three tables are used, which are described as follows:

Main Table: Original Sample

As the data set label of post-logistic regression model, Clk will be renamed as flag later.

Data cleaning:

According to the user id+ time_stamp, there will be many duplicate records. Delete the repetition time+user ID and set a unique identifier;

View the current data set size:

View null values:

View click percentage:

Advertising _ function

Data cleaning:

View duplicate values:

Generate date 1 associated with the main table:

View merged null values:

View click percentage:

User profile

View duplicate values:

And date 1 generation date 2:

Final data sheet information:

View click rate:

Delete the ID identifier field that does not need to be analyzed:

Click on the ratio of the final data table:

Category field: ratio of male to female/ratio of students

Time field:

The data set only has a timestamp field, from which we extract the number of weeks and the corresponding time period to see the time trend;

View click trends:

Re-distinguish the week for feature extraction in the future;

Similarly, hours are grouped for later feature processing.

Continuous field treatment:

Price:

View descriptive statistics of advertising prices:

Distribution advertising price:

Time field can be deleted:

To view the currently missing data, you need to fill in the missing data:

Look at the missing ratio, there are many missing pvalue_level, which are filled with a special number, and 9999.0 is used here.

Empty padding: if it is numeric, replace it with average value; If it is classified data, replace it with the most common category;

View the distribution and descriptive statistics of the remaining continuous data:

Delete the original column of the partition:

Standardize gender to 0/1;

Current data preview:

There are too many age _ level/WeChat _ group classifications, and there are too many features generated when feature coding is done later. Partition it:

Dataset Rename Backup:

Use get_dummies to encode the previously extracted features with one key (similarly, only three are posted).

After the classification column stu is encoded, we keep one feature:

Correlation coefficient method: calculate the correlation coefficient of each feature.

Check the correlation coefficient between each feature and the click situation (flag), and ascending =False indicates descending:

Intercepted before and after several, the correlation coefficient is not high, and the click-through rate of users' own advertisements is very low;

Advertising price, resource location, gender, commodity category and Friday can affect users' clicks;

According to the correlation coefficient between each feature and the logo, these features are selected as the input of the model:

Establishing a training data set and a test data set;

Establish logical regression and calculate the logical correct rate;