Identify and understand your target variable.
Understand the type of the target variable: binary, categorical, or numeric.
Examine the distribution of the target variable
for a binary variable (you need to convert it into 0s and 1s if it is in a string format), it is simply the mean (a proportion of 1s)
for a categorical one, it is value counts
for a numeric variable, it is a histogram or a pandas' describe table.
After examining and understanding the target variable, add a markdown cell with your conclusions.
Choose the metric(s) for the target variable you will use in pivot tables.
For a binary variable, this metric is the mean.
For a categorical one, it could be the share between categories (and you will have a column for each category with the corresponding percentage).
For a numeric variable, it could be the mean and/or median and other quantiles, along with some spread metric like standard deviation or mean average deviation.
Calculate the percentage of missing values in each column and sort them in descending order.
Missing values and outliers are not a problem to be fixed! They are facts.
During EDA you must not “fix” them because you have to deal with your data and problem as it is.
If you see missing values, just report them.
Don’t do outlier analysis as a stand-alone step–deal with them along the way. If you see outliers, report them and come up with appropriate analysis and metrics that can capture the complexity of the situation. In severe situations, it's okay to put outliers aside for convince if the goal of your analysis allows it. Never delete outliers blindly.
Create a pivot table for each variable, examining the breakdown of your metric(s) by different categories and deriving insights.
Start with the most important variables or variables that draw your attention. If you have too many variables (e.g. 100) and have no idea what’s going on you can calculate the phik coefficient between the target and all other variables. Phik coefficient highlights the strongest relationship automatically and you can start from examining them first.
If you wish to use a numeric variable for making a split, first convert it into a categorical variable (create bins) using the 'cut' or 'qcut' function with bins=5 (use 'qcut' if the sizes of groups resulting from 'cut' are very uneven).
Each pivot table should include the size of the groups (count) along with the metric for the target variable.
The count should be the first column in each pivot table.
Use the groupby function (with dropna=False to have the metrics for all missing values in the variable) to create pivot tables.
Each pivot table should be created in a separate cell.
If a pivot table has 5 or more rows, use a bar chart to visualize it.
Do not visualize small tables!
If you're making a split by a categorical variable and it has too many distinct values, display the top 10 values by count.
Alternatively, try to reduce the number of groups through some transformation.
Add a markdown cell in the notebook with the insights you've found after each cell that has an output, table, or graph.
Insights should draw attention to evident differences in the metric(s) for the target variable between groups, and also provide a possible common-sense (general experience) explanation grounded in reality and the nature of the problem you are investigating. Provide a "why" along with reporting your findings.
Also, add some ideas for further analysis as a short TODO list with 2-3 items (e.g., add another specific variable in groupby to see if it is the real cause). Do not do these TODOs until you finish the first iteration of your EDA and get the big picture of the whole situation.
All comments you make should be added as markdown cells within the notebook.
The notebook should be a well-written and structured document, like an interesting blog post or essay, ready to be shared with readers.
You should also use markdown cells to structure your notebook.
Use level one headings for major sections like 'Data Upload', 'Understanding Target Variable', 'EDA'.
Use level two headings for sub-items, for example, for each variable in the EDA section.
This structure will help to create a table of contents automatically.
Do not display the results of your analysis here in the chatGPT interface.
Save your results only in the notebook.
Here, at the chatGPT interface, only report what step you are working on now and also report when this step is finished.