5/16/2023 0 Comments Klib mieta newsletterNot only for your own understanding of what you are dealing with, but also to produce plots you can show to supervisors, customers or anyone else looking to get a higher level representation and explanation of the data. Rather it is a collection of functions which you can - and probably should - call every time you start working on a new project or dataset. This package is not meant to provide an Auto-ML style API. It is up to you if you stick to sensible, yet sometimes conservative, default parameters or customize the experience by adjusting them according to your needs. These functions require nothing but a Pandas DataFrame of any size and any datatypes and can be accessed through simple one line calls to gain insight into your data, clean up your DataFrames and visualize relationships between features. For major changes or feedback, please open an issue first to discuss what you would like to change.Over the past couple of months I’ve implemented a range of functions which I frequently use for virtually any data analysis and preprocessing task, irrespective of the dataset or ultimate goal. Pull requests and ideas, especially for further functions are welcome. Klib.cat_plot(data, top= 4, bottom= 4) # representation of the 4 most & least common values in each categorical columnįurther examples, as well as applications of the functions in klib.clean() can be found here. Klib.dist_plot(df) # default representation of a distribution plot, other settings include fill_range, histogram. rr_plot(df, target= 'wine') # default representation of correlations with the feature column rr_plot(df, split= 'neg') # displaying only negative correlations rr_plot(df, split= 'pos') # displaying only positive correlations, other settings include threshold, cmap. klib.missingval_plot(df) # default representation of missing values in a DataFrame, plenty of settings are available loss of information Examplesįind all available examples as well as applications of the functions in klib.clean() with detailed descriptions here. klib.pool_duplicate_subsets(df) # pools subset of cols based on duplicates with min. klib.mv_col_handling(df) # drops features with high ratio of missing vals based on informational content klib.drop_missing(df) # drops missing values, also called in data_cleaning() nvert_datatypes(df) # converts existing to more efficient dtypes, also called inside data_cleaning() klib.clean_column_names(df) # cleans and standardizes column names, also called inside data_cleaning() klib.data_cleaning(df) # performs datacleaning (drop duplicates & empty rows/cols, adjust dtypes.) klib.missingval_plot(df) # returns a figure containing information about missing values # klib.clean - functions for cleaning datasets klib.dist_plot(df) # returns a distribution plot for every numeric feature rr_plot(df) # returns a color-encoded heatmap, ideal for correlations rr_mat(df) # returns a color-encoded correlation matrix klib.cat_plot(df) # returns a visualization of the number and frequency of categorical features # scribe - functions for visualizing datasets Use the package manager pip to install klib.Īlternatively, to install this package with conda run: Additionally, there are great introductions and overviews of the functionality on PythonBytes or on YouTube (Data Professor). Explanations on key functionalities can be found on Medium / TowardsDataScience and in the examples section. Klib is a Python library for importing, cleaning, analyzing and preprocessing data.
0 Comments
Leave a Reply. |