DataFrame methods
_obj = pandas_obj
instance-attribute
__init__(pandas_obj)
assert_all_nulls(fail_message=' γ¨ Assert all nulls failed ', pass_message=' βοΈ Assert all nulls passed ', subset=None, raise_exception=True, exception_to_raise=DataError, verbose=False)
Tests whether Dataframe or subset of columns has all nulls. Optionally raises an exception. Does not modify the DataFrame itself.
Example
(
iris
.check.assert_all_nulls(subset=["sepal_length"])
)
# Will raise an exception "γ¨ Assert all nulls failed"
See docs for .check.assert_data() for examples of how to customize assertions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fail_message
|
str
|
Message to display if the condition fails. |
' γ¨ Assert all nulls failed '
|
pass_message
|
str
|
Message to display if the condition passes. |
' βοΈ Assert all nulls passed '
|
subset
|
SubsetTypes
|
Optional, which column or columns to check the condition against. |
None
|
raise_exception
|
bool
|
Whether to raise an exception if the condition fails. |
True
|
exception_to_raise
|
Type[BaseException]
|
The exception to raise if the condition fails and raise_exception is True. |
DataError
|
verbose
|
bool
|
Whether to display the pass message if the condition passes. |
False
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
The original DataFrame, unchanged. |
assert_data(condition, fail_message=' γ¨ Assertion failed ', pass_message=' βοΈ Assertion passed ', subset=None, raise_exception=True, exception_to_raise=DataError, message_shows_condition=True, verbose=False)
Tests whether Dataframe meets condition. Optionally raises an exception. Does not modify the DataFrame itself.
Example
# Validate that the Dataframe has at least 1 row
(
iris
.check.assert_data(lambda df: df.shape[0]>0)
# Or customize the message displayed when assert fails
.check.assert_data(lambda df: df.shape[0]>0, "Assertion failed, DataFrame has no rows!")
# Or show a warning instead of raising an exception
.check.assert_data(lambda df: s.shape[0]>0, "FYI DataFrame has no rows", raise_exception=False)
# Or show a message if it passes, and raise a specific exception (ValueError) if it fails.
.check.assert_data(
lambda df: s.shape[0]>0,
fail_message="FYI DataFrame has 0 rows",
pass_message="DataFrame has at least 1 row!",
exception_to_raise=ValueError,
verbose=True # To show pass_message when assertion passes
)
)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
condition
|
Callable
|
Assertion criteria in the form of a lambda function, such as |
required |
fail_message
|
str
|
Message to display if the condition fails. |
' γ¨ Assertion failed '
|
pass_message
|
str
|
Message to display if the condition passes. |
' βοΈ Assertion passed '
|
subset
|
SubsetTypes
|
Optional, which column or columns to check the condition against. Applied after fn. Subsetting can also be done within the |
None
|
raise_exception
|
bool
|
Whether to raise an exception if the condition fails. |
True
|
exception_to_raise
|
Type[BaseException]
|
The exception to raise if the condition fails and raise_exception is True. |
DataError
|
message_shows_condition
|
bool
|
Whether the fail/pass message should also print the assertion criteria |
True
|
verbose
|
bool
|
Whether to display the pass message if the condition passes. |
False
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
The original DataFrame, unchanged. |
assert_datetime(fail_message=None, pass_message=' βοΈ Assert datetime passed ', subset=None, raise_exception=True, exception_to_raise=TypeError, verbose=False)
Tests whether Dataframe or subset of columns is datetime or timestamp. Optionally raises an exception. Does not modify the DataFrame itself.
Example
(
df
.check.assert_datetime(subset="datetime_col")
)
See docs for .check.assert_data() for examples of how to customize assertions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
subset
|
SubsetTypes
|
Optional, which column or columns to check the condition against. |
None
|
fail_message
|
Union[str, None]
|
Message to display if the condition fails. If None, will report expected vs observed type. |
None
|
pass_message
|
str
|
Message to display if the condition passes. |
' βοΈ Assert datetime passed '
|
raise_exception
|
bool
|
Whether to raise an exception if the condition fails. |
True
|
exception_to_raise
|
Type[BaseException]
|
The exception to raise if the condition fails and raise_exception is True. |
TypeError
|
verbose
|
bool
|
Whether to display the pass message if the condition passes. |
False
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
The original DataFrame, unchanged. |
assert_float(fail_message=None, pass_message=' βοΈ Assert float passed ', subset=None, raise_exception=True, exception_to_raise=TypeError, verbose=False)
Tests whether Dataframe or subset of columns is floats. Optionally raises an exception. Does not modify the DataFrame itself.
Example
(
df
.check.assert_float(subset="float_col")
)
See docs for .check.assert_data() for examples of how to customize assertions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fail_message
|
Union[str, None]
|
Message to display if the condition fails. |
None
|
pass_message
|
str
|
Message to display if the condition passes. |
' βοΈ Assert float passed '
|
subset
|
SubsetTypes
|
Optional, which column or columns to check the condition against. |
None
|
raise_exception
|
bool
|
Whether to raise an exception if the condition fails. |
True
|
exception_to_raise
|
Type[BaseException]
|
The exception to raise if the condition fails and raise_exception is True. |
TypeError
|
verbose
|
bool
|
Whether to display the pass message if the condition passes. |
False
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
The original DataFrame, unchanged. |
assert_greater_than(min, fail_message=' γ¨ Assert minimum failed ', pass_message=' βοΈ Assert minimum passed ', or_equal_to=False, subset=None, raise_exception=True, exception_to_raise=DataError, verbose=False)
Tests whether all values in a Dataframe or subset of columns is > or >= a minimum threshold. Optionally raises an exception. Does not modify the DataFrame itself.
Example
(
iris
# Validate that sepal_length is always greater than 0.1
.check.assert_greater_than(0.1, subset="sepal_length")
# Validate that two columns are each always greater than or equal to 0.1
.check.assert_greater_than(0.1, subset=["sepal_length", "petal_length"], or_equal_to=True)
)
See docs for .check.assert_data() for examples of how to customize assertions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
min
|
Any
|
the minimum value to compare DataFrame to. Accepts any type that can be used in >, such as int, float, str, datetime |
required |
fail_message
|
str
|
Message to display if the condition fails. |
' γ¨ Assert minimum failed '
|
pass_message
|
str
|
Message to display if the condition passes. |
' βοΈ Assert minimum passed '
|
or_equal_to
|
bool
|
whether to test for >= min (True) or > min (False) |
False
|
subset
|
SubsetTypes
|
Optional, which column or columns to check the condition against. |
None
|
raise_exception
|
bool
|
Whether to raise an exception if the condition fails. |
True
|
exception_to_raise
|
Type[BaseException]
|
The exception to raise if the condition fails and raise_exception is True. |
DataError
|
verbose
|
bool
|
Whether to display the pass message if the condition passes. |
False
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
The original DataFrame, unchanged. |
assert_int(fail_message=None, pass_message=' βοΈ Assert integeer passed ', subset=None, raise_exception=True, exception_to_raise=TypeError, verbose=False)
Tests whether Dataframe or subset of columns is integers. Optionally raises an exception. Does not modify the DataFrame itself.
Example
(
df
.check.assert_int(subset="int_col")
)
See docs for .check.assert_data() for examples of how to customize assertions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fail_message
|
Union[str, None]
|
Message to display if the condition fails. |
None
|
pass_message
|
str
|
Message to display if the condition passes. |
' βοΈ Assert integeer passed '
|
subset
|
SubsetTypes
|
Optional, which column or columns to check the condition against. |
None
|
raise_exception
|
bool
|
Whether to raise an exception if the condition fails. |
True
|
exception_to_raise
|
Type[BaseException]
|
The exception to raise if the condition fails and raise_exception is True. |
TypeError
|
verbose
|
bool
|
Whether to display the pass message if the condition passes. |
False
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
The original DataFrame, unchanged. |
assert_less_than(max, fail_message=' γ¨ Assert maximum failed ', pass_message=' βοΈ Assert maximum passed ', or_equal_to=False, subset=None, raise_exception=True, exception_to_raise=DataError, verbose=False)
Tests whether all values in a Dataframe or subset of columns is < or <= a maximum threshold. Optionally raises an exception. Does not modify the DataFrame itself.
Example
(
iris
# Validate that sepal_length is always < 1000
.check.assert_less_than(1000, subset="sepal_length")
# Validate that two columns are each always less than or equal too 100
.check.assert_less_than(1000, subset=["sepal_length", "petal_length"], or_equal_to=True)
)
See docs for .check.assert_data() for examples of how to customize assertions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
max
|
Any
|
the max value to compare DataFrame to. Accepts any type that can be used in <, such as int, float, str, datetime |
required |
or_equal_to
|
bool
|
whether to test for <= max (True) or < max (False) |
False
|
fail_message
|
str
|
Message to display if the condition fails. |
' γ¨ Assert maximum failed '
|
pass_message
|
str
|
Message to display if the condition passes. |
' βοΈ Assert maximum passed '
|
subset
|
SubsetTypes
|
Optional, which column or columns to check the condition against. |
None
|
raise_exception
|
bool
|
Whether to raise an exception if the condition fails. |
True
|
exception_to_raise
|
Type[BaseException]
|
The exception to raise if the condition fails and raise_exception is True. |
DataError
|
verbose
|
bool
|
Whether to display the pass message if the condition passes. |
False
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
The original DataFrame, unchanged. |
assert_negative(fail_message=' γ¨ Assert negative failed ', pass_message=' βοΈ Assert negative passed ', subset=None, assert_no_nulls=True, raise_exception=True, exception_to_raise=DataError, verbose=False)
Tests whether Dataframe or subset of columns has all negative values. Optionally raises an exception. Does not modify the DataFrame itself.
Example
(
df
.check.assert_negative(subset="column_name")
)
See docs for .check.assert_data() for examples of how to customize assertions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fail_message
|
str
|
Message to display if the condition fails. |
' γ¨ Assert negative failed '
|
pass_message
|
str
|
Message to display if the condition passes. |
' βοΈ Assert negative passed '
|
subset
|
SubsetTypes
|
Optional, which column or columns to check the condition against. |
None
|
assert_no_nulls
|
bool
|
Whether to also enforce that data has no nulls. |
True
|
raise_exception
|
bool
|
Whether to raise an exception if the condition fails. |
True
|
exception_to_raise
|
Type[BaseException]
|
The exception to raise if the condition fails and raise_exception is True. |
DataError
|
verbose
|
bool
|
Whether to display the pass message if the condition passes. |
False
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
The original DataFrame, unchanged. |
assert_no_nulls(fail_message=' γ¨ Assert no nulls failed ', pass_message=' βοΈ Assert no nulls passed ', subset=None, raise_exception=True, exception_to_raise=DataError, verbose=False)
Tests whether Dataframe or subset of columns has no nulls. Optionally raises an exception. Does not modify the DataFrame itself.
Example
(
iris
.check.assert_no_nulls(subset=["sepal_length"])
)
See docs for .check.assert_data() for examples of how to customize assertions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fail_message
|
str
|
Message to display if the condition fails. |
' γ¨ Assert no nulls failed '
|
pass_message
|
str
|
Message to display if the condition passes. |
' βοΈ Assert no nulls passed '
|
subset
|
SubsetTypes
|
Optional, which column or columns to check the condition against. |
None
|
raise_exception
|
bool
|
Whether to raise an exception if the condition fails. |
True
|
exception_to_raise
|
Type[BaseException]
|
The exception to raise if the condition fails and raise_exception is True. |
DataError
|
verbose
|
bool
|
Whether to display the pass message if the condition passes. |
False
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
The original DataFrame, unchanged. |
assert_nrows(nrows, fail_message=' γ¨ Assert nrows failed ', pass_message=' βοΈ Assert nrows passed ', raise_exception=True, exception_to_raise=DataError, verbose=False)
Tests whether Dataframe has a given number of rows. Optionally raises an exception. Does not modify the DataFrame itself.
Example
(
iris
.check.assert_nrows(20)
)
See docs for .check.assert_data() for examples of how to customize assertions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
nrows
|
int
|
The expected number of rows |
required |
fail_message
|
str
|
Message to display if the condition fails. |
' γ¨ Assert nrows failed '
|
pass_message
|
str
|
Message to display if the condition passes. |
' βοΈ Assert nrows passed '
|
raise_exception
|
bool
|
Whether to raise an exception if the condition fails. |
True
|
exception_to_raise
|
Type[BaseException]
|
The exception to raise if the condition fails and raise_exception is True. |
DataError
|
verbose
|
bool
|
Whether to display the pass message if the condition passes. |
False
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
The original DataFrame, unchanged. |
assert_positive(fail_message=' γ¨ Assert positive failed ', pass_message=' βοΈ Assert positive passed ', subset=None, assert_no_nulls=True, raise_exception=True, exception_to_raise=DataError, verbose=False)
Tests whether Dataframe or subset of columns has all positive values. Optionally raises an exception. Does not modify the DataFrame itself.
Example
(
iris
.check.assert_positive(subset=["sepal_length"])
)
See docs for .check.assert_data() for examples of how to customize assertions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fail_message
|
str
|
Message to display if the condition fails. |
' γ¨ Assert positive failed '
|
pass_message
|
str
|
Message to display if the condition passes. |
' βοΈ Assert positive passed '
|
subset
|
SubsetTypes
|
Optional, which column or columns to check the condition against. |
None
|
assert_no_nulls
|
bool
|
Whether to also enforce that data has no nulls. |
True
|
raise_exception
|
bool
|
Whether to raise an exception if the condition fails. |
True
|
exception_to_raise
|
Type[BaseException]
|
The exception to raise if the condition fails and raise_exception is True. |
DataError
|
verbose
|
bool
|
Whether to display the pass message if the condition passes. |
False
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
The original DataFrame, unchanged. |
assert_same_nrows(other, fail_message=' γ¨ Assert same_nrows failed ', pass_message=' βοΈ Assert same_nrows passed ', raise_exception=True, exception_to_raise=DataError, verbose=False)
Tests whether Dataframe has the same number of rows as another DataFrame/Series has.
Optionally raises an exception. Does not modify the DataFrame itself.
Example
transactions_raw_df = load_transactions()
transactions_processed_df = process_transactions()
transactions_final_df = (
transactions_processed_df
.merge(how="left", right=products_df, on="product_id")
.check.assert_same_nrows(transactions_raw_df, "Unexpected change in row count of final DF vs raw DF. Check for duplicate `product_id` keys in product_df?")
)
See docs for .check.assert_data() for examples of how to customize assertions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
other
|
Union[DataFrame, Series]
|
The DataFrame or Series that we expect to have the same # of rows as |
required |
fail_message
|
str
|
Message to display if the condition fails. |
' γ¨ Assert same_nrows failed '
|
pass_message
|
str
|
Message to display if the condition passes. |
' βοΈ Assert same_nrows passed '
|
raise_exception
|
bool
|
Whether to raise an exception if the condition fails. |
True
|
exception_to_raise
|
Type[BaseException]
|
The exception to raise if the condition fails and raise_exception is True. |
DataError
|
verbose
|
bool
|
Whether to display the pass message if the condition passes. |
False
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
The original DataFrame, unchanged. |
Note
For typical validation of merges, such as 1:1 joins, it's easier to use the validate argument in Pandas merge().
assert_str(fail_message=None, pass_message=' βοΈ Assert string passed ', subset=None, raise_exception=True, exception_to_raise=TypeError, verbose=False)
Tests whether Dataframe or subset of columns is strings. Optionally raises an exception. Does not modify the DataFrame itself.
Example
(
iris
.check.assert_str(subset=["species", "another_string_column"])
)
See docs for .check.assert_data() for examples of how to customize assertions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fail_message
|
Union[str, None]
|
Message to display if the condition fails. If None, will report expected vs observed type. |
None
|
pass_message
|
str
|
Message to display if the condition passes. |
' βοΈ Assert string passed '
|
subset
|
SubsetTypes
|
Optional, which column or columns to check the condition against. |
None
|
raise_exception
|
bool
|
Whether to raise an exception if the condition fails. |
True
|
exception_to_raise
|
Type[BaseException]
|
The exception to raise if the condition fails and raise_exception is True. |
TypeError
|
verbose
|
bool
|
Whether to display the pass message if the condition passes. |
False
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
The original DataFrame, unchanged. |
assert_timedelta(fail_message=None, pass_message=' βοΈ Assert timedelta passed ', subset=None, raise_exception=True, exception_to_raise=TypeError, verbose=False)
Tests whether Dataframe or subset of columns is of type timedelta. Optionally raises an exception. Does not modify the DataFrame itself.
Example
(
df
.check.assert_timedelta(subset=["timedelta_col"])
)
See docs for .check.assert_data() for examples of how to customize assertions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fail_message
|
Union[str, None]
|
Message to display if the condition fails. If None, will report expected vs observed type. |
None
|
pass_message
|
str
|
Message to display if the condition passes. |
' βοΈ Assert timedelta passed '
|
subset
|
SubsetTypes
|
Optional, which column or columns to check the condition against. |
None
|
raise_exception
|
bool
|
Whether to raise an exception if the condition fails. |
True
|
exception_to_raise
|
Type[BaseException]
|
The exception to raise if the condition fails and raise_exception is True. |
TypeError
|
verbose
|
bool
|
Whether to display the pass message if the condition passes. |
False
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
The original DataFrame, unchanged. |
assert_type(dtype, fail_message=None, pass_message=' βοΈ Assert type passed ', subset=None, raise_exception=True, exception_to_raise=TypeError, verbose=False)
Tests whether Dataframe or subset of columns meets type assumption. Optionally raises an exception. Does not modify the DataFrame itself.
Example
# Validate that a column of mixed types has overall type `object`
(
iris
.check.assert_type(object, subset="column_with_mixed_types")
)
See docs for .check.assert_data() for examples of how to customize assertions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dtype
|
Type[Any]
|
The required variable type |
required |
fail_message
|
Union[str, None]
|
Message to display if the condition fails. If None, will report expected vs observed type. |
None
|
pass_message
|
str
|
Message to display if the condition passes. |
' βοΈ Assert type passed '
|
subset
|
SubsetTypes
|
Optional, which column or columns to check the condition against. |
None
|
raise_exception
|
bool
|
Whether to raise an exception if the condition fails. |
True
|
exception_to_raise
|
Type[BaseException]
|
The exception to raise if the condition fails and raise_exception is True. |
TypeError
|
verbose
|
bool
|
Whether to display the pass message if the condition passes. |
False
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
The original DataFrame, unchanged. |
assert_unique(fail_message=' γ¨ Assert unique failed ', pass_message=' βοΈ Assert unique passed ', subset=None, raise_exception=True, exception_to_raise=DataError, verbose=False)
Validates that a subset of columns have no duplicate values, or validates that a DataFrame has no duplicate rows. Optionally raises an exception. Does not modify the DataFrame itself.
Example
(
df
# Validate that a column has no duplicate values
.check.assert_unique(subset="id_column")
# Validate that a DataFrame has no duplicate rows
.check.assert_unique()
)
See docs for .check.assert_data() for examples of how to customize assertions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fail_message
|
str
|
Message to display if the condition fails. |
' γ¨ Assert unique failed '
|
pass_message
|
str
|
Message to display if the condition passes. |
' βοΈ Assert unique passed '
|
subset
|
SubsetTypes
|
Optional, which column or columns to check the condition against. |
None
|
raise_exception
|
bool
|
Whether to raise an exception if the condition fails. |
True
|
exception_to_raise
|
Type[BaseException]
|
The exception to raise if the condition fails and raise_exception is True. |
DataError
|
verbose
|
bool
|
Whether to display the pass message if the condition passes. |
False
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
The original DataFrame, unchanged. |
columns(msg='ποΈ Columns', fn=lambda df: df, subset=None)
Prints the column names of a DataFrame, without modifying the DataFrame itself.
Example
(
df
.check.columns()
)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
msg
|
Union[str, None]
|
Optionally customize the text displayed before the result of the check. |
'ποΈ Columns'
|
fn
|
Callable
|
An optional lambda function to apply to the DataFrame before printing columns. Example: |
lambda df: df
|
subset
|
SubsetTypes
|
Optional column name or names to select before printing their names. Applied after fn. |
None
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
The original DataFrame, unchanged. |
describe(fn=lambda df: df, subset=None, msg='π Distributions', **kwargs)
Displays descriptive statistics about a DataFrame without modifying the DataFrame itself.
See Pandas docs for describe() for additional usage information, including more options you can pass to this Pandas Checks method.
Example
(
df
.check.describe()
)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fn
|
Callable
|
An optional lambda function to apply to the DataFrame before running Pandas describe(). Example: |
lambda df: df
|
subset
|
SubsetTypes
|
Optional column name or names to select before running Pandas describe(). Applied after fn. |
None
|
msg
|
Union[str, None]
|
Optionally customize the text displayed before the result of the check. |
'π Distributions'
|
**kwargs
|
Any
|
Optional, additional arguments that are accepted by Pandas describe() method. |
{}
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
The original DataFrame, unchanged. |
disable_checks(enable_asserts=True)
Turns off Pandas Checks globally, such as in production mode. Calls to .check functions will not be run. Does not modify the DataFrame itself.
Example
(
iris
.check.disable_checks()
.check.assert_data(lambda df: df.shape[0]>10) # This check will NOT be run
.check.enable_checks() # Subsequent calls to .check will be run
)
Args enable_assert: Optionally, whether to also enable or disable assert statements
Returns:
| Type | Description |
|---|---|
DataFrame
|
The original DataFrame, unchanged. |
dtypes(fn=lambda df: df, subset=None, msg='ποΈ Data types')
Displays the data types of a DataFrame's columns without modifying the DataFrame itself.
Example
(
iris
.check.dtypes()
)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fn
|
Callable
|
An optional lambda function to apply to the DataFrame before running Pandas dtypes. Example: |
lambda df: df
|
subset
|
SubsetTypes
|
Optional column name or names to select before running Pandas .dtypes. Applied after fn. |
None
|
msg
|
Union[str, None]
|
Optionally customize the text displayed before the result of the check. |
'ποΈ Data types'
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
The original DataFrame, unchanged. |
enable_checks(enable_asserts=True)
Globally enables Pandas Checks. Subequent calls to .check methods will be run. Does not modify the DataFrame itself.
Example
(
iris
["sepal_length"]
.check.disable_checks()
.check.assert_data(lambda s: s.shape[0]>10) # This check will NOT be run
.check.enable_checks() # Subsequent calls to .check will be run
)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
enable_asserts
|
bool
|
Optionally, whether to globally enable or disable calls to .check.assert_data(). |
True
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
The original DataFrame, unchanged. |
function(fn=lambda df: df, subset=None, msg=None)
Applies an arbitrary function on a DataFrame and shows the result, without modifying the DataFrame itself.
Example
(
iris
.check.function(fn=lambda df: df.shape[0]>10, msg='Has at least 10 rows?')
)
# Will return either 'True' or 'False'
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fn
|
Callable
|
A lambda function to apply to the DataFrame. Example: |
lambda df: df
|
subset
|
SubsetTypes
|
Optional column name or names to select before running Pandas describe(). Applied after fn. |
None
|
msg
|
Union[str, None]
|
Optionally customize the text displayed before the result of the check. |
None
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
The original DataFrame, unchanged. |
get_mode(msg='πΌπ©Ί Pandas Checks mode')
Displays the current values of Pandas Checks global options enable_checks and enable_asserts. Does not modify the DataFrame itself.
Example
(
iris
.check.get_mode()
)
# The check will print:
# "πΌπ©Ί Pandas Checks mode: {'enable_checks': True, 'enable_asserts': True}"
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
msg
|
Union[str, None]
|
Optionally customize the text displayed before the result of the check. |
'πΌπ©Ί Pandas Checks mode'
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
The original DataFrame, unchanged. |
head(n=5, fn=lambda df: df, subset=None, msg=None)
Displays the first n rows of a DataFrame, without modifying the DataFrame itself.
See Pandas docs for head() for additional usage information.
Example
(
iris
.check.head(10)
)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n
|
int
|
The number of rows to display. |
5
|
fn
|
Callable
|
An optional lambda function to apply to the DataFrame before running Pandas head(). Example: |
lambda df: df
|
subset
|
SubsetTypes
|
Optional column name or names to select before running Pandas head(). Applied after fn. |
None
|
msg
|
Union[str, None]
|
Optionally customize the text displayed before the result of the check. |
None
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
The original DataFrame, unchanged. |
hist(subset=None, fn=lambda df: df, msg=None, **kwargs)
Displays a histogram for the DataFrame, without modifying the DataFrame itself.
You can pass a single column (via kwargs) or a subset argument, which can display a grid of multiple histograms.
See Pandas docs for hist() for additional usage information, including more options you can pass to this Pandas Checks method.
Example
(
iris
# Show one histogram
.check.hist("sepal_length")
# Show two histograms
.check.hist(["petal_length", "petal_width"])
)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
subset
|
SubsetTypes
|
Optional column name or names to select before running Pandas hist(). Applied after fn. |
None
|
fn
|
Callable
|
An optional lambda function to apply to the DataFrame before running Pandas hist(). Example: |
lambda df: df
|
msg
|
Union[str, None]
|
Optionally customize the text displayed before the result of the check. |
None
|
**kwargs
|
Any
|
Optional, additional arguments that are accepted by Pandas hist() method. |
{}
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
The original DataFrame, unchanged. |
Note
Only renders in interactive mode (IPython/Jupyter), not in terminal.
info(fn=lambda df: df, subset=None, msg='βΉοΈ Info', **kwargs)
Displays summary information about a DataFrame, without modifying the DataFrame itself.
See Pandas docs for info() for additional usage information, including more options you can pass to this Pandas Checks method.
Example
(
iris
.check.info()
)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fn
|
Callable
|
An optional lambda function to apply to the DataFrame before running Pandas info(). Example: |
lambda df: df
|
subset
|
SubsetTypes
|
Optional column name or names to select before running Pandas info(). Applied after fn. |
None
|
msg
|
Union[str, None]
|
Optionally customize the text displayed before the result of the check. |
'βΉοΈ Info'
|
**kwargs
|
Any
|
Optional, additional arguments that are accepted by Pandas info() method. |
{}
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
The original DataFrame, unchanged. |
memory_usage(fn=lambda df: df, subset=None, msg='πΎ Memory usage', **kwargs)
Displays the memory footprint of a DataFrame, without modifying the DataFrame itself.
See Pandas docs for memory_usage() for additional usage information, including more options you can pass to this Pandas Checks method.
Example
(
iris
.check.memory_usage()
)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fn
|
Callable
|
An optional lambda function to apply to the DataFrame before running Pandas memory_usage(). Example: |
lambda df: df
|
subset
|
SubsetTypes
|
Optional column name or names to select before running Pandas memory_usage(). Applied after fn. |
None
|
msg
|
Union[str, None]
|
Optionally customize the text displayed before the result of the check. |
'πΎ Memory usage'
|
**kwargs
|
Any
|
Optional, additional arguments that are accepted by Pandas info() method. |
{}
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
The original DataFrame, unchanged. |
Note
Include argument deep=True to get further memory usage of object dtypes in the DataFrame. See Pandas docs for memory_usage() for more info.
ncols(msg='ποΈ Columns', fn=lambda df: df, subset=None)
Displays the number of columns in a DataFrame, without modifying the DataFrame itself.
Example
(
iris
.check.ncols()
)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
msg
|
Union[str, None]
|
Optionally customize the text displayed before the result of the check. |
'ποΈ Columns'
|
fn
|
Callable
|
An optional lambda function to apply to the DataFrame before counting the number of columns. Example: |
lambda df: df
|
subset
|
SubsetTypes
|
Optional column name or names to select before counting the number of columns. Applied after fn. |
None
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
The original DataFrame, unchanged. |
ndups(subset=None, fn=lambda df: df, msg=None, **kwargs)
Displays the number of duplicated rows in a DataFrame, without modifying the DataFrame itself.
See Pandas docs for duplicated() for additional usage information, including more options you can pass to this Pandas Checks method.
Example
# Count the number of rows with duplicate pairs of values across two columns:
(
iris
.check.ndups(subset=["sepal_length", "sepal_width"])
)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
subset
|
SubsetTypes
|
Optional column name or names to select before counting duplicate rows. Applied after fn. |
None
|
fn
|
Callable
|
An optional lambda function to apply to the DataFrame before counting the number of duplicates. Example: |
lambda df: df
|
msg
|
Union[str, None]
|
Optionally customize the text displayed before the result of the check. |
None
|
**kwargs
|
Any
|
Optional, additional arguments that are accepted by Pandas duplicated() method. |
{}
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
The original DataFrame, unchanged. |
nnulls(fn=lambda df: df, subset=None, by_column=True, msg='π» Rows with NaNs')
Displays the number of rows with null values in a DataFrame, without modifying the DataFrame itself.
See Pandas docs for isna() for additional usage information.
Example
# Count the number of rows that have any nulls, one count per column
(
iris
.check.nnulls()
)
# Count the number of rows in the DataFrame that have a null in any column
(
iris
.check.nnulls(by_column=False)
)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fn
|
Callable
|
An optional lambda function to apply to the DataFrame before counting the number of rows with a null. Example: |
lambda df: df
|
subset
|
SubsetTypes
|
Optional column name or names to select before counting nulls. |
None
|
by_column
|
bool
|
If True, count null values with each column separately. If False, count rows with a null value in any column. Applied after fn. |
True
|
msg
|
Union[str, None]
|
Optionally customize the text displayed before the result of the check. |
'π» Rows with NaNs'
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
The original DataFrame, unchanged. |
nrows(msg='β° Rows', fn=lambda df: df, subset=None)
Displays the number of rows in a DataFrame, without modifying the DataFrame itself.
Example
(
iris
.check.nrows()
)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
msg
|
Union[str, None]
|
Optionally customize the text displayed before the result of the check. |
'β° Rows'
|
fn
|
Callable
|
An optional lambda function to apply to the DataFrame before counting the number of rows. Example: |
lambda df: df
|
subset
|
SubsetTypes
|
Optional column name or names to select before counting rows. Applied after fn. |
None
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
The original DataFrame, unchanged. |
nunique(subset=None, column=None, across_columns=False, fn=lambda df: df, msg=None, **kwargs)
Displays the number of unique values in a Series or unique combinations of rows in a DataFrame, without modifying the DataFrame itself.
Note
- When across_columns=False, we use the standard Pandas nunique() methods. In those methods, dropna=True by default. You can change this by passing dropna=False
- See Pandas docs for nunique() DataFrame or nunique() Series for additional usage information, including more options you can pass to this Pandas Checks method.
Example
(
iris
.check.nunique(column="sepal_width") # Unique values of sepal_width (standard Pandas Series.nunique())
.check.nunique(subset=["petal_width, "sepal_width"]) # Unique values in each column separately (standard Pandas DataFrame.nunique())
.check.nunique(subset=["petal_width, "sepal_width"], across_columns=True) # Unique combinations of values
)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
subset
|
SubsetTypes
|
Optional column name or names to select before counting uniques. If None, and column is None, will include all columns. Applied after fn. |
None
|
column
|
Union[str, None]
|
The optional name of a column to count uniques in. Applied after fn. Kept for backwards compatibility. |
None
|
across_columns
|
bool
|
When dataframe has multiple columns (after applying subset), whether we should - count the unique values in each column separately (False), the standard Pandas DataFrame nunique() - count the unique combinations of rows across those columns (True) or |
False
|
fn
|
Callable
|
An optional lambda function to apply to the DataFrame before running Pandas nunique(). Example: |
lambda df: df
|
msg
|
Union[str, None]
|
Optionally customize the text displayed before the result of the check. |
None
|
**kwargs
|
Any
|
Optional, additional arguments that are accepted by Pandas nunique() method(s) |
{}
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
The original DataFrame, unchanged. |
plot(subset=None, fn=lambda df: df, msg='', **kwargs)
Displays a plot of the DataFrame, without modifying the DataFrame itself.
See Pandas docs for plot() for additional usage information, including more options you can pass to this Pandas Checks method.
Example
(
iris
.check.plot(kind="scatter", x="sepal_width", y="sepal_length", title="Sepal width vs sepal length")
)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
subset
|
SubsetTypes
|
Optional column name or names to select before plotting. Applied after fn. |
None
|
fn
|
Callable
|
An optional lambda function to apply to the DataFrame before running Pandas plot(). Example: |
lambda df: df
|
msg
|
Union[str, None]
|
An optional title for the plot. |
''
|
**kwargs
|
Any
|
Optional, additional arguments that are accepted by Pandas plot() method. |
{}
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
The original DataFrame, unchanged. |
Note
Plots are only displayed when code is run in IPython/Jupyter, not in terminal.
If you pass a 'title' kwarg, it becomes the plot title, overriding msg
print(object=None, fn=lambda df: df, subset=None, msg=None, max_rows=10)
Displays text, another object, or (by default) the current DataFrame's head. Does not modify the DataFrame itself.
Example
# Print messages and milestones
(
iris
.check.print("Starting data cleaning..."")
...
)
# Inspect a DataFrame, such as the interim result of data processing
(
iris
...
.check.print(fn=lambda df: df.query("sepal_width<0"), msg="Rows with negative sepal_width")
)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
object
|
Any
|
Object to print. Can be anything printable: str, int, list, another DataFrame, etc. If None, print the DataFrame's head (with |
None
|
fn
|
Callable
|
An optional lambda function to apply to the DataFrame before printing |
lambda df: df
|
subset
|
SubsetTypes
|
Optional column name or names to limit which data are printed. Applied after fn. |
None
|
msg
|
Union[str, None]
|
Optionally customize the text displayed before the result of the check. |
None
|
max_rows
|
int
|
Maximum number of rows to print if object=None. |
10
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
The original DataFrame, unchanged. |
print_time_elapsed(start_time, lead_in='Time elapsed', units='auto')
Displays the time elapsed since start_time.
Example
import pandas_checks as pdc
start_time = pdc.start_timer()
(
iris
... # Do some data processing
.check.print_time_elapsed(start_time, "Cleaning took")
... # Do more
.check.print_time_elapsed(start_time, "Processing total time", units="seconds") # Force units to stay in seconds
)
# Result: "Cleaning took: 17.298324584960938 seconds
# "Processing total time: 71.0400543212890625 seconds
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
start_time
|
float
|
The index time when the stopwatch started, which comes from the Pandas Checks start_timer() |
required |
lead_in
|
Union[str, None]
|
Optional text to print before the elapsed time. |
'Time elapsed'
|
units
|
str
|
The units in which to display the elapsed time. Allowed values: "auto", "milliseconds", "seconds", "minutes", "hours" or shorthands "ms", "s", "m", "h". |
'auto'
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
Returns:
| Type | Description |
|---|---|
DataFrame
|
The original DataFrame, unchanged. |
reset_format()
Globally restores all Pandas Checks formatting options to their default "factory" settings. Does not modify the DataFrame itself.
Example
(
iris
.check.set_format(precision=9, use_emojis=False)
# Print DF summary stats with precision 9 digits and no Pandas Checks emojis
.check.describe()
.check.reset_format() # Go back to default precision and emojis π₯³
)
Returns:
| Type | Description |
|---|---|
DataFrame
|
The original DataFrame, unchanged. |
set_format(**kwargs)
Configures selected formatting options for Pandas Checks. Does not modify the DataFrame itself.
Run pandas_checks.describe_options() to see a list of available options.
Example
(
iris
.check.set_format(precision=9, use_emojis=False)
# Print DF summary stats with precision 9 digits and no Pandas Checks emojis
.check.describe()
.check.reset_format() # Go back to default precision and emojis π₯³
)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
**kwargs
|
Any
|
Pairs of setting name and its new value. |
{}
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
The original DataFrame, unchanged. |
set_mode(enable_checks, enable_asserts)
Configures the operation mode for Pandas Checks globally. Does not modify the DataFrame itself.
Example
# Disable checks except keep running assertions. Same as using `.check.disable_checks()`:
(
iris
.check.set_mode(enable_checks=False)
.check.describe() # This check will not be run
.check.assert_data(lambda s: s.shape[0]>10) # This check will still be run
)
# Disable checks _and_ assertions
(
iris
.check.set_mode(enable_checks=False, enable_asserts=False)
)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
enable_checks
|
bool
|
Whether to run any Pandas Checks methods globally. Does not affect .check.assert_*(). |
required |
enable_asserts
|
bool
|
Whether to run calls to Pandas Checks .check.assert_*() statements globally. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
The original DataFrame, unchanged. |
shape(msg='π Shape', fn=lambda df: df, subset=None)
Displays the Dataframe's dimensions, without modifying the DataFrame itself.
Example
(
iris
.check.shape()
.check.shape(msg="Shape of DataFrame subgroup with sepal_length<5", fn=lambda df: df.query("sepal_length<5"))
)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
msg
|
Union[str, None]
|
Optionally customize the text displayed before the result of the check. |
'π Shape'
|
fn
|
Callable
|
An optional lambda function to apply to the DataFrame before running Pandas |
lambda df: df
|
subset
|
SubsetTypes
|
Optional column name or names to limit which columns are considered when printing the shape. Applied after fn. |
None
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
The original DataFrame, unchanged. |
Note
See also .check.nrows() and .check.ncols()
tail(n=5, fn=lambda df: df, subset=None, msg=None)
Displays the last n rows of the DataFrame, without modifying the DataFrame itself.
See Pandas docs for tail() for additional usage information, including more options you can pass to this Pandas Checks method.
Example
(
iris
.check.tail(10)
)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n
|
int
|
Number of rows to show. |
5
|
fn
|
Callable
|
An optional lambda function to apply to the DataFrame before running Pandas tail(). Example: |
lambda df: df
|
subset
|
SubsetTypes
|
Optional column name or names to select before displaying tail. Applied after fn. |
None
|
msg
|
Union[str, None]
|
Optionally customize the text displayed before the result of the check. |
None
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
The original DataFrame, unchanged. |
unique(column, fn=lambda df: df, msg=None)
Displays the unique values in a column, without modifying the DataFrame itself.
See Pandas docs for unique() for additional usage information, including more options you can pass to this Pandas Checks method.
Example
(
iris
.check.unique("species")
)
# The check will print: "π Unique values of species: ['setosa', 'versicolor', 'virginica']"
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
column
|
str
|
Column to check for unique values. |
required |
fn
|
Callable
|
An optional lambda function to apply to the DataFrame before calling Pandas unique(). Example: |
lambda df: df
|
msg
|
Union[str, None]
|
Optionally customize the text displayed before the result of the check. |
None
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
The original DataFrame, unchanged. |
Note
fn is applied to the dataframe before selecting column. If you want to select the column before modifying it, set column=None and start fn with a column selection, i.e. fn=lambda df: df["my_column"].stuff()
value_counts(subset=None, column=None, fn=lambda df: df, max_rows=10, msg=None, **kwargs)
Displays the value counts for a column, without modifying the DataFrame itself.
See Pandas docs for value_counts() for additional usage information, including more options you can pass to this Pandas Checks method.
Example
(
iris
.check.value_counts("sepal_length")
)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
subset
|
SubsetTypes
|
Optional column name or names to select before exporting data. Applied after fn. |
None
|
column
|
Union[str, None]
|
Column to check for value counts. Applied after fn. Kept for backwards compatibility. |
None
|
max_rows
|
int
|
Maximum number of rows to show in the value counts. |
10
|
fn
|
Callable
|
An optional lambda function to apply to the DataFrame before running Pandas value_counts(). Example: |
lambda df: df
|
msg
|
Union[str, None]
|
Optionally customize the text displayed before the result of the check. |
None
|
**kwargs
|
Any
|
Optional, additional arguments that are accepted by Pandas value_counts() method. |
{}
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
The original DataFrame, unchanged. |
Note
fn is applied to the dataframe before selecting column. If you want to select the column before modifying it, set column=None and start fn with a column selection, i.e. fn=lambda df: df["my_column"].stuff()
write(path, format=None, fn=lambda df: df, subset=None, verbose=False, **kwargs)
Exports DataFrame to file, without modifying the DataFrame itself.
The file format is inferred from the extension. Supports: - .csv - .feather - .parquet - .pkl # Pickle - .tsv # Tab-separated data file - .xlsx
This functions uses the corresponding Pandas export function, such as to_csv() and to_feather(). See Pandas docs for those corresponding export functions for additional usage information, including more options you can pass to this Pandas Checks method.
Example
(
iris
# Process data
...
# Export the interim data for inspection
.check.write("iris_interim.xlsx")
# Continue processing
...
)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to write the file to. |
required |
format
|
Union[str, None]
|
Optional file format to force for the export. If None, format is inferred from the file's extension in |
None
|
fn
|
Callable
|
An optional lambda function to apply to the DataFrame before exporting. Example: |
lambda df: df
|
subset
|
SubsetTypes
|
Optional column name or names to select before exporting data. Applied after fn. |
None
|
verbose
|
bool
|
Whether to print a message when the file is written. |
False
|
**kwargs
|
Any
|
Optional, additional keyword arguments to pass to the Pandas export function (e.g. |
{}
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
The original DataFrame, unchanged. |
Note
Exporting to some formats such as Excel, Feather, and Parquet may require you to install additional packages.