DataFrame methods
_obj = pandas_obj
instance-attribute
__init__(pandas_obj)
assert_all_nulls(fail_message=' γ¨ Assert all nulls failed ', pass_message=' βοΈ Assert all nulls passed ', subset=None, raise_exception=True, exception_to_raise=DataError, verbose=False)
Tests whether Dataframe or subset of columns has all nulls. Optionally raises an exception. Does not modify the DataFrame itself.
Example
(
iris
.check.assert_all_nulls(subset=["sepal_length"])
)
# Will raise an exception "γ¨ Assert all nulls failed"
See docs for .check.assert_data()
for examples of how to customize assertions.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
fail_message
|
str
|
Message to display if the condition fails. |
' γ¨ Assert all nulls failed '
|
pass_message
|
str
|
Message to display if the condition passes. |
' βοΈ Assert all nulls passed '
|
subset
|
Union[str, List, None]
|
Optional, which column or columns to check the condition against. |
None
|
raise_exception
|
bool
|
Whether to raise an exception if the condition fails. |
True
|
exception_to_raise
|
Type[BaseException]
|
The exception to raise if the condition fails and raise_exception is True. |
DataError
|
verbose
|
bool
|
Whether to display the pass message if the condition passes. |
False
|
Returns:
Type | Description |
---|---|
DataFrame
|
The original DataFrame, unchanged. |
assert_data(condition, fail_message=' γ¨ Assertion failed ', pass_message=' βοΈ Assertion passed ', subset=None, raise_exception=True, exception_to_raise=DataError, message_shows_condition=True, verbose=False)
Tests whether Dataframe meets condition. Optionally raises an exception. Does not modify the DataFrame itself.
Example
# Validate that the Dataframe has at least 2 rows
(
iris
.check.assert_data(lambda df: df.shape[0]>1)
# Or customize the message displayed when alert fails
.check.assert_data(lambda df: df.shape[0]>1, "Assertion failed, DataFrame has no rows!")
# Or show a warning instead of raising an exception
.check.assert_data(lambda df: s.shape[0]>1, "FYI Series has no rows", raise_exception=False)
# Or show a message if it passes, and raise a specific exception (ValueError) if it fails.
.check.assert_data(
lambda df: s.shape[0]>1,
fail_message="FYI Series has no rows",
pass_message="Series has rows!",
exception_to_raise=ValueError,
verbose=True # To show pass_message when assertion passes
)
)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
condition
|
Callable
|
Assertion criteria in the form of a lambda function, such as |
required |
fail_message
|
str
|
Message to display if the condition fails. |
' γ¨ Assertion failed '
|
pass_message
|
str
|
Message to display if the condition passes. |
' βοΈ Assertion passed '
|
subset
|
Union[str, List, None]
|
Optional, which column or columns to check the condition against. Applied after fn. Subsetting can also be done within the |
None
|
raise_exception
|
bool
|
Whether to raise an exception if the condition fails. |
True
|
exception_to_raise
|
Type[BaseException]
|
The exception to raise if the condition fails and raise_exception is True. |
DataError
|
message_shows_condition
|
bool
|
Whether the fail/pass message should also print the assertion criteria |
True
|
verbose
|
bool
|
Whether to display the pass message if the condition passes. |
False
|
Returns:
Type | Description |
---|---|
DataFrame
|
The original DataFrame, unchanged. |
assert_datetime(fail_message=None, pass_message=' βοΈ Assert datetime passed ', subset=None, raise_exception=True, exception_to_raise=TypeError, verbose=False)
Tests whether Dataframe or subset of columns is datetime or timestamp. Optionally raises an exception. Does not modify the DataFrame itself.
Example
(
df
.check.assert_datetime(subset="datetime_col")
)
See docs for .check.assert_data()
for examples of how to customize assertions.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
subset
|
Union[str, List, None]
|
Optional, which column or columns to check the condition against. |
None
|
fail_message
|
Union[str, None]
|
Message to display if the condition fails. If None, will report expected vs observed type. |
None
|
pass_message
|
str
|
Message to display if the condition passes. |
' βοΈ Assert datetime passed '
|
raise_exception
|
bool
|
Whether to raise an exception if the condition fails. |
True
|
exception_to_raise
|
Type[BaseException]
|
The exception to raise if the condition fails and raise_exception is True. |
TypeError
|
verbose
|
bool
|
Whether to display the pass message if the condition passes. |
False
|
Returns:
Type | Description |
---|---|
DataFrame
|
The original DataFrame, unchanged. |
assert_float(fail_message=None, pass_message=' βοΈ Assert float passed ', subset=None, raise_exception=True, exception_to_raise=TypeError, verbose=False)
Tests whether Dataframe or subset of columns is floats. Optionally raises an exception. Does not modify the DataFrame itself.
Example
(
df
.check.assert_float(subset="float_col")
)
See docs for .check.assert_data()
for examples of how to customize assertions.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
fail_message
|
Union[str, None]
|
Message to display if the condition fails. |
None
|
pass_message
|
str
|
Message to display if the condition passes. |
' βοΈ Assert float passed '
|
subset
|
Union[str, List, None]
|
Optional, which column or columns to check the condition against. |
None
|
raise_exception
|
bool
|
Whether to raise an exception if the condition fails. |
True
|
exception_to_raise
|
Type[BaseException]
|
The exception to raise if the condition fails and raise_exception is True. |
TypeError
|
verbose
|
bool
|
Whether to display the pass message if the condition passes. |
False
|
Returns:
Type | Description |
---|---|
DataFrame
|
The original DataFrame, unchanged. |
assert_greater_than(min, fail_message=' γ¨ Assert minimum failed ', pass_message=' βοΈ Assert minimum passed ', or_equal_to=False, subset=None, raise_exception=True, exception_to_raise=DataError, verbose=False)
Tests whether all values in a Dataframe or subset of columns is > or >= a minimum threshold. Optionally raises an exception. Does not modify the DataFrame itself.
Example
(
iris
# Validate that sepal_length is always greater than 0.1
.check.assert_greater_than(0.1, subset="sepal_length")
# Validate that two columns are each always greater than or equal to 0.1
.check.assert_greater_than(0.1, subset=["sepal_length", "petal_length"], or_equal_to=True)
)
See docs for .check.assert_data()
for examples of how to customize assertions.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
min
|
Any
|
the minimum value to compare DataFrame to. Accepts any type that can be used in >, such as int, float, str, datetime |
required |
fail_message
|
str
|
Message to display if the condition fails. |
' γ¨ Assert minimum failed '
|
pass_message
|
str
|
Message to display if the condition passes. |
' βοΈ Assert minimum passed '
|
or_equal_to
|
bool
|
whether to test for >= min (True) or > min (False) |
False
|
subset
|
Union[str, List, None]
|
Optional, which column or columns to check the condition against. |
None
|
raise_exception
|
bool
|
Whether to raise an exception if the condition fails. |
True
|
exception_to_raise
|
Type[BaseException]
|
The exception to raise if the condition fails and raise_exception is True. |
DataError
|
verbose
|
bool
|
Whether to display the pass message if the condition passes. |
False
|
Returns:
Type | Description |
---|---|
DataFrame
|
The original DataFrame, unchanged. |
assert_int(fail_message=None, pass_message=' βοΈ Assert integeer passed ', subset=None, raise_exception=True, exception_to_raise=TypeError, verbose=False)
Tests whether Dataframe or subset of columns is integers. Optionally raises an exception. Does not modify the DataFrame itself.
Example
(
df
.check.assert_int(subset="int_col")
)
See docs for .check.assert_data()
for examples of how to customize assertions.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
fail_message
|
Union[str, None]
|
Message to display if the condition fails. |
None
|
pass_message
|
str
|
Message to display if the condition passes. |
' βοΈ Assert integeer passed '
|
subset
|
Union[str, List, None]
|
Optional, which column or columns to check the condition against. |
None
|
raise_exception
|
bool
|
Whether to raise an exception if the condition fails. |
True
|
exception_to_raise
|
Type[BaseException]
|
The exception to raise if the condition fails and raise_exception is True. |
TypeError
|
verbose
|
bool
|
Whether to display the pass message if the condition passes. |
False
|
Returns:
Type | Description |
---|---|
DataFrame
|
The original DataFrame, unchanged. |
assert_less_than(max, fail_message=' γ¨ Assert maximum failed ', pass_message=' βοΈ Assert maximum passed ', or_equal_to=False, subset=None, raise_exception=True, exception_to_raise=DataError, verbose=False)
Tests whether all values in a Dataframe or subset of columns is < or <= a maximum threshold. Optionally raises an exception. Does not modify the DataFrame itself.
Example
(
iris
# Validate that sepal_length is always < 1000
.check.assert_less_than(1000, subset="sepal_length")
# Validate that two columns are each always less than or equal too 100
.check.assert_less_than(1000, subset=["sepal_length", "petal_length"], or_equal_to=True)
)
See docs for .check.assert_data()
for examples of how to customize assertions.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
max
|
Any
|
the max value to compare DataFrame to. Accepts any type that can be used in <, such as int, float, str, datetime |
required |
or_equal_to
|
bool
|
whether to test for <= max (True) or < max (False) |
False
|
fail_message
|
str
|
Message to display if the condition fails. |
' γ¨ Assert maximum failed '
|
pass_message
|
str
|
Message to display if the condition passes. |
' βοΈ Assert maximum passed '
|
subset
|
Union[str, List, None]
|
Optional, which column or columns to check the condition against. |
None
|
raise_exception
|
bool
|
Whether to raise an exception if the condition fails. |
True
|
exception_to_raise
|
Type[BaseException]
|
The exception to raise if the condition fails and raise_exception is True. |
DataError
|
verbose
|
bool
|
Whether to display the pass message if the condition passes. |
False
|
Returns:
Type | Description |
---|---|
DataFrame
|
The original DataFrame, unchanged. |
assert_negative(fail_message=' γ¨ Assert negative failed ', pass_message=' βοΈ Assert negative passed ', subset=None, assert_no_nulls=True, raise_exception=True, exception_to_raise=DataError, verbose=False)
Tests whether Dataframe or subset of columns has all negative values. Optionally raises an exception. Does not modify the DataFrame itself.
Example
(
df
.check.assert_negative(subset="column_name")
)
See docs for .check.assert_data()
for examples of how to customize assertions.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
fail_message
|
str
|
Message to display if the condition fails. |
' γ¨ Assert negative failed '
|
pass_message
|
str
|
Message to display if the condition passes. |
' βοΈ Assert negative passed '
|
subset
|
Union[str, List, None]
|
Optional, which column or columns to check the condition against.` |
None
|
assert_no_nulls
|
bool
|
Whether to also enforce that data has no nulls. |
True
|
raise_exception
|
bool
|
Whether to raise an exception if the condition fails. |
True
|
exception_to_raise
|
Type[BaseException]
|
The exception to raise if the condition fails and raise_exception is True. |
DataError
|
verbose
|
bool
|
Whether to display the pass message if the condition passes. |
False
|
Returns:
Type | Description |
---|---|
DataFrame
|
The original DataFrame, unchanged. |
assert_no_nulls(fail_message=' γ¨ Assert no nulls failed ', pass_message=' βοΈ Assert no nulls passed ', subset=None, raise_exception=True, exception_to_raise=DataError, verbose=False)
Tests whether Dataframe or subset of columns has no nulls. Optionally raises an exception. Does not modify the DataFrame itself.
Example
(
iris
.check.assert_no_nulls(subset=["sepal_length"])
)
See docs for .check.assert_data()
for examples of how to customize assertions.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
fail_message
|
str
|
Message to display if the condition fails. |
' γ¨ Assert no nulls failed '
|
pass_message
|
str
|
Message to display if the condition passes. |
' βοΈ Assert no nulls passed '
|
subset
|
Union[str, List, None]
|
Optional, which column or columns to check the condition against. |
None
|
raise_exception
|
bool
|
Whether to raise an exception if the condition fails. |
True
|
exception_to_raise
|
Type[BaseException]
|
The exception to raise if the condition fails and raise_exception is True. |
DataError
|
verbose
|
bool
|
Whether to display the pass message if the condition passes. |
False
|
Returns:
Type | Description |
---|---|
DataFrame
|
The original DataFrame, unchanged. |
assert_nrows(nrows, fail_message=' γ¨ Assert nrows failed ', pass_message=' βοΈ Assert nrows passed ', raise_exception=True, exception_to_raise=DataError, verbose=False)
Tests whether Dataframe has a given number of rows. Optionally raises an exception. Does not modify the DataFrame itself.
Example
(
iris
.check.assert_nrows(20)
)
See docs for .check.assert_data()
for examples of how to customize assertions.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
nrows
|
int
|
The expected number of rows |
required |
fail_message
|
str
|
Message to display if the condition fails. |
' γ¨ Assert nrows failed '
|
pass_message
|
str
|
Message to display if the condition passes. |
' βοΈ Assert nrows passed '
|
raise_exception
|
bool
|
Whether to raise an exception if the condition fails. |
True
|
exception_to_raise
|
Type[BaseException]
|
The exception to raise if the condition fails and raise_exception is True. |
DataError
|
verbose
|
bool
|
Whether to display the pass message if the condition passes. |
False
|
Returns:
Type | Description |
---|---|
DataFrame
|
The original DataFrame, unchanged. |
assert_positive(fail_message=' γ¨ Assert positive failed ', pass_message=' βοΈ Assert positive passed ', subset=None, assert_no_nulls=True, raise_exception=True, exception_to_raise=DataError, verbose=False)
Tests whether Dataframe or subset of columns has all positive values. Optionally raises an exception. Does not modify the DataFrame itself.
Example
(
iris
.check.assert_positive(subset=["sepal_length"])
)
See docs for .check.assert_data()
for examples of how to customize assertions.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
fail_message
|
str
|
Message to display if the condition fails. |
' γ¨ Assert positive failed '
|
pass_message
|
str
|
Message to display if the condition passes. |
' βοΈ Assert positive passed '
|
subset
|
Union[str, List, None]
|
Optional, which column or columns to check the condition against. |
None
|
assert_no_nulls
|
bool
|
Whether to also enforce that data has no nulls. |
True
|
raise_exception
|
bool
|
Whether to raise an exception if the condition fails. |
True
|
exception_to_raise
|
Type[BaseException]
|
The exception to raise if the condition fails and raise_exception is True. |
DataError
|
verbose
|
bool
|
Whether to display the pass message if the condition passes. |
False
|
Returns:
Type | Description |
---|---|
DataFrame
|
The original DataFrame, unchanged. |
assert_same_nrows(other, fail_message=' γ¨ Assert same_nrows failed ', pass_message=' βοΈ Assert same_nrows passed ', raise_exception=True, exception_to_raise=DataError, verbose=False)
Tests whether Dataframe has the same number of rows as another DataFrame/Series has.
Optionally raises an exception. Does not modify the DataFrame itself.
Example
# Validate that an expected one-to-one join didn't add rows due to duplicate keys in the right table.
(
transactions_df
.merge(how="left", right=products_df, on="product_id")
.check.assert_same_nrows(transactions_df, "Left join changed row count! Check for duplicate `product_id` keys in product_df.")
)
See docs for .check.assert_data()
for examples of how to customize assertions.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
other
|
Union[DataFrame, Series]
|
The DataFrame or Series that we expect to have the same # of rows as |
required |
fail_message
|
str
|
Message to display if the condition fails. |
' γ¨ Assert same_nrows failed '
|
pass_message
|
str
|
Message to display if the condition passes. |
' βοΈ Assert same_nrows passed '
|
raise_exception
|
bool
|
Whether to raise an exception if the condition fails. |
True
|
exception_to_raise
|
Type[BaseException]
|
The exception to raise if the condition fails and raise_exception is True. |
DataError
|
verbose
|
bool
|
Whether to display the pass message if the condition passes. |
False
|
Returns:
Type | Description |
---|---|
DataFrame
|
The original DataFrame, unchanged. |
assert_str(fail_message=None, pass_message=' βοΈ Assert string passed ', subset=None, raise_exception=True, exception_to_raise=TypeError, verbose=False)
Tests whether Dataframe or subset of columns is strings. Optionally raises an exception. Does not modify the DataFrame itself.
Example
(
iris
.check.assert_str(subset=["species", "another_string_column"])
)
See docs for .check.assert_data()
for examples of how to customize assertions.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
fail_message
|
Union[str, None]
|
Message to display if the condition fails. If None, will report expected vs observed type. |
None
|
pass_message
|
str
|
Message to display if the condition passes. |
' βοΈ Assert string passed '
|
subset
|
Union[str, List, None]
|
Optional, which column or columns to check the condition against. |
None
|
raise_exception
|
bool
|
Whether to raise an exception if the condition fails. |
True
|
exception_to_raise
|
Type[BaseException]
|
The exception to raise if the condition fails and raise_exception is True. |
TypeError
|
verbose
|
bool
|
Whether to display the pass message if the condition passes. |
False
|
Returns:
Type | Description |
---|---|
DataFrame
|
The original DataFrame, unchanged. |
assert_timedelta(fail_message=None, pass_message=' βοΈ Assert timedelta passed ', subset=None, raise_exception=True, exception_to_raise=TypeError, verbose=False)
Tests whether Dataframe or subset of columns is of type timedelta. Optionally raises an exception. Does not modify the DataFrame itself.
Example
(
df
.check.assert_timedelta(subset=["timedelta_col"])
)
See docs for .check.assert_data()
for examples of how to customize assertions.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
fail_message
|
Union[str, None]
|
Message to display if the condition fails. If None, will report expected vs observed type. |
None
|
pass_message
|
str
|
Message to display if the condition passes. |
' βοΈ Assert timedelta passed '
|
subset
|
Union[str, List, None]
|
Optional, which column or columns to check the condition against. |
None
|
raise_exception
|
bool
|
Whether to raise an exception if the condition fails. |
True
|
exception_to_raise
|
Type[BaseException]
|
The exception to raise if the condition fails and raise_exception is True. |
TypeError
|
verbose
|
bool
|
Whether to display the pass message if the condition passes. |
False
|
Returns:
Type | Description |
---|---|
DataFrame
|
The original DataFrame, unchanged. |
assert_type(dtype, fail_message=None, pass_message=' βοΈ Assert type passed ', subset=None, raise_exception=True, exception_to_raise=TypeError, verbose=False)
Tests whether Dataframe or subset of columns meets type assumption. Optionally raises an exception. Does not modify the DataFrame itself.
Example
# Validate that a column of mixed types has overall type `object`
(
iris
.check.assert_type(object, subset="column_with_mixed_types")
)
See docs for .check.assert_data()
for examples of how to customize assertions.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dtype
|
Type[Any]
|
The required variable type |
required |
fail_message
|
Union[str, None]
|
Message to display if the condition fails. If None, will report expected vs observed type. |
None
|
pass_message
|
str
|
Message to display if the condition passes. |
' βοΈ Assert type passed '
|
subset
|
Union[str, List, None]
|
Optional, which column or columns to check the condition against. |
None
|
raise_exception
|
bool
|
Whether to raise an exception if the condition fails. |
True
|
exception_to_raise
|
Type[BaseException]
|
The exception to raise if the condition fails and raise_exception is True. |
TypeError
|
verbose
|
bool
|
Whether to display the pass message if the condition passes. |
False
|
Returns:
Type | Description |
---|---|
DataFrame
|
The original DataFrame, unchanged. |
assert_unique(fail_message=' γ¨ Assert unique failed ', pass_message=' βοΈ Assert unique passed ', subset=None, raise_exception=True, exception_to_raise=DataError, verbose=False)
Validates that a subset of columns have no duplicate values, or validates that a DataFrame has no duplicate rows. Optionally raises an exception. Does not modify the DataFrame itself.
Example
(
df
# Validate that a column has no duplicate values
.check.assert_unique(subset="id_column")
# Validate that a DataFrame has no duplicate rows
.check.assert_unique()
)
See docs for .check.assert_data()
for examples of how to customize assertions.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
fail_message
|
str
|
Message to display if the condition fails. |
' γ¨ Assert unique failed '
|
pass_message
|
str
|
Message to display if the condition passes. |
' βοΈ Assert unique passed '
|
subset
|
Union[str, List, None]
|
Optional, which column or columns to check the condition against. |
None
|
raise_exception
|
bool
|
Whether to raise an exception if the condition fails. |
True
|
exception_to_raise
|
Type[BaseException]
|
The exception to raise if the condition fails and raise_exception is True. |
DataError
|
verbose
|
bool
|
Whether to display the pass message if the condition passes. |
False
|
Returns:
Type | Description |
---|---|
DataFrame
|
The original DataFrame, unchanged. |
columns(fn=lambda df: df, subset=None, check_name='ποΈ Columns')
Prints the column names of a DataFrame, without modifying the DataFrame itself.
Example
(
df
.check.columns()
)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
fn
|
Callable
|
An optional lambda function to apply to the DataFrame before printing columns. Example: |
lambda df: df
|
subset
|
Union[str, List, None]
|
An optional list of column names or a string to select a subset of columns before printing their names. Applied after fn. |
None
|
check_name
|
Union[str, None]
|
An optional name for the check to preface the result with. |
'ποΈ Columns'
|
Returns:
Type | Description |
---|---|
DataFrame
|
The original DataFrame, unchanged. |
describe(fn=lambda df: df, subset=None, check_name='π Distributions', **kwargs)
Displays descriptive statistics about a DataFrame without modifying the DataFrame itself.
See Pandas docs for describe() for additional usage information, including more configuration options you can pass to this Pandas Checks method.
Example
(
df
.check.describe()
)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
fn
|
Callable
|
An optional lambda function to apply to the DataFrame before running Pandas describe(). Example: |
lambda df: df
|
subset
|
Union[str, List, None]
|
An optional list of column names or a string to select a subset of columns before running Pandas describe(). Applied after fn. |
None
|
check_name
|
Union[str, None]
|
An optional name for the check to preface the result with. |
'π Distributions'
|
**kwargs
|
Any
|
Optional, additional arguments that are accepted by Pandas describe() method. |
{}
|
Returns:
Type | Description |
---|---|
DataFrame
|
The original DataFrame, unchanged. |
disable_checks(enable_asserts=True)
Turns off Pandas Checks globally, such as in production mode. Calls to .check functions will not be run. Does not modify the DataFrame itself.
Example
(
iris
.check.disable_checks()
.check.assert_data(lambda df: df.shape[0]>10) # This check will NOT be run
.check.enable_checks() # Subsequent calls to .check will be run
)
Args enable_assert: Optionally, whether to also enable or disable assert statements
Returns:
Type | Description |
---|---|
DataFrame
|
The original DataFrame, unchanged. |
dtypes(fn=lambda df: df, subset=None, check_name='ποΈ Data types')
Displays the data types of a DataFrame's columns without modifying the DataFrame itself.
Example
(
iris
.check.dtypes()
)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
fn
|
Callable
|
An optional lambda function to apply to the DataFrame before running Pandas dtypes. Example: |
lambda df: df
|
subset
|
Union[str, List, None]
|
An optional list of column names or a string to select a subset of columns before running Pandas .dtypes. Applied after fn. |
None
|
check_name
|
Union[str, None]
|
An optional name for the check to preface the result with. |
'ποΈ Data types'
|
Returns:
Type | Description |
---|---|
DataFrame
|
The original DataFrame, unchanged. |
enable_checks(enable_asserts=True)
Globally enables Pandas Checks. Subequent calls to .check methods will be run. Does not modify the DataFrame itself.
Example
(
iris
["sepal_length"]
.check.disable_checks()
.check.assert_data(lambda s: s.shape[0]>10) # This check will NOT be run
.check.enable_checks() # Subsequent calls to .check will be run
)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
enable_asserts
|
bool
|
Optionally, whether to globally enable or disable calls to .check.assert_data(). |
True
|
Returns:
Type | Description |
---|---|
DataFrame
|
The original DataFrame, unchanged. |
function(fn=lambda df: df, subset=None, check_name=None)
Applies an arbitrary function on a DataFrame and shows the result, without modifying the DataFrame itself.
Example
(
iris
.check.function(fn=lambda df: df.shape[0]>10, check_name='Has at least 10 rows?')
)
# Will return either 'True' or 'False'
Parameters:
Name | Type | Description | Default |
---|---|---|---|
fn
|
Callable
|
A lambda function to apply to the DataFrame. Example: |
lambda df: df
|
subset
|
Union[str, List, None]
|
An optional list of column names or a string to select a subset of columns before running Pandas describe(). Applied after fn. |
None
|
check_name
|
Union[str, None]
|
An optional name for the check to preface the result with. |
None
|
Returns:
Type | Description |
---|---|
DataFrame
|
The original DataFrame, unchanged. |
get_mode(check_name='πΌπ©Ί Pandas Checks mode')
Displays the current values of Pandas Checks global options enable_checks and enable_asserts. Does not modify the DataFrame itself.
Example
(
iris
.check.get_mode()
)
# The check will print:
# "πΌπ©Ί Pandas Checks mode: {'enable_checks': True, 'enable_asserts': True}"
Parameters:
Name | Type | Description | Default |
---|---|---|---|
check_name
|
Union[str, None]
|
An optional name for the check. Will be used as a preface the printed result. |
'πΌπ©Ί Pandas Checks mode'
|
Returns:
Type | Description |
---|---|
DataFrame
|
The original DataFrame, unchanged. |
head(n=5, fn=lambda df: df, subset=None, check_name=None)
Displays the first n rows of a DataFrame, without modifying the DataFrame itself.
See Pandas docs for head() for additional usage information.
Example
(
iris
.check.head(10)
)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
n
|
int
|
The number of rows to display. |
5
|
fn
|
Callable
|
An optional lambda function to apply to the DataFrame before running Pandas head(). Example: |
lambda df: df
|
subset
|
Union[str, List, None]
|
An optional list of column names or a string to select a subset of columns before running Pandas head(). Applied after fn. |
None
|
check_name
|
Union[str, None]
|
An optional name for the check, to be printed as preface to the result. |
None
|
Returns:
Type | Description |
---|---|
DataFrame
|
The original DataFrame, unchanged. |
hist(fn=lambda df: df, subset=[], check_name=None, **kwargs)
Displays a histogram for the DataFrame, without modifying the DataFrame itself.
See Pandas docs for hist() for additional usage information, including more configuration options you can pass to this Pandas Checks method.
Example
(
iris
.check.hist(subset=["sepal_length", "sepal_width"])
)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
fn
|
Callable
|
An optional lambda function to apply to the DataFrame before running Pandas hist(). Example: |
lambda df: df
|
subset
|
Union[str, List, None]
|
An optional list of column names or a string to select a subset of columns before running Pandas hist(). Applied after fn. |
[]
|
check_name
|
Union[str, None]
|
An optional name for the check, to be printed as preface to the result. |
None
|
**kwargs
|
Any
|
Optional, additional arguments that are accepted by Pandas hist() method. |
{}
|
Returns:
Type | Description |
---|---|
DataFrame
|
The original DataFrame, unchanged. |
Note
If more than one column is passed, displays a grid of histograms.
Only renders in interactive mode (IPython/Jupyter), not in terminal.
info(fn=lambda df: df, subset=None, check_name='βΉοΈ Info', **kwargs)
Displays summary information about a DataFrame, without modifying the DataFrame itself.
See Pandas docs for info() for additional usage information, including more configuration options you can pass to this Pandas Checks method.
Example
(
iris
.check.info()
)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
fn
|
Callable
|
An optional lambda function to apply to the DataFrame before running Pandas info(). Example: |
lambda df: df
|
subset
|
Union[str, List, None]
|
An optional list of column names or a string to select a subset of columns before running Pandas info(). Applied after fn. |
None
|
check_name
|
Union[str, None]
|
An optional name for the check, to be printed as preface to the result. |
'βΉοΈ Info'
|
**kwargs
|
Any
|
Optional, additional arguments that are accepted by Pandas info() method. |
{}
|
Returns:
Type | Description |
---|---|
DataFrame
|
The original DataFrame, unchanged. |
memory_usage(fn=lambda df: df, subset=None, check_name='πΎ Memory usage', **kwargs)
Displays the memory footprint of a DataFrame, without modifying the DataFrame itself.
See Pandas docs for memory_usage() for additional usage information, including more configuration options you can pass to this Pandas Checks method.
Example
(
iris
.check.memory_usage()
)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
fn
|
Callable
|
An optional lambda function to apply to the DataFrame before running Pandas memory_usage(). Example: |
lambda df: df
|
subset
|
Union[str, List, None]
|
An optional list of column names or a string to select a subset of columns before running Pandas memory_usage(). Applied after fn. |
None
|
check_name
|
Union[str, None]
|
An optional name for the check, to be printed as preface to the result. |
'πΎ Memory usage'
|
**kwargs
|
Any
|
Optional, additional arguments that are accepted by Pandas info() method. |
{}
|
Returns:
Type | Description |
---|---|
DataFrame
|
The original DataFrame, unchanged. |
Note
Include argument deep=True
to get further memory usage of object dtypes in the DataFrame. See Pandas docs for memory_usage() for more info.
ncols(fn=lambda df: df, subset=None, check_name='ποΈ Columns')
Displays the number of columns in a DataFrame, without modifying the DataFrame itself.
Example
(
iris
.check.ncols()
)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
fn
|
Callable
|
An optional lambda function to apply to the DataFrame before counting the number of columns. Example: |
lambda df: df
|
subset
|
Union[str, List, None]
|
An optional list of column names or a string to select a subset of columns before counting the number of columns. Applied after fn. |
None
|
check_name
|
Union[str, None]
|
An optional name for the check, to be printed as preface to the result. |
'ποΈ Columns'
|
Returns:
Type | Description |
---|---|
DataFrame
|
The original DataFrame, unchanged. |
ndups(fn=lambda df: df, subset=None, check_name=None, **kwargs)
Displays the number of duplicated rows in a DataFrame, without modifying the DataFrame itself.
See Pandas docs for duplicated() for additional usage information, including more configuration options you can pass to this Pandas Checks method.
Example
# Count the number of rows with duplicate pairs of values across two columns:
(
iris
.check.ndups(subset=["sepal_length", "sepal_width"])
)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
fn
|
Callable
|
An optional lambda function to apply to the DataFrame before counting the number of duplicates. Example: |
lambda df: df
|
subset
|
Union[str, List, None]
|
An optional list of column names or a string to select a subset of columns before counting duplicate rows. Applied after fn. |
None
|
check_name
|
Union[str, None]
|
An optional name for the check, to be printed as preface to the result. |
None
|
**kwargs
|
Any
|
Optional, additional arguments that are accepted by Pandas duplicated() method. |
{}
|
Returns:
Type | Description |
---|---|
DataFrame
|
The original DataFrame, unchanged. |
nnulls(fn=lambda df: df, subset=None, by_column=True, check_name='π» Rows with NaNs')
Displays the number of rows with null values in a DataFrame, without modifying the DataFrame itself.
See Pandas docs for isna() for additional usage information.
Example
# Count the number of rows that have any nulls, one count per column
(
iris
.check.nnulls()
)
# Count the number of rows in the DataFrame that have a null in any column
(
iris
.check.nnulls(by_column=False)
)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
fn
|
Callable
|
An optional lambda function to apply to the DataFrame before counting the number of rows with a null. Example: |
lambda df: df
|
subset
|
Union[str, List, None]
|
An optional list of column names or a string to select a subset of columns before counting nulls. |
None
|
by_column
|
bool
|
If True, count null values with each column separately. If False, count rows with a null value in any column. Applied after fn. |
True
|
check_name
|
Union[str, None]
|
An optional name for the check, to be printed as preface to the result. |
'π» Rows with NaNs'
|
Returns:
Type | Description |
---|---|
DataFrame
|
The original DataFrame, unchanged. |
nrows(fn=lambda df: df, subset=None, check_name='β° Rows')
Displays the number of rows in a DataFrame, without modifying the DataFrame itself.
Example
(
iris
.check.nrows()
)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
fn
|
Callable
|
An optional lambda function to apply to the DataFrame before counting the number of rows. Example: |
lambda df: df
|
subset
|
Union[str, List, None]
|
An optional list of column names or a string name of one column to limit which columns are considered when counting rows. Applied after fn. |
None
|
check_name
|
Union[str, None]
|
An optional name for the check, to be printed as preface to the result. |
'β° Rows'
|
Returns:
Type | Description |
---|---|
DataFrame
|
The original DataFrame, unchanged. |
nunique(column, fn=lambda df: df, check_name=None, **kwargs)
Displays the number of unique rows in a single column, without modifying the DataFrame itself.
See Pandas docs for nunique() for additional usage information, including more configuration options you can pass to this Pandas Checks method.
Example
(
iris
.check.nunique(column="sepal_width")
)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
column
|
str
|
The name of a column to count uniques in. Applied after fn. |
required |
fn
|
Callable
|
An optional lambda function to apply to the DataFrame before running Pandas nunique(). Example: |
lambda df: df
|
check_name
|
Union[str, None]
|
An optional name for the check, to be printed as preface to the result. |
None
|
**kwargs
|
Any
|
Optional, additional arguments that are accepted by Pandas nunique() method. |
{}
|
Returns:
Type | Description |
---|---|
DataFrame
|
The original DataFrame, unchanged. |
plot(fn=lambda df: df, subset=None, check_name='', **kwargs)
Displays a plot of the DataFrame, without modifying the DataFrame itself.
See Pandas docs for plot() for additional usage information, including more configuration options you can pass to this Pandas Checks method.
Example
(
iris
.check.plot(kind="scatter", x="sepal_width", y="sepal_length", title="Sepal width vs sepal length")
)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
fn
|
Callable
|
An optional lambda function to apply to the DataFrame before running Pandas plot(). Example: |
lambda df: df
|
subset
|
Union[str, List, None]
|
An optional list of column names or a string name of one column to limit which columns are plotted. Applied after fn. |
None
|
check_name
|
Union[str, None]
|
An optional title for the plot. |
''
|
**kwargs
|
Any
|
Optional, additional arguments that are accepted by Pandas plot() method. |
{}
|
Returns:
Type | Description |
---|---|
DataFrame
|
The original DataFrame, unchanged. |
Note
Plots are only displayed when code is run in IPython/Jupyter, not in terminal.
If you pass a 'title' kwarg, it becomes the plot title, overriding check_name
print(object=None, fn=lambda df: df, subset=None, check_name=None, max_rows=10)
Displays text, another object, or (by default) the current DataFrame's head. Does not modify the DataFrame itself.
Example
# Print messages and milestones
(
iris
.check.print("Starting data cleaning..."")
...
)
# Inspect a DataFrame, such as the interim result of data processing
(
iris
...
.check.print(fn=lambda df: df.query("sepal_width<0"), check_name="Rows with negative sepal_width")
)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
object
|
Any
|
Object to print. Can be anything printable: str, int, list, another DataFrame, etc. If None, print the DataFrame's head (with |
None
|
fn
|
Callable
|
An optional lambda function to apply to the DataFrame before printing |
lambda df: df
|
subset
|
Union[str, List, None]
|
An optional list of column names or a string name of one column to limit which columns are printed. Applied after fn. |
None
|
check_name
|
Union[str, None]
|
An optional name for the check, to be printed as preface to the result. |
None
|
max_rows
|
int
|
Maximum number of rows to print if object=None. |
10
|
Returns:
Type | Description |
---|---|
DataFrame
|
The original DataFrame, unchanged. |
print_time_elapsed(start_time, lead_in='Time elapsed', units='auto')
Displays the time elapsed since start_time.
Example
import pandas_checks as pdc
start_time = pdc.start_timer()
(
iris
... # Do some data processing
.check.print_time_elapsed(start_time, "Cleaning took")
... # Do more
.check.print_time_elapsed(start_time, "Processing total time", units="seconds") # Force units to stay in seconds
)
# Result: "Cleaning took: 17.298324584960938 seconds
# "Processing total time: 71.0400543212890625 seconds
Parameters:
Name | Type | Description | Default |
---|---|---|---|
start_time
|
float
|
The index time when the stopwatch started, which comes from the Pandas Checks start_timer() |
required |
lead_in
|
Union[str, None]
|
Optional text to print before the elapsed time. |
'Time elapsed'
|
units
|
str
|
The units in which to display the elapsed time. Allowed values: "auto", "milliseconds", "seconds", "minutes", "hours" or shorthands "ms", "s", "m", "h". |
'auto'
|
Raises:
Type | Description |
---|---|
ValueError
|
If |
Returns:
Type | Description |
---|---|
DataFrame
|
The original DataFrame, unchanged. |
reset_format()
Globally restores all Pandas Checks formatting options to their default "factory" settings. Does not modify the DataFrame itself.
Example
(
iris
.check.set_format(precision=9, use_emojis=False)
# Print DF summary stats with precision 9 digits and no Pandas Checks emojis
.check.describe()
.check.reset_format() # Go back to default precision and emojis π₯³
)
Returns:
Type | Description |
---|---|
DataFrame
|
The original DataFrame, unchanged. |
set_format(**kwargs)
Configures selected formatting options for Pandas Checks. Does not modify the DataFrame itself.
Run pandas_checks.describe_options() to see a list of available options.
Example
(
iris
.check.set_format(precision=9, use_emojis=False)
# Print DF summary stats with precision 9 digits and no Pandas Checks emojis
.check.describe()
.check.reset_format() # Go back to default precision and emojis π₯³
)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
**kwargs
|
Any
|
Pairs of setting name and its new value. |
{}
|
Returns:
Type | Description |
---|---|
DataFrame
|
The original DataFrame, unchanged. |
set_mode(enable_checks, enable_asserts)
Configures the operation mode for Pandas Checks globally. Does not modify the DataFrame itself.
Example
# Disable checks except keep running assertions. Same as using `.check.disable_checks()`:
(
iris
.check.set_mode(enable_checks=False)
.check.describe() # This check will not be run
.check.assert_data(lambda s: s.shape[0]>10) # This check will still be run
)
# Disable checks _and_ assertions
(
iris
.check.set_mode(enable_checks=False, enable_asserts=False)
)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
enable_checks
|
bool
|
Whether to run any Pandas Checks methods globally. Does not affect .check.assert_*(). |
required |
enable_asserts
|
bool
|
Whether to run calls to Pandas Checks .check.assert_*() statements globally. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
The original DataFrame, unchanged. |
shape(fn=lambda df: df, subset=None, check_name='π Shape')
Displays the Dataframe's dimensions, without modifying the DataFrame itself.
Example
(
iris
.check.shape()
.check.shape(fn=lambda df: df.query("sepal_length<5"), check_name="Shape of DataFrame subgroup with sepal_length<5")
)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
fn
|
Callable
|
An optional lambda function to apply to the DataFrame before running Pandas |
lambda df: df
|
subset
|
Union[str, List, None]
|
An optional list of column names or a string name of one column to limit which columns are considered when printing the shape. Applied after fn. |
None
|
check_name
|
Union[str, None]
|
An optional name for the check, to be printed as preface to the result. |
'π Shape'
|
Returns:
Type | Description |
---|---|
DataFrame
|
The original DataFrame, unchanged. |
Note
See also .check.nrows() and .check.ncols()
tail(n=5, fn=lambda df: df, subset=None, check_name=None)
Displays the last n rows of the DataFrame, without modifying the DataFrame itself.
See Pandas docs for tail() for additional usage information, including more configuration options you can pass to this Pandas Checks method.
Example
(
iris
.check.tail(10)
)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
n
|
int
|
Number of rows to show. |
5
|
fn
|
Callable
|
An optional lambda function to apply to the DataFrame before running Pandas tail(). Example: |
lambda df: df
|
subset
|
Union[str, List, None]
|
An optional list of column names or a string name of one column to limit which columns are displayed. Applied after fn. |
None
|
check_name
|
Union[str, None]
|
An optional name for the check, to be printed as preface to the result. |
None
|
Returns:
Type | Description |
---|---|
DataFrame
|
The original DataFrame, unchanged. |
unique(column, fn=lambda df: df, check_name=None)
Displays the unique values in a column, without modifying the DataFrame itself.
See Pandas docs for unique() for additional usage information, including more configuration options you can pass to this Pandas Checks method.
Example
(
iris
.check.unique("species")
)
# The check will print: "π Unique values of species: ['setosa', 'versicolor', 'virginica']"
Parameters:
Name | Type | Description | Default |
---|---|---|---|
column
|
str
|
Column to check for unique values. |
required |
fn
|
Callable
|
An optional lambda function to apply to the DataFrame before calling Pandas unique(). Example: |
lambda df: df
|
check_name
|
Union[str, None]
|
An optional name for the check, to be printed as preface to the result. |
None
|
Returns:
Type | Description |
---|---|
DataFrame
|
The original DataFrame, unchanged. |
Note
fn
is applied to the dataframe before selecting column
. If you want to select the column before modifying it, set column=None
and start fn
with a column selection, i.e. fn=lambda df: df["my_column"].stuff()
value_counts(column, fn=lambda df: df, max_rows=10, check_name=None, **kwargs)
Displays the value counts for a column, without modifying the DataFrame itself.
See Pandas docs for value_counts() for additional usage information, including more configuration options you can pass to this Pandas Checks method.
Example
(
iris
.check.value_counts("sepal_length")
)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
column
|
str
|
Column to check for value counts. |
required |
max_rows
|
int
|
Maximum number of rows to show in the value counts. |
10
|
fn
|
Callable
|
An optional lambda function to apply to the DataFrame before running Pandas value_counts(). Example: |
lambda df: df
|
check_name
|
Union[str, None]
|
An optional name for the check, to be printed as preface to the result. |
None
|
**kwargs
|
Any
|
Optional, additional arguments that are accepted by Pandas value_counts() method. |
{}
|
Returns:
Type | Description |
---|---|
DataFrame
|
The original DataFrame, unchanged. |
Note
fn
is applied to the dataframe before selecting column
. If you want to select the column before modifying it, set column=None
and start fn
with a column selection, i.e. fn=lambda df: df["my_column"].stuff()
write(path, format=None, fn=lambda df: df, subset=None, verbose=False, **kwargs)
Exports DataFrame to file, without modifying the DataFrame itself.
The file format is inferred from the extension. Supports: - .csv - .feather - .parquet - .pkl # Pickle - .tsv # Tab-separated data file - .xlsx
This functions uses the corresponding Pandas export function, such as to_csv()
and to_feather()
. See [Pandas docs for those corresponding export functions][Pandas docs for those export functions](https://pandas.pydata.org/docs/reference/io.html) for additional usage information, including more configuration options you can pass to this Pandas Checks method.
Example
(
iris
# Process data
...
# Export the interim data for inspection
.check.write("iris_interim.xlsx")
# Continue processing
...
)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to write the file to. |
required |
format
|
Union[str, None]
|
Optional file format to force for the export. If None, format is inferred from the file's extension in |
None
|
fn
|
Callable
|
An optional lambda function to apply to the DataFrame before exporting. Example: |
lambda df: df
|
subset
|
Union[str, List, None]
|
An optional list of column names or a string name of one column to limit which columns are exported. Applied after fn. |
None
|
verbose
|
bool
|
Whether to print a message when the file is written. |
False
|
**kwargs
|
Any
|
Optional, additional keyword arguments to pass to the Pandas export function (e.g. |
{}
|
Returns:
Type | Description |
---|---|
DataFrame
|
The original DataFrame, unchanged. |
Note
Exporting to some formats such as Excel, Feather, and Parquet may require you to install additional packages.