manatee

This module contains a single class, Manatee, which inherits from pyspark.sql.dataframe.DataFrame. It provides some routine data-wrangling methods while still exposing the full functionality of the PySpark DataFrame. These are described here.

class manatee.manatee.Manatee(df=None)[source]

Wrapper class around the PySpark DataFrame object, providing some usability features closer to the pandas.DataFrame object.

This class provides friendly methods for performing routine actions like dropping NAs, adding columns, casting columns to another variable type, or discovering unique values in a column. It aims to provide a more pandas-like experience to the PySpark DataFrame object.

Attributes

Methods

T

Transposes the dataframe’s index and columns in-memory.

This calls df.transpose(memory=True, inplace=False) to quickly transpose the RDD, assuming the whole thing can fit into memory. For a transpose that only loads one row into memory at once, use Manatee.transpose() with memory=False.

concatenate(data, inplace=False)[source]

Concatenates a dataframe, a RDD, or a scalar column to the current dataframe.

Parameters:

data : scalar, RDD, PySpark DataFrame, or Manatee DataFrame.

If dataframe, this glues the two dataframes together columnwise. They must have the same length, and column names must be unique. If RDD, it must be able to generate a Manatee DataFrame using Manatee.from_rdd(). If scalar, data must be either a int, a float, or a str. The scalar will be broadcast such that all elements in the new column have the value data.

inplace : bool.

If False, this method returns a new Manatee DataFrame. If True, the current DataFrame is mutated in-place, and this method returns nothing.

drop(columns, inplace=False)[source]

Returns a new dataframe that drops the specified column.

This function extends PySpark’s DataFrame.drop() by allowing column to be a list or tuple. In this case, all columns in the list or tuple are dropped. This function can also be run in-place.

Parameters:

column : str, list, or tuple.

If str, drops the column named in column. If list of str or tuple of str, drops all columns named in column.

inplace : bool.

If False, this method returns a new Manatee DataFrame. If True, the current dataframe is mutated in-place, and this method returns nothing.

drop_first(n=1, inplace=False)[source]

Drop the first n rows of the dataframe.

Parameters:

n : int.

The number of rows to drop.

inplace : bool.

If False, this method returns a new Manatee dataframe. If True, the current dataframe is mutated in-place, and this returns nothing.

drop_last(n=1, inplace=False)[source]

Drop the last n rows of the dataframe.

Parameters:

n : int.

The number of rows to drop.

inplace : bool.

If False, this method returns a new Manatee dataframe. If True, the current dataframe is mutated in-place, and this returns nothing.

dropna(how='any', na=None, subset=None, inplace=False)[source]

Drops rows containing NA or any of the values in na, such as na=["", "NULL"].

Parameters:

how : str.

If “any”, drop rows that contain at least one NA element. If “all”, drops rows only if all of their elements are NA.

na : list or None.

If None, only empty elements are considered NA. Otherwise, all elements in this list are also considered NA elements. You might want na = ["", "NULL"] to remove any rows containing empty elements, empty strings, and the string “NULL”.

subset : str, list, or None.

If None, the entire dataframe is considered when looking for NA values. Otherwise, only the columns whose names are given in this argument are considered.

inplace : bool.

If False, this method returns a new Manatee DataFrame. If True, the current dataframe is mutated in-place, and this method returns nothing.

classmethod from_rdd(rdd, name=None)[source]

Create a Manatee DataFrame from an RDD.

This method returns a Manatee DataFrame from data in a resilient distributed dataset.

Parameters:

rdd : pyspark.rdd.RDD.

The RDD to be turned into a dataframe. Elements of the RDD can either be single non-container items ( like a float or str ), a tuple of such elements, or a pyspark.sql.types.Row object.

name : str or list.

The desired name(s) of the column(s), if the RDD’s elements are not Row objects.

map_concat(f)[source]

Performs a map() followed by a concatenate() in-place.

Concatenates the RDD that results from a map() operation as a new column in the dataframe.

na_rate(na=None, subset=None)[source]

Returns the fraction of NA elements per column. Not yet implemented.

quick_cast(column, dtype, inplace=False)[source]

Attempts to cast a column to a given variable dtype.

Unlike the PySpark DataFrame cast method, quick_cast doesn’t return a column separately, but rather casts a column inside the dataframe, keeping the column’s name but changing the dtype of its elements. If it fails, this call “fails gracefully” : the dataframe is unchanged and Spark issues lots of text. I’m not sure how to catch these exceptions yet...

Parameters:

column : str.

The name of the column whose elements are to be cast.

dtype : type.

The type of variable to cast to. Valid entries are the keys in the dictionary Manatee.typedict.

inplace : bool.

If False, this method returns a new Manatee DataFrame. If True, the current dataframe is mutated in-place, and this method returns nothing.

toPySparkDF()[source]

Returns a PySpark DataFrame object.

Returns:df : pyspark.sql.dataframe.DataFrame.
to_null(to_replace, subset=None, inplace=False)[source]

Replaces a set of values with None.

This method replaces a set of values with None, which is helpful if your dataframe has a number of different values that should be null, such as empty strings, whitespace, or the string “NULL”.

Parameters:

to_replace : scalar or list.

The value, or list of values, to replace with None.

subset : str, list, or None.

If None, the entire dataframe is considered when looking for to_replace values. Otherwise, only the columns whose names are given in this argument are considered.

inplace : bool.

If False, this method returns a new Manatee DataFrame. If True, the current dataframe is mutated in-place, and this method returns nothing.

transpose(memory=False, inplace=True)[source]

Performs a transpose of the underlying RDD.

This method is not yet fully implemented; it currently works in-memory only by leveraging pandas.DataFrame.transpose().

Parameters:

memory : bool.

If True, the transpose is done in-memory. If False, the transpose is performed in a distributed fashion. At least one row of the dataframe must fit into memory.

inplace : bool.

If False, this method returns a new Manatee dataframe. If True, the current dataframe is mutated in-place, and this returns nothing.

unique(columns=None)[source]

Find the unique values of a column in the dataframe.

This method returns the various unique values in a column or in the entire dataframe, and the number of rows with that value. If no column is passed, this returns a dictionary, mapping column names to unique values in each column. This option might be slow.

Parameters:

columns : str, list, or None.

The names of columns in which to find unique values. If columns is a string, it is the name of the column in which to find unique values. If columns is a list, its elements are assumed to be column names. If columns is None, unique values are found from across the entire dataframe, on a per-column basis.

Returns:

unique_values : dict.

A dictionary whose keys are column names, and values are the unique values, and their counts, in that column.