[0, 5]: Pandas

(The introduction is taken from the Python Handbook for Data Science)

Pandas is a package built on top of NumPy, and provides an efficient implementation of a DataFrame. DataFrames are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data. Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs.

As we saw, NumPy’s ndarray data structure provides essential features for the type of clean, well-organized data typically seen in numerical computing tasks. While it serves this purpose very well, its limitations become clear when we need more flexibility (attaching labels to data, working with missing data, etc.), especially when dealing with less structured data, which is the most common thing in real data. Pandas, and in particular its Series and DataFrame objects, builds on the NumPy array structure and provides efficient access to these sorts of “data munging” tasks that occupy much of a scientist time.

In this chapter, we will focus on the mechanics of using Series, DataFrame, and related structures effectively.

The common alias for Pandas is pd.

import pandas as pd

The Pandas Series Object

A Pandas Series is an indexed one-dimensional array. It can be created from a list, an array, or numpy array as follows:

my_array = [1,3,6,10,15]
data = pd.Series(my_array)

data
0     1
1     3
2     6
3     10
4     15
dtype: int64

As we can see, there is an automatical index, which we can access by the attribute index. The values are accessed by the attribute values.

But a Series is more than just a NumPy array with an index. The index for example can be not just an integer, but any type.

data = pd.Series(my_array, index=['alpha','beta','gamma','delta','epsilon')

data
alpha    1
beta    3
gamma    6
delta    10
epsilon    15
dtype: int64

The way we access the values are similar to dictionaries:

data['delta']
10

The similarities to the standard dictionaries are quite clear. The main thing to remember though, is that Series are ordered. We can construct a Series from a Dictionary:

map_dict={'ENST':'RNA', 'ENSG':'gene', 'ENSP':'protein'}
data=pd.Series(map_dict)

data
ENSG    gene
ENSP    protein
ENST    RNA
dtype: object

The dictionary is intrinsically unordered, so when converting it to a Series, Pandas builds the index from a sorted list of the dictionary keys.
Beware that the last part, the dtype, is not part of the elements you access when cycling through the values of the Series, it is just another attribute.

Pandas Series support slicing just like other arrays:

data['ENSP':]
ENSP    protein
ENST    RNA
dtype: object

And now for the main course:

The Pandas DataFrame

Like the Series, a DataFrame can be thought as a generalization of a mix of both a NumPy array and a dictionary.

A DataFrame can be constructed from 2 or more Series with the same index:

map_dict={'ENST':'RNA', 'ENSG':'gene', 'ENSP':'protein'}
count_dict={'ENST': 3300, 'ENSG':18435, 'ENSP':12034}

df = pd.DataFrame({'mapping': map_dict, 'counts': count_dict})

df
      counts  mapping
ENSG   18435     gene
ENSP   12034  protein
ENST    3300      RNA

We can access the index labels with the DataFrame attribute index.
Additionally, the DataFrame has a columns attribute, which holds the labels for all columns:

df.index
Index(['ENSG','ENSP','ENST'], dtype='object')

df.columns
Index(['counts','mapping'], dtype='object')

A DataFrame can be thus thought as a generalized 2-dimensional array with generalized indexing for accessing data.

So how do we access data?
In the most intuitive way: say we need to access the column (or ‘feature’) ‘counts’, we can access it like a dictionary, or in a Pandas way:

df['counts']
ENSG    18435
ENSP    12034
ENST     3300
Name: counts, dtype: int64

df.counts
ENSG    18435
ENSP    12034
ENST     3300
Name: counts, dtype: int64

The only difference is that the dictionary way supports labels with spaces and special characters while the ‘object’, Pandas way does not (it is easily understandable that we can make a call for

df['my ingenuous label']

but not for

df.my ingenuous label

)

However, the most common way we are going to construct our DataFrames is not from dictionaries, but from numpy 2-d arrays, because most of the time biological data will come in Tab Separated Values format (.tsv), or something easily convertible to a 2-d array (or a list of lists, which is the same for Pandas).


"""this is a small sample of a list of performances of the two multiple aligners 
mafftx and raf. The values represent, in order: method, dataset,
average pairwise identity, structural pairwise score, 
structural conservation index, reference structural conservation index,
 and how many sequences were in the dataset

You don't need to understand all these numbers"""

biglist= \
[['mafftx','5_8S_rRNA','75','0.775230007077','0.7154','68', '10'],
 ['mafftx','5_8S_rRNA','76','0.801566951567','0.5849','67', '10'],
['raf', 'yybP-ykoY', '47', '0.25886864813', '0.8660', '68', '7'],
 ['raf', 'yybP-ykoY', '47', '0.273654916512', '0.9000', '74', '7']]

#The backslash after 'biglist= ' is needed to write on a newline, nothing important

cols = ['method','dataset','APSI','SPS','SCI','refSCI','k']
my_df = pd.DataFrame(biglist, columns=cols)

my_df
   method    dataset APSI             SPS     SCI refSCI   k
0  mafftx  5_8S_rRNA   75  0.775230007077  0.7154     68  10
1  mafftx  5_8S_rRNA   76  0.801566951567  0.5849     67  10
2     raf  yybP-ykoY   47   0.25886864813  0.8660     68   7
3     raf  yybP-ykoY   47  0.273654916512  0.9000     74   7

#be careful that the entries are actually all strings
#(we can see that from the initial list, but we'll see
# a better method in a while)

my_df[['APSI','SPS','SCI','refSCI','k']] = my_df[['APSI','SPS','SCI','refSCI','k']].apply(pd.to_numeric)
#this is needed to convert the strings to the best numeric type.

As you can see, we can define our column labels with the parameter columns, and the same can be done with the index (the parameter is of course: index)

Actually, we can define the index and columns even after the DataFrame is constructed:

df.index=['first_index','second_index','another_index', ...]
df.columns=['measure1','measure3','measure22', ...]

DataFrame attributes, methods and data selection

#assume a DataFrame called df

df.T #returns the transpose of the DataFrame
df.dtypes #it's plural
df.shape #like a NumPy array, returns the DataFrame's dimensions
df.iloc #integer-location indexing
df.loc #label-location indexing

Let’s talk about iloc and loc: the simple call to the attribute will return nothing but a pandas.core.indexing._iLocIndexer or pandas.core.indexing._LocIndexer, but that object can be accessed via slicing and it is used to select rows of the DataFrame. The iloc attribute requires an integer, a list of integers or a boolean array which will be interpreted as the positions to select. The loc attribute instead must be used with labels.

my_df
   method    dataset APSI             SPS     SCI refSCI   k
0  mafftx  5_8S_rRNA   75  0.775230007077  0.7154     68  10
1  mafftx  5_8S_rRNA   76  0.801566951567  0.5849     67  10
2     raf  yybP-ykoY   47   0.25886864813  0.8660     68   7
3     raf  yybP-ykoY   47  0.273654916512  0.9000     74   7

my_df.index=['first','second','third','fourth']

my_df
        method    dataset APSI             SPS     SCI refSCI   k
first   mafftx  5_8S_rRNA   75  0.775230007077  0.7154     68  10
second  mafftx  5_8S_rRNA   76  0.801566951567  0.5849     67  10
third      raf  yybP-ykoY   47   0.25886864813  0.8660     68   7
fourth     raf  yybP-ykoY   47  0.273654916512  0.9000     74   7


my_df.iloc[1:3]
   method    dataset APSI             SPS     SCI refSCI   k
second  mafftx  5_8S_rRNA   76  0.801566951567  0.5849     67  10
third     raf  yybP-ykoY   47   0.25886864813  0.8660     68   7
#1:3 is sliced (3 not included, just like in the range function)



my_df.loc['second':'fourth']
   method    dataset APSI             SPS     SCI refSCI   k
second  mafftx  5_8S_rRNA   76  0.801566951567  0.5849     67  10
third     raf  yybP-ykoY   47   0.25886864813  0.8660     68   7
fourth     raf  yybP-ykoY   47  0.273654916512  0.9000     74   7

#we can select a portion of the df straight away
my_df.loc['second':'fourth','SPS':'SCI']
             SPS     SCI
second  0.801567  0.5849
third   0.258869  0.8660

#we can combine them if you have mixed information
my_df.iloc[1:3].loc[:, 'SPS']
second    0.801567
third     0.258869
Name: SPS, dtype: float64

First trouble: while the iloc behaves as expected, the loc indexing uses a different workaround which includes the end point of the slice. Be sure not to forget this particular behaviours.

Second thing to learn: the notation

[indexes]

refers to rows. If we want to select columns the notation required is, as can be seen in the last example

...loc[row_indexes, column_indexes]

. Just as in standard python slices, the colon “:” can be used alone to refer to all the available range.

Moreover, we can select data by boolean indexing (just like with NumPy)

bool_idx=my_df.SCI > 0.7 
bool_idx
first      True
second    False
third      True
fourth     True
Name: SCI, dtype: bool

my_df[bool_idx]
        method    dataset  APSI             SPS     SCI refSCI   k
first   mafftx  5_8S_rRNA    75  0.775230007077  0.7154     68  10
third      raf  yybP-ykoY    47   0.25886864813  0.8660     68   7
fourth     raf  yybP-ykoY    47  0.273654916512  0.9000     74   7

DataFrame description

Two useful methods exist to inquire on DataFrames: info and describe.
They both sum up the features (columns) present in the DataFrame, but info talks about the data type, the memory usage, and the number of non-null elements, while describe does a bit of descriptive statistics.

my_df.info()

Index: 4 entries, first to fourth
Data columns (total 7 columns):
method     4 non-null object
dataset    4 non-null object
APSI       4 non-null int64
SPS        4 non-null float64
SCI        4 non-null float64
refSCI     4 non-null int64
k          4 non-null int64
dtypes: float64(2), int64(3), object(2)
memory usage: 416.0+ bytes

my_df.describe()
            APSI       SPS       SCI     refSCI          k
count   4.000000  4.000000  4.000000   4.000000   4.000000
mean   61.250000  0.527330  0.766575  69.250000   8.500000
std    16.459546  0.301708  0.145272   3.201562   1.732051
min    47.000000  0.258869  0.584900  67.000000   7.000000
25%    47.000000  0.269958  0.682775  67.750000   7.000000
50%    61.000000  0.524442  0.790700  68.000000   8.500000
75%    75.250000  0.781814  0.874500  69.500000  10.000000
max    76.000000  0.801567  0.900000  74.000000  10.000000

These kind of information can be useful when trying to infer missing values, presence of out-of-expected-range values (like -1 in positive features) and such. This will be better covered in the EDA chapter.