Panda: Working With Date In Pandas

From OnnoWiki
Revision as of 10:53, 7 August 2019 by Onnowpurbo (talk | contribs) (Created page with "Sumber: https://galaxydatatech.com/2017/12/23/working-date-pandas/ Pandas provides powerful tools for working with dates and time data sets. Pandas datetime functionality hea...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Sumber: https://galaxydatatech.com/2017/12/23/working-date-pandas/

Pandas provides powerful tools for working with dates and time data sets. Pandas datetime functionality heavily depends on the NumPy datetime64 and timedelta64 dtypes. Working with date in pandas is very important especially in the analysis of time related data sets. Date and time often comes in different format where some format contains numbers, some contain a combination of numbers and text. Before the modeling of data the date must be in the standard format so as to give accurate predictions. In this post we will learn how to work with date in pandas. Working With Date In Pandas

Python provides flexible methods for working with dates. Below, we are going to see how to manipulate date-time data in Pandas. Creating Date Range

To create ranges of dates in Pandas we use date_range() function. The pandas.date_range(start=None, end=None, periods=None, freq=’D’, tz=None, normalize=False, name=None, closed=None, **kwargs) function returns a DateTimeIndex. Below are some of the parameters in the function.

start : This argument specifies the start datetime-like string to be passed. If not specified the default is None.
end : This argument specifies the end datetime-like string to be passed. If not specified the default is None.
periods : This argument specifies the number of periods to be generated.
tz : This argument specifies the string time zone name for returning localized DatetimeIndex.
name :This argument specifies the name of the resulting DatetimeIndex.
freq : This specifies a Date Offset, the default being ‘D’ for calendar daily.

For more information about other parameters in the date_range function you can visit the Pandas date_range documentation here.

import pandas as pd
date_rng=pd.date_range('2017-12-01', periods=5)
print(date_rng)

Output

DatetimeIndex(['2017-12-01', '2017-12-02', '2017-12-03', '2017-12-04',
               '2017-12-05'],
              dtype='datetime64[ns]', freq='D')

Creating a Series From date_range()

import pandas as pd
date_rng=pd.date_range('2017-12-01', periods=5,name='Date')
s=pd.Series(date_rng)
print(s)

Output

0   2017-12-01
1   2017-12-02
2   2017-12-03
3   2017-12-04
4   2017-12-05
Name: Date, dtype: datetime64[ns]

Changing The Frequency

The default frequency is ‘D’ for day of calendar. We can change it to other offsets like M for month, and Y for year. Below example changes the frequency to month.

import pandas as pd
date_rng=pd.date_range('2017-12-01', periods=5,name='Date',freq='M')
s=pd.Series(date_rng)
print(s)

Output

0   2017-12-31
1   2018-01-31
2   2018-02-28
3   2018-03-31
4   2018-04-30
Name: Date, dtype: datetime64[ns]

bdate_range()

This function returns the business date from Monday to Friday. It excludes Saturday and Sunday.

import pandas as pd
date_rng=pd.bdate_range('2017-12-01', periods=7,name='Date')
s=pd.Series(date_rng)
print(s)

Output

0   2017-12-01
1   2017-12-04
2   2017-12-05
3   2017-12-06
4   2017-12-07
5   2017-12-08
6   2017-12-11
Name: Date, dtype: datetime64[ns]

Converting a String to a Date

import pandas as pd
s=pd.Series(['2017/12/01','2017/12/02','2017/12/03','2017/12/04','2017/12/05'])
print("String \n ",s)
date=pd.to_datetime(s)
print("\n\nDate \n ",date)

Output

String 
  0    2017/12/01
1    2017/12/02
2    2017/12/03
3    2017/12/04
4    2017/12/05
dtype: object

Date 
  0   2017-12-01
1   2017-12-02
2   2017-12-03
3   2017-12-04
4   2017-12-05
dtype: datetime64[ns]

Changing The DateFormat

We can easily specify the datetime format using the format argument. Below example convert the date format from ‘Y-m-d’ to ‘Y-d-m’

import pandas as pd
s=pd.Series(['2017/12/01','2017/12/02','2017/12/03','2017/12/04','2017/12/05'])
print("String \n ",s)
date=pd.to_datetime(s, format='%Y/%d/%m')
print("\n\nDate \n ",date)

Output

String 
  0    2017/12/01
1    2017/12/02
2    2017/12/03
3    2017/12/04
4    2017/12/05
dtype: object

Date 
  0   2017-01-12
1   2017-02-12
2   2017-03-12
3   2017-04-12
4   2017-05-12
dtype: datetime64[ns]


Converting Epoch Time To Timestamps

Pandas supports the conversion of epoch time to Timestamp or DatetimeIndex. Below example converts an epoch time to a DatetimeIndex.

import pandas as pd
date_epoch=pd.to_datetime([1512086400000000000, 1512172800000000000, 1512259200000000000, 1512345600000000000, 1512432000000000000])
print(date_epoch)

Output

DatetimeIndex(['2017-12-01', '2017-12-02', '2017-12-03', '2017-12-04',
               '2017-12-05'],
              dtype='datetime64[ns]', freq=None)

Converting Timestamp to Epoch Time.

We can convert time from timestamp format to epoch time using the view function.

import pandas as pd
date_rng=pd.date_range('2017-12-01', periods=5,name='Date')
print("DatetimeIndex \n ",date_rng)
epoch=date_rng.view('int64')
print("\n\nEpoch time \n ",epoch)

Output

DatetimeIndex 
  DatetimeIndex(['2017-12-01', '2017-12-02', '2017-12-03', '2017-12-04',
               '2017-12-05'],
              dtype='datetime64[ns]', name='Date', freq='D')
Epoch time 
  [1512086400000000000 1512172800000000000 1512259200000000000
 1512345600000000000 1512432000000000000]

Extracting Month and Year From Date

Below example extracts the month and year from date and creates new columns for them.

import pandas as pd
date_rng=pd.date_range('2017-12-01', periods=5,freq='M')
df=pd.DataFrame(date_rng,columns=['Date'])
print("Original DataFrame\n",df)

#create new columns for month and year
df['month']=df['Date'].dt.month
df['year']=df['Date'].dt.year

print("\n\nNew DataFrame\n",df)

Output

Original DataFrame
         Date
0 2017-12-31
1 2018-01-31
2 2018-02-28
3 2018-03-31
4 2018-04-30

New DataFrame
         Date  month  year
0 2017-12-31     12  2017
1 2018-01-31      1  2018
2 2018-02-28      2  2018
3 2018-03-31      3  2018
4 2018-04-30      4  2018


Let’s use lambda and apply function to extract year from date

import pandas as pd
date_rng=pd.date_range('2017-12-01', periods=5,freq='M')
df=pd.DataFrame(date_rng,columns=['Date'])
print("Original DataFrame\n",df)
df['Date'] = df['Date'].apply(lambda x: x.strftime('%Y'))
print("\nYear\n",df['Date'])

Output

Original DataFrame
         Date
0 2017-12-31
1 2018-01-31
2 2018-02-28
3 2018-03-31
4 2018-04-30

Year
 0    2017
1    2018
2    2018
3    2018
4    2018
Name: Date, dtype: object

Now, let’s extract the month and the year using lambda and apply function

import pandas as pd
date_rng=pd.date_range('2017-12-01', periods=5,freq='M')
df=pd.DataFrame(date_rng,columns=['Date'])
print("Original DataFrame\n",df)
df['Date'] = df['Date'].apply(lambda x: x.strftime('%B-%Y'))

Output

Original DataFrame
         Date
0 2017-12-31
1 2018-01-31
2 2018-02-28
3 2018-03-31
4 2018-04-30 

Month and Year
0    December-2017
1     January-2018
2    February-2018
3       March-2018
4       April-2018
Name: Date, dtype: object

Subtracting Dates in Pandas

In the example below we subtract the two dates and create a new column on the DataFrame. By default the difference is in days.

import pandas as pd
date_1=pd.date_range('2017-12-01', periods=5,freq='M')
date_2=pd.date_range('2018-03-05', periods=5,freq='M')
df=pd.DataFrame([date_1,date_2],index=['A','B']).T
print("Original DataFrame\n",df)
#subtract A from B
df['difference']=df['B']-df['A']
print("\n\nNew DataFrame\n",df)

Output

Original DataFrame
            A          B
0 2017-12-31 2018-03-31
1 2018-01-31 2018-04-30
2 2018-02-28 2018-05-31
3 2018-03-31 2018-06-30
4 2018-04-30 2018-07-31

New DataFrame
            A          B  difference
0 2017-12-31 2018-03-31     90 days
1 2018-01-31 2018-04-30     89 days
2 2018-02-28 2018-05-31     92 days
3 2018-03-31 2018-06-30     91 days
4 2018-04-30 2018-07-31     92 days

Pandas Timedelta

Timedelta are differences in times, expressed in days, hours, minutes, or seconds. The timedelta can be either positive or negative. Creating Timedelta

Using String

We can use string to create a timedelta as shown below.

import pandas as pd
time_delta=pd.Timedelta('15 days 30 minutes 45 seconds')
print(time_delta)

Output

15 days 00:30:45

Using Integer

We can create a timedelta by passing an integer in the timedelta function as shown below.

import pandas as pd
time_delta=pd.Timedelta(15,unit='d')
print(time_delta)

Output

15 days 00:00:00

Converting Text to Timedelta

We can easily convert any valid timedelta text to a timedelta format by using to_timedelta() function. The example below demonstrates how to convert a valid timedelta text DataFrame to a timedelta column and appending it to the original DataFrame.

import pandas as pd
df=pd.DataFrame(['15 days 30 minutes 45 seconds','20 days 10 minutes 55 seconds','29 days 50 minutes 03 seconds','30 days 24 minutes 07 seconds','31 days 59 minutes 59 seconds'],columns=['Timedelta Text'])
print("Original DataFrame\n",df)
df['Timedelta']=pd.to_timedelta(df['Timedelta Text'])
print("\n\nTimedelta DataFrame\n",df)

Output

Original DataFrame
                   Timedelta Text
0  15 days 30 minutes 45 seconds
1  20 days 10 minutes 55 seconds
2  29 days 50 minutes 03 seconds
3  30 days 24 minutes 07 seconds
4  31 days 59 minutes 59 seconds
Timedelta DataFrame
                   Timedelta Text        Timedelta
0  15 days 30 minutes 45 seconds 15 days 00:30:45
1  20 days 10 minutes 55 seconds 20 days 00:10:55
2  29 days 50 minutes 03 seconds 29 days 00:50:03
3  30 days 24 minutes 07 seconds 30 days 00:24:07
4  31 days 59 minutes 59 seconds 31 days 00:59:59

Working With Missing Date In Pandas

Sometimes we are presented with data that has missing datetime values. Most machine learning and data analysis models don’t work well with missing values. Handling missing datetime values is very important when cleaning data. Missing values in datetime data type is represented as NAT. Handling NAT is similar to handling of the NaN.

Creating a DataFrame with Missing Datetime Values.

import pandas as pd
import numpy as np
date_rng=pd.date_range('2017-11-01', periods=10,freq='M')
df=pd.DataFrame(date_rng,columns=['Date'])
df.loc[2,3,7,8]=np.nan
print(df)

Output

        Date
0 2017-11-30
1 2017-12-31
2        NaT
3        NaT
4 2018-03-31
5 2018-04-30
6 2018-05-31
7        NaT
8        NaT
9 2018-08-31

We can perform all the missing values operations like dropna, and fillna on NaN as well as on NAT .

Conclusion

Time is an important variable in research and statistical inference. Working with datetime data in Pandas is an important task as most of the data are analyzed in relation to time. There are two approaches of working with dates in data analysis, one of the option is to use Python datetime functionality and the other option is to use the Pandas date functionality. We have covered the basics of working with date in Pandas in this post. More advanced date functionality will be covered in other posts. What’s Next

Now that you have basics of working with date in Pandas, in the next post we will learn about the Pandas Group By functionality for splitting, applying and combining data.



Referensi