Cherry Blossom Peak Bloom Predictions

Author

Student

Published

February 28, 2025

Abstract

This experiment attempts to forecast peak bloom dates by using a temperature-based time-series pattern matching technique. By using historical weather data and bloom observations, the method identifies analogous climate sequences from past years to project future bloom timings.

The framework used is Fast Dynamic Time Warping (FastDTW), which compares recent temperature patterns with historical patterns, determining the most probable position in the bloom cycle. The dataset consists of daily temperature records for many years, formatted to represent the cycles of dormancy and bloom. To improve predictive accuracy, we introduce derived features including cumulative degree days, rolling temperature sums, and sum of squares of temperature values, and more.

When testing the model, partial temperature sequences from historical bloom years are selected and compared in alignment to actual bloom dates. By iteratively sliding time windows over historical records, we can evaluate how closely predicted bloom dates match observed events through minimizng the Euclidean norm.

Despite the many limitations in this design, it may prove beneficial to track bloom dates in terms of cyles and patterns. This tool of sliding windows to find patterns in the data can allow us to forecast any time-series datasets.

Introduction

This project aims to predict Cherry Blossom peak bloom dates (PBD) for 2025 at 5 locations around the world. Historical data on Cherry Blossom PBD was used to model bloom cycles, where the beginning of the cycle represents one day after previous PBD, and the final day represents the current PBD. These cycles are compared to one another, mapping similar trends on functions of temperature. In particular, the days leading up to the bloom were closely analyzed, allowing us to theoretically input a slice of current temperature data in order to predict how far along the cycle we are, thus generating an estimate for the future PBD.

Setup

This project uses python and related packages. Please ensure python is installed, as well as all other dependencies listed below:

import pandas as pd 
import numpy as np 
import datetime as dt
import matplotlib.pyplot as plt
from fastdtw import fastdtw
import warnings
warnings.filterwarnings('ignore')
# Import Bloom Datasets
washingtondc = pd.read_csv('data/washingtondc.csv')
liestal = pd.read_csv('data/liestal.csv')
kyoto = pd.read_csv('data/kyoto.csv')
vancouver = pd.read_csv('data/vancouver.csv')
newyorkcity = pd.read_csv('data/nyc.csv')
# Create and engineer relevant features for Washington DC 
# Years not included (missing data): 2000 
wdc_weather_df = pd.read_csv('data/wdc_weather1990_2024.csv')
wdc_weather_df.reset_index(drop=True, inplace=True)
wdc_weather_df['Date'] = pd.to_datetime(wdc_weather_df['Date'])
wdc_weather_df.sort_values(by='Date',inplace=True)
wdc_weather_df.drop(columns=['index'], inplace=True)
wdc_weather_df
Temp_max Temp_avg Temp_min Dew_max Dew_avg Dew_min Hum_max Hum_avg Hum_min Wind_max Wind_avg Wind_min Pres_max Pres_avg Pres_min Precipitation Date
0 46 39.4 0 45 25.3 0 96 55.0 0 24 17.5 8 30.3 29.9 29.6 0.00 1990-01-01
1 46 36.3 29 28 23.0 20 86 60.5 39 12 8.0 6 30.5 30.4 30.3 0.00 1990-01-02
2 50 37.0 30 30 27.2 22 89 70.2 39 12 5.8 2 30.4 30.3 30.3 0.00 1990-01-03
3 55 32.8 0 51 27.3 0 97 74.4 0 17 10.4 3 30.3 30.0 29.9 0.00 1990-01-04
4 51 45.4 42 46 36.5 31 100 72.8 50 15 8.0 0 30.2 30.1 29.9 0.00 1990-01-05
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
12412 49 42.0 36 38 30.5 27 82 64.4 50 7 3.0 0 30.5 30.4 30.3 0.00 2024-12-27
12413 53 47.0 42 50 44.2 39 94 89.8 86 12 4.6 0 30.3 30.2 30.2 0.29 2024-12-28
12414 67 56.5 50 56 51.6 46 93 84.2 59 21 8.0 0 30.2 30.0 29.6 0.35 2024-12-29
12415 63 55.8 44 53 37.5 27 86 54.4 27 21 9.0 3 29.9 29.7 29.6 0.03 2024-12-30
12416 63 50.3 40 47 40.5 35 90 71.5 41 32 8.2 0 29.9 29.7 29.5 0.00 2024-12-31

12417 rows × 17 columns

wdc_bloomdates_df = pd.DataFrame(washingtondc)
wdc_bloomdates_df = wdc_bloomdates_df[wdc_bloomdates_df['year'] >= 1990]
wdc_bloomdates_df = wdc_bloomdates_df.drop(wdc_bloomdates_df[wdc_bloomdates_df['year'] == 2000].index)
wdc_bloomdates_df['bloom_date'] = pd.to_datetime(wdc_bloomdates_df['bloom_date'])
wdc_bloomdates_df.reset_index(drop=True, inplace=True)
wdc_bloomdates_df
location lat long alt year bloom_date bloom_doy
0 washingtondc 38.88535 -77.038628 0 1990 1990-03-15 74
1 washingtondc 38.88535 -77.038628 0 1991 1991-03-29 88
2 washingtondc 38.88535 -77.038628 0 1992 1992-04-07 98
3 washingtondc 38.88535 -77.038628 0 1993 1993-04-11 101
4 washingtondc 38.88535 -77.038628 0 1994 1994-04-05 95
5 washingtondc 38.88535 -77.038628 0 1995 1995-04-02 92
6 washingtondc 38.88535 -77.038628 0 1996 1996-04-04 95
7 washingtondc 38.88535 -77.038628 0 1997 1997-03-26 85
8 washingtondc 38.88535 -77.038628 0 1998 1998-03-27 86
9 washingtondc 38.88535 -77.038628 0 1999 1999-04-05 95
10 washingtondc 38.88535 -77.038628 0 2001 2001-04-06 96
11 washingtondc 38.88535 -77.038628 0 2002 2002-04-02 92
12 washingtondc 38.88535 -77.038628 0 2003 2003-04-02 92
13 washingtondc 38.88535 -77.038628 0 2004 2004-03-31 91
14 washingtondc 38.88535 -77.038628 0 2005 2005-04-09 99
15 washingtondc 38.88535 -77.038628 0 2006 2006-03-30 89
16 washingtondc 38.88535 -77.038628 0 2007 2007-04-01 91
17 washingtondc 38.88535 -77.038628 0 2008 2008-03-26 86
18 washingtondc 38.88535 -77.038628 0 2009 2009-04-01 91
19 washingtondc 38.88535 -77.038628 0 2010 2010-03-31 90
20 washingtondc 38.88535 -77.038628 0 2011 2011-03-29 88
21 washingtondc 38.88535 -77.038628 0 2012 2012-03-20 80
22 washingtondc 38.88535 -77.038628 0 2013 2013-04-09 99
23 washingtondc 38.88535 -77.038628 0 2014 2014-04-10 100
24 washingtondc 38.88535 -77.038628 0 2015 2015-04-10 100
25 washingtondc 38.88535 -77.038628 0 2016 2016-03-25 85
26 washingtondc 38.88535 -77.038628 0 2017 2017-03-25 84
27 washingtondc 38.88535 -77.038628 0 2018 2018-04-05 95
28 washingtondc 38.88535 -77.038628 0 2019 2019-04-01 91
29 washingtondc 38.88535 -77.038628 0 2020 2020-03-20 80
30 washingtondc 38.88535 -77.038628 0 2021 2021-03-28 87
31 washingtondc 38.88535 -77.038628 0 2022 2022-03-21 80
32 washingtondc 38.88535 -77.038628 0 2023 2023-03-23 82
33 washingtondc 38.88535 -77.038628 0 2024 2024-03-17 77

Now we have two dataframes:

wdc_weather_df, daily weather data wdc_bloomdates_df, with bloom dates for each year

Since we want to compare a bloom cycle, i.e from dormancy to bloom, we will count each year’s bloom as starting from the day after the previous bloom and ending with the current bloom.

temp_of_day_from_bloom_df:

  • rows: day of bloom (1 denotes first day after previous bloom)
  • column: cycle (year)

For example, if the column name is 1992_b, this means that this column the days leading up to the bloom in 1992. Thus the first row will take place in 1991, after the previous bloom 1991_b. The last value, prior to any NaN, is the bloom date.

Each column represents one cycle, from post-bloom dormancy to the next bloom.

We will also take the inverse of the dataframe, and call it temp_of_day_till_bloom_df:

  • rows: day till bloom (1 denotes one day prior to the current bloom, and 0 be the bloom itself.)
  • column: cycle (year)
bloom_dates = wdc_bloomdates_df['bloom_date']
temp_of_day_from_bloom_df = pd.DataFrame() 

for i in range(len(bloom_dates) - 1):
    data = wdc_weather_df[(wdc_weather_df['Date'] > bloom_dates[i]) & (wdc_weather_df['Date'] <= bloom_dates[i+1])]
    data.reset_index(drop=True, inplace=True)
    data["day_from_bloom"] = data.index + 1
    pivot_df = data.pivot_table(index='day_from_bloom', values='Temp_avg')
    pivot_df.columns=[str(bloom_dates[i+1].year)+'_b']
    temp_of_day_from_bloom_df = pd.concat([temp_of_day_from_bloom_df, pivot_df], axis=1)

temp_of_day_till_bloom_df = pd.DataFrame()
inv_temp_of_day_from_bloom_df = temp_of_day_from_bloom_df.iloc[::-1]

for col in inv_temp_of_day_from_bloom_df.columns:
    nan_count = inv_temp_of_day_from_bloom_df[col].isna().sum()

    # Remove NaN
    temp_of_day_till_bloom_df[col] = inv_temp_of_day_from_bloom_df[col].iloc[nan_count:].reset_index(drop=True)

temp_of_day_till_bloom_df = temp_of_day_till_bloom_df.head(330)
temp_of_day_till_bloom_df['days_till_bloom'] = temp_of_day_till_bloom_df.index
temp_of_day_till_bloom_df
1991_b 1992_b 1993_b 1994_b 1995_b 1996_b 1997_b 1998_b 1999_b 2001_b ... 2016_b 2017_b 2018_b 2019_b 2020_b 2021_b 2022_b 2023_b 2024_b days_till_bloom
0 36.7 57.1 53.3 55.8 42.7 56.6 52.5 65.7 51.1 58.1 ... 64.3 64.9 45.2 41.8 70.9 59.8 58.1 65.3 59.3 0
1 71.9 50.8 27.7 51.3 45.3 54.7 47.4 58.2 64.1 52.8 ... 64.6 47.2 55.5 55.3 56.3 61.5 55.6 56.0 56.5 1
2 59.5 45.8 43.0 59.1 47.7 45.5 42.3 43.1 58.2 51.7 ... 61.4 38.5 48.7 63.8 51.4 70.7 65.8 48.0 65.0 2
3 50.1 41.0 52.8 55.8 51.8 24.4 44.0 41.2 60.1 49.4 ... 49.3 47.2 45.8 59.3 51.1 58.6 54.2 41.1 61.5 3
4 51.8 38.4 47.8 51.8 48.3 52.3 54.8 42.0 58.3 48.4 ... 44.2 53.6 55.7 49.6 46.2 56.4 54.8 36.9 58.3 4
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
325 67.6 61.1 64.3 69.0 64.0 52.3 50.9 63.0 69.5 61.0 ... 72.7 56.2 66.0 73.9 66.9 52.0 64.2 65.1 60.5 325
326 63.0 50.4 60.4 62.7 61.3 35.4 55.9 60.4 58.0 62.0 ... 71.5 64.1 65.3 73.0 54.9 53.6 75.6 58.9 56.9 326
327 59.6 75.8 60.8 65.2 62.3 67.4 65.3 58.1 57.8 59.7 ... 69.3 62.1 53.5 68.9 60.0 66.3 73.9 57.1 53.4 327
328 44.3 77.3 46.3 67.7 60.5 65.4 63.4 66.3 53.5 65.6 ... 63.5 54.6 54.4 67.5 63.3 66.1 66.5 54.8 53.3 328
329 50.2 68.7 48.5 75.0 43.2 39.1 56.5 60.3 54.9 71.1 ... 57.6 53.7 55.9 66.8 65.4 61.6 57.0 49.3 57.4 329

330 rows × 34 columns

Now, let’s add some more complicated features onto the dataframe. We will look at functions of temperature, as it is theorized that bloom cycles occur after certain thresholds of temperature are met. The following functions were used as features:

All data is created and stored in the new dataframe, df.

Each feature name begins with the year of bloom, followed by the following suffix. For example, the 1991 bloom cycle will have the features:

  • 1991_b : daily average Temperature
  • 1991_b_sum_sq : sum of squared daily temperature, rolling over last 21 days
  • 1991_b_sum_log : sum of log daily temperature, rolling over last 21 days

… and so on. Below are definitions of all features

Feature 10
_sum_sq sum of squared daily temperature, rolling over last 21 days
_sum_log sum of log daily temperature, rolling over last 21 days
_sum_dd32 sum of number of days where temperature exceeds 32 F, rolling over last 21 days
_sum_dd50 sum of number of days where temperature exceeds 50 F, rolling over last 21 days
_sum_last14 sum of daily temperature, rolling over last 14 days
_norm_dd_32 sum of days where days above 32F are (+1) and days below are (-1), rolling over last 21 days
_norm_dd_50 sum of days where days above 50F are (+1) and days below are (-1), rolling over last 21 days
_rolling_freeze sum of number of days where temperature is below 32 F, rolling over last 21 days
_rolling_chill sum of number of days where temperature is between 32 F and 50 F, rolling over last 21 days
_rolling_temp_change temperature change between last 7 days
df = temp_of_day_till_bloom_df.copy()

# Functions of temperature for each bloom cycle
for col in df.columns:
    if col.endswith('_b'):
        df[f'{col}_sum_sq'] = df[col][::-1].pow(2).rolling(window=21, min_periods=1).sum()[::-1]
        df[f'{col}_sum_log'] = df[col][::-1].apply(lambda x: np.log(x + 1)).rolling(window=21, min_periods=1).sum()[::-1]
        df[f'{col}_sum_dd32'] = (df[col][::-1] > 32).rolling(window=21, min_periods=1).sum()[::-1]
        df[f'{col}_sum_dd50'] = (df[col][::-1] > 50).rolling(window=21, min_periods=1).sum()[::-1]
        df[f'{col}_sum_last14'] = df[col][::-1].rolling(window=14, min_periods=1).sum()[::-1]
        df[f'{col}_norm_dd_32'] = ((df[col] > 32).astype(int) * 1 + (df[col] <= 32).astype(int) * -1).rolling(window=21, min_periods=1).sum()[::-1]
        df[f'{col}_norm_dd_50'] = ((df[col] > 50).astype(int) * 1 + (df[col] <= 50).astype(int) * -1).rolling(window=21, min_periods=1).sum()[::-1]
        df[f'{col}_rolling_freeze'] = (df[col][::-1] < 32).rolling(window=21, min_periods=1).sum()[::-1]
        df[f'{col}_rolling_chill'] = ((df[col][::-1] >= 32) & (df[col][::-1] < 50)).rolling(window=21, min_periods=1).sum()[::-1]
        df[f'{col}_rolling_temp_change'] = df[col][::-1].diff(periods=7)[::-1]

cols = ['days_till_bloom'] + [c for c in df.columns if c != 'days_till_bloom']
df = df[cols]
df = df.head(80)
df
days_till_bloom 1991_b 1992_b 1993_b 1994_b 1995_b 1996_b 1997_b 1998_b 1999_b ... 2024_b_sum_sq 2024_b_sum_log 2024_b_sum_dd32 2024_b_sum_dd50 2024_b_sum_last14 2024_b_norm_dd_32 2024_b_norm_dd_50 2024_b_rolling_freeze 2024_b_rolling_chill 2024_b_rolling_temp_change
0 0 36.7 57.1 53.3 55.8 42.7 56.6 52.5 65.7 51.1 ... 60490.21 83.770057 21.0 15.0 770.8 1.0 1.0 0.0 6.0 13.3
1 1 71.9 50.8 27.7 51.3 45.3 54.7 47.4 58.2 64.1 ... 58387.48 83.323977 21.0 14.0 765.1 2.0 2.0 0.0 7.0 7.8
2 2 59.5 45.8 43.0 59.1 47.7 45.5 42.3 43.1 58.2 ... 57202.27 83.096476 21.0 13.0 756.0 3.0 3.0 0.0 8.0 12.9
3 3 50.1 41.0 52.8 55.8 51.8 24.4 44.0 41.2 60.1 ... 55139.52 82.767551 21.0 12.0 734.9 4.0 4.0 0.0 9.0 3.9
4 4 51.8 38.4 47.8 51.8 48.3 52.3 54.8 42.0 58.3 ... 53054.71 82.374805 21.0 11.0 714.6 5.0 5.0 0.0 10.0 5.4
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
75 75 36.6 28.1 30.6 7.9 42.0 23.9 38.0 43.8 44.3 ... 42792.07 80.265438 21.0 4.0 629.6 5.0 -19.0 0.0 17.0 -6.1
76 76 30.5 29.0 34.2 2.1 46.6 35.3 32.5 44.5 45.2 ... 42665.03 80.226121 21.0 4.0 627.4 7.0 -19.0 0.0 17.0 -6.4
77 77 13.3 36.8 45.2 18.3 42.4 24.9 33.2 56.7 48.7 ... 42521.03 80.182211 21.0 4.0 638.0 9.0 -19.0 0.0 17.0 -0.8
78 78 41.5 29.8 43.1 12.0 44.5 31.9 39.0 62.0 45.3 ... 43692.63 80.408414 21.0 5.0 643.8 11.0 -19.0 0.0 16.0 1.1
79 79 31.8 20.1 23.9 11.5 53.1 24.1 52.4 60.5 39.0 ... 43921.88 80.461788 21.0 5.0 645.5 13.0 -19.0 0.0 16.0 7.5

80 rows × 364 columns

Let’s see what our data looks like.

# Show raw data

plt.figure(figsize=(10, 10))

for year in temp_of_day_from_bloom_df.columns:
    bloom_cycle = df[year]
    plt.plot(bloom_cycle.index, bloom_cycle.values, label=str(year))

plt.xlabel('Number of Days Till Bloom')
plt.ylabel('Temperature in F')
plt.title('Temperature Over Bloom Cycles')
plt.legend(title="Year", bbox_to_anchor=(1, 1), loc='upper left')
plt.show()

window = 7

plt.figure(figsize=(10, 10))  

for year in temp_of_day_from_bloom_df.columns:
    bloom_cycle = df[year]
    smoothed = bloom_cycle.rolling(window=window).mean()  
    plt.plot(smoothed.index, smoothed.values, label=str(year))

plt.xlabel('Number of Days Till Bloom')
plt.ylabel('Temperature (Temp Avg), in F')
plt.title('Smoothed Temperature Change Over Bloom Cycles')
plt.legend(title="Year", bbox_to_anchor=(1.05, 1), loc='upper left')

plt.tight_layout()
plt.show()

Looking at continuous functions of temperature:

feature_suffixes_cont = ['_sum_sq', '_sum_log', '_sum_last14', '_rolling_temp_change']
for suffix in feature_suffixes_cont:
    cv = df[[col for col in df.columns if col.endswith(suffix)]]
    plt.figure(figsize=(10, 3)) 
    plt.xlabel('Number of Days Till Bloom')
    plt.xticks(np.arange(0, 115, 5))
    plt.title('Patterns of ' + suffix)
    plt.ylabel(suffix)
    for year in cv.columns:
        bloom_cycle = df[year]
        plt.plot(bloom_cycle.index,bloom_cycle.values)
    plt.show()

Now discrete functions of temperature:

feature_suffixes_disc = ['_sum_dd32', '_sum_dd50','_norm_dd_32', '_norm_dd_50','_rolling_freeze', '_rolling_chill',]
for suffix in feature_suffixes_disc:
    cv = df[[col for col in df.columns if col.endswith(suffix)]]
    plt.figure(figsize=(10, 3))
    plt.xlabel('Number of Days Till Bloom')
    plt.xticks(np.arange(0, 115, 5))
    plt.title('Patterns of ' + suffix)
    plt.ylabel(suffix)
    for year in cv.columns:
        bloom_cycle = df[year]
        plt.plot(bloom_cycle.index,bloom_cycle.values,
                 marker='o', linestyle='-', markersize=4, label=year
)
    plt.show()

Prediction

Now let’s define some methods that will help us experiment and predict on data.

# DEFINE METHODS FOR PREDICTING BLOOM FOR 2025

def extract_historical_feature_sets(df, feature_suffixes):
    """
    Returns a dict where keys are feature suffixes, and values are DF of those suffixes.
    """
    feature_sets = {}
    for suffix in feature_suffixes:
        historical_features = df[[col for col in df.columns if col.endswith(suffix)]]
        feature_sets[suffix] = historical_features
    return feature_sets

def predict_from_arrays(historical_feature_sets, test_sequences):
    """
    - historical_feature_sets: dict where keys are feature suffixes, and values are DF of those suffixes
    - test_sequences: dict where keys are feature suffixes, values are test numpy arrays.
        
    Returns:
        - results: dict with best-matching year, best days_till_bloom index, and distance for each feature type.
    """
    results = {}
    
    for suffix, test_seq in test_sequences.items():
        best_match = None
        best_distance = float('inf')
        best_days_till_bloom = None

        historical_data = historical_feature_sets[suffix]

        for year in historical_data.columns:
            hist_seq = historical_data[year].values.flatten()
                        
            # FastDTW over a sliding window
            for days_till_bloom in range(len(hist_seq) - len(test_seq) + 1):
                window = hist_seq[days_till_bloom: days_till_bloom + len(test_seq)]
                distance, _ = fastdtw(window, test_seq, dist=2)
                
                if distance < best_distance:
                    best_distance = distance
                    best_match = year
                    best_days_till_bloom = days_till_bloom
        
        results[suffix] = (best_match, best_days_till_bloom, best_distance)
        #print(f"Feature: {suffix}, Best Match: {best_match}, Days_till_bloom: {best_days_till_bloom}, Distance: {best_distance}")

    return results

# feature suffixes
feature_suffixes = ['_sum_sq', '_sum_log', '_sum_dd32', '_sum_dd50', '_sum_last14', 
    '_norm_dd_32', '_norm_dd_50','_rolling_freeze', '_rolling_chill', '_rolling_temp_change'
]

# historical feature sets
historical_feature_sets = extract_historical_feature_sets(df, feature_suffixes)

Here, the prediction function uses FastDTW to compare vectors of time series. It finds the lowest Euclidean norm (L2) between a given input and a historical data. This slice of temperature input is thought to be “sliding” across the time, checking for patterns and attempting to match it to a historical bloom cycle. After completing the computations, it returns the lowest L2 norm and what the implied days_till_bloom is. This can be used to predict where we currently are in the cycle, which intuitively allows us to estimate when it will bloom. This process was used due to the fact that future temperature data would not need to be modeled or predicted.

# BUILDING 2025 TEMPERATURE SET DATAFRAME FOR 5 LOCATIONS

wdc_weather2025 = pd.read_csv('data/wdc_weather2025.csv')
liestal_weather2025 = pd.read_csv('data/liestal_weather2025.csv')
kyoto_weather2025 = pd.read_csv('data/kyoto_weather2025.csv')
vancouver_weather2025 = pd.read_csv('data/vancouver_weather2025.csv')
nyc_weather2025 = pd.read_csv('data/nyc_weather2025.csv')

wdc_weather2025['Date'] = pd.to_datetime(wdc_weather2025['Date'])
liestal_weather2025['Date'] = pd.to_datetime(liestal_weather2025['Date'])
kyoto_weather2025['Date'] = pd.to_datetime(kyoto_weather2025['Date'])
vancouver_weather2025['Date'] = pd.to_datetime(vancouver_weather2025['Date'])
nyc_weather2025['Date'] = pd.to_datetime(nyc_weather2025['Date'])

wdc_weather2025.name = "Washington DC"
liestal_weather2025.name = "Liestal"
kyoto_weather2025.name = "Kyoto"
vancouver_weather2025.name = "Vancouver"
nyc_weather2025.name = "NYC"

all_locations = [wdc_weather2025,liestal_weather2025, kyoto_weather2025, vancouver_weather2025, nyc_weather2025]

This function computes the temperature prediction for all 5 locations:

data = []
for loc in all_locations:
    temp_2025_dc = loc.sort_values(by='Date', ascending=False)[['Temp_avg']]
    temp_2025_dc.columns=['2025_b']
    for col in temp_2025_dc.columns:
        if col.endswith('_b'):
            temp_2025_dc[f'{col}_sum_sq'] = temp_2025_dc[col][::-1].pow(2).rolling(window=21, min_periods=1).sum()[::-1]
            temp_2025_dc[f'{col}_sum_log'] = temp_2025_dc[col][::-1].apply(lambda x: np.log(x + 1)).rolling(window=21, min_periods=1).sum()[::-1]
            temp_2025_dc[f'{col}_sum_dd32'] = (temp_2025_dc[col][::-1] > 32).rolling(window=21, min_periods=1).sum()[::-1]
            temp_2025_dc[f'{col}_sum_dd50'] = (temp_2025_dc[col][::-1] > 50).rolling(window=21, min_periods=1).sum()[::-1]
            temp_2025_dc[f'{col}_sum_last14'] = temp_2025_dc[col][::-1].rolling(window=14, min_periods=1).sum()[::-1]
            temp_2025_dc[f'{col}_norm_dd_32'] = ((temp_2025_dc[col] > 32).astype(int) * 1 + (temp_2025_dc[col] <= 32).astype(int) * -1).rolling(window=21, min_periods=1).sum()[::-1]
            temp_2025_dc[f'{col}_norm_dd_50'] = ((temp_2025_dc[col] > 50).astype(int) * 1 + (temp_2025_dc[col] <= 50).astype(int) * -1).rolling(window=21, min_periods=1).sum()[::-1]
            temp_2025_dc[f'{col}_rolling_freeze'] = (temp_2025_dc[col][::-1] < 32).rolling(window=21, min_periods=1).sum()[::-1]
            temp_2025_dc[f'{col}_rolling_chill'] = ((temp_2025_dc[col][::-1] >= 32) & (temp_2025_dc[col][::-1] < 50)).rolling(window=21, min_periods=1).sum()[::-1]
            temp_2025_dc[f'{col}_rolling_temp_change'] = temp_2025_dc[col][::-1].diff(periods=7)[::-1]
    temp_2025_dc = temp_2025_dc.head(50)

    all_results = []
    intervals = [10,20,30,40]

    for interval in intervals:
        temp_2025_dc_df = temp_2025_dc.head(interval)

        test_sequences = {
            '_sum_sq': np.array(temp_2025_dc_df['2025_b_sum_sq']),
            '_sum_log': np.array(temp_2025_dc_df['2025_b_sum_log']),
            '_sum_dd32': np.array(temp_2025_dc_df['2025_b_sum_dd32']),
            '_sum_dd50': np.array(temp_2025_dc_df['2025_b_sum_dd50']),
            '_sum_last14': np.array(temp_2025_dc_df['2025_b_sum_last14']),
            '_norm_dd_32': np.array(temp_2025_dc_df['2025_b_norm_dd_32']),
            '_norm_dd_50': np.array(temp_2025_dc_df['2025_b_norm_dd_50']),
            '_rolling_freeze': np.array(temp_2025_dc_df['2025_b_rolling_freeze']),
            '_rolling_chill': np.array(temp_2025_dc_df['2025_b_rolling_chill']),
            '_rolling_temp_change': np.array(temp_2025_dc_df['2025_b_rolling_temp_change']),

        }

        all_results.append((interval, predict_from_arrays(historical_feature_sets, test_sequences)))

    #all_results
    print("Results for " + loc.name)

    for result in all_results:
        print(result)

    arrays = []
    for j in range(len(feature_suffixes)):
        arrays.append([])

    for i in range(4):
        for j in range(len(feature_suffixes)):
            arrays[j].append(all_results[i][1][feature_suffixes[j]][1])
    
    data.append((loc.name, arrays))
    
    print("----- ----- -----")
Results for Washington DC
(10, {'_sum_sq': ('1991_b_sum_sq', 24, 3852.680000000011), '_sum_log': ('2013_b_sum_log', 41, 1.0494271412236884), '_sum_dd32': ('1991_b_sum_dd32', 0, 0.0), '_sum_dd50': ('1992_b_sum_dd50', 31, 0.0), '_sum_last14': ('2005_b_sum_last14', 23, 54.799999999999955), '_norm_dd_32': ('1994_b_norm_dd_32', 0, 0.0), '_norm_dd_50': ('2018_b_norm_dd_50', 4, 0.0), '_rolling_freeze': ('1991_b_rolling_freeze', 0, 0.0), '_rolling_chill': ('1999_b_rolling_chill', 34, 0.0), '_rolling_temp_change': ('1997_b_rolling_temp_change', 59, 24.800000000000008)})
(20, {'_sum_sq': ('2013_b_sum_sq', 37, 13526.310000000023), '_sum_log': ('2005_b_sum_log', 33, 5.482217935625627), '_sum_dd32': ('1991_b_sum_dd32', 6, 0.0), '_sum_dd50': ('1998_b_sum_dd50', 17, 0.0), '_sum_last14': ('2024_b_sum_last14', 46, 127.79999999999995), '_norm_dd_32': ('1994_b_norm_dd_32', 0, 2.0), '_norm_dd_50': ('2018_b_norm_dd_50', 4, 2.0), '_rolling_freeze': ('1991_b_rolling_freeze', 6, 0.0), '_rolling_chill': ('2001_b_rolling_chill', 55, 4.0), '_rolling_temp_change': ('2011_b_rolling_temp_change', 38, 71.59999999999997)})
(30, {'_sum_sq': ('1991_b_sum_sq', 34, 18253.010000000024), '_sum_log': ('2009_b_sum_log', 22, 8.464229549028587), '_sum_dd32': ('1991_b_sum_dd32', 1, 7.0), '_sum_dd50': ('1991_b_sum_dd50', 36, 0.0), '_sum_last14': ('2009_b_sum_last14', 23, 191.9999999999996), '_norm_dd_32': ('1991_b_norm_dd_32', 1, 7.0), '_norm_dd_50': ('1993_b_norm_dd_50', 2, 5.0), '_rolling_freeze': ('1991_b_rolling_freeze', 1, 7.0), '_rolling_chill': ('1997_b_rolling_chill', 27, 4.0), '_rolling_temp_change': ('2005_b_rolling_temp_change', 33, 124.59999999999995)})
(40, {'_sum_sq': ('1991_b_sum_sq', 35, 32586.250000000022), '_sum_log': ('2005_b_sum_log', 22, 11.476110915306847), '_sum_dd32': ('1999_b_sum_dd32', 40, 14.0), '_sum_dd50': ('1991_b_sum_dd50', 36, 0.0), '_sum_last14': ('2005_b_sum_last14', 24, 321.99999999999966), '_norm_dd_32': ('2009_b_norm_dd_32', 0, 8.0), '_norm_dd_50': ('2001_b_norm_dd_50', 0, 6.0), '_rolling_freeze': ('1999_b_rolling_freeze', 40, 15.0), '_rolling_chill': ('1997_b_rolling_chill', 27, 9.0), '_rolling_temp_change': ('2019_b_rolling_temp_change', 20, 182.9)})
----- ----- -----
Results for Liestal
(10, {'_sum_sq': ('2002_b_sum_sq', 66, 2793.6500000000015), '_sum_log': ('2004_b_sum_log', 29, 0.779998276216844), '_sum_dd32': ('1997_b_sum_dd32', 27, 0.0), '_sum_dd50': ('1991_b_sum_dd50', 53, 0.0), '_sum_last14': ('2002_b_sum_last14', 69, 30.700000000000045), '_norm_dd_32': ('1992_b_norm_dd_32', 0, 0.0), '_norm_dd_50': ('1996_b_norm_dd_50', 10, 0.0), '_rolling_freeze': ('1997_b_rolling_freeze', 27, 0.0), '_rolling_chill': ('2013_b_rolling_chill', 31, 0.0), '_rolling_temp_change': ('2001_b_rolling_temp_change', 18, 26.899999999999988)})
(20, {'_sum_sq': ('1999_b_sum_sq', 18, 10711.30000000001), '_sum_log': ('2008_b_sum_log', 45, 2.978655652284772), '_sum_dd32': ('1997_b_sum_dd32', 27, 0.0), '_sum_dd50': ('1993_b_sum_dd50', 58, 0.0), '_sum_last14': ('1998_b_sum_last14', 43, 80.00000000000034), '_norm_dd_32': ('2002_b_norm_dd_32', 0, 2.0), '_norm_dd_50': ('1992_b_norm_dd_50', 4, 4.0), '_rolling_freeze': ('1997_b_rolling_freeze', 27, 0.0), '_rolling_chill': ('2013_b_rolling_chill', 29, 3.0), '_rolling_temp_change': ('2013_b_rolling_temp_change', 23, 55.79999999999997)})
(30, {'_sum_sq': ('2013_b_sum_sq', 28, 19976.200000000015), '_sum_log': ('2008_b_sum_log', 22, 7.369746634481103), '_sum_dd32': ('1997_b_sum_dd32', 24, 0.0), '_sum_dd50': ('1993_b_sum_dd50', 48, 0.0), '_sum_last14': ('1998_b_sum_last14', 36, 213.60000000000014), '_norm_dd_32': ('2002_b_norm_dd_32', 0, 2.0), '_norm_dd_50': ('1993_b_norm_dd_50', 4, 4.0), '_rolling_freeze': ('1997_b_rolling_freeze', 23, 0.0), '_rolling_chill': ('2013_b_rolling_chill', 26, 7.0), '_rolling_temp_change': ('2013_b_rolling_temp_change', 23, 99.09999999999997)})
(40, {'_sum_sq': ('2005_b_sum_sq', 18, 17272.089999999997), '_sum_log': ('2005_b_sum_log', 18, 10.720291321811246), '_sum_dd32': ('1997_b_sum_dd32', 17, 4.0), '_sum_dd50': ('1994_b_sum_dd50', 22, 0.0), '_sum_last14': ('2005_b_sum_last14', 22, 395.2999999999997), '_norm_dd_32': ('2002_b_norm_dd_32', 0, 8.0), '_norm_dd_50': ('1994_b_norm_dd_50', 14, 7.0), '_rolling_freeze': ('1997_b_rolling_freeze', 27, 3.0), '_rolling_chill': ('2013_b_rolling_chill', 16, 16.0), '_rolling_temp_change': ('2013_b_rolling_temp_change', 21, 147.8)})
----- ----- -----
Results for Kyoto
(10, {'_sum_sq': ('1997_b_sum_sq', 35, 3806.989999999998), '_sum_log': ('1997_b_sum_log', 35, 0.9963614374403846), '_sum_dd32': ('1997_b_sum_dd32', 0, 0.0), '_sum_dd50': ('1991_b_sum_dd50', 53, 0.0), '_sum_last14': ('2013_b_sum_last14', 31, 29.999999999999773), '_norm_dd_32': ('1992_b_norm_dd_32', 0, 0.0), '_norm_dd_50': ('1996_b_norm_dd_50', 10, 0.0), '_rolling_freeze': ('1997_b_rolling_freeze', 0, 0.0), '_rolling_chill': ('1998_b_rolling_chill', 38, 0.0), '_rolling_temp_change': ('2024_b_rolling_temp_change', 68, 21.599999999999994)})
(20, {'_sum_sq': ('1991_b_sum_sq', 32, 11540.369999999988), '_sum_log': ('2012_b_sum_log', 48, 2.9421979249935077), '_sum_dd32': ('1997_b_sum_dd32', 0, 0.0), '_sum_dd50': ('1991_b_sum_dd50', 53, 0.0), '_sum_last14': ('2013_b_sum_last14', 31, 54.19999999999993), '_norm_dd_32': ('1995_b_norm_dd_32', 0, 0.0), '_norm_dd_50': ('1992_b_norm_dd_50', 4, 4.0), '_rolling_freeze': ('1997_b_rolling_freeze', 0, 0.0), '_rolling_chill': ('1998_b_rolling_chill', 37, 1.0), '_rolling_temp_change': ('2013_b_rolling_temp_change', 28, 59.70000000000001)})
(30, {'_sum_sq': ('1999_b_sum_sq', 19, 20571.660000000007), '_sum_log': ('2012_b_sum_log', 47, 4.449839565681472), '_sum_dd32': ('1998_b_sum_dd32', 17, 0.0), '_sum_dd50': ('1992_b_sum_dd50', 45, 0.0), '_sum_last14': ('1998_b_sum_last14', 37, 136.39999999999986), '_norm_dd_32': ('1997_b_norm_dd_32', 0, 0.0), '_norm_dd_50': ('1993_b_norm_dd_50', 4, 4.0), '_rolling_freeze': ('1998_b_rolling_freeze', 17, 0.0), '_rolling_chill': ('1998_b_rolling_chill', 28, 11.0), '_rolling_temp_change': ('2001_b_rolling_temp_change', 1, 87.29999999999998)})
(40, {'_sum_sq': ('1999_b_sum_sq', 30, 51632.58000000003), '_sum_log': ('1999_b_sum_log', 30, 11.255820258773852), '_sum_dd32': ('1998_b_sum_dd32', 26, 0.0), '_sum_dd50': ('1993_b_sum_dd50', 12, 0.0), '_sum_last14': ('1999_b_sum_last14', 14, 332.0999999999991), '_norm_dd_32': ('1997_b_norm_dd_32', 0, 0.0), '_norm_dd_50': ('1993_b_norm_dd_50', 4, 4.0), '_rolling_freeze': ('1998_b_rolling_freeze', 17, 0.0), '_rolling_chill': ('1998_b_rolling_chill', 19, 29.0), '_rolling_temp_change': ('1998_b_rolling_temp_change', 18, 118.39999999999992)})
----- ----- -----
Results for Vancouver
(10, {'_sum_sq': ('1999_b_sum_sq', 69, 3539.059999999994), '_sum_log': ('2004_b_sum_log', 33, 1.8825529727395605), '_sum_dd32': ('1991_b_sum_dd32', 52, 0.0), '_sum_dd50': ('1991_b_sum_dd50', 53, 0.0), '_sum_last14': ('2016_b_sum_last14', 24, 41.80000000000041), '_norm_dd_32': ('1992_b_norm_dd_32', 0, 0.0), '_norm_dd_50': ('1996_b_norm_dd_50', 10, 0.0), '_rolling_freeze': ('1991_b_rolling_freeze', 52, 0.0), '_rolling_chill': ('1991_b_rolling_chill', 53, 0.0), '_rolling_temp_change': ('2010_b_rolling_temp_change', 13, 21.50000000000001)})
(20, {'_sum_sq': ('2003_b_sum_sq', 16, 10951.32), '_sum_log': ('1997_b_sum_log', 42, 3.912478765509377), '_sum_dd32': ('1991_b_sum_dd32', 47, 2.0), '_sum_dd50': ('1991_b_sum_dd50', 53, 0.0), '_sum_last14': ('1993_b_sum_last14', 0, 205.39999999999975), '_norm_dd_32': ('1994_b_norm_dd_32', 0, 10.0), '_norm_dd_50': ('1992_b_norm_dd_50', 4, 4.0), '_rolling_freeze': ('1991_b_rolling_freeze', 47, 2.0), '_rolling_chill': ('2013_b_rolling_chill', 52, 0.0), '_rolling_temp_change': ('2010_b_rolling_temp_change', 13, 46.600000000000016)})
(30, {'_sum_sq': ('2010_b_sum_sq', 22, 15917.249999999978), '_sum_log': ('2010_b_sum_log', 24, 6.122629691371159), '_sum_dd32': ('2016_b_sum_dd32', 44, 11.0), '_sum_dd50': ('1992_b_sum_dd50', 45, 0.0), '_sum_last14': ('2022_b_sum_last14', 26, 221.99999999999977), '_norm_dd_32': ('1993_b_norm_dd_32', 0, 14.0), '_norm_dd_50': ('1993_b_norm_dd_50', 4, 4.0), '_rolling_freeze': ('2016_b_rolling_freeze', 44, 11.0), '_rolling_chill': ('2013_b_rolling_chill', 49, 5.0), '_rolling_temp_change': ('2021_b_rolling_temp_change', 45, 80.49999999999994)})
(40, {'_sum_sq': ('2010_b_sum_sq', 22, 20165.759999999973), '_sum_log': ('2010_b_sum_log', 24, 10.418869760718167), '_sum_dd32': ('2022_b_sum_dd32', 38, 9.0), '_sum_dd50': ('1993_b_sum_dd50', 12, 0.0), '_sum_last14': ('2010_b_sum_last14', 20, 231.6000000000007), '_norm_dd_32': ('1994_b_norm_dd_32', 9, 37.0), '_norm_dd_50': ('1993_b_norm_dd_50', 4, 4.0), '_rolling_freeze': ('2022_b_rolling_freeze', 38, 8.0), '_rolling_chill': ('2023_b_rolling_chill', 10, 5.0), '_rolling_temp_change': ('2010_b_rolling_temp_change', 12, 112.39999999999998)})
----- ----- -----
Results for NYC
(10, {'_sum_sq': ('2011_b_sum_sq', 42, 4121.240000000002), '_sum_log': ('2016_b_sum_log', 49, 0.9733589713594881), '_sum_dd32': ('2019_b_sum_dd32', 53, 0.0), '_sum_dd50': ('1991_b_sum_dd50', 53, 0.0), '_sum_last14': ('2022_b_sum_last14', 39, 58.699999999999875), '_norm_dd_32': ('1994_b_norm_dd_32', 41, 5.0), '_norm_dd_50': ('1996_b_norm_dd_50', 10, 0.0), '_rolling_freeze': ('2019_b_rolling_freeze', 53, 0.0), '_rolling_chill': ('2012_b_rolling_chill', 41, 0.0), '_rolling_temp_change': ('2012_b_rolling_temp_change', 52, 30.899999999999995)})
(20, {'_sum_sq': ('1993_b_sum_sq', 8, 14104.629999999997), '_sum_log': ('1996_b_sum_log', 20, 7.187871515130922), '_sum_dd32': ('2014_b_sum_dd32', 36, 6.0), '_sum_dd50': ('1991_b_sum_dd50', 53, 0.0), '_sum_last14': ('1993_b_sum_last14', 11, 122.59999999999991), '_norm_dd_32': ('1993_b_norm_dd_32', 0, 20.0), '_norm_dd_50': ('1992_b_norm_dd_50', 4, 4.0), '_rolling_freeze': ('2014_b_rolling_freeze', 36, 6.0), '_rolling_chill': ('2004_b_rolling_chill', 23, 5.0), '_rolling_temp_change': ('2023_b_rolling_temp_change', 40, 62.999999999999986)})
(30, {'_sum_sq': ('1993_b_sum_sq', 7, 20540.860000000004), '_sum_log': ('2010_b_sum_log', 50, 14.727854823709848), '_sum_dd32': ('2005_b_sum_dd32', 28, 7.0), '_sum_dd50': ('1992_b_sum_dd50', 45, 0.0), '_sum_last14': ('1993_b_sum_last14', 9, 280.19999999999965), '_norm_dd_32': ('1993_b_norm_dd_32', 0, 44.0), '_norm_dd_50': ('1993_b_norm_dd_50', 4, 4.0), '_rolling_freeze': ('2014_b_rolling_freeze', 36, 7.0), '_rolling_chill': ('2004_b_rolling_chill', 16, 7.0), '_rolling_temp_change': ('2005_b_rolling_temp_change', 33, 105.79999999999998)})
(40, {'_sum_sq': ('1993_b_sum_sq', 8, 23326.34), '_sum_log': ('1993_b_sum_log', 5, 22.32314793928245), '_sum_dd32': ('2005_b_sum_dd32', 31, 9.0), '_sum_dd50': ('1993_b_sum_dd50', 12, 0.0), '_sum_last14': ('1993_b_sum_last14', 11, 408.5999999999997), '_norm_dd_32': ('1996_b_norm_dd_32', 0, 36.0), '_norm_dd_50': ('1993_b_norm_dd_50', 4, 4.0), '_rolling_freeze': ('2009_b_rolling_freeze', 26, 15.0), '_rolling_chill': ('2004_b_rolling_chill', 9, 14.0), '_rolling_temp_change': ('2005_b_rolling_temp_change', 33, 175.00000000000003)})
----- ----- -----

After computing the estimated days_till_bloom for differing interval lengths, we can compare the estimated days. There are many outliers in the prediction, and the predictions can be wildly different at times. The chosen dates were picked in a non-rigorous, intuitive approach. While this may not be an ideal method for repeatability, for this specific prediction it might be okay. That’s because many of the high variance points are very unlikely; for example, points like 0 or 50 mean that the model thinks the trees are blooming today or in 50 days. We can reason intuitively that neither of these cases are/will happen, based on historical blooms. Thus, the median values we are left with are used to create the prediction and prediction intervals.

Table for Washington DC

  • columns represent the size of the testing interval.
  • Values represent the predicted days_till_bloom
Feature 10 20 30 40
_sum_sq 24 37 34 32
_sum_log 41 33 22 21
_sum_dd32 0 6 1 40
_sum_dd50 31 17 36 18

…and more values.

In the end, the final predictions for bloom dates (day of year) are:

location prediction lower upper
washingtondc 88 84 91
liestal 85 83 90
kyoto 91 88 97
vancouver 88 84 94
newyorkcity 95 93 97

Translating from day of year to actual dates (in the year 2025):

location prediction lower upper
washingtondc MAR 29 MAR 25 APR 1
liestal MAR 26 MAR 24 MAR 31
kyoto APR 1 MAR 29 APR 7
vancouver MAR 29 MAR 25 APR 4
newyorkcity APR 5 APR 3 APR 7

Limitations

While the initial results seem like they have potential, it is important to mention the many limitations of this modeling.

  • Data set only trained on Washington DC temperatures; may not be comparable to other countries
  • Training on more years past 1990
  • Differing cycle lengths may impact prediction
  • Low historical training data for places like NYC
  • Different breeds of cherry blossoms might have different cycles and cycle patterns
  • Not accounting for special impactful events (storms, dry spells, etc)
  • Likely better results using other variables alongside temperature (emissions, sunlight, rainfall)
  • Predictions and predictions intervals not chosen using empirical and repeatable method