Title: | Regression Analysis and Forecasting Using Textual Data from a Time-Varying Dictionary |
---|---|
Description: | Provides functionalities based on the paper "Time Varying Dictionary and the Predictive Power of FED Minutes" (Lima, 2018) <doi:10.2139/ssrn.3312483>. It selects the most predictive terms, that we call time-varying dictionary using supervised machine learning techniques as lasso and elastic net. |
Authors: | Luiz Renato Lima [aut], Lucas Godeiro [aut, cre] |
Maintainer: | Lucas Godeiro <[email protected]> |
License: | GPL-3 |
Version: | 0.1.2 |
Built: | 2024-10-21 04:48:09 UTC |
Source: | https://github.com/lucasgodeiro/textforecast |
get_collocations function
get_collocations(corpus_dates, path_name, ntrms, ngrams_number, min_freq, language)
get_collocations(corpus_dates, path_name, ntrms, ngrams_number, min_freq, language)
corpus_dates |
a character vector indicating the subfolders where are located the texts. |
path_name |
the folders path where the subfolders with the dates are located. |
ntrms |
maximum numbers of collocations that will be filtered by tf-idf. We rank the collocations by tf-idf in a decreasing order. Then, after we select the words with the ntrms highest tf-idf. |
ngrams_number |
integer indicating the size of the collocations. Defaults to 2, indicating to compute bigrams. If set to 3, will find collocations of bigrams and trigrams. |
min_freq |
integer indicating the frequency of how many times a collocation should at least occur in the data in order to be returned. |
language |
the texts language. Default is english. |
a list containing a sparse matrix with the all collocations couting and another with a tf-idf filtered collocations counting according to the ntrms.
st_year=2017 end_year=2018 path_name=system.file("news",package="TextForecast") qt=paste0(sort(rep(seq(from=st_year,to=end_year,by=1),12)), c("m1","m2","m3","m4","m5","m6","m7","m8","m9","m10","m11","m12")) z_coll=get_collocations(corpus_dates=qt[1:23],path_name=path_name, ntrms=500,ngrams_number=3,min_freq=10) path_name=system.file("news",package="TextForecast") days=c("2019-30-01","2019-31-01") z_coll=get_collocations(corpus_dates=days[1],path_name=path_name, ntrms=500,ngrams_number=3,min_freq=1)
st_year=2017 end_year=2018 path_name=system.file("news",package="TextForecast") qt=paste0(sort(rep(seq(from=st_year,to=end_year,by=1),12)), c("m1","m2","m3","m4","m5","m6","m7","m8","m9","m10","m11","m12")) z_coll=get_collocations(corpus_dates=qt[1:23],path_name=path_name, ntrms=500,ngrams_number=3,min_freq=10) path_name=system.file("news",package="TextForecast") days=c("2019-30-01","2019-31-01") z_coll=get_collocations(corpus_dates=days[1],path_name=path_name, ntrms=500,ngrams_number=3,min_freq=1)
get_terms function
get_terms(corpus_dates, ntrms_words, st, path.name, ntrms_collocation, ngrams_number, min_freq, language)
get_terms(corpus_dates, ntrms_words, st, path.name, ntrms_collocation, ngrams_number, min_freq, language)
corpus_dates |
a character vector indicating the subfolders where the texts are located. |
ntrms_words |
maximum numbers of words that will be filtered by tf-idf. We rank the word by tf-idf in a decreasing order. Then, we select the words with the ntrms highest tf-idf. |
st |
set 0 to stem the words and 1 otherwise. |
path.name |
the folders path where the subfolders with the dates are located. |
ntrms_collocation |
maximum numbers of collocations that will be filtered by tf-idf. We rank the collocations by tf-idf in a decreasing order. Then, after we select the words with the ntrms highest tf-idf. |
ngrams_number |
integer indicating the size of the collocations. Defaults to 2, indicating to compute bigrams. If set to 3, will find collocations of bigrams and trigrams. |
min_freq |
integer indicating the frequency of how many times a collocation should at least occur in the data in order to be returned. |
language |
the texts language. Default is english. |
a list containing a sparse matrix with the all collocations and words couting and another with a tf-idf filtered collocations and words counting according to the ntrms.
st_year=2017 end_year=2018 path_name=system.file("news",package="TextForecast") qt=paste0(sort(rep(seq(from=st_year,to=end_year,by=1),12)), c("m1","m2","m3","m4","m5","m6","m7","m8","m9","m10","m11","m12")) z_terms=get_terms(corpus_dates=qt[1:23],path.name=path_name, ntrms_words=500,ngrams_number=3,st=0,ntrms_collocation=500,min_freq=10) path_name=system.file("news",package="TextForecast") days=c("2019-30-01","2019-31-01") z_terms=get_terms(corpus_dates=days[1],path.name=path_name, ntrms_words=500,ngrams_number=3,st=0,ntrms_collocation=500,min_freq=1)
st_year=2017 end_year=2018 path_name=system.file("news",package="TextForecast") qt=paste0(sort(rep(seq(from=st_year,to=end_year,by=1),12)), c("m1","m2","m3","m4","m5","m6","m7","m8","m9","m10","m11","m12")) z_terms=get_terms(corpus_dates=qt[1:23],path.name=path_name, ntrms_words=500,ngrams_number=3,st=0,ntrms_collocation=500,min_freq=10) path_name=system.file("news",package="TextForecast") days=c("2019-30-01","2019-31-01") z_terms=get_terms(corpus_dates=days[1],path.name=path_name, ntrms_words=500,ngrams_number=3,st=0,ntrms_collocation=500,min_freq=1)
get_words function
get_words(corpus_dates, ntrms, st, path_name, language)
get_words(corpus_dates, ntrms, st, path_name, language)
corpus_dates |
A vector of characters indicating the subfolders where are located the texts. |
ntrms |
maximum numbers of words that will be filtered by tf-idf. We rank the word by tf-idf in a decreasing order. Then, we select the words with the ntrms highest tf-idf. |
st |
set 0 to stem the words and 1 otherwise. |
path_name |
the folders path where the subfolders with the dates are located. |
language |
The texts language. |
a list containing a sparse matrix with the all words couting and another with a td-idf filtered words counting according to the ntrms.
st_year=2017 end_year=2018 path_name=system.file("news",package="TextForecast") qt=paste0(sort(rep(seq(from=st_year,to=end_year,by=1),12)), c("m1","m2","m3","m4","m5","m6","m7","m8","m9","m10","m11","m12")) z_wrd=get_words(corpus_dates=qt[1:23],path_name=path_name,ntrms=500,st=0) path_name=system.file("news",package="TextForecast") days=c("2019-31-01","2019-31-01") z_wrd=get_words(corpus_dates=days,path_name=path_name,ntrms=500,st=0)
st_year=2017 end_year=2018 path_name=system.file("news",package="TextForecast") qt=paste0(sort(rep(seq(from=st_year,to=end_year,by=1),12)), c("m1","m2","m3","m4","m5","m6","m7","m8","m9","m10","m11","m12")) z_wrd=get_words(corpus_dates=qt[1:23],path_name=path_name,ntrms=500,st=0) path_name=system.file("news",package="TextForecast") days=c("2019-31-01","2019-31-01") z_wrd=get_words(corpus_dates=days,path_name=path_name,ntrms=500,st=0)
hard thresholding
hard_thresholding(x, w, y, p_value, newx)
hard_thresholding(x, w, y, p_value, newx)
x |
the input matrix x. |
w |
the optional input matrix w, that cannot be selected. |
y |
the response variable. |
p_value |
the threshold p-value. |
newx |
matrix that selection will applied. Useful for time series, when we need the observation at time t. |
the variables less than p-value.
data("stock_data") data("optimal_factors") y=as.matrix(stock_data[,2]) y=as.vector(y) w=as.matrix(stock_data[,3]) pc=as.matrix(optimal_factors) t=length(y) news_factor <- hard_thresholding(w=w[1:(t-1),],x=pc[1:(t-1),],y=y[2:t],p_value = 0.01,newx = pc)
data("stock_data") data("optimal_factors") y=as.matrix(stock_data[,2]) y=as.vector(y) w=as.matrix(stock_data[,3]) pc=as.matrix(optimal_factors) t=length(y) news_factor <- hard_thresholding(w=w[1:(t-1),],x=pc[1:(t-1),],y=y[2:t],p_value = 0.01,newx = pc)
A simple tibble
containing the term counting of the financial news from the wall street journal and the news york times from 1992:01 through 2018:11.
news_data
news_data
A tibble with 1631 components.
The vector of dates.
The terms counting.
Title optimal alphas function
optimal_alphas(x, w, y, grid_alphas, cont_folds, family)
optimal_alphas(x, w, y, grid_alphas, cont_folds, family)
x |
A matrix of variables to be selected by shrinkrage methods. |
w |
A matrix or vector of variables that cannot be selected(no shrinkrage). |
y |
response variable. |
grid_alphas |
a grid of alphas between 0 and 1. |
cont_folds |
Set TRUE for contiguous folds used in time depedent data. |
family |
The glmnet family. |
lambdas_opt a vector with the optimal alpha and lambda.
set.seed(1) data("stock_data") data("news_data") y=as.matrix(stock_data[1:200,2]) w=as.matrix(stock_data[1:200,3]) data("news_data") X=news_data[1:200,2:ncol(news_data)] x=as.matrix(X) grid_alphas=seq(by=0.25,to=1,from=0.5) cont_folds=TRUE t=length(y) optimal_alphas=optimal_alphas(x[1:(t-1),], w[1:(t-1),],y[2:t],grid_alphas,TRUE,"gaussian")
set.seed(1) data("stock_data") data("news_data") y=as.matrix(stock_data[1:200,2]) w=as.matrix(stock_data[1:200,3]) data("news_data") X=news_data[1:200,2:ncol(news_data)] x=as.matrix(X) grid_alphas=seq(by=0.25,to=1,from=0.5) cont_folds=TRUE t=length(y) optimal_alphas=optimal_alphas(x[1:(t-1),], w[1:(t-1),],y[2:t],grid_alphas,TRUE,"gaussian")
A simple vector containing the Optimal factors select by optimal_number_factors function.
optimal_factors
optimal_factors
A vector with 1 component.
The vector of factor.
optimal number of factors function
optimal_number_factors(x, kmax)
optimal_number_factors(x, kmax)
x |
a matrix x. |
kmax |
the maximum number of factors |
a list with the optimal factors.
data("optimal_x") optimal_factor <- optimal_number_factors(x=optimal_x,kmax=8)
data("optimal_x") optimal_factor <- optimal_number_factors(x=optimal_x,kmax=8)
A simple matrix
containing the optimal words selected by Elastic Net from 1992:01 through 2018:11.
optimal_x
optimal_x
A matrix with the most predictive terms.
The matrix with 4 components.
A simple tibble
containing the S&P 500 return and the VIX volatility index from 1992:01 through 2018:11.
stock_data
stock_data
A tibble with 3 components.
The vector of dates.
The S&P 500 returns.
The volatility index.
Text Forecast function
text_forecast(x, y, h, intercept)
text_forecast(x, y, h, intercept)
x |
the input matrix x. |
y |
the response variable |
h |
the forecast horizon |
intercept |
TRUE for include intercept in the forecast equation. |
The h step ahead forecast
set.seed(1) data("stock_data") data("news_data") y=as.matrix(stock_data[,2]) w=as.matrix(stock_data[,3]) data("news_data") data("optimal_factors") pc=optimal_factors z=cbind(w,pc) fcsts=text_forecast(z,y,1,TRUE)
set.seed(1) data("stock_data") data("news_data") y=as.matrix(stock_data[,2]) w=as.matrix(stock_data[,3]) data("news_data") data("optimal_factors") pc=optimal_factors z=cbind(w,pc) fcsts=text_forecast(z,y,1,TRUE)
text nowcast
text_nowcast(x, y, intercept)
text_nowcast(x, y, intercept)
x |
the input matrix x. It should have 1 observation more that y. |
y |
the response variable |
intercept |
TRUE for include intercept in the forecast equation. |
the nowcast h=0 for the variable y.
set.seed(1) data("stock_data") data("news_data") y=as.matrix(stock_data[,2]) w=as.matrix(stock_data[,3]) data("news_data") data("optimal_factors") pc=optimal_factors z=cbind(w,pc) t=length(y) ncsts=text_nowcast(z,y[1:(t-1)],TRUE)
set.seed(1) data("stock_data") data("news_data") y=as.matrix(stock_data[,2]) w=as.matrix(stock_data[,3]) data("news_data") data("optimal_factors") pc=optimal_factors z=cbind(w,pc) t=length(y) ncsts=text_nowcast(z,y[1:(t-1)],TRUE)
tf-idf function
tf_idf(x)
tf_idf(x)
x |
a input matrix x of terms counting. |
a list with the terms tf-idf and the terms tf-idf in descending order.
data("news_data") X=as.matrix(news_data[,2:ncol(news_data)]) tf_idf_terms = tf_idf(X)
data("news_data") X=as.matrix(news_data[,2:ncol(news_data)]) tf_idf_terms = tf_idf(X)
Top Terms Function
top_terms(x, w, y, alpha, lambda, k, wordcloud, max.words, scale, rot.per, family)
top_terms(x, w, y, alpha, lambda, k, wordcloud, max.words, scale, rot.per, family)
x |
the input matrix of terms to be selected. |
w |
optional argument. the input matrix of structured data to not be selected. |
y |
the response variable |
alpha |
the glmnet alpha |
lambda |
the glmnet lambda |
k |
the k top terms |
wordcloud |
set TRUE to plot the wordcloud |
max.words |
the maximum number of words in the wordcloud |
scale |
the wordcloud size. |
rot.per |
wordcloud proportion 90 degree terms |
family |
glmnet family |
the top k terms and the corresponding wordcloud.
set.seed(1) data("stock_data") data("news_data") y=as.matrix(stock_data[,2]) w=as.matrix(stock_data[,3]) data("news_data") X=news_data[,2:ncol(news_data)] x=as.matrix(X) grid_alphas=seq(by=0.05,to=0.95,from=0.05) cont_folds=TRUE t=length(y) optimal_alphas=optimal_alphas(x[1:(t-1),],w[1:(t-1),], y[2:t],grid_alphas,TRUE,"gaussian") top_trms<- top_terms(x[1:(t-1),],w[1:(t-1),],y[2:t], optimal_alphas[[1]], optimal_alphas[[2]],10,TRUE, 10,c(2,0.3),.15,"gaussian")
set.seed(1) data("stock_data") data("news_data") y=as.matrix(stock_data[,2]) w=as.matrix(stock_data[,3]) data("news_data") X=news_data[,2:ncol(news_data)] x=as.matrix(X) grid_alphas=seq(by=0.05,to=0.95,from=0.05) cont_folds=TRUE t=length(y) optimal_alphas=optimal_alphas(x[1:(t-1),],w[1:(t-1),], y[2:t],grid_alphas,TRUE,"gaussian") top_trms<- top_terms(x[1:(t-1),],w[1:(t-1),],y[2:t], optimal_alphas[[1]], optimal_alphas[[2]],10,TRUE, 10,c(2,0.3),.15,"gaussian")
tv dictionary function
tv_dictionary(x, w, y, alpha, lambda, newx, family)
tv_dictionary(x, w, y, alpha, lambda, newx, family)
x |
A matrix of variables to be selected by shrinkrage methods. |
w |
Optional Argument. A matrix of variables to be selected by shrinkrage methods. |
y |
the response variable. |
alpha |
the alpha required in glmnet. |
lambda |
the lambda required in glmnet. |
newx |
Matrix that selection will applied. Useful for time series, when we need the observation at time t. |
family |
the glmnet family. |
X_star: a list with the coefficients and a sparse matrix with the most predictive terms.
set.seed(1) data("stock_data") data("news_data") y=as.matrix(stock_data[1:200,2]) w=as.matrix(stock_data[1:200,3]) data("news_data") X=news_data[1:200,2:ncol(news_data)] x=as.matrix(X) grid_alphas=seq(by=0.5,to=1,from=0.5) cont_folds=TRUE t=length(y) optimal_alphas=optimal_alphas(x[1:(t-1),],w[1:(t-1),], y[2:t],grid_alphas,TRUE,"gaussian") x_star=tv_dictionary(x=x[1:(t-1),],w=w[1:(t-1),],y=y[2:t], alpha=optimal_alphas[1],lambda=optimal_alphas[2],newx=x,family="gaussian")
set.seed(1) data("stock_data") data("news_data") y=as.matrix(stock_data[1:200,2]) w=as.matrix(stock_data[1:200,3]) data("news_data") X=news_data[1:200,2:ncol(news_data)] x=as.matrix(X) grid_alphas=seq(by=0.5,to=1,from=0.5) cont_folds=TRUE t=length(y) optimal_alphas=optimal_alphas(x[1:(t-1),],w[1:(t-1),], y[2:t],grid_alphas,TRUE,"gaussian") x_star=tv_dictionary(x=x[1:(t-1),],w=w[1:(t-1),],y=y[2:t], alpha=optimal_alphas[1],lambda=optimal_alphas[2],newx=x,family="gaussian")
tv sentiment index function
tv_sentiment_index(x, w, y, alpha, lambda, newx, family, k)
tv_sentiment_index(x, w, y, alpha, lambda, newx, family, k)
x |
A matrix of variables to be selected by shrinkrage methods. |
w |
Optional Argument. A matrix of variables to be selected by shrinkrage methods. |
y |
the response variable. |
alpha |
the alpha required in glmnet. |
lambda |
the lambda required in glmnet. |
newx |
Matrix that selection will be applied. Useful for time series, when we need the observation at time t. |
family |
the glmnet family. |
k |
the highest positive and negative coefficients to be used. |
The time-varying sentiment index. The index is based on the word/term counting and is computed using: tv_index=(pos-neg)/(pos+neg).
suppressWarnings(RNGversion("3.5.0")) set.seed(1) data("stock_data") data("news_data") y=as.matrix(stock_data[,2]) w=as.matrix(stock_data[,3]) data("news_data") X=news_data[,2:ncol(news_data)] x=as.matrix(X) grid_alphas=0.05 cont_folds=TRUE t=length(y) optimal_alphas=optimal_alphas(x[1:(t-1),],w[1:(t-1),], y[2:t],grid_alphas,TRUE,"gaussian") tv_index <- tv_sentiment_index(x[1:(t-1),],w[1:(t-1),],y[2:t], optimal_alphas[[1]],optimal_alphas[[2]],x,"gaussian",2)
suppressWarnings(RNGversion("3.5.0")) set.seed(1) data("stock_data") data("news_data") y=as.matrix(stock_data[,2]) w=as.matrix(stock_data[,3]) data("news_data") X=news_data[,2:ncol(news_data)] x=as.matrix(X) grid_alphas=0.05 cont_folds=TRUE t=length(y) optimal_alphas=optimal_alphas(x[1:(t-1),],w[1:(t-1),], y[2:t],grid_alphas,TRUE,"gaussian") tv_index <- tv_sentiment_index(x[1:(t-1),],w[1:(t-1),],y[2:t], optimal_alphas[[1]],optimal_alphas[[2]],x,"gaussian",2)
TV sentiment index using all positive and negative coefficients.
tv_sentiment_index_all_coefs(x, w, y, alpha, lambda, newx, family, scaled, k_mov_avg, type_mov_avg)
tv_sentiment_index_all_coefs(x, w, y, alpha, lambda, newx, family, scaled, k_mov_avg, type_mov_avg)
x |
A matrix of variables to be selected by shrinkrage methods. |
w |
Optional Argument. A matrix of variables to be selected by shrinkrage methods. |
y |
the response variable. |
alpha |
the alpha required in glmnet. |
lambda |
the lambda required in glmnet. |
newx |
Matrix that selection will be applied. Useful for time series, when we need the observation at time t. |
family |
the glmnet family. |
scaled |
Set TRUE for scale and FALSE for no scale. |
k_mov_avg |
The moving average order. |
type_mov_avg |
The type of moving average. See movavg. |
A list with the net, postive and negative sentiment index. The net time-varying sentiment index. The index is based on the word/term counting and is computed using: tv_index=(pos-neg)/(pos+neg). The postive sentiment index is computed using: tv_index_pos=pos/(pos+neg) and the negative tv_index_neg=neg/(pos+neg).
suppressWarnings(RNGversion("3.5.0")) set.seed(1) data("stock_data") data("news_data") y=as.matrix(stock_data[,2]) w=as.matrix(stock_data[,3]) data("news_data") X=news_data[,2:ncol(news_data)] x=as.matrix(X) grid_alphas=0.05 cont_folds=TRUE t=length(y) optimal_alphas=optimal_alphas(x=x[1:(t-1),], y=y[2:t],grid_alphas=grid_alphas,cont_folds=TRUE,family="gaussian") tv_idx=tv_sentiment_index_all_coefs(x=x[1:(t-1),],y=y[2:t],alpha = optimal_alphas[1], lambda = optimal_alphas[2],newx=x, scaled = TRUE,k_mov_avg = 4,type_mov_avg = "s")
suppressWarnings(RNGversion("3.5.0")) set.seed(1) data("stock_data") data("news_data") y=as.matrix(stock_data[,2]) w=as.matrix(stock_data[,3]) data("news_data") X=news_data[,2:ncol(news_data)] x=as.matrix(X) grid_alphas=0.05 cont_folds=TRUE t=length(y) optimal_alphas=optimal_alphas(x=x[1:(t-1),], y=y[2:t],grid_alphas=grid_alphas,cont_folds=TRUE,family="gaussian") tv_idx=tv_sentiment_index_all_coefs(x=x[1:(t-1),],y=y[2:t],alpha = optimal_alphas[1], lambda = optimal_alphas[2],newx=x, scaled = TRUE,k_mov_avg = 4,type_mov_avg = "s")