Skip to contents

Introduction

The BycatchEstimator tool runs a generic model-based bycatch estimation procedure,along with some common design-based bycatch estimation methods, after you first set up the data inputs and other specifications. The code can estimate both total bycatch, calculated by expanding a sample, such as an observer database, to total effort from logbooks or landings records, and an annual index of abundance, calculated only from the observer data.

The code runs best in RStudio. Before running the code for the first time, install the latest versions of R (R Core Team (2020)) and RStudio (RStudio Team (2020)). The following libraries are used: tidyverse, ggplot2, MASS, lme4, cplm, tweedie, DHARMa, tidyselect, MuMIn, gridExtra, pdftools, foreach, doParallel, reshape2, glmmTMB and quantreg (Wickham et al. (2019); Wickham (2016); Venables and Ripley (2002); Bates et al. (2015); Zhang (2013); Dunn and Smyth (2005); Hartig (2020); Henry and Wickham (2020); Barton (2020); Auguie (2017); Wickham and Pedersen (2019); Iannone, Cheng, and Schloerke (2020); Ooms (2020);Microsoft and Weston (2020);Corporation and Weston (2020); Wickham (2007); Brooks et al. (2017); Koenker (2021)). The output figures and tables are printed to a pdf file using R Markdown and the knitr library, which outputs a LaTex file; therefore, you must have a LaTex program installed, such as TinyTex (Xie (2019), Xie (2021), Xie (2015)). For a quick start guide with example data, see the GitHub page (https://github.com/ebabcock/BycatchEstimator). For more help with installation issues, see the Installation Guide https://ebabcock.github.io/BycatchEstimator/articles/InstallationGuide.html.

The R code estimates total bycatch from models as follows. First, mean catch per unit effort (CPUE) of observed sample units (e.g. trips or sets) is estimated from a linear model with predictor variables. The observation error models used are delta-lognormal (using either glm or glmmTMB), delta-gamma (from glm or glmmTMB), negative binomial (from either glm.nb in the MASS library or glmmTMB, nbinom1 and nbinom2) and Tweedie (from cpglm or glmmTMB). For comparison, simple normal and lognormal models (from lm or glmmTMB) are also available, although these methods are not expected to perform well with many zero bycatch observations. Binomial models from either glm or glmmTMB, are used in the delta-models, and can also be used independently.

Within each observation error model group, potential predictor variables are chosen based on the user’s choice of information criteria (AICc, AIC or BIC) (Barton (2020)). The user specifies a most complex and simplest model, and all intermediate models are considered. All the variables in the simplest and most complex model are interpretted as fixed effects. The user may also specify random effects if desired, and the random effects will be included in all models. This may be useful for including a trip effect in a set-by-set analysis, for example, or for including a Year:area interaction as a random effect when calculating indices. If random effects are included in the model, then glmmTMB is used for all model fitting. The user-specified simplest model will usually include year, and can also include, for example, stratification variables that are used in the observer program sampling design. The model with the lowest value of the information criterion, between the simplest and most commplex model, is chosen as best within each observation error group.

The best candidate models in each observation error group are then compared using 10-fold cross validation, if desired, to see which observation error model best predicts CPUE. Note that information criteria can not be used directly to compare, for example, delta-lognormal to negative binomial or Tweedie, because the observation error models have different y data. However, the models can be used to predict CPUE directly, and these predictions can be compared with cross validation. The best model according to cross validation is the one with the lowest root mean square error (RMSE) in the predicted CPUE and the mean error (ME) closest to zero, excluding from consideration models that do not fit well according to criteria described below. Note that this model selection using information criteria and cross validation is only intended as a guide. The user should also look at the information criteria across multiple models, as well as residuals and other diagnostics, and may want to choose a different model for bycatch estimation or abundance index calculation based on other criteria, such as the design of the observer sampling program.

For the best model in each observation error model group (according to the selected information criterion), the total bycatch is estimated by predicting the catch in all logbook trips (i.e., the whole fishery) from the fitted model and summing across trips. There is an option to only predict bycatch from unobserved effort (i.e. trips or sets not sampled by observers) and calculate total bycatch as the observed bycatch plus the predicted bycatch in unobserved effort. This only works if it is possible to match the observed trips to the logbook trips, and the amount of observed effort in a trip is always less than or equal to the amount of total effort. For fisheries with high observer coverage (e.g. 20% or more), predicting bycatch from only unobserved effort would be preferred, because treating the whole fishery as unobserved might overestimate the variance.

The catch in each trip is predicted directly by the negative binomial models. Tweedie and normal models predict CPUE, which is then multiplied by effort. Delta-lognormal and delta-gamma models have separate components for the probability of a positive CPUE and the CPUE, which must be multiplied together (with appropriate bias corrections) and multiplied by effort to get the total catch. Catch in each trip is summed across trips to get the total catch in each year. The variance of the prediction in each trip is calculated as the variance of the prediction interval, which is the variance of the estimated mean prediction plus the residual variance. Because the predicted catches in each logbook trip are dependent on linear model coefficients, which are the same across multiple trips, the trips are not independent; thus, the variances cannot be added without accounting for the covariance. The variance of the total catch in each year is thus calculated either using Monte Carlo simulation, or using a delta method. Users may also chose not to estimate variances for large logbook datasets where these methods will not work. In the case where only the unobserved bycatch is estimated, the observed bycatch is added to the predictions as a known constant with no variance.

For the Monte Carlo variance estimation method, we first draw random values of the linear model coefficients from a multivariate normal distribution with the mean and variance/covariance matrix estimated from the model. The predictions for each trip are then drawn for each draw of the parameters using the appropriate probability density function (e.g. Tweedie, negative binomial) with additional parameters (e.g. residual variance, negative binomial dispersion, Tweedie power and phi) estimated by the model. Trips are then summed for each year (adding the observed catch if necessary) within each draw, and the mean, standard error, and quantiles (e.g. 2.5% and 97.5% for a 95% confidence interval) are then calculated across the Monte Carlo draws. An approximation of the total variance of the predicted bycatch can also be made using a delta method. The delta method approximates the variance of a function of a variable as the derivative of the function squared times the variance of the original variable. Thus, the variance of the prediction intervals in the original data scale is calculated by pre and post multiplying the derivative of the inverse link function to the variance covariance matrix of the predicted values.

\[\Sigma_p = J \Sigma_l J' \]

were \(\Sigma_l\) is the variance covariance matrix for the predicted trips in the stratum on the scale of the log link, and J is the matrix of derivatives. See the function MakePredictionsDeltaVar in bycatchFunctions.R for the details for each model type. This code is partly based on the method developed by https://stackoverflow.com/questions/39337862/linear-model-with-lm-how-to-get-prediction-variance-of-sum-of-predicted-value. The delta method is not available for delta-gamma, delta-lognormal, or Tweedie via the cplm library. For those error models, the simulation method will be used even if the delta method is selected with VarCalc.

If the logbook data is aggregated across multiple trips (e.g. by strata) the effort is allocated equally to all the trips in a row of the logbook data table for the purpose of simulating catches or estimating variances using the delta method. This allocation procedure is not needed to estimate the mean total bycatch, but it is necessary to estimate the variances correctly. When using aggregated effort, it is not yet possible to include the observed catches as known.

The model also calculations design based total bycatch estimates if requested, using either a stratified ratio estimator Rao (2000) or the design based delta-lognormal estimator of Pennington (1983). These total bycatch estimates are made at the the stratification variables defined by the user in the vector called designVars. To deal with strata that have no observations or few observations, the user may request pooling (designPooling=TRUE), and specify the minimum number of sample units needed to avoid pooling. The specifics of the pooling scheme are defined by inputs described below. If pooling, the total bycatch is estimated for the observer and logbook data specified by the pooling scheme for each stratum (i.e. combination of variables) and then allocated to the stratum according to the fraction of the logbook effort in the pooled data that is in the stratum of interest. Unpooled estimates are used for strata with sufficient data. Variances are also allocated to the stratum of interest based on the fraction of total effort. See Brown (2001) for details of one potential pooling scheme.

The model can estimate bycatch and abundance indices for multiple species or dispositions (e.g. dead discard, live release) from the same fishery simultaneously if they are all being estimated from the same data sets.

If a user requests an annual abundance index, this is calculated from the same models as were selected for bycatch estimation, with an option to specify the simplest model to be considered. Any random effects will be included in the index.

Data specification

This provides guidance and summarizes arguments of the function bycatchSetup

The observer data should be aggregated to the appropriate sample unit, either trips or sets. Effort must be in the same units in both data sets (e.g. sets or hook-hours). The logbook data may be aggregated to sample units, or it may be aggregated further, as long as it includes data on all stratification or predictor variables. For example, the data can be aggregated by year, region and season if those are the stratification variables, or it can be aggregated by trip. If any environmental variables, such as depth, are included, the logbook data probably has to be entered at the set level. The observer data should have columns for year and the other predictor variables, the observed effort and the observed bycatch or catch per trip of each species to be estimated. The logbook data must also have year and the other predictor variables, and the total effort in the same units (e.g. sets or hook-hours) as the observer data. There should also be a column that reports how many sample units (i.e. trips) are included in each row in the logbook data if the data are aggregated. This is needed to predict catches by trip for the variance calculations. If there are any NA values in any of the variables, those rows will be deleted from the data set. To include the observed catches as known in the totals, it is necessary to include a column for unsampled effort in the logbook data (which can be zero in completely sampled trips) and a column in both the logbook and observer data that can be used to match observed trips to logbook records.

Next, give the names of the variables in the observer and (if estimating bycatch) logbook data files. You must also specify the common and scientific names of the species being analyzed, the units of the bycatch estimates (e.g. kg, numbers), and the type of catch (e.g. retained catch, dead discards, live releases). If analyzing more than one species or disposition type, some of these inputs must be vectors with the same length as the number of species and/or disposition types. If calculating catch as the sum of observed catch and predicted catch from unobserved effort (includeObsCatch=TRUE) then you must have columns for both total effort and unsampled effort in the logbook data, and a column to match observed trips to the logbook trips.

Give the formulas for the most complex and simplest model to be considered. If the simplest model requires stratification variables other than year, summaries of the predicted bycatch at the level of these stratification variables will be printed to .csv files, but will not be plotted automatically. Abundance indices will be calculated including all the variables requested in indexVars, to allow for different indices for different stratification variables if desired (e.g. different spatial areas). You may also specify which predictor variables should be interpreted as factor (categorical). The named variables will be converted to factor before analysis. Note that year may be either a number or a factor. If year is a number, then a model such as y ~ Year will estimate a linear trend across years. Polynomial regression may be a useful way to estimate more complex trends across years in data sets where not all years have enough data to estimate Year as a categorical fixed effect. This can be specified as, for example y ~ Year + I(Year^2) + I(Year^3). Any desired random effects can be entered as a character vector (e.g. randomEffects=“Year:area”). In the case that any delta models are used, there can be separate random effects for the binomial model (randomEffects) and abundance when present model (randomEffects2).

To use design-based estimators, specify them as a character vector. For example, designMethods =c(“Ratio”,“Delta”) would calculate both a stratified ratio estimator and a delta-lognormal estimator. The variables in designVar will be included in the stratified design. If pooling is requested, the strata will be pooled in the order of designVars. If pooling is requested, three additional vectors are needed to define how the pooling works: poolTypes, which says whether the variable should be pooled across adjacent levels (for Year only), all values, or a new variable (e.g. season rather than month); pooledVar which specifies the new variable for pooling for the last pooling method; and adjacentNum which specifies the number of adjacent years to include if the adjacent year method is used (e.g. 1 to include both the previous and following year). Each of these vectors must be the same length as designVars. You must also specify the minimum number of sample units in a stratum to require pooling.

The design-based estimators are calculated when bycatchSetup is run, so you may stop here if you only need the design based estimators. They are given in the columns called ratioMean, ratiaSE, deltaMean and deltaSE in the file labelled AnnualSummary and the file labelled StratumSummary. See details on model outputs below. See the function getPooling for details on how the pooling is set up for design-based estimation, and getDesignEstimates for the total bycatch calculations.

Model specification

This provides guidance and summarizes arguments of the function bycatchFit

First, specify the input data object generated by bycatchSetup. Then, specify which information criterion to use in narrowing down the predictor variables to use in each observation error model group. Model selection is done with dredge function in the MuMIn library(Barton (2020)). Note that the model outputs should be the same whether using glmmTMB or other functions. The negative binomial 2 in TMB is the same as the negative binomial in glm.nb. To get faster results, use TMB only for model fitting. The other functions are included for comparison.

Also, specify whether to do cross validation to choose between observation error models, and whether to use the dredge function to use information criteria to choose the best set of predictor variables for each fold in cross validation. If DredgeCrossValidation is FALSE, the same predictor variables will be used for each fold, as selected for the full data set. This saves time with large data sets. The variable ResidualsTest allows excluding from cross validation any model where the residuals have a P<0.01 for a Kolmogorov Smirnov test of whether the residuals are distributed as expected under the likelihood (testUniformity in DHARMa). This is useful for excluding poorly performing models. However, it may be too restrictive if none of the models perform well, and it only works for small datasets. Specify the confidence levels desired for the total bycatch calculations.

If the variable useParallel is true and your computer has multiple cores, the dredge function will be run in parallel. This greatly speeds up the calculations. If you have trouble getting this to work, set useParallel to FALSE.

Finally, if you have information on total bycatch in each year to validate your estimates, for example in a simulation study, fill out the arguments plotValidation, trueVals, trueCols. Otherwise, set these arguments to FALSE, NULL and NULL, respectively. To include validation data, trueVals should be set equal to a character string containing a filename (with complete path) for a file containing a column labelled “Year”, and columns with the total bycatch in each year, with names specified in trueCols. For multiple species, trueCols can be a vector giving the names of all the column for each species in order.

Reviewing the model setup

After running the function bycatchSetup, data summaries are output to directories named “output” followed by the specified run name. A pdf file with summaries for all species (DataSummary.pdf) is placed in the main output folder, and a csv file for each species is placed in an output file named for the species. Note that records with NA in the catch or effort variable are excluded from the analysis and not included in the estimated sample size. If there are any years with no data, or no positive observations, you may want to exclude those years from the analysis. The summary table also counts the number of outliers (defined as data points more than 8 standard deviations from the mean) because outliers cause problems with fitting in some models. For data-checking, this output file also includes columns with estimated bycatch and its variance using a simple unstratified ratio estimator by year (See BycatchFunctions.r for all the functions). You will get an error message if any of the variables in the specified models are not found in the data frame.

Main analysis

This section explains the use of the function bycatchFit to run the CPUE models, estimate total bycatch and/or abundance indices, and do cross validation if requested. You can ignore the warnings and information about the models as long as the code keeps running. Most of these are information about which variables are being tried and which models have converged, which will be summarized in the output files.

The model may be slow (many hours) if you have a large data set or are doing cross validation. The outputs all go into the directories labeled with the species names, and include a pdf of all results, and separate csv files for all the tables. When the models are fit, the code keeps track of whether the model converged correctly, or if not, where it went wrong. An output table called modelFail.csv summarizes the results, with a “-” for models that converged successfully, “data” for models that could not be fit due to insufficient data (no positive observations in some year prevents fitting the delta models), “fit” for models that failed to converge, “cv” for models that produced results with unreasonably high CVs (>10) in the annual catch predictions, and “resid” for models that failed the Kolmogorov Smirnoff test for having the correct distribution in the DHARMa library (Hartig (2020)), if residualTest is TRUE. Models that fail in any of these ways are discarded and not used in cross validation. If you get one of these errors for a model you want to use, you should check the data for missing combinations of predictor variables, extreme outliers, or years with too few positive observations.

Finding best approximating models

The first section of the code includes two loops that go through all the models in modelTry and find the best model using the specified information criterion. The first loop runs all the models that are applied to all data together (i.e. not delta models). The delta models are run separately in the second loop because they require a different data setup. There is a function called findBestModelFunc, which applies the dredge function from the MuMin library to find the best combination of the predictor variables according to the specified information criteria (MuMIn). The dredge function produces a table which includes all the information criteria for each model that was considered, as well as model weights calculated for the information criterion the user specified, which sum to one and indicate the degree of support for the model in the data. The best model will have the highest weight. But, in some cases other models with also have strong support, and should perhaps be considered, particularly if they are simpler. The current version of the code does not use MuMIn’s model averaging function, but this may be worth considering if several models have similar weights.

A binomial model will be tried if it was requested, or if either a delta-lognormal or delta-gamma model were requested, since delta models have a binomial component. The binomial models are fitted with a logit link using either the glm function or glmmTMB. The negative binomial is run using the glm.nb function from the MASS library (Venables and Ripley (2002)) or nbinom1 nbinom2 from the glmmTMB library(Brooks et al. (2017)). The glm.nb function is very similar to the nbinom2 method in glmTMB, but both are included for comparison. Both define the variance of the negative binomial as: \[\sigma^2=\mu+\mu^2/\theta\], where \(\theta\) is an estimated parameter. For nbinom1, the variance is defined as: \[\sigma^2= \mu(1+\alpha)\], where \(\alpha\) is an estimated parameter. This version of the negative binomial model, which is equivalent to a quasi-Poisson model, gives somewhat different results from the other negative binomial models.

The negative binomial predicts integer counts, so it is appropriate for predicting bycatch in numbers per trip for all the trips in the logbook data. To allow this model to also be used with catch or bycatch measured in weights, the code rounds the catches to integers before running this model. Check that this is appropriate for the units you are using. To predict CPUE it is necessary to include an offset in the model. We use a log link for all three negative binomial models, so that the model predicts: \[log(C_i)=b_0+ b_1x_1+offset(log(E_i))\] were \(C_i\) is the catch in trip \(i\) in the observer data, \(b_0+ b_1x_1\) is an example linear predictor with an intercept and a slope, and the offset is the log of the effort \(E_i\) in each trip. This is algebraically equivalent to modeling CPUE as a function of the same linear predictor (without the offset).

The Tweedie distribution is available using the cpglm function in the cplm library (Zhang (2013)) or the Tweedie family in glmmTMB (Brooks et al. (2017)) by including “Tweedie” or “TMBtweedie” in modelTry. The Tweedie is a generalized function that estimates a distribution similar to a gamma distribution, except that it allows extra probability mass at zero. It is thus appropriate for continuous data with extra zeros. It uses a log link, and, in addition to the linear predictor for the log(mean) it estimates an index parameter \(p\) and dispersion parameter \(\phi\) which together determine the shape of the distribution.

If all years have at least one positive observation, the code next runs the lognormal and gamma models if requested. For the delta lognormal model, the CPUE is log transformed, and the mean CPUE for positive observations is modeled with the lm function for positive data only. For the delta-gamma method, the log link is used to model the positive CPUE values, using the glm function. The lognormal and gamma method are similar, except that they are run on all the CPUE data, including zeros, after adding a constant of 0.1.

Model residuals and residual diagnostics

The next section loops through all the models again, and calculates the model residuals, residual diagnostics from the DHARMa library (Hartig (2020)), and, if requested, calculations of the total bycatch and an index of abundance.

For the best model (according to the information criterion) in each observation error model group, both ordinary residuals and DHARMa scaled residuals are plotted, and the DHARMa diagnostics are calculated. The DHARMa library uses simulation to generate scaled residuals based on the specified observation error model so that the results are more clearly interpretable than ordinary residuals for non-normal models. DHARMA draws random predicted values from the fitted model to generate an empirical predictive density for each data point and then calculates the fraction of the empirical density that is greater than the true data point. Values of 0.5 are expected, and values near 0 or 1 indicate a mismatch between the data and the model. Particularly for the binomial and negative binomial models, in which the ordinary residuals are not normally distributed, the DHARMa residuals are a better representation of whether the data are consistent with the assumed distribution. Both the regular residuals and the DHARMa residuals are appropriate for lognormal and gamma models, since they model continuous data which is expected to be approximately normal when transformed by the link function. The DHARMA residuals should be uniformly distributed, as indicated by the QQUniform plot and the Kolmogorov-Smirnov test of uniformity. If the DHARMA residuals show significant over-dispersion then the model is not appropriate. A summary of the the DHARMa residual diagnostics is added to a table called residualTab, and models that fail the Kolmogorov-Smirnov test for uniformity of the DHARMa scaled residuals are excluded from cross validation if residualTest is TRUE. Because the cpglm functions do not produce estimates of standard error, simulation is used to generate the DHARMa residuals if “Tweedie” is the model selected, using the rtweedie distribution from the tweedie library (Dunn and Smyth (2005)).

Bycatch and abundance index calculation

For each model group the best model, as selected by the information criteria, is used to predict the mean and variance of the total bycatch in each year with the exception of the binomial, for which the model predicts to the total number of positive trips. For the binomial model, the best model is used to predict probability of a positive observation in each logbook trip and these are summed to get the total number of positive trips in each year (and in each stratum if further stratification was requested). The number of positive trips is calculated because, for a very rare species that is never caught more than once in a trip, the number of positive trips would be a good estimate of total bycatch and many of the other models would fail to converge. For more common species, the estimates of total catch are more appropriate, so the results of the binomial model alone are not included in the cross validation for model comparison. An abundance index based on the binomial distribution is also calculated, if requested.

For all the other model types, the functions makePredictionSimVarBig, makePredictionDeltaVar and makePredictionNoVar (depending on the selected variance calculation method) calculate the total bycatch in each year, by first predicting the total catch in each trip in the logbook data and then summing over all trips in each stratum. For all the negative binomial models, the log(effort) from the logbook trips is used as an offset in the predictions, along with the values of all the predictor variables, so that the model can predict bycatch in each trip directly. For the Tweedie, normal, and gamma models the CPUE is predicted for each trip in the logbook data and must be multiplied by effort and summed across trips to get the annual summaries. For delta-gamma models, the predicted bycatch in each trip is the predicted probability of a positive observation from the binomial, times the predicted CPUE from the gamma model, times effort in the trip. For delta-lognormal models, the variance of the predicted CPUE is needed to bias-correct when converting the mean predicted log(CPUE) to mean predicted CPUE. The variance of the prediction interval for each trip is calculated as the variance of the estimated mean plus the residual variance, and this value is used in the bias correction. The total predicted CPUE is the predicted probability of a positive observation from the binomial times the predicted positive CPUE, and predicted catch is the predicted CPUE times effort. For all models, the predicted total catches are summed across trips to get the totals in each stratum. The variance are calculated using the Monte Carlo simulation method described above or using a delta method. The delta method variance is not available for the delta-lognormal, delta-gamma or Tweedie models using cplm, although it is available for Tweedie using glmmTMB.

If a user requests an annual abundance index, this is also calculated from the best model in each model group. The annual abundance index is calculated by setting all variables other than year, and any variables required to be included in the index (e.g. region or fleet) to a reference level, which is the mean for numerical variables or the most common value for categorical variables. The index is calculated by predicting the mean CPUE in each year, and its standard error is calculated as the standard error of the mean prediction. For delta-lognormal and delta-gamma models the standard error of the prediction is calculated from the means and standard errors of the binomial and positive catch models using the method of Lo, Jacobson, and Squire (1992).

Finally, this section combines all the bycatch estimates and index estimates into data frames, and prints out CSV files with the DHARMa residual diagnostics. These are P values for a Kolmogorov-Smirnoff test of whether the DHARMa residuals are uniformly distributed as expected, a test of over-dispersion, a test of zero-inflation (which is meaningless for the delta models, but helpful to see if the negative binomial and tweedie models adequately model the zeros) and a test of whether there are more outliers than expected.

Cross validation

The next part runs the cross validation if requested. Only observation error models that converged and produced reasonable results with the complete data set are used in cross validation. For example, if there were not enough positive observations in all years to estimate delta-lognormal and delta-gamma models, then they will not be included in the cross validation.

For cross validation, the observer data are randomly divided into 10 folds. Each fold is left out one at a time and the models are fit to the other 9 folds. The same procedure described above is used to find the best model within each observation error group using information criteria and the MuMIn library if DredgeCrossValidation is TRUE; otherwise, the same variables will be used in each fold as for the full data set to save time. The fitted model is used to predict the CPUE for the left out fold, and the root mean square error is calculated as:

\[RMSE =\frac{1}{n}\sum_{i=1}^{n}(\hat{y_i}-y_i)^2\] where n is the number of observed trips and y is the CPUE data in the left-out tenth of the observer data, and \(\hat{y}\) is the CPUE predicted from the model fitted to the other 9/10th of the observer data. The model with the lowest mean RMSE across the 10 folds is selected as the best model. Mean error (ME) is also calculated as an indicator of whether the model has any systematic bias.

\[ME =\frac{1}{n}\sum_{i=1}^{n}\hat{y_i}-y_i\]

Note that cross-validation is only done for one random draw of the data. To use cross-validation for model selection it would be appropriate to do more than one draw. This is only intended to diagnose large problems with model predictive ability.

Final results markdown and files

The last section of the code saves all the results to a file, and uses markdown to make output files that print all the graphics and tables.

The R markdown file prints a .pdf file with the figures and tables to the output directory for each species labeled with the species and disposition code (e.g. SimulatedSpeciesDeadDiscardResults.pdf). These results may be all that is needed. However, if you want to look more closely at a specific model result, whether or not it was selected by the information criteria and cross validation, all the outputs are printed to .csv files in the folders listed for each species. The total bycatch figures for each individual model show both the analytical estimate of the total bycatch, and the mean across the Monte Carlo draws. These should be very similar, but may be different in years with small sample sizes, due to sampling error.

The other output files are, where [model] refers to the name of the observation error model:

  1. DataSummary.csv. Contains the data summaries for each year, and the ratio and delta-lognormal design-based estimators, aggregated to years.

  2. StrataSummary.csv. The same as DataSummary, except aggregated to the variables in simpleModel. This is useful if you need to, for example, keep fleets or regions separate.

  3. catchPooling.csv If pooling was requested. Summarizes the amount of pooling in each stratum

  4. [model]AnnualSummary.csv. The estimated total bycatch, standard errors and confidence intervals for the model.

  5. [model]StatumSummary.csv. The estimated total bycatch, standard errors and confidence intervals for the model at the variables in simpleModel.

  6. [model]Index.csv. An annual abundance index.

  7. Residuals[model].pdf The same residuals figure that is in the combined output.

  8. residualDiagnostics from the DHARMa library.

Conclusions

Although this code automates much of the process of model selection, it is not recommended that the final model be used without looking closely at the outputs. The selected model must have good fits and should be reasonably consistent with data according to the DHARMa residuals. The results should appear reasonable and be around the same scale as the ratio estimator results. The model should not be overly complex. Also, look at the model selection table to see if other models are also supported by the data.

References

Auguie, Baptiste. 2017. GridExtra: Miscellaneous Functions for "Grid" Graphics. https://CRAN.R-project.org/package=gridExtra.

Barton, Kamil. 2020. MuMIn: Multi-Model Inference. https://CRAN.R-project.org/package=MuMIn.

Bates, Douglas, Martin Mächler, Ben Bolker, and Steve Walker. 2015. “Fitting Linear Mixed-Effects Models Using lme4.” Journal of Statistical Software 67 (1): 1–48. https://doi.org/10.18637/jss.v067.i01.

Brooks, Mollie E., Kasper Kristensen, Koen J. van Benthem, Arni Magnusson, Casper W. Berg, Anders Nielsen, Hans J. Skaug, Martin Maechler, and Benjamin M. Bolker. 2017. “glmmTMB Balances Speed and Flexibility Among Packages for Zero-Inflated Generalized Linear Mixed Modeling.” The R Journal 9 (2): 378–400. https://journal.r-project.org/archive/2017/RJ-2017-066/index.html.

Brown, C. A. 2001. “Revised Estimates of Bluefin Tuna Dead Discards by the U.s. Atlantic Pelagic Longline Fleet, 1992-1999.” Journal Article. ICCAT Collective Volume of Scientific Papers 52 (3): 1007–21.

Corporation, Microsoft, and Steve Weston. 2020. DoParallel: Foreach Parallel Adaptor for the ’Parallel’ Package. https://CRAN.R-project.org/package=doParallel.

Dunn, Peter K., and Gordon K. Smyth. 2005. “Series Evaluation of Tweedie Exponential Dispersion Models.” Statistics and Computing 15: 267–80.

Hartig, Florian. 2020. DHARMa: Residual Diagnostics for Hierarchical (Multi-Level / Mixed) Regression Models. https://CRAN.R-project.org/package=DHARMa.

Henry, Lionel, and Hadley Wickham. 2020. Tidyselect: Select from a Set of Strings. https://CRAN.R-project.org/package=tidyselect.

Iannone, Richard, Joe Cheng, and Barret Schloerke. 2020. Gt: Easily Create Presentation-Ready Display Tables. https://CRAN.R-project.org/package=gt.

Koenker, Roger. 2021. Quantreg: Quantile Regression. https://CRAN.R-project.org/package=quantreg.

Lo, N. C. H., L. D. Jacobson, and J. L. Squire. 1992. “Indexes of Relative Abundance from Fish Spotter Data Based on Delta-Lognormal Models.” Journal Article. Canadian Journal of Fisheries and Aquatic Sciences 49 (12): 2515–26. https://doi.org/Doi 10.1139/F92-278.

Microsoft, and Steve Weston. 2020. Foreach: Provides Foreach Looping Construct. https://CRAN.R-project.org/package=foreach.

Ooms, Jeroen. 2020. Pdftools: Text Extraction, Rendering and Converting of Pdf Documents. https://CRAN.R-project.org/package=pdftools.

Pennington, M. 1983. “Efficient Estimators of Abundance, for Fish and Plankton Surveys.” Journal Article. Biometrics 39 (1): 281–86. https://www.jstor.org/stable/2530830.

Rao, P. S. R. S. 2000. Sampling Methodologies with Applications. 1st Edition. Chapman; Hall/CRC.

R Core Team. 2020. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

RStudio Team. 2020. RStudio: Integrated Development Environment for R. Boston, MA: RStudio, PBC. http://www.rstudio.com/.

Venables, W. N., and B. D. Ripley. 2002. Modern Applied Statistics with S. Fourth. New York: Springer. http://www.stats.ox.ac.uk/pub/MASS4.

Wickham, Hadley. 2007. “Reshaping Data with the reshape Package.” Journal of Statistical Software 21 (12): 1–20. http://www.jstatsoft.org/v21/i12/.

———. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.

Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.

Wickham, Hadley, and Thomas Lin Pedersen. 2019. Gtable: Arrange ’Grobs’ in Tables. https://CRAN.R-project.org/package=gtable.

Xie, Yihui. 2015. Dynamic Documents with R and Knitr. Second. Chapman; Hall/CRC.

———. 2019. “TinyTeX: A Lightweight, Cross-Platform, and Easy-to-Maintain Latex Distribution Based on Tex Live.” TUGboat, no. 1: 30–32. http://tug.org/TUGboat/Contents/contents40-1.html.

———. 2021. Tinytex: Helper Functions to Install and Maintain Tex Live, and Compile Latex Documents. https://github.com/yihui/tinytex.

Zhang, Yanwei. 2013. “Likelihood-Based and Bayesian Methods for Tweedie Compound Poisson Linear Mixed Models.”