Raw data, R code and analyses related to gene-to-metabolite analysis using Pearson's correlation coefficient from the research article 'Exploring L-tyrosine and L-DOPA biosynthesis in faba bean (Vicia faba L.)'

Sheehan, Hester; Geu-Flores, Fernando; Mancinotti, Davide; Escobar-Herrera, Leandro

doi:10.48436/kq7h6-tgf87

Published February 25, 2026 | Version v1

Dataset Open

Raw data, R code and analyses related to gene-to-metabolite analysis using Pearson's correlation coefficient from the research article 'Exploring L-tyrosine and L-DOPA biosynthesis in faba bean (Vicia faba L.)'

1. TU Wien
2. University of Copenhagen
3. Aarhus University

Contributors

Researchers:

1. University of Copenhagen
2. Aarhus University

Dataset description:

These datasets and R code were used to implicate L-tyrosine oxidase candidates in faba bean (Vicia faba) using a gene-to-metabolite correlation analysis that uses the Pearson correlation coefficient (PCC). The results derived from this analysis are described in the manuscript 'Exploring L-tyrosine and L-DOPA biosynthesis in faba bean (Vicia faba L.)'.

The main purpose of this dataset is to create a record of the raw data and the analyses that was used in the analyses in the manuscript.

This dataset includes:

'R_code' directory:
- R code used to carry out the analysis (Code_PCC.R)
- LICENSE.txt
'Data' directory:
- Raw metabolite data and the processing of this data (Xia-et-al_Metabolite-data_PCC-analysis.xlsx)
  - See description of separate spreadsheet tabs in 'Context and methodology'
- Metabolite dataset used for correlation analysis (Metabolite_data.csv)
  - dataframe of metabolite intensities normalised to internal standard and dry weight; metabolite features x tissue samples
- Gene expression dataset used for correlation analysis (VC1_expressionmatrix.csv)
  - This expression data comes from Björnsdotter et al. (2021).
  - dataframe of gene expression (TPM); gene x tissue samples
- Results of the correlation analysis and selection of candidate genes (Xia-et-al_Results_PCC-analysis.xlsx)
  - See description of separate spreadsheet tabs in 'Context and methodology'
- LICENSE.txt

Context and methodology

Metabolite data generation and processing

We expanded the metabolomics dataset by Björnsdotter et al. (2021) with the addition of four extra tissues as well as a few additional samples of the original tissues. The new tissues/samples had been analyzed alongside the original ones (same LC-MS method and run) but had not been subjected to data analysis due to the lack of corresponding RNAseq data. The full metabolomics data comprises the following tissues, all derived from field-grown V. faba plants: young leaf (4 samples), mature leaves (3 samples), stem (4 samples), flower (4 samples), roots (4 samples), whole seeds at early seed-filling stage (3 samples), pods at early seed-filling stage (4 sample), seed coats at mid-maturation stage (1 sample), embryo at mid-maturation stage (3 samples), pods at mid-maturation stage (2 samples), pods at drying stage (4 samples), and seed coats at drying stage (4 samples). All samples were run in technical duplicates. We subjected the raw data to the same XCMS-based analysis pipeline described by Björnsdotter et al. (2021). Briefly, we used XCMS Online (v.3.7.1; Tautenhahn et al. 2012) to align chromatograms as well as identify and quantify metabolite features (peak width between 5 and 20 s; signal-to-noise ratio < 6:1; tab 1 of Xia-et-al_Metabolite-data_PCC-analysis.xlsx). We then removed metabolic features from the mass calibrant (retention time < 0.5 min; tab 2 of Xia-et-al_Metabolite-data_PCC-analysis.xlsx) as well as those from blank samples (similar intensity between samples and blanks, p < 0.01 in Student’s t-test; tab 3 of Xia-et-al_Metabolite-data_PCC-analysis.xlsx). The intensity of the remaining metabolic features was then normalized to the dry weight of the samples and to the signal of the internal standard ([M+H]⁺ for caffeine; tab 4 of Xia-et-al_Metabolite-data_PCC-analysis.xlsx). Tab 5 of Xia-et-al_Metabolite-data_PCC-analysis.xlsx shows the final metabolomics dataset (corresponds to Metabolite_data.csv). Please note that the metabolic feature identifiers (IDs) in our expanded dataset will not necessarily correspond to those of the corresponding feature in the previously published dataset of Björnsdotter et al. (2021).

Correlation analysis

Prior to calculating correlation coefficients between genes and metabolites, the transcriptomics dataset from Björnsdotter et al. (2021; VC1_expressionmatrix.csv) was reduced by removing genes with low variance (SD < 5). The cor function of R v.3.4.3 (R Core Team, 2023) was used to calculate the Pearson correlation coefficients for gene expression versus metabolite feature intensity (Metabolite_data.csv). The correlations were calculated on the basis of individual samples except for whole seeds at early seed-filling-stage, where values across samples were averaged to give a single value per dataset.

To identify the metabolite features associated with L-DOPA, we analyzed a commercial standard using the same LC-MS method. By comparison of m/z ratios, retention times, and peak shapes, we selected four metabolic features corresponding to L-DOPA [42_(+), 143_(+), 107_(+), 41_(+)](tab 1 of Xia-et-al_Results_PCC-analysis.xlsx). Genes were then ranked by their average correlation coefficients with respect to the four L-DOPA features (tab 2 of Xia-et-al_Results_PCC-analysis.xlsx). The top-250 genes were annotated, of which ten were found to encode putative oxidase enzymes. The short-list was further reduced to three candidates by examining the reactions that close homologues in other organisms were characterized to catalyze or predicted to do so and choosing those associated to reactions most chemically similar to the conversion of L-tyrosine to L-DOPA.

Technical details

To work with the .R file and the R datasets, it is necessary to use R: A Language and Environment for Statistical Computing (version 4.5.2; R Core Team, 2024).

References

Björnsdotter E, Nadzieja M, Chang W, et al., Geu-Flores F. 2021. VC1 catalyses a key step in the biosynthesis of vicine in faba bean. Nature Plants 7: 923–931. DOI: 10.1038/s41477-021-00950-w.

R Core Team, R: A Language and Environment for Statistical Computing 2023. https://www.R-project.org/

Tautenhahn R, Patti GJ, Rinehart D, Siuzdak G. 2012. XCMS online: a web-based platform to process untargeted metabolomic data. Analytical Chemistry 84: 5035–5039. DOI: dx.doi.org/10.1021/ac300698c.

Additional funding

In addition to the funding sources listed below, the project was also supported by Guangzhou Elite (JY201722).

Files

Xia-et-al_2026.zip

Files (258.5 MiB)

Name	Size
Xia-et-al_2026.zip md5:485906769f1fb46f283723f448a89cc3	258.5 MiB	Preview Download

Additional details

Cites: Publication: 10.1038/s41477-021-00950-w (DOI); Software: https://www.R-project.org/ (URL); Publication: 10.1021/ac300698c (DOI)

FWF Austrian Science Fund
10.55776/ESP122
Novo Nordisk Foundation
NNF17OC0027744, NNF19OC0056580, and NNF22OC0075193
Danish National Research Foundation
2035-00056B, 2035-00038B

Raw data, R code and analyses related to gene-to-metabolite analysis using Pearson's correlation coefficient from the research article 'Exploring L-tyrosine and L-DOPA biosynthesis in faba bean (Vicia faba L.)'

Contributors

Researchers:

Dataset description:

Context and methodology

Metabolite data generation and processing

Correlation analysis

Technical details

References

Additional funding

Files

Xia-et-al_2026.zip

Files (258.5 MiB)

Additional details

Related works

Funding

Raw data, R code and analyses related to gene-to-metabolite analysis using Pearson's correlation coefficient from the research article 'Exploring L-tyrosine and L-DOPA biosynthesis in faba bean (Vicia faba L.)'

Creators

Contributors

Researchers:

Description

Dataset description:

Context and methodology

Metabolite data generation and processing

Correlation analysis

Technical details

References

Additional funding

Files

Xia-et-al_2026.zip

Files (258.5 MiB)

Additional details

Related works

Funding