Raw data, R code and analyses related to gene-to-metabolite analysis using Pearson's correlation coefficient from the research article 'Exploring L-tyrosine and L-DOPA biosynthesis in faba bean (Vicia faba L.)'
Creators
Contributors
Description
Dataset description:
These datasets and R code were used to implicate L-tyrosine oxidase candidates in faba bean (Vicia faba) using a gene-to-metabolite correlation analysis that uses the Pearson correlation coefficient (PCC). The results derived from this analysis are described in the manuscript 'Exploring L-tyrosine and L-DOPA biosynthesis in faba bean (Vicia faba L.)'.
The main purpose of this dataset is to create a record of the raw data and the analyses that was used in the analyses in the manuscript.
This dataset includes:
- 'R_code' directory:
- R code used to carry out the analysis (Code_PCC.R)
- LICENSE.txt
- 'Data' directory:
- Raw metabolite data and the processing of this data (Xia-et-al_Metabolite-data_PCC-analysis.xlsx)
- See description of separate spreadsheet tabs in 'Context and methodology'
- Metabolite dataset used for correlation analysis (Metabolite_data.csv)
- dataframe of metabolite intensities normalised to internal standard and dry weight; metabolite features x tissue samples
- Gene expression dataset used for correlation analysis (VC1_expressionmatrix.csv)
- This expression data comes from Björnsdotter et al. (2021).
- dataframe of gene expression (TPM); gene x tissue samples
- Results of the correlation analysis and selection of candidate genes (Xia-et-al_Results_PCC-analysis.xlsx)
- See description of separate spreadsheet tabs in 'Context and methodology'
- LICENSE.txt
- Raw metabolite data and the processing of this data (Xia-et-al_Metabolite-data_PCC-analysis.xlsx)
Context and methodology
Metabolite data generation and processing
We expanded the metabolomics dataset by Björnsdotter et al. (2021) with the addition of four extra tissues as well as a few additional samples of the original tissues. The new tissues/samples had been analyzed alongside the original ones (same LC-MS method and run) but had not been subjected to data analysis due to the lack of corresponding RNAseq data. The full metabolomics data comprises the following tissues, all derived from field-grown V. faba plants: young leaf (4 samples), mature leaves (3 samples), stem (4 samples), flower (4 samples), roots (4 samples), whole seeds at early seed-filling stage (3 samples), pods at early seed-filling stage (4 sample), seed coats at mid-maturation stage (1 sample), embryo at mid-maturation stage (3 samples), pods at mid-maturation stage (2 samples), pods at drying stage (4 samples), and seed coats at drying stage (4 samples). All samples were run in technical duplicates. We subjected the raw data to the same XCMS-based analysis pipeline described by Björnsdotter et al. (2021). Briefly, we used XCMS Online (v.3.7.1; Tautenhahn et al. 2012) to align chromatograms as well as identify and quantify metabolite features (peak width between 5 and 20 s; signal-to-noise ratio < 6:1; tab 1 of Xia-et-al_Metabolite-data_PCC-analysis.xlsx). We then removed metabolic features from the mass calibrant (retention time < 0.5 min; tab 2 of Xia-et-al_Metabolite-data_PCC-analysis.xlsx) as well as those from blank samples (similar intensity between samples and blanks, p < 0.01 in Student’s t-test; tab 3 of Xia-et-al_Metabolite-data_PCC-analysis.xlsx). The intensity of the remaining metabolic features was then normalized to the dry weight of the samples and to the signal of the internal standard ([M+H]+ for caffeine; tab 4 of Xia-et-al_Metabolite-data_PCC-analysis.xlsx). Tab 5 of Xia-et-al_Metabolite-data_PCC-analysis.xlsx shows the final metabolomics dataset (corresponds to Metabolite_data.csv). Please note that the metabolic feature identifiers (IDs) in our expanded dataset will not necessarily correspond to those of the corresponding feature in the previously published dataset of Björnsdotter et al. (2021).
Correlation analysis
Prior to calculating correlation coefficients between genes and metabolites, the transcriptomics dataset from Björnsdotter et al. (2021; VC1_expressionmatrix.csv) was reduced by removing genes with low variance (SD < 5). The cor function of R v.3.4.3 (R Core Team, 2023) was used to calculate the Pearson correlation coefficients for gene expression versus metabolite feature intensity (Metabolite_data.csv). The correlations were calculated on the basis of individual samples except for whole seeds at early seed-filling-stage, where values across samples were averaged to give a single value per dataset.
To identify the metabolite features associated with L-DOPA, we analyzed a commercial standard using the same LC-MS method. By comparison of m/z ratios, retention times, and peak shapes, we selected four metabolic features corresponding to L-DOPA [42_(+), 143_(+), 107_(+), 41_(+)](tab 1 of Xia-et-al_Results_PCC-analysis.xlsx). Genes were then ranked by their average correlation coefficients with respect to the four L-DOPA features (tab 2 of Xia-et-al_Results_PCC-analysis.xlsx). The top-250 genes were annotated, of which ten were found to encode putative oxidase enzymes. The short-list was further reduced to three candidates by examining the reactions that close homologues in other organisms were characterized to catalyze or predicted to do so and choosing those associated to reactions most chemically similar to the conversion of L-tyrosine to L-DOPA.
Technical details
To work with the .R file and the R datasets, it is necessary to use R: A Language and Environment for Statistical Computing (version 4.5.2; R Core Team, 2024).
References
Björnsdotter E, Nadzieja M, Chang W, et al., Geu-Flores F. 2021. VC1 catalyses a key step in the biosynthesis of vicine in faba bean. Nature Plants 7: 923–931. DOI: 10.1038/s41477-021-00950-w.
R Core Team, R: A Language and Environment for Statistical Computing 2023. https://www.R-project.org/
Tautenhahn R, Patti GJ, Rinehart D, Siuzdak G. 2012. XCMS online: a web-based platform to process untargeted metabolomic data. Analytical Chemistry 84: 5035–5039. DOI: dx.doi.org/10.1021/ac300698c.
Additional funding
In addition to the funding sources listed below, the project was also supported by Guangzhou Elite (JY201722).
Files
Xia-et-al_2026.zip
Additional details
Related works
- Cites
- Publication: 10.1038/s41477-021-00950-w (DOI)
- Software: https://www.R-project.org/ (URL)
- Publication: 10.1021/ac300698c (DOI)
Funding
- FWF Austrian Science Fund
- 10.55776/ESP122
- Novo Nordisk Foundation
- NNF17OC0027744, NNF19OC0056580, and NNF22OC0075193
- Danish National Research Foundation
- 2035-00056B, 2035-00038B