For answers to common questions about interpreting genetic associations with educational fields, see our frequently asked questions document in Supplementary Notes or online at https://www.thehastingscenter.org/genomic-findings-on-social-and-behavioral-outcomes-faqs/ and https://github.com/rosacheesman/Fields_genetics/wiki/Frequently-Asked-Questions-(FAQ).
This study complies with all relevant ethical regulations. The Norwegian Mother, Father and Child Cohort Study (MoBa) was approved by the Regional Committees for Medical and Health Research Ethics (protocol no. 2017/2205) and operates under the Norwegian Health Registry Act, with data handling managed by the University of Oslo under agreements with Statistics Norway. FinnGen received approval from the Coordinating Ethics Committee of the Hospital District of Helsinki and Uusimaa (protocol no. HUS/990/2017), with participants providing informed consent under the Finnish Biobank Act and multiple institutional permits from Finnish health authorities. Lifelines was approved by the University Medical Center Groningen (UMCG) Medical Ethical Committee (2007/152). All participants provided informed consent and data were processed in secure facilities compliant with national data protection regulations. See Supplementary Note for full details of ethical approvals.
Our main analyses were based on data from Finland and Norway, which are both social democratic welfare states that fit the 'Scandinavian model' of education for all. Compared to other high-income countries, economic inequality is low and access to education is less restricted by economic barriers. For example, Norway and Finland have free tuition, affordable loans and generous public subsidies for students. However, despite the reversal of the gender gap in educational attainment, gender-typical segregation into fields of study persists. Correspondingly, Nordic labor markets are among the most gender segregated.
We also analyzed a Dutch sample. The Netherlands has been defined as a conservative welfare state. Relative to the social democratic welfare states, social stratification in education is greater, partly due to early educational tracking and tuition fees.
FinnGen (https://www.finngen.fi/en), launched in 2017, is a public-private research project, combining genome and digital healthcare data on about 500,000 Finns. The nationwide research project aims to provide new medically and therapeutically relevant insight into human diseases. FinnGen is a pre-competitive partnership of Finnish biobanks and their background organizations (universities and university hospitals) and international pharmaceutical industry partners and the Finnish Biobank Cooperative (FINBB). All FinnGen partners are listed at https://www.finngen.fi/en/partners. The project utilizes data from the nationwide longitudinal health register collected since 1969 from every resident in Finland. Analyses were conducted on individuals aged >25 years with complete data for genome-wide genotyping and complete educational records.
We studied adults who participated in MoBa, a prospective population-based pregnancy cohort study conducted by the Norwegian Institute of Public Health. Pregnant women were recruited from across Norway from 1999 to 2009. In 41% of the pregnancies, the women consented to initial participation. Of fathers invited to participate, 82.9% consented. The total cohort includes approximately 114,500 children, 95,200 mothers and 75,200 fathers. Analyses were conducted on MoBa parents aged >25 years with complete data for genome-wide genotyping and complete administrative records linked to MoBa through the Norwegian national ID number system (n = 125,016).
Lifelines is a multidisciplinary, prospective, population-based cohort study examining, in a unique three-generation design, the health and health-related behaviors of 167,729 people living in the north of the Netherlands. It employs a broad range of investigative procedures in assessing the biomedical, sociodemographic, behavioral, physical and psychological factors that contribute to the health and disease of the general population, with a special focus on multimorbidity and complex genetics. Participants were sampled from the northern population of the Netherlands and the final sample encompasses about 10% of the region's population. Between 2006 and 2013, randomly selected general practitioners invited all their listed patients aged 25-49 years to participate in the study. We restricted our sample to genotyped Lifelines respondents who were ≥25 years (n = 63,927). PGIs and the first ten PCs of the genetic data were linked to an administrative data file containing educational fields ('HOOGSTEOPLTAB 2022, v1'), housed by Statistics Netherlands. Due to missingness in educational fields, in particular for the older generations, the total final sample was n = 36,501.
FinnGen release 11 contains genotype data for 473,681 individuals after quality control (QC). A total of 387,601 individuals were genotyped with a FinnGen Thermo Fisher Axiom customized array v2. Data on 86,080 additional individuals were derived from legacy collections. Further information is available at https://finngen.gitbook.io/finngen-handbook/finngen-data-specifics/red-library-data-individual-level-data/genotype-data/affymetrix-chip-and-its-design.
Blood samples were obtained from both parents during pregnancy and from mothers and children (umbilical cord) at birth. Quality-controlled genotyping array data for the full 207,569 unique MoBa participants were recently generated. Phasing and imputation were performed with IMPUTE4.1.2_r300.3, using the publicly available Haplotype Reference Consortium release 1.1 panel as a reference. To identify a subpopulation of European-associated ancestry, PCA was performed with 1,000 Genomes phase 1 after LD pruning. During post-imputation QC, the following thresholds were used for SNP removal: imputation quality (INFO) score ≤0.8; minor allele frequency (MAF) <1%; call rate <95%.
Blood samples were collected from Lifelines participants at the first assessment visit. Genotypes were released as part of two separate cohorts. The CytoSNP cohort was measured on the Illumina CytoSNP-12v2 array, measuring ~300,000 SNPs. The UMCG Genetics Lifelines Initiative (UGLI) cohort was measured on the Infinium Global Screening Array MultiEthnic Disease version, measuring ~700,000 SNPs. Quality-controlled data for both cohorts were released. The QC reports for CytoSNP and UGLI are available at http://wiki.lifelines.nl/doku.php?id=gwas and http://wiki.lifelines.nl/lib/exe/fetch.php?media=qc_report_ugli_r1.pdf, respectively. Before PGI construction, and in each cohort, we dropped multiallelic SNPs, SNPs with MAF < 1%, SNPs with an INFO score <0.8 or SNPs that were not in Hardy-Weinberg equilibrium (P < 10). We also dropped individuals with homozygosity rates of ±3 s.d. values (removing 655 respondents). We further dropped 1,289 respondents from the CytoSNP cohort who were also available in the UGLI cohort. After all these QC steps were completed, we merged the CytoSNP and UGLI cohorts into a single data file, using only SNPs that both cohorts had in common after QC (~6.4 million SNPs in total).
In all three cohorts, we extracted register data on broad educational field codes representing the field of education of each person's highest qualification completed by the year 2018. We extracted field codes at all highest qualification levels (that is, not just at university level).
To harmonize the data and facilitate future replication studies in other cohorts, we converted broad field codes from national-level coding systems to broad field codes as defined by the ISCED 2013 (https://uis.unesco.org/sites/default/files/documents/international-standard-classification-of-education-fields-of-education-and-training-2013-detailed-field-descriptions-2015-en.pdf).
In FinnGen, we used linked administrative data from Statistics Finland to define individuals' educational qualifications. The Finnish educational field records are described at https://www2.stat.fi/fi/luokitukset/koulutusala/. In MoBa, we used linked administrative data from the Norwegian Standard Classification of Education (NUS2000). The administrative data were of high quality and did not suffer from attrition. More information on the NUS coding and the conversion to ISCED is available at http://www.ssb.no/en/utdanning/norwegian-standard-classification-of-education. In the present study, missing data occur only for individuals whose fields do not map exactly on to the ISCED system, for instance because they are interdisciplinary (for example, 16,000 genotyped individuals in MoBa with qualifications termed 'interdisciplinary programs' and 'qualifications involving health and welfare'). Note that Dutch administrative records on education are incomplete, such that we had educational field measures available only for 56% of the original Lifelines sample (n = 36,373).
We created a binary variable for each of the ISCED broad field specialization codes, scoring individuals as 1 if they chose the field and 0 otherwise. The 0 category included people studying generic programs (not a specialization as such). This includes a wide range of qualifications, for example, unspecialized high school diploma and professional development skills training.
We also created harmonized educational attainment variables in all datasets. We took educational level information from the exact variable containing the field code and converted it to the ISCED EduYears (years of completed education) categories, as per international GWAS meta-analyses.
To test whether genetic associations reflected direct effects rather than confounding from familial and geographical factors, we created covariates from Norwegian population and education registers. We generated dummy variables for birthplace municipalities (216 codes representing Norway's lowest administrative level) to control for shared local environments and for parents' educational fields to control for familial transmission of education-specific skills, resources and networks. The genetic-geographical data linkage and structure has been described previously. Although Norway's free tuition and dispersed population with geographical variation in educational opportunities provide a useful context for these analyses, our approach may not fully capture temporally unstable or more localized influences.
To validate our GWA analysis results, we tested associations between PGIs for educational field choices and those for actual educational field choices in an independent cohort. PGIs were constructed using the GWAS summary statistics for each respective field (not corrected for EA), using SBayesR within GCTB software v2.05beta_Linux. SBayesR uses Bayesian shrinkage and explicitly models LD to estimate SNP effect sizes in the presence of correlated markers, using LD scores of individuals with European-associated ancestry estimated by the UK Biobank. To test for the direct effect of the PGI, we added the mid-parental PGI as a control variable. The parental PGI was constructed using a combination of observed and imputed genotypes (from parents and siblings), constructed using snipar. Snipar uses sibling data or the data of available parents to impute genotypes for unobserved parents. The parental PGI could be imputed for individuals who had at least one sibling or at least one parent who was also genotyped in Lifelines (n = 17,705).
We performed GWA analyses (GWASs) for ten dichotomous broad educational field phenotypes in MoBa and FinnGen, using models developed for resource-efficient analysis of case-control phenotypes in biobank-scale datasets. In MoBa, the FastGWA-GLMM approach (a generalized linear mixed model (GLMM) method for large-scale (GWASs) in GCTA software v1.91.7beta (--fastGWA-mlm-binary)) was used. This is a logistic regression model with the added complexity of using a sparse matrix to account for the dense genetic relatedness in MoBa without removing relatives. FastGWA-GLMM also uses the saddle point approximation method to account for inflation in test statistics due to case-control imbalance. In FinnGen, the binary option in REGENIE software was used (v2.2.4). REGENIE is a machine learning method that implements whole-genome ridge regression and uses Firth's logistic regression test to account for case-control imbalance. The two methods are similarly effective in terms of false-positive rates and statistical power.
To enable meta-analysis of MoBa and FinnGen GWAS summary statistics, we performed QC and harmonization. We removed variants with low MAF < 1%, poor imputation quality (INFO < 0.8), multiallelic variants and variants with ambiguous alleles (for example, alleles other than A, C, G or T) and we resolved strand and sign flips. The datasets were harmonized based on chr:pos from genome build 37 as an SNP identifier. Sample-size-weighted meta-analyses of MoBa and FinnGen were then performed using METAL software. We used the SCHEME SAMPLESIZE setting to convert effect sizes to z-scores before meta-analysis.
To identify independent genome-wide significant associations in the meta-analytical results, we performed clumping using standard parameters in FUMA software: leadP = 5 × 10, gwasP = 0.05, r2 = 0.6, r2_2 = 0.1, refpanel = 1KG/Phase3.
We calculated cohort-specific effective sample sizes following ref. and then summed these to obtain the sum of effective sample sizes for each field-of-study GWAS.
In MoBa, we also used FastGWA-GLMM for the following additional analyses: GWASs of ten fields with controls for educational attainment; GWASs of ten fields controlling for geographical and parental variables; and GWASs of ten fields in men and women separately.
In all GWAS (except sex-stratified) analyses, we controlled for sex, age, PCs of genetic ancestry (20 in MoBa, 10 in FinnGen) and batch identifiers.
To identify genetic signals associated with educational fields net of educational level, we used GWASs by subtraction in Genomic SEM (package within R-4.3.2) using GWAS summary statistics for educational fields and educational attainment. This genomic Cholesky's decomposition approach was first applied to create a GWAS of noncognitive skills by 'subtracting' the genetic component of cognitive performance from the association of each SNP with educational attainment. Here we 'subtracted' the genetic component of educational attainment from the association of each SNP with a given educational field. For each of the ten fields, we fitted an SNP-level Cholesky's model regressing two observed variables (the field, and educational attainment) on two latent variables (a field-specific factor and an educational attainment factor). The covariances between the latent variables were fixed to 0 so that all variance is explained by the latent factors. See Supplementary Fig. 21 and accompanying Supplementary Notes for full model specification details and a model illustration.
The overall contribution of common SNPs to field choices was estimated from GWAS summary statistics by LD score regression, via Genomic SEM v0.0.3f in R-4.3.2. On average, SNPs with higher LD scores (more correlations with other SNPs) are more likely to be correlated with a true causal variant. As such, when GWA test statistics (χ²) are regressed on LD scores, the slope provides an estimate of the heritability that can be explained by common SNPs. Heritability represents the percentage of variance explained by common genetic variants.
We estimated SNP-based heritabilities of educational fields after controlling for EA in two ways: through GWASs by subtraction and phenotypic adjustment. We note that it is not recommended to base heritability estimates on the former approach, because environmental variance is undefined. Nevertheless, we reported SNP heritability estimates based on GWASs by subtraction because they are relevant to interpreting downstream PCs and genetic correlation analyses.
To study how much the genetic associations with broad field choices are mediated through unobserved social factors in the geographical area in which individuals were born, we repeated the GWAS analyses in MoBa only, controlling for municipality codes and parental field codes as dummy variables and testing for attenuated heritability with LD score regression. We did this in an iterative fashion, first controlling for the most distal factor (birthplace municipality), then adding parental fields. Following previous methods applied in UK Biobank, we used Genomic SEM to compare the SNP heritability estimates with and without controls while accounting for dependence between estimates. We used the p.adjust function in R with the method 'fdr' to control the false discovery rate (the expected proportion of false discoveries among the rejected hypotheses) and considered results with an adjusted P < 0.05 as significant.
In Dutch Lifelines, we tested PGI associations with educational fields using logistic regression, controlling for ten PCs, age polynomials, sex and their interactions. We quantified variance explained using incremental McFadden's pseudo-R². This is defined as , where is the likelihood of the model and the likelihood of the model with only a fitted intercept. To estimate direct genetic effects, we added controls for imputed parental PGIs and compared within-family versus population-level effect sizes, with CIs from 1,000 bootstrap replications and Bonferroni's correction (P < 0.005) to correct for testing 10 hypotheses. We also tested PGI associations with spouse's or partner's educational field (n = 28,581; logit regression), defining spouses or partners as the first co-parent identified through Dutch population linkage, although these associations likely reflect broader assortative mating patterns rather than direct genetic effects on partner choice.
We explored the structure of the field GWASs (meta-analytic results) by calculating genetic correlations using LD score regression within Genomic SEM v0.0.3f in R-4.3.2. For pairs of traits, the product of the GWA z-scores at each SNP can be regressed on the LD score, providing an estimate of the genetic correlation between the two traits.
We explored the dimensionality of the genetic associations with the ten educational fields by applying PCA (using eigen() in R-4.3.2) to the standardized matrix of genetic correlations between fields (after GWAS by subtraction to remove EA genetic variance). To obtain GWAS summary statistics for PC1 and PC2, we then performed PCA GWASs following ref. (more details at https://annafurtjes.github.io/genomicPCA/). This approach adapts a GWA meta-analysis function designed for meta-analysis across multiple traits. Instead of weighting by SNP heritability when averaging across SNP effects, the standardized loadings on the PC of interest provide the weights. The function allows for sample overlap in the GWAS summary statistics by adjusting for the LD score regression intercept. The resulting GWAS summary statistics for the two independent educational field components had effective sample sizes of 10,413 and 7,353, respectively (following ref. ).
We chose to use PCA rather than confirmatory factor analysis (CFA) for studying the dimensionality of educational fields for several reasons. First, PCA involves fewer assumptions than CFA. Although CFA models latent factors assumed to represent real traits measured by field choices, PCA simply reduces the dimensionality of the data by finding axes that explain maximum variation. In addition, CFA might require the somewhat ad-hoc addition of crossloadings based on modification indices to achieve a good fit, whereas PCA does not involve model-fit considerations. Second, the limited number of possible fields (ten indicators) makes PCA more suitable. CFA is ideally designed for studying latent factors that can be measured by an extensive, potentially infinite, number of indicators, which is not the case here.
We estimated genetic correlations between latent educational fields factors and 96 human phenotypes using LD score regression. We used publicly available GWAS summary statistics that were well powered and covered a comprehensive range of domains of human variation. See Supplementary Table 21 for the GWAS study reference list with sample sizes.
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.