On the interface between Linguistics, Computer Science and Psychiatry: analyzing textual key-factors affecting BERT-based classification of schizophrenia in social media texts

João Victor Miranda e Silva; Cilene Rodrigues; Emilio Ashton Vital Brazil

doi:10.1590/SciELOPreprints.14646

O preprint foi publicado em outro meio.
DOI do preprint publicado https://doi.org/10.3389/frai.2026.1781552

Preprint / Versão 1

On the interface between Linguistics, Computer Science and Psychiatry: analyzing textual key-factors affecting BERT-based classification of schizophrenia in social media texts

article.authors6a614752a127b

João Victor Miranda e Silva Pontifícia Universidade Católica do Rio de Janeiro https://orcid.org/0000-0002-0525-5307
- Conceptualization
- Data Curation
- Formal Analysis
- Investigation
- Methodology
- Project Administration
- Writing – Original Draft Preparation
Cilene Rodrigues Pontifícia Universidade Católica do Rio de Janeiro
- Conceptualization
- Formal Analysis
- Investigation
- Methodology
- Supervision
- Writing – Review & Editing
Emilio Ashton Vital Brazil IMPA Tech
- Formal Analysis
- Methodology
- Supervision

DOI:

https://doi.org/10.1590/SciELOPreprints.14646

Palavras-chave:

schizophrenia, language, data filtering, natural language processing

Resumo

This paper investigates language impairments in schizophrenia (SZ) by integrating insights from language-centered investigations with computational approaches. Using BERT-base-cased, a transformer-based model, it explores how linguistic markers of SZ can be identified through Natural Language Processing (NLP) techniques, with emphasis on improving performance reliability via dataset refinement and approaching interpretability of deep learning outputs via statistical analyses of thematic content. We report the fine-tuning of a BERT model for text classification of 31,278 Reddit posts (15,639 SZ, 15,639 controls). The experiment evaluated the capacity of the model to distinguish language produced by individuals with SZ. The model achieved moderate performance (Accuracy = 0.6969; AUC = 0.78) and remained stable across hyperparameter configurations, indicating that foundation models such as BERT easily fit to data and, therefore, further performance gains are more likely to be derived from dataset refinement than from additional hyperparameter optimization. There were three key factors affecting the model’s performance: text length, topic of discussion and vocabulary choices. Posts that were correctly classified tended to be significantly longer (p < 0.001, M = 37.30), focused on specific abstract topics (e.g., religion), and contained more words related to mental conditions. These factors have also been reported in manual analyses of the impacts of SZ on language. These findings contribute to the accuracy of computational models aimed at working on linguistic classification tasks and underscore the value of carefully curated datasets, while demonstrating the viability of NLP methods in profiling SZ language.

Downloads

Os dados de download ainda não estão disponíveis.

PDF (Inglês)

Enviado

18/12/2025

Postado

31/12/2025

Como Citar

On the interface between Linguistics, Computer Science and Psychiatry: analyzing textual key-factors affecting BERT-based classification of schizophrenia in social media texts. (2025). Em SciELO Preprints. https://doi.org/10.1590/SciELOPreprints.14646

Baixar Citação

Série

Linguística, letras e artes

Este trabalho está licenciado sob uma licença Creative Commons Attribution 4.0 International License.

Dados de financiamento

Conselho Nacional de Desenvolvimento Científico e Tecnológico
Números do Financiamento Array
Coordenação de Aperfeiçoamento de Pessoal de Nível Superior

Plaudit

Declaração de dados

Os dados de pesquisa estão contidos no próprio manuscrito
Os dados de pesquisa estão disponíveis em um ou mais repositório de dados
- https://doi.org/10.5281/zenodo.17643919

On the interface between Linguistics, Computer Science and Psychiatry: analyzing textual key-factors affecting BERT-based classification of schizophrenia in social media texts

article.authors6a614752a127b

DOI:

Palavras-chave:

Resumo

Downloads

Enviado

Postado

Como Citar

Série

Dados de financiamento

Plaudit

Declaração de dados

Aviso de preprints

Notícias

SciELO Preprints adota obrigatoriedade de declaração de disponibilização de dados de pesquisa

Atualização da Política Editorial e FAQ

Prêmio Ben Barres Spotlight: Inscrições abertas para 2024