Studies in Second Language Learning and Teaching Investigating individual differences with qualitative research methods: Results of a meta-analysis of leading applied linguistics journals

The aim of the present article is to provide a systematic review of qualitative studies in the leading journals of our field focusing on their distributional prop-erties in the various journals as well as topic choice and selected quality control issues. In order to achieve this aim, we carried out a systematic review of research articles published in leading journals in our field, namely, Applied Linguistics , Language Learning , Language Teaching Research , Studies in Second Language Acquisition and Modern Language Journal between 2016 and 2020. Our sample contains 93 articles in which researchers employed qualitative research methods or mixed methods including a qualitative component. Our main results indicate that there is great variation among journals in terms of the number of qualitative studies. As for topic considerations, some traditional individual difference variables seem to have a dominant role, with cognitive processes involved in language acquisition gaining some ground as well. Concerning quality control issues, there could be room for improvement with regard to reporting the quality control measures, including the tools employed in the studies. Based on our results, we can conclude that a more systematic understanding of acceptable processes in the field of applied linguistics could increase not only the number of qualitative articles published but also their topical importance .


Introduction
Applied linguistics research has long been centered on individual differences (IDs) among learners and why and how these differences influence the learning processes and outcomes. Comparing the two editions of Zoltán Dörnyei's comprehensive monograph on individual difference research (Dörnyei, 2005;Dörnyei & Ryan, 2015), it becomes apparent that the qualitative approach, which "uses text as empirical material (instead of numbers), starts from the notion of the social construction of realities under study, and is interested in the perspectives of participants, in everyday practices and everyday knowledge referring to the issue under study" (Flick, 2018, p. 2), is gaining ground in the investigation of these variables. This is because researchers are moving away from large-scale data collection to more situated and contextualized studies that are focusing on actual differences of learners instead of providing often vaguely generalizable results that might not be relevant in many contexts and learning environments. Despite this seeming interest in qualitative methods, there are only few comprehensive reviews of qualitative studies (see Chong & Plonsky, 2021a, 2021b, while both content and quality-related meta-analyses of quantitative research appear more often in leading journals (e.g., Plonsky & Derrick, 2016). Hence, the aim of this study is to provide a systematic review of qualitative studies in the leading journals of our field, that is, Applied Linguistics, Language Learning, Language Teaching Research, Studies in Second Language Acquisition, and Modern Language Journal. The importance of our research lies in its potential to inform future studies concerning possible quality control measures to be applied in qualitative investigations. In this article, we focus on multiple issues starting with the descriptive analysis of the distribution of the papers in these journals and the topics they cover. Next, we analyze what type of data collection methods and techniques researchers used. Finally, we touch upon quality control issues in connection with three of the data collection tools.

ID variables and their investigation
The investigation of IDs is a diverse field including an ever-increasing number of variables that researchers think relevant for students and teachers in various contexts. Some of the classic constructs include language learning aptitude that explains the rate of acquisition of language learners (Carroll & Sapon, 1959), language learning motivation that measures the amount of effort students are willing to invest in language learning (Dörnyei & Ushioda, 2011) or learners' age and its influence on learning processes and outcomes (e.g., Pfenninger & Singleton, 2021). In addition, language learning styles and strategies have traditionally been seen important in the learning process, the former indicating the preferred cognitive styles used by students when processing information and completing tasks (Reid, 1995), while the latter includes those conscious techniques that students might apply when learning or solving a task (O'Malley & Chamot, 1990). A further cognitive variable are language learning beliefs that concern those dispositions that students or teachers hold in connection with the target language, its learning, or themselves (Dörnyei, 2005;Mori, 1999). It has also been acknowledged that emotions play an important role in shaping learning. One influential emotion, anxiety, has widely been researched (Horwitz et al., 1986), while the inclusion of further negative (e.g., shame, Teimouri, 2017) and positive (e.g., enjoyment, Dewaele & MacIntytre, 2016) emotions is a more recent advancement in our field. It has also been acknowledged that successful language learning is not possible without students taking responsibility for their own learning and regulating this process (Kormos & Csizér, 2014), which needs to be coupled with their willingness to communicate (MacIntyre et al., 1998). In terms of the most recent addition to the list of ID variables, they include experience-related notions, such as the flow experience, that is, immersing oneself into the learning and forgetting about the passing of time (Csíkszentmihályi et al., 2005), or engagement, which is a complex construct, including a range of psychological events, a certain quality of interaction, and positive emotions (Shernoff, 2013). Although ID variables are mostly investigated individually, Ryan's (2019) recent call for the need to investigate such factors in concert in order to understand their interplay and the way they shape one another in complex ways should definitely be considered.
It is an undeniable fact that the investigation of ID variables has traditionally employed quantitative questionnaire studies that provide cross-sectional analyses of individuals, whose answers could be generalized to a wider population. While questionnaires are versatile and economical instruments to collect data, there are a number of possible critical considerations that show the limitations of such studies, which led to the use of qualitative investigations within the field of ID research (Dörnyei & Ryan, 2015). However, given the diversity of these variables and the fact that researchers usually tend to concentrate on a limited number of them, research strategies might differ for different ID variables. Examples of addressing such limitations come from subfields in which qualitative techniques have been employed to counter-balance certain shortcomings within the field. For instance, in the L2 motivation field, Ushioda (2001) started to employ qualitative methods to map the longitudinal changes in L2 motivation and its relation to language learning autonomy. The number of qualitative studies has steadily increased in this field, which has resulted in a more balanced approach to L2 motivational issues (Dörnyei & Ryan, 2015). A similar tendency can be observed in the investigation of language learning strategies, as Woodrow (2005) called our attention to the fact that in order to investigate learning strategies in a situated manner, more qualitative studies are needed. IN fact, the special issue of System published in 2014 already contains quite a few qualitative studies. Another such example is provided by mapping learners' willingness to communicate (WTC). First, quantitative studies were used to operationalize and investigate this concept, but these later gave way to innovative qualitative studies in order to research the personal differences in WTC and its subtle temporal changes MacIntyre & Legatto, 2011).
Another important issue that needs to be considered when looking at the methodological issues pertaining to the investigation of ID variables is the extent to which a given variable is integrated into the field. The problem is the fact that although second language acquisition theories have long included a number of cognitive variables, such as attention, noticing or awareness, in explaining learning processes, the empirical investigation of such variables as ID concepts has not drawn much attention (Robinson, 1995(Robinson, , 2012Schmidt, 1990Schmidt, , 2010. The process-like characteristics of these variables would call for longitudinal and exploratory qualitative studies, which are not unheard of in our field but their meaningful combination is still a task ahead of us. Another issue that needs to be mentioned here is the investigation of emotions. Despite the fact that anxiety and its role in different aspects of learning has been researched in much detail (Horwitz et al. 1986;MacIntyre & Gardner, 1994), the efforts to include additional negative and positive emotions are recent. Csizér, Albert and Piniel (2021; see also Albert et al., in press) investigated the role of various negative and positive emotions such as, for example, pride, hope, curiosity, shame, boredom, and apathy. One of their most important results concerns the fact that the supreme role of anxiety cannot be proved in all contexts, and other negative emotions play equally strong roles. In addition, in comparative studies of positive and negative emotions, the role of the former seems to be more defining.
The next issue pertaining to ID research that should be considered are the trait-and state-characteristics of ID factors. In the traditional operationalization of ID variables, there seems to be agreement that these are trait-like characteristics which are individually consistent over time (Dörnyei, 2005). However, drawing on Long (2014), for example, one cannot neglect the fact that many ID variables behave in a state-like manner and are susceptible to change due to contextual influences. In one of our recent theoretical overview papers , it was shown that there are still only few investigations that focus on state-like characteristics of ID variables. As we argued, this could be best achieved in complex qualitative and mixed methods studies, by including multiple sources of data within the same project.
Despite the fact the aim of any large-scale investigation is to obtain results that would be generalizable to the population from which participants were selected, Lowie and Verspoor (2019) convincingly show us that students scoring similarly on ID scales can still be strikingly different when the actual learning processes are also considered. In this sense, naming the field individual differences is really a misnomer as researchers up until very recently were more interested in the ways learners are similar to one another (Dörnyei, 2005;Dörnyei & Ryan, 2015). Therefore, when investigating ID variables, specific learning processes should also be included in empirical research, and qualitative designs lend themselves well to this purpose.
These considerations all point to the conclusion that qualitative investigations should be fully integrated into the investigation of ID variables, but this cannot be done without considering what quality control issues should be taken into consideration. As multiple issues should be tackled when designing and executing such studies, the next section provides an overview of quality control in qualitative studies.

Quality control in qualitative research
Assuring that the quality of qualitative research meets given standards so that the findings of the study can be trusted is an important issue, discussed in most textbooks dealing with research methods (see e.g. Creswell, 2002Creswell, , 2018Delamont, 2012;Dörnyei, 2007;Fraenkel et al., 2012;Heigham & Croker, 2009;Patton, 2015;Tracy, 2019). Nevertheless, establishing precisely what the basic ideals are that good qualitative research should adhere to and how those can be met in terms of specific actions are much less clear. The conceptual framework and the means to ensure good quality research seem to be more straightforward in quantitative studies: the overarching terms of validity and reliability and their related concepts and subcategories comprise the theoretical framework while different sampling and statistical procedures provide the means for implementing those ideals (Dörnyei, 2007). Although, according to Mirhosseini (2020), students or beginner qualitative researchers who were mainly trained in the positivist tradition might believe that the ideals of objectivity, reliability and generalizability can be extended to and applied in qualitative research, he claims that this approach goes against the very nature of qualitative enquiry as "qualitative research is not unbiased, replicable, and generalizable" (p. 178). Hence, in what follows, we provide a succinct overview of the most important, overall quality control issues, but, as in this study we mainly concentrate on the most often used data collection tools, the section ends with a discussion of tool-related quality insurance.
The initial approach to quality control in qualitative studies involves researchers trying to adhere to essentially quantitative principles of quality. This approach is apparent in Guba's (1981) famous work, in which he reinterpreted and relabeled several cornerstones of positivist ideals of good quality science. In order to ensure the trustworthiness of qualitative research, he came up with analogs to scientific understandings of conventional notions of internal validity (labelled credibility), external validity (called transferability), reliability (named dependability), and objectivity (termed neutrality). Credibility pertaining to the truth value of research means that respondents' perceptions are faithfully represented in the research, while transferability, reflecting the applicability of qualitative research, calls for detailed description of the research context based on which the reader can judge possible similarities to their own. Dependability addresses the problem of consistency by calling for the trackability of any changes that occurred in the course of the research, and, finally, since true objectivity seems to be impossible even in physics (the dual nature of light is one example), the aspect of neutrality is ensured by confirmability of the data produced.
What seems to be the recent trend is that authors working within the qualitative paradigm attempt to break with those ideals of good research that are rooted in positivism and create their own criteria organically from the philosophical underpinnings of the qualitative research paradigm (cf. Creswell, 2018;Mirhosseini, 2020;Rallis & Rossman, 2009;Tracy, 2019). However, the existence of the many different qualitative research traditions such as phenomenology, narrative enquiry, ethnography or conversation analysis, to mention just a few, makes the creation of a unified framework similar to what exists in the positivist research tradition nearly impossible. As Lazaraton (2003) points it out "the nature of the research cannot be separated from the methods used to carry it out, which are implicated in the criteria used to judge it" (p. 9). One solution under these circumstances could be creating specific quality criteria unique to each research tradition. Another could be opting for the other extreme, creating very broad criteria that could synthesize different practices across theoretical traditions and paradigms. This latter approach is represented by Tracy's (2010Tracy's ( , 2019 eight big tent framework for high quality qualitative research. The eight big tent framework (Tracy, 2010(Tracy, , 2019 includes broad criteria, such as worthy topic, referring to the significance to the subject investigated, and significant contribution, addressing similar issues in connection with the research itself. Rich rigor emphasizes the care and effort taken to carry out the study in an appropriate manner based on the theoretical constructs throughout the whole data collection and analysis process, while sincerity highlights the transparency of these issues and self-reflexivity about any values and biases. Similarly to the meaning proposed by Guba (1981), credibility aims to ensure that the reality represented in the research is plausible or appears to be true, whereas resonance refers to the effect that a study or its written account has on the audience. One of the final two tents house ethical issues, which, despite the fact that they are missing from Guba's (1981) original framework, feature increasingly prominently in more recent quality guidelines (Howitt, 2016;Mirhosseini, 2020;Rallis & Rossman, 2009). The other calls for meaningful coherence, which is a rather broad concept, and it reflects on the whole research process stating that qualitative studies should: "(a) achieve their stated purpose; (b) accomplish what they espouse to be about; (c) use methods and representation practices that partner well with espoused theories and paradigms; and (d) attentively interconnect literature reviewed with research foci, methods, and findings" (Tracy, 2010, p. 848).
Although the labels used for describing good quality research in qualitative studies clearly proliferate, there seems to be more common ground and agreement with regard to the steps or procedures that can be used to ensure that these ideals are met. Triangulation, thick description, member checks, peer debriefing, inter-coder reliability, audit trail, and ethical issues feature very frequently in the majority of works discussing quality control issues in qualitative research (see e.g., Creswell, 2018;Dörnyei, 2007;Guba, 1981;Mirhosseini, 2020;Patton, 2015;Rallies & Rossman, 2009;Tracy, 2019). Therefore, we would like to briefly explain what each of these concepts covers.
Triangulation is a concept that originates from navigational and land surveying techniques; it can be used to "determine a single point in space with the convergence of measurements taken from two other distinct points" (Rothbauer, 2008, p. 892). It is frequently used in social sciences, and usually four different types of triangulation are differentiated: triangulation of the methods of data collection, triangulation of data sources, investigator triangulation, and theory triangulation (Denzin, 1989). Although the original purpose of triangulation was the verification of the research findings by relying on data originating from different sources, it is now increasingly understood as a strategy that allows researchers to strengthen their findings and enrich their interpretations via the exploration of different perspectives, as the philosophical stance associated with qualitative research emphasizes uniqueness, which seems incompatible with the idea of establishing one objective truth.
Thick description is a procedure that can be used to provide rich detail about the different aspects of the research which is needed for more profound understanding of the context of the study and draws attention to the context-embedded nature of qualitative research. Member checks, or member reflections, as Tracy (2019) chooses to refer to this procedure, are needed to ensure that the participants' viewpoints are represented in the study, which can be achieved by sharing the research data with the participants and asking their feedback on it. Both peer debriefing and inter-coder reliability mean involving fellow researchers in the research process either for the purpose of receiving overall feedback or for specifically ensuring the reliability of the coding process. Establishing an audit trail requires that different steps of the research process can be traced back later on, as these were documented throughout the study. Finally, careful consideration of ethical issues is expected to be present throughout the research process, and references to all of these need to appear to some extent in the write-up.
At an even more specific level, carrying out empirical research is not possible without data; therefore, regardless of the fact whether the study is quantitative or qualitative, data need to be collected. In some cases, researchers work with naturalistic data, some sort of artefacts that exist independently of researchers' endeavors; thus, they only need to be located and not created for the purpose of the research. Such data in our field is usually understood as some kind of text: books, diaries, blogs, lesson plans and so on. In the majority of cases, however, data are collected specifically for the purpose of the study; in these cases data collection tools are employed, which can be interviews, tasks, and tests, to name just a few. Observations occupy a middle ground here since the event observed might be external to the research (e.g., English lessons in a high school); nevertheless, the presence of a researcher or even just a recording device might alter the phenomenon under scrutiny. Although there are some data collection tools that are more clearly linked with the qualitative research tradition and are referred to as qualitative data collection tools, such as observations or interviews (see e.g., Dörnyei, 2007), the method of data analysis chosen is also determinant: for example, interviews can be analyzed quantitatively as well using content analysis (Fraenkel, et al., 2012).
Quality control considerations, which are present from the moment researchers start to plan their investigation to the point of write-up, are also relevant with regard to the use and the reported use -these are clearly different issues -of data collection tools. Although papers devoted to broader conceptual matters related to quality control considerations rarely address issues of this specificity, discussions aimed at the use of specific data collection tools might give us hints as to their proper application. Likewise, papers sharing guidelines on writeup might offer certain recommendations as to what needs to be reported or supplied in the form of supplementary materials (see e.g., Howitt, 2016).
With regard to interviews (Dörnyei, 2007;Fraenkel, et al., 2012;Howitt, 2016;Patton, 2015;Richards, 2009), it seems useful to indicate what type of interview was used. Moreover, by sharing some details about the interview guide's content, the reader can be presented with a more informed picture about the topics discussed: including the interview guide in the appendix or some supplementary material, if space permits, would be a preferable solution here. Although interviews are expected to be recorded, information about this fact should also be given besides indicating the language of the interview, as the common assumption of conducting the interviews in the respondents' L1 might not always be fulfilled (Welch & Piekkari, 2006). Indicating the length of recordings or shedding light on the size of the resulting text corpus may give insights to the reader about the breadth and depth with which the topic might have been covered. Finally, in cases where several interviews are conducted, piloting the instrument is also an expected quality control step, as this could provide information about potential problems with the instrument before its application.
As regards observations (Dörnyei, 2007;Fraenkel, et al., 2012;Patton, 2015), recording the phenomenon observed may or may not make sense depending on circumstances. Moreover, the recording of events can render traditionally used data collection tools of observation like field notes, researcher diaries or observation grids unnecessary, as the event can be replayed countless times making finegrained analyses possible. Of course, in the case of complex events involving many participants, like a language class, recording everything that happens in its entirety is close to impossible, so a combination of recordings and field notes might be a sensible choice. Data collection tools used in observations range from unstructured field notes to structured observation grids. In the case of the latter, a description of the main points to be observed or providing access to the instrument itself besides details about piloting are expected as quality control steps. Indicating the length of observation or the size of the resulting text corpus in some form is also helpful for the reader in judging the volume of the data.
Despite their growing popularity as data collection tools, tasks are less established instruments than interviews or observations, which is evident from that fact that while research methodology handbooks all contain chapters on interviews and observation, tasks are hardly ever mentioned although they probably fit the broad category of elicited learner language (Mirhosseini, 2020). Nevertheless, recording and piloting as well as providing a brief description of content or even sample tasks in the text or supplementary materials seem to be justifiable expectations in connection with them as well.

Research questions
Since the number of qualitative research articles seems to be on the rise when it comes to investigating IDs in language learning, providing a systematic review of such articles that have appeared in leading journals of our field is timely. We have intentionally selected top-tier journals as they present the leading voices and often cited articles, thus creating and maintaining quality-control-related expectations. A systematic review allows offering insights into the distribution of qualitative and quantitative articles across the journals and the main topics investigated in them. It can also shed light on the most popular data collection instruments and those quality control steps that were applied in connection with them to ensure that they meet the high standards set by these journals. In order to achieve these aims, we formulated the following research questions: 1.
What are the distributional characteristics of research studies on individual differences reported in top-tier journals of applied linguistics in recent years? (RQ1) 2.
What are the main topics of qualitative studies on individual differences reported in top-tier journals of applied linguistics in recent years? (RQ2) 3.
What data collection tools are employed in qualitative studies on individual differences reported in top-tier journals of applied linguistics in recent years? (RQ3) 4.
What is reported in terms of quality control in connection with three frequently used data collection tools (interviews, observations, and tasks) in qualitative studies on individual differences published in toptier journals of applied linguistics in recent years? (RQ4)

Criteria for inclusion: The journals
As we had planned to investigate quality-control issues, we decided to focus on top-tier journals in our study to indetify selective and high-quality publishing policies. The selection of journals to be included in our analysis was based on the following process. First, the scientific journal ranking (SJR) of the journals in the category of "language and linguistics" was downloaded from the Scimago Institution's homepage https://www.scimagojr.com/journalrank.php?category= 1203. Subsequently, for the first 20 journals on the list, two further indices were looked up. The 2019 impact factor of the journals was copied from the 2020 Edition of the Journal Citation Reports® (JCR) published by Clarivate Analytics https://www.annualreviews.org/page/librarians/impact-factors and they were rank ordered based on this index as well. Finally, the so-called SNIP (source normalized impact per publication) score of the journals was also established based on the CWTS website https://www.journalindicators.com/indicators, and a third rank order was prepared for the journals based on this. The three rank orders were then added up, creating a final order for the journals (see Table 1). Before creating the final list of journals, those not dealing with topics related to second or foreign language learning or teaching were eliminated from the list, along with those journals where all three indices described above were not available.
We then arrived at the following list of journals: 1.

Criteria for inclusion: The articles
In this phase of our study, we independently selected articles from the journals by first including those that investigated issues in individual differences research between the years 2016 and 2020. We decided to cover a five-year span in order to have a fairly large number but relatively recent studies to analyze. We approached individual difference variables in the broadest sense and defined them for the purpose of the selection as any variable aiming to measure differences among learners that might impact learning processes or outcomes. Once we agreed on the final number of articles (N = 371), the articles were categorized into five groups: quantitative, mono-or multi-methods qualitative, mixed methods with qualitative parts and finally studies using other methods (e.g., meta-or theoretical analysis). We included articles from special issues but excluded short communications. In the analysis, we worked with studies containing mono-or multimethods qualitative methods and mixed methods with qualitative parts. Thus, our final sample consisted of 93 research articles.

Coding and analysis
When coding the information in the selected articles, we employed a cyclical coding process. First, basic information about the selected articles was recorded, such as the author(s)' name, the title of the article and the year of publication. With regard to the publication date, we used the volumes from the journals' websites and checked volumes published between 2016 and 2020, disregarding information about online-first publications. In this round, we also recorded the main ID topic of the articles along with the data collection instruments applied in them (RQ1, RQ2). This enabled us to further divide our 93 qualitative or partially qualitative articles into the following three categories: (1) monomethod qualitative studies, (2) multi-method qualitative studies, and (3) mixed methods studies containing a qualitative part RQ 3. In the next round of coding, the data collection tools used in the selected qualitative articles were recorded. Finally, we examined quality control steps in connection with two popular qualitative instruments, interviews and observations, and a fairly frequently used third instrument, language tasks (RQ4). The coding in this case was based on the following five common categories: (1) mentioning audio or video recording, (2) reference to length or corpus size, (3) description about the content of the instrument used, (4) whether the instrument (or a sample of it) was provided either in the text or in the appendix, and (5) mention of any attempt at piloting, that is, trying out the instrument. In the case of observations, we also recorded whether a tool (e.g., field notes or an observation grid) was used for data collection as well; in this case, the description of content and piloting obviously referred to this instrument. In the case of interviews, we noted if the language of the interview was mentioned, while in the case of tasks we recorded whether the task was a receptive or productive one. For productive tasks we indicated the modality (oral or written) and took note of the task type as well. In order to enhance the quality of our study, both of us coded all the articles, results were compared and differences discussed and agreed upon (for the coding schemes applied in various phases of the research see Appendix).

Results and discussion
In this part of the article, we will present and discuss our research results according to our research questions. We start with the distributional characteristics of the articles in the sample. Next, the topical and methodological analyses are presented. Finally, we deal with the issues pertaining to quality control in the articles.  These results are somewhat in contrast with the encouraging tendency to rely more and more often on qualitative studies in one particular ID field, that is, L2 motivation (Dörnyei & Ryan, 2015). Moreover, mixed methods studies clearly dominate our sample, as 47 out of the 93 articles, that is, over 50%, were mixed methods studies containing a quantitative component as well. The remaining 46 articles were equally divided between mono-method qualitative (23 articles) and multi-methods qualitative (23 articles) studies, suggesting that around 25% of all the studies included in our database relied on a single data source.

Main topics in qualitative studies and mixed methods studies
It is clear from Table 3 that there are a number of topics that are probably more likely to be targeted from a qualitative perspective or by using at least partially qualitative methods. These topics are motivation, various cognitive processes influencing language acquisition, identity, language learning experiences, beliefs about language learning, and strategies. Since motivation and strategies are constructs that have been extensively researched with the help of quantitative methods in the past, in their case the use of qualitative methods represents a novel approach with hopefully new insights. By contrast, identity, language learning experiences and beliefs are fairly complex issues whose investigation has involved qualitative methods for quite some time. The relatively large number of studies devoted to analyzing different cognitive processes, such as attention, noticing, explicit and implicit learning, and awareness, involved in language acquisition might represent the broadening of the research agenda through a closer examination of cognitive processes, where qualitative measures are typically used to explain findings derived from traditional, quantitative ones.
As Table 3 only contains the main topics identified, it is also important to point out that only a minority of the articles (N = 16) focused on more than one topic. Some examples of studies addressing multiple IDs include identities and experiences (Anderson, 2019;Brown, 2016), motivation, emotions and beliefs (e.g., Csizér & Kontra, 2020;Poupore, 2018) as well as perceptions, beliefs and emotions (e.g., Jung & Révész, 2018;Kormos & Préfontaine, 2017). It seems that the common perception that qualitative studies target multiple IDs within the same research design cannot be supported with these results.

Data collection tools used in qualitative and mixed methods studies
The 93 articles in our database used 223 data collection instruments in total (see Table 4), which reflects a trend that the majority of the studies reported in the articles used more than one data collection tool. When we coded the different data collection tools used in qualitative and partially qualitative mixed methods studies, we took all data collection tools appearing in the articles into consideration. The reason for this decision is that sometimes it is quite hard to establish whether a certain data collection instrument was only employed to collect data for the qualitative part of the study or not, as is the case when biographical data are collected with the help of questionnaires and then are also used when interpreting data collected for the qualitative phase. This is the background against which our results should be interpreted. In light of the above, the relatively high percentage of questionnaires (15%), language tests (5%) and measurement data (5%) is perhaps not surprising. As expected, the largest proportion of data collection tools comprised interviews (25%) which, together with other data collection tools aimed at eliciting spoken or written language in response to some question or prompt, made up almost half of all the data collection tools (44%) found in the articles. A relatively smaller proportion of data were collected with the help of observations, accounting for 16% of the instruments, while naturalistic written data were only collected in 6% of the studies in question. Language tasks which often lead to numerically quantifiable measures made up 9% of all the data collection instruments identified, which raises the question as to why this type of instrument is not discussed more prominently in publications concerned with conducting research in applied linguistics.

Quality control measures reported in connection with interviews, observations and language tasks
In terms of the six quality control measures we investigated in relation to the qualitative interview guides used in the studies (1 -recording, 2 -length, 3content, 4 -sample instrument, 5 -piloting, and 6 -language), we can report a mixed picture. As for the actual content of the interviews, we used three codes depending on the amount of information included in the article, appendix or as supplementary information. Out of the 55 articles, 14 contained hardly any information on the content of the interview, while 10 included the instrument as an attachment (nine as appendix, one in the supplementary information package) or detailed information was given in the article. The majority of the remaining articles contained some information about questions or prompts used in the study; thus, the reader could form an impressionistic view of the data collected. Concerning the piloting of the instrument, only four studies included some insight into this process either by stating that the instrument was tried out before use or by commenting on some elements of the pilot process. This piece of information is difficult to process because piloting instruments before use seems to be a generally advocated guideline (Dörnyei, 2007;Howitt, 2016;Richards, 2009). Another important issue is the language of the data collection. According to Welch and Piekkari (2006), this issue is rarely discussed explicitly in interview studies probably due to the fact that, based on common assumptions, interviews are conducted in the participants' L1. In the articles analyzed, we found a roughly equal number of studies that reported information on the language used (N = 27) and those that did not (N = 28). When the information was not given, we very often had the impression that the language of the data collection was in fact the L1 of the participants, which was probably the intention of the researchers. However, for obvious reasons, we cannot be sure of this without the authors explicitly reporting this piece of information. In view of these considerations, when the focus of a piece of research is on some aspect of language learning, and participants are typically learners who can be considered bilingual to some extent, providing information about whether they were proficient enough in the language of the interview to be able to express their views clearly appears quite crucial. The last two categories coded concerned the length of the interview, which provides information about the volume of data collected, as well as whether the interviews were recorded. Based on our analysis, 30 articles contained satisfactory information about the lengths of the interviews and thus gave the reader an understanding about the volume of data collected, while 25 articles reported no such information. As for data recording, we did not find the information in 11 articles. Again, this does not mean that the data was not recorded but simply that it was not reported explicitly.
Data recording was a very popular option in the observation studied reviewed: 26 out of the 34 studies used either video or audio recording or a combination of the two. Field or observation notes, however, appeared to be less popular as they were only used in 13 articles, while structured observation instruments were employed even less frequently. They were only mentioned in four articles along with research journals used in two studies. In seven out of the eight articles where no audio or video recording was used, the application of some form of note taking was mentioned, leaving one study where no data recording was reported whatsoever. Quite understandably, providing information about the content of the data recording tool was only meaningful in the case of structured instruments; in three out of the four cases such information or the instrument itself was made available for the readers. Piloting would have only made sense in connection with these four instruments; however, no reference to piloting could be found in any of the articles. The fact the three out of the four instruments were based on either COLT (Spada & Frolich, 1995) or MOLT (Guilloteaux & Dörnyei, 2008) might be responsible for this finding, as the authors might have considered the piloting process unnecessary. The length of observation either per occasion or in total was recorded in 26 cases while no such information was provided in eight studies.
As regards tasks, it was possible to establish for all 20 articles whether they used productive/output tasks (N = 15), receptive/input tasks (N = 3) or a combination of the two (N = 2). Out of the 17 articles containing output tasks, three required written production, 13 oral production and one both. Out of the 14 cases when oral production tasks were used, either the audio or video recording of the data was clearly indicated in 11 cases, it could be implied in two, and there was only one article where it was unclear whether oral performance was recorded or not. Task descriptions were in some detail provided in 18 out of the 20 articles; in those two where such information was not given, the authors reported using a variety of different tasks. In eight cases out of the 18, sample tasks or task instructions were provided either in the article itself or in some form of supplementary material. Reference to the piloting of the tasks, however, was only made in three cases. The length of individual tasks or of the whole corpus was provided in 11 cases. A label for the task type was provided in all 20 articles; nevertheless, the great variety of existing task types (cf. Ellis, 2018) makes such labels moderately useful. The most popular task types were narratives used in five cases, opinion gap tasks used in three and description, argumentation, and information gap tasks, each used in two cases.

Conclusion
Based on our systematic review of the 93 fully or partially qualitative research articles dealing with the topics of ID factors appearing in top tier journals of applied linguistics in the past 5 years, the following answers can be provided to our research questions. As regards the distribution of articles across these journals, certain tendencies can be identified. As far as the investigation of ID variables is concerned, Studies in Second Language Acquisition and Language Learning mostly tend to publish numerous, mainly quantitative articles on this topic, while in Modern Language Journal and Language Teaching Research we found a larger proportion of studies taking at least a partially qualitative approach in this respect. With regard to the most popular topics, a few, well-established ID variables such as motivation, beliefs and strategies seem to be the dominant theme in many articles, with research on emerging topics such as identity and learning experiences also gaining ground. We also identified a diverse group of articles dealing with information processing involved in language acquisition. These studies mainly adopted mixed methods designs, which might indicate a shift of focus in the field of research on IDs.
The most frequently used data collection tools reported in the articles turned out to be interviews, followed by observations, questionnaires and various language tasks. As the large number of data collection tools indicate, monomethod studies were relatively rare among the articles reviewed while the heavy reliance on questionnaires can probably be ascribed to the overrepresentation of mixed methods studies in the corpus. Quality control issues were investigated in connection with the data collection instruments, and we found that in certain cases important quality control steps were ether ignored or simply not reported. The most shocking finding in this regard was the low number of articles reporting previous piloting of their data collection tools. As for the overall quality control issues presented in the literature review, we think that it is not only the quality of data collection instruments that should be considered but also overarching issues such as triangulation, thick description, member checks, peer debriefing, inter-coder reliability, audit trail, and ethical issues. Unfortunately, including all of these issues in this article would have been impossible.
Our systematic review has certain limitations primarily linked to the depth of analysis concerning quality control issues. Since we firmly believe that quality control should be present from the start of the design until the point of writeup, we feel that there are a number of other relevant issues that should have been checked in the papers; however, space limitations prevented us from carrying out a more in-depth review in this respect. Therefore, investigating a larger number of potential quality control issues in this corpus will be a task for future researchers. Another limitation is that we selected top-tier journals in order to include the highest quality research in this study. We understand that our sampling presents an inflated view of the quality of the studies in the field, but this was done on purpose to see what the highest-level publications have to offer in terms of quality control. Future studies, though, need to consider other journals, monographs as well as unpublished sources, such as PhD dissertations.

Coding schemes used in the three rounds of analysis
Coding scheme used in first round of data analysis