Highlights
- •
Interrater agreement for both the 2019 and 2005 Bosniak classifications was comparable, showing fair to moderate agreement among resident raters.
- •
The Bosniak 2019 version has increased the proportion of downgraded masses compared to the 2005 version.
- •
Comprehensive training may be needed to enhance reliability and accuracy.
Abstract
Background
The Bosniak classification for cystic renal masses has undergone refinements since its inception. The 2019 version provides more objective criteria to enhance interrater agreement but needs validation. This study compares the interrater agreement of the 2005 and 2019 Bosniak classifications for cystic renal masses.
Methods
Forty cystic renal masses identified on computed tomography scans were selected, distributed equally among the five classes of the 2005 Bosniak classification. Eight radiology residents participated in 2 consecutive rating sessions using the 2005 and 2019 versions, respectively, with a 1-month wash-out period in between. Interrater reliability was assessed using Fleiss’ κ, and changes in cyst classes between the versions were assessed using the Wilcoxon signed-rank test.
Results
Fleiss’ κ values for interrater reliability were 0.354 (0.286–0.431) for 2005 and 0.373 (0.292–0.487) for 2019, indicating fair to moderate agreement. A significant decrease in cyst grades was noted using the 2019 version (Z = 3.49, r = 0.55, P < 0.001) among all cysts assessed by residents and only in complex cysts assessed by consultants (Z = 1.907, r = 0.275, P = 0.048).
Conclusion
Interrater agreement was similar for both classifications, ranging from fair to moderate. The 2019 version increased the proportion of masses downgraded to lower classes. Comprehensive training may enhance reliability and accuracy.
1
Introduction
Cystic renal masses are commonly encountered in clinical practice and comprise a structurally and prognostically heterogenous group of kidney lesions. From benign simple renal cysts to malignant complex lesions, risk stratification of these cystic masses is an integral part of the approach to any renal cyst [ ]. Accordingly, the Bosniak classification for cystic renal masses emerged in the 1980s and remains – to date – an indispensable tool in the urologist’s armamentarium for risk-stratifying and managing renal cysts [ ].
The Bosniak classification utilizes contrast-enhanced computed tomography (CT) findings to assign each renal cyst to 1 of 5 categories (I, II, II-F, III, or IV), each of which corresponds to a certain malignancy likelihood that accordingly informs management [ , ]. Understandably, erroneous categorization can be immensely detrimental, as wrongly labelling class I, II or II-F cysts as class III or IV can unnecessarily expose a patient to surgery, while the converse, although likely less common, may deprive a patient of potentially life-saving surgery. Misclassification is of particular concern in the Bosniak classification given its relatively subjective nature, and indeed, interrater variability is among the drawbacks of this widely used classification system [ ]. Other concerns pertaining to the true malignancy risk for each of the Bosniak categories and the potential unnecessary resection of benign or malignant-yet-indolent tumors have also been raised [ ].
Ever since its inception, the Bosniak classification has been repeatedly refined to enhance specificity and mitigate interrater variability [ ]. A systematic review conducted in 2017 revealed that the 2005 classification version remains flawed by large interreader variability as evidenced by an absolute disagreement ranging from 6% to 75%, largely for masses designated as Bosniak II, II-F and III [ ]. The latest refinement to the Bosniak classification in 2019 incorporates quantitative definitions for the qualitative criteria in an attempt to address the existing shortcomings of the 2005 version, and additionally introduces a formal magnetic resonance imaging (MRI)-based classification alongside the traditional CT-based classification [ ]. However, adoption of the 2019 version of the Bosniak classification is currently limited in routine clinical practice as validation studies are awaited. Preliminary studies have demonstrated enhanced interobserver agreement in comparison with the original version [ , ], while others have demonstrated similar overall interrater agreement [ , ]. Hence, the need for validation persists and further studies on the matter are needed. Bosniak classification criteria for the 2005 and 2019 versions are shown in Table 1 .
Class | Bosniak classification version 2005 | Bosniak classification version 2019 (CT) |
---|---|---|
I | Hairline-thin wall; water attenuation; no septa, calcifications, or solid components; nonenhancing | Well-defined, thin (≤ 2 mm) smooth wall; homogeneous simple fluid (29 to 20 HU); no septa or calcifications; the wall may enhance |
II | Two types: 1. Few thin septa with or without perceived (not measurable) enhancement; fine calcification or a short segment of slightly thickened calcification in the wall or septa 2. Homogeneously high-attenuating masses ≤ 3 cm that are sharply marginated and do not enhance | Six types, all well-defined with thin (≤ 2 mm) smooth walls: 1. Cystic masses with thin (≤ 2 mm) and few (1–3) septa; septa and wall may enhance; may have calcification of any type † 2. Homogeneous hyperattenuating (≥ 70 HU) masses at noncontrast CT 3. Homogeneous nonenhancing masses. 20 HU at renal mass protocol CT, may have calcification of any type † 4. Homogeneous masses 29–20 HU at noncontrast CT 5. Homogeneous masses 21–30 HU at portal venous phase CT 6. Homogeneous low-attenuation masses that are too small to characterize |
II-F | Two types: 1. Minimally thickened or more than a few thin septa with or without perceived (not measurable) enhancement that may have thick or nodular calcification 2. Intrarenal nonenhancing hyperattenuating renal masses > 3 cm | Cystic masses with a smooth minimally thickened (3 mm) enhancing wall, or smooth minimal thickening (3 mm) of one or more enhancing septa, or many (≥ 4) smooth thin (≤ 2 mm) enhancing septa |
III | Thickened or irregular walls or septa with measurable enhancement | One or more enhancing thick (≥ 4 mm width) or enhancing irregular (displaying ≤ 3-mm obtusely margined convex protrusion[s]) walls or septa |
IV | Soft-tissue components (ie, nodule[s]) with measurable enhancement | One or more enhancing nodule(s) (≥ 4-mm convex protrusion with obtuse margins, or a convex protrusion of any size that has acute margins) |
† Renal masses that at CT have abundant thick or nodular calcifications; are hyperattenuating, homogeneous, non-enhancing, and larger than 3 cm; or are heterogeneous (including but not limited to many [four or more] non-enhancing septa or 3-mm or larger non-enhancing septa or wall) might best be visualized at MRI prior to the assignment of a Bosniak class to determine if there are occult enhancing elements that might affect classification.
Given the impact of classification on choice of management, it is crucial that categorization be optimized in terms of accuracy and agreement among raters. In this study, we aim to assess interrater agreement of the Bosniak classification version 2019 and compare it with that of the 2005 version using CT scan images of cystic renal masses.
2
Methodology
The study design and protocol were approved by the Institutional Review Board at Jordan University Hospital No. (11/2023) and the scientific committee at The University of Jordan School of Medicine. An overview of the methodology is summarized in Fig. 1 .

2.1
Image selection and preparation
A retrospective search for radiology reports for CT scans that were conducted between January 1, 2018 and February 2, 2023 using the search terms “Bosniak,” “Renal Cell Carcinoma,” or “RCC” was performed on our institutional database and yielded 439 images. Images were screened for eligibility. Inclusion criteria for image selection were as follows: 1) the CT scan report described a cystic renal mass designated any of the 5 Bosniak 2005 classes; 2) the image followed the standard renal mass protocol, and 3) the patient for whom the image was performed was 18 years of age or older at the time the image was undertaken. Images were excluded if: 1) image rating could have been influenced by artifactual or technical errors; 2) the report was unclear as to which mass it was describing when multiple cystic renal masses were present and 3) the renal mass was part of a syndromic disease. Computed tomography scans at our institution were performed using Siemens SOMATOM 64-slice dual-source CT scanner (Siemens Healthineers, Germany). The CT renal mass protocol that was employed comprised a triphasic assessment consisting of 1) a noncontrast scan of the kidneys; 2) a corticomedullary phase documented after a 30–40 second delay following contrast injection; and 3) a nephrogenic phase documented after a 80–100 second delay following contrast injection. An excretory phase is not routinely included per our institutional renal mass CT protocol.
After removal of duplicates and application of inclusion and exclusion criteria, 312 images containing 339 cysts remained and were distributed among the Bosniak 2005 classes (assigned in the CT report for each cyst) as follows: 201 class I cysts, 66 class II cysts, 14 class II-F cysts, 8 class III cysts, and 50 class IV cysts.
Guided by the minimum required sample size dictated by the power analysis (described in the “Power Analysis” section), a total sample size of 40 renal cysts was chosen for this study such that all classes of the Bosniak 2005 classification were equally represented. Accordingly, 8 images were randomly selected from each class of the 2005 Bosniak classification as follows: each of the 312 images from the initial search and screen was assigned an identification (ID) number, and the list of IDs was added to a spreadsheet (Google Sheets) where they were grouped according to the Bosniak 2005 class of the cyst in each image into 5 lists/columns (1 per Bosniak class). Images containing multiple cysts of different classes were listed multiple times (once for each class of cysts found in that image). The IDs within each of the 5 Bosniak classes were shuffled, and then an online true random number generator using atmospheric noise was used to generate random numerical orders for the images to be chosen from each Bosniak 2005 class [ ].
Anonymization of the images, blinding from the reports, and randomization of image sequence were ensured prior to both the Bosniak 2005 and 2019 rating sessions. The location of each cyst was documented for the corresponding image and provided to the raters during the rating sessions. For images containing multiple cysts, the cyst of interest was marked beforehand, and the session supervisor guided the rater on the cyst to be classified to resolve ambiguity.
2.2
Primary classification of selected cysts
For the purpose of image screening, sorting, and selection for rating, initial determination of cyst class was based on the class assigned in the CT report. Per our institution’s reporting protocol, the cyst class documented in the CT report follows the 2005 Bosniak classification version and is the culmination of consensus and discussion among 2 radiologists: the resident radiologist who initially assessed the image and issued the report, and the attending radiologist that reviewed the image and signed the report. For the purposes of our study, the 40 images that were ultimately selected for rating in our study were again reviewed by a consultant radiologist according to the 2005 Bosniak classification. Discrepancies between the radiologist’s assessment and CT report were only noted among a few cysts originally assigned as class 2F and 3 in the CT report. The discrepancies were ultimately resolved by discussion with another radiologist and resulted in no change in the classification assigned in the CT reports of the 40 selected images.
2.3
Rater selection
A purposive sampling method was adopted for rater recruitment. Among 28 residents actively working in the Department of Radiology at our institution, 8 residents were ultimately invited to participate as raters including: four 4th-year radiology (i.e., senior) residents, two 3rd-year radiology residents, and two 2nd-year radiology residents. All raters consented to take part in the study, with 7 being female (87.5%) and 1 being male (12.5%).
2.4
Rating Sessions
Each rater independently attended 2 rating sessions with an intervening washout period of 1 month between both sessions to minimize recall bias. Raters classified the cysts according to the 2005 Bosniak classification version in the first session, and the 2019 version in the second session. Each session consisted of 1) a brief 1-on-1 training period where raters were instructed on the Bonsiak classification criteria (version 2005 in the first session, and version 2019 in the second session) with examples, and 2) a rating period. During the rating period, each rater was asked to classify each of the 40 selected cysts into 1 of 5 Bosniak classes according to Bosniak Classification 2005 version in the first session, and 2019 version in the second session. The same 40 images were used for both sessions, and recall bias was minimized by introducing an intervening 1-month washout period. Residents were provided with instructive atlases and tables for 2005 and 2019 classification versions for reference during the first and second rating periods, respectively.
In all sessions, rating took place in controlled settings, where all raters utilized the same software, and all raters were allotted a roughly equal amount of time (60–90 minutes) for rating. Each session was supervised by at least 1 of the authors to ensure controlled settings and provide technical support. Raters’ responses were recorded on an online form by the attendant supervisor. Raters were also requested to report their confidence in each of their recorded responses on a scale from 1 (not confident at all) to 10 (very confident). In order to reduce bias and ensure impartiality, raters were blinded from each other as well as from any radiological or histopathological reports related to the images being classified.
2.5
Power analysis
We used R (version 2022.12.0+353) to estimate the minimum number of cysts/images required for classification by raters in this study. We used functions provided by the R software package, kappaSize (version 1.2), for sample size estimation in reliability studies. CI5Cats, a confidence interval approach function, was applied. Required arguments included the anticipated κ, the upper and lower limits of the desired confidence interval, the relative frequency of each category/class, the number of raters, and the type I error rate. The anticipated κ was set at 0.4 in addition to a desired confidence interval of 0.2 to 0.6. The values correspond to minimal-to-weak reliability according to McHugh [ ]. An equal number of images were to be selected for each of the 5 classes and thus a relative frequency of 0.2 was set for each class. A desired type I error rate of 0.05 was chosen. For the sake of the CI5Cats function, the number of raters was assumed to be 6, the maximum limit of the function. Given the recruitment of > 6 raters for this study, sample sizes from the CI5Cats function are overestimated but were regarded as the minimum required size. The function yielded a total of 20 images as the estimated sample size required for our study. We selected a total of 40 images to be assessed by 8 raters for our study.
2.6
Statistical analysis
All statistical analyses were conducted using R (version 4.2.2). Interrater reliability was measured using Fleiss’ κ with the kappam.fleiss function from the “irr” package (version 0.84.1). To measure the uncertainty of Fleiss’ κ, we computed the standard error of the statistic from 1,000 ordinary nonparametric bootstrap replicates using the boot function of the boot package (version 1.3.28.1). Then, we computed the 95% bias-corrected and accelerated confidence interval using the boot.ci function of the boot package (version 1.3.28.1).
To evaluate changes in classification among residents and determine whether the 2019 version resulted in downgrading or upgrading of renal cysts compared to the 2005 version, the average Bosniak class for all 8 raters for each image was calculated by converting each class into a corresponding score between 0 and 4 (I = 0, II = 1, II-F = 2, III = 3, IV = 4). The significance of the differences between both sessions was assessed using the Wilcoxon signed-rank test, implemented through the wilcox.test function in R. The effect size (r) was calculated using the formula <SPAN role=presentation tabIndex=0 id=MathJax-Element-1-Frame class=MathJax style="POSITION: relative" data-mathml='r=|Z|n.’>𝑟=|𝑍|𝑛√.r=|Z|n.
r = | Z | n .
where Z is the Z-score corresponding to the Wilcoxon test result and n is the sample size. Effect sizes were interpreted as follows: r = 0.1 indicated a small effect, r = 0.3 indicated a medium effect, and r = 0.5 indicated a large effect [ ].
To evaluate changes in cyst class after applying the Bosniak 2019 version by consultant radiologists, the 40 selected images were reclassified according to the Bosniak 2019 version through consensus among 3 radiologists, including 2 consultant radiologists who participated in reviewing the primary 2005 Bosniak classification of selected images, and a senior radiology resident who did not participate as a rater in this study. Their consensus Bosniak 2019 classification results were compared to the original consensus Bosniak 2005 classification results to evaluate cyst class changes using the Wilcoxon signed-rank test as previously described.
3
Results
3.1
Clinical and histopathologic characteristics of selected images
A total of 40 images belonging to 40 different patients were selected for evaluation in both sessions including 8 images per Bosniak category. Among these, 34 images belonged to male patients (85%), while the remaining 6 were from female patients (15%). The median age of the study population was recorded as 58 years (range: 21–78 years). Nine out of 40 patients had undergone surgery. Among those, 6 patients underwent surgery at our hospital, 4 of whom underwent radical nephrectomy, while 2 opted for partial nephrectomy. The remaining 3 patients underwent surgery at outside hospitals but whose histopathology results were obtained and documented in our institutional database.
Histopathology reports were available for the aforementioned 9 patients who had undergone surgery, all of which confirmed the diagnosis of renal cell carcinoma (RCC). For the remaining individuals, either a biopsy/surgery was not indicated, or the patient was lost to follow-up. A summary of patient characteristics is provided in Table 2 .
