Reliability and Validity Home Up E

Reliability and Validity

Home Up

EXPLORING RELIABILITY IN ACADEMIC ASSESSMENT

Written by Colin Phelan and Julie Wren, Graduate Assistants, UNI Office of Academic Assessment (2005-06)

Reliability is the degree to which an assessment tool produces stable and consistent results.

Types of Reliability

Test-retest reliability is a measure of reliability obtained by administering the same test twice over a period of time to a group of individuals. The scores from Time 1 and Time 2 can then be correlated in order to evaluate the test for stability over time.

Example: A test designed to assess student learning in psychology could be given to a group of students twice, with the second administration perhaps coming a week after the first. The obtained correlation coefficient would indicate the stability of the scores.

Parallel forms reliability is a measure of reliability obtained by administering different versions of an assessment tool (both versions must contain items that probe the same construct, skill, knowledge base, etc.) to the same group of individuals. The scores from the two versions can then be correlated in order to evaluate the consistency of results across alternate versions.

Example: If you wanted to evaluate the reliability of a critical thinking assessment, you might create a large set of items that all pertain to critical thinking and then randomly split the questions up into two sets, which would represent the parallel forms.

Inter-rater reliability is a measure of reliability used to assess the degree to which different judges or raters agree in their assessment decisions. Inter-rater reliability is useful because human observers will not necessarily interpret answers the same way; raters may disagree as to how well certain responses or material demonstrate knowledge of the construct or skill being assessed.

Example: Inter-rater reliability might be employed when different judges are evaluating the degree to which art portfolios meet certain standards. Inter-rater reliability is especially useful when judgments can be considered relatively subjective. Thus, the use of this type of reliability would probably be more likely when evaluating artwork as opposed to math problems.

Internal consistency reliability is a measure of reliability used to evaluate the degree to which different test items that probe the same construct produce similar results.

Average inter-item correlation is a subtype of internal consistency reliability. It is obtained by taking all of the items on a test that probe the same construct (e.g., reading comprehension), determining the correlation coefficient for each pair of items, and finally taking the average of all of these correlation coefficients. This final step yields the average inter-item correlation.

Split-half reliability is another subtype of internal consistency reliability. The process of obtaining split-half reliability is begun by “splitting in half” all items of a test that are intended to probe the same area of knowledge (e.g., World War II) in order to form two “sets” of items. The entire test is administered to a group of individuals, the total score for each “set” is computed, and finally the split-half reliability is obtained by determining the correlation between the two total “set” scores.

Validity refers to how well a test measures what it is purported to measure.

Why is it necessary?

While reliability is necessary, it alone is not sufficient. For a test to be reliable, it also needs to be valid. For example, if your scale is off by 5 lbs, it reads your weight every day with an excess of 5lbs. The scale is reliable because it consistently reports the same weight every day, but it is not valid because it adds 5lbs to your true weight. It is not a valid measure of your weight.

Types of Validity

1. Face Validity ascertains that the measure appears to be assessing the intended construct under study. The stakeholders can easily assess face validity. Although this is not a very “scientific” type of validity, it may be an essential component in enlisting motivation of stakeholders. If the stakeholders do not believe the measure is an accurate assessment of the ability, they may become disengaged with the task.

Example: If a measure of art appreciation is created all of the items should be related to the different components and types of art. If the questions are regarding historical time periods, with no reference to any artistic movement, stakeholders may not be motivated to give their best effort or invest in this measure because they do not believe it is a true assessment of art appreciation.

2. Construct Validity is used to ensure that the measure is actually measure what it is intended to measure (i.e. the construct), and not other variables. Using a panel of “experts” familiar with the construct is a way in which this type of validity can be assessed.

Reliability and Validity

Home Up

EXPLORING RELIABILITY IN ACADEMIC ASSESSMENT

Written by Colin Phelan and Julie Wren, Graduate Assistants, UNI Office of Academic Assessment (2005-06)

Reliability is the degree to which an assessment tool produces stable and consistent results.

Types of Reliability

Test-retest reliability is a measure of reliability obtained by administering the same test twice over a period of time to a group of individuals. The scores from Time 1 and Time 2 can then be correlated in order to evaluate the test for stability over time.

Example: A test designed to assess student learning in psychology could be given to a group of students twice, with the second administration perhaps coming a week after the first. The obtained correlation coefficient would indicate the stability of the scores.

Parallel forms reliability is a measure of reliability obtained by administering different versions of an assessment tool (both versions must contain items that probe the same construct, skill, knowledge base, etc.) to the same group of individuals. The scores from the two versions can then be correlated in order to evaluate the consistency of results across alternate versions.

Example: If you wanted to evaluate the reliability of a critical thinking assessment, you might create a large set of items that all pertain to critical thinking and then randomly split the questions up into two sets, which would represent the parallel forms.

Inter-rater reliability is a measure of reliability used to assess the degree to which different judges or raters agree in their assessment decisions. Inter-rater reliability is useful because human observers will not necessarily interpret answers the same way; raters may disagree as to how well certain responses or material demonstrate knowledge of the construct or skill being assessed.

Example: Inter-rater reliability might be employed when different judges are evaluating the degree to which art portfolios meet certain standards. Inter-rater reliability is especially useful when judgments can be considered relatively subjective. Thus, the use of this type of reliability would probably be more likely when evaluating artwork as opposed to math problems.

Internal consistency reliability is a measure of reliability used to evaluate the degree to which different test items that probe the same construct produce similar results.

Average inter-item correlation is a subtype of internal consistency reliability. It is obtained by taking all of the items on a test that probe the same construct (e.g., reading comprehension), determining the correlation coefficient for each pair of items, and finally taking the average of all of these correlation coefficients. This final step yields the average inter-item correlation.

Split-half reliability is another subtype of internal consistency reliability. The process of obtaining split-half reliability is begun by “splitting in half” all items of a test that are intended to probe the same area of knowledge (e.g., World War II) in order to form two “sets” of items. The entire test is administered to a group of individuals, the total score for each “set” is computed, and finally the split-half reliability is obtained by determining the correlation between the two total “set” scores.

Validity refers to how well a test measures what it is purported to measure.

Why is it necessary?

While reliability is necessary, it alone is not sufficient. For a test to be reliable, it also needs to be valid. For example, if your scale is off by 5 lbs, it reads your weight every day with an excess of 5lbs. The scale is reliable because it consistently reports the same weight every day, but it is not valid because it adds 5lbs to your true weight. It is not a valid measure of your weight.

Types of Validity

1. Face Validity ascertains that the measure appears to be assessing the intended construct under study. The stakeholders can easily assess face validity. Although this is not a very “scientific” type of validity, it may be an essential component in enlisting motivation of stakeholders. If the stakeholders do not believe the measure is an accurate assessment of the ability, they may become disengaged with the task.

Example: If a measure of art appreciation is created all of the items should be related to the different components and types of art. If the questions are regarding historical time periods, with no reference to any artistic movement, stakeholders may not be motivated to give their best effort or invest in this measure because they do not believe it is a true assessment of art appreciation.

2. Construct Validity is used to ensure that the measure is actually measure what it is intended to measure (i.e. the construct), and not other variables. Using a panel of “experts” familiar with the construct is a way in which this type of validity can be assessed.

0/5000

Từ: -

Sang: -

Kết quả (Việt) 1: [Sao chép]

Sao chép!

Độ tin cậy và hiệu lực Trang chủ lên KHÁM PHÁ ĐỘ TIN CẬY TRONG ĐÁNH GIÁ HỌC TẬP Viết bởi Colin Phelan và Julie Wren, tốt nghiệp trợ lý, UNI văn phòng học tập đánh giá (2005-06) Độ tin cậy là mức độ mà một công cụ đánh giá kết quả ổn định và nhất quán. Trong số các loại đáng tin cậy Kiểm tra-thi lại độ tin cậy là một thước đo đáng tin cậy được quản lý các thử nghiệm cùng một hai lần trong một khoảng thời gian để một nhóm các cá nhân. Phổ nhạc từ lần 1 và lần 2 sau đó có thể được tương quan để đánh giá thử nghiệm cho sự ổn định qua thời gian. Ví dụ: Một thử nghiệm được thiết kế để đánh giá học sinh học tâm lý học có thể được trao cho một nhóm học sinh hai lần, với chính quyền thứ 2 có lẽ đến một tuần sau khi lần đầu tiên. Hệ số tương quan thu được ở đây sẽ cho thấy sự ổn định của điểm số. Song song với hình thức đáng tin cậy là một thước đo về sự thu được bằng cách quản lý các phiên bản khác nhau của một công cụ đánh giá (cả hai phiên bản phải chứa các mục thăm dò tương tự xây dựng, kỹ năng, kiến thức cơ bản, vv) để cùng một nhóm các cá nhân đáng tin cậy. Phổ nhạc từ hai phiên bản sau đó có thể được tương quan để đánh giá sự thống nhất của kết quả trên thay thế các phiên bản. Ví dụ: Nếu bạn muốn để đánh giá độ tin cậy của một đánh giá tư duy phê phán, bạn có thể tạo ra một tập lớn các bản ghi tất cả các liên quan đến tư duy phê phán và sau đó một cách ngẫu nhiên chia ra các câu hỏi hai bộ, sẽ đại diện cho các hình thức song song. Độ tin cậy liên lên lương là một thước đo đáng tin cậy được sử dụng để đánh giá mức độ mà các thẩm phán khác nhau hoặc raters đồng ý quyết định đánh giá của họ. Độ tin cậy liên lên lương rất hữu ích vì con người quan sát sẽ không nhất thiết phải giải thích câu trả lời theo cùng một cách; Raters có thể không đồng ý như thế nào cũng nhất định phản ứng hoặc tài liệu chứng minh kiến thức về xây dựng hoặc kỹ năng được đánh giá. Ví dụ: Độ tin cậy liên lên lương có thể được sử dụng khi thẩm phán khác nhau đánh giá mức độ mà nghệ thuật danh mục đáp ứng tiêu chuẩn nhất định. Độ tin cậy liên lên lương là đặc biệt hữu ích khi bản án có thể được coi là tương đối chủ quan. Vì vậy, việc sử dụng các loại đáng tin cậy có thể nhiều khả năng khi đánh giá các tác phẩm nghệ thuật như trái ngược với vấn đề toán học. Độ tin cậy thống nhất nội bộ là một thước đo đáng tin cậy được sử dụng để đánh giá mức độ mà các mục khác nhau kiểm tra thăm dò xây dựng cùng một kết quả tương tự. Trung bình giữa hai mục tương quan là một phiên bản của độ tin cậy thống nhất nội bộ. Nó thu được bằng cách tham gia tất cả các mục trên một thử nghiệm thăm dò xây dựng tương tự (ví dụ, đọc hiểu), xác định hệ số tương quan cho mỗi cặp khoản mục và cuối cùng tham gia với mức trung bình của tất cả các hệ số tương quan. Bước cuối cùng này mang lại sự tương quan giữa mặt trung bình. Split-một nửa độ tin cậy là một phiên bản của độ tin cậy thống nhất nội bộ. Quá trình thu thập phân chia một nửa độ tin cậy bắt đầu "tách một nửa" tất cả các mục trong một thử nghiệm nhằm mục đích thăm dò cùng một khu vực của kiến thức (ví dụ: thế chiến II) để hình thành nên hai "bộ" mục. Kiểm tra toàn bộ được quản lý một nhóm các cá nhân, tổng số điểm cho mỗi "bộ" là tính toán, và cuối cùng tách một nửa độ tin cậy là thu được bằng cách xác định mối tương quan giữa hai tổng "thiết lập" điểm. Giá trị đề cập đến như thế nào một bài kiểm tra các biện pháp đó là mục đích để đo lường. Tại sao là cần thiết?Trong khi độ tin cậy là cần thiết, nó một mình là không đủ. Cho một thử nghiệm đáng tin cậy, nó cũng cần phải có hiệu lực. Ví dụ, nếu quy mô của bạn tắt bởi 5 lbs, nó đọc trọng lượng của bạn mỗi ngày với một dư thừa của 5lbs. Quy mô là đáng tin cậy vì nó luôn báo cáo cùng một trọng lượng mỗi ngày, nhưng nó là không hợp lệ vì nó thêm 5lbs đến trọng lượng của bạn đúng. Nó không phải là một thước đo giá trị trọng lượng của bạn. Hiệu lực các loại 1. mặt hiệu lực ascertains các biện pháp dường như đánh giá xây dựng dự định theo học. Các bên liên quan có thể dễ dàng đánh giá tính hợp lệ của khuôn mặt. Mặc dù đây không phải là một loại rất "khoa học" hiệu lực, có thể là một thành phần thiết yếu trong enlisting động lực của các bên liên quan. Nếu các bên liên quan không tin rằng các biện pháp là một đánh giá chính xác của các khả năng, họ có thể trở nên nhàn hạ với nhiệm vụ. Ví dụ: Nếu một thước đo của sự đánh giá cao nghệ thuật tạo ra tất cả các mục nên có liên quan đến các thành phần khác nhau và các loại của nghệ thuật. Nếu các câu hỏi có liên quan đến thời kỳ lịch sử thời gian, với không có tham chiếu đến bất kỳ chuyển động nghệ thuật, các bên liên quan không có động cơ để cung cấp cho các nỗ lực tốt nhất hoặc đầu tư vào các biện pháp này bởi vì họ không tin rằng đó là một đánh giá đúng sự thật của sự đánh giá cao nghệ thuật. 2. xây dựng giá trị được sử dụng để đảm bảo rằng các biện pháp thực sự đo lường những gì nó là nhằm mục đích biện pháp (tức là xây dựng), và không phải các biến khác. Bằng cách sử dụng một bảng điều khiển của "chuyên gia" đã quen thuộc với xây dựng là một cách mà trong đó loại giá trị có thể được đánh giá.

đang được dịch, vui lòng đợi..

Kết quả (Việt) 2:[Sao chép]

Sao chép!

đang được dịch, vui lòng đợi..

Kết quả (Việt) 3:[Sao chép]

Sao chép!

đang được dịch, vui lòng đợi..

Các ngôn ngữ khác

Hỗ trợ công cụ dịch thuật: Albania, Amharic, Anh, Armenia, Azerbaijan, Ba Lan, Ba Tư, Bantu, Basque, Belarus, Bengal, Bosnia, Bulgaria, Bồ Đào Nha, Catalan, Cebuano, Chichewa, Corsi, Creole (Haiti), Croatia, Do Thái, Estonia, Filipino, Frisia, Gael Scotland, Galicia, George, Gujarat, Hausa, Hawaii, Hindi, Hmong, Hungary, Hy Lạp, Hà Lan, Hà Lan (Nam Phi), Hàn, Iceland, Igbo, Ireland, Java, Kannada, Kazakh, Khmer, Kinyarwanda, Klingon, Kurd, Kyrgyz, Latinh, Latvia, Litva, Luxembourg, Lào, Macedonia, Malagasy, Malayalam, Malta, Maori, Marathi, Myanmar, Mã Lai, Mông Cổ, Na Uy, Nepal, Nga, Nhật, Odia (Oriya), Pashto, Pháp, Phát hiện ngôn ngữ, Phần Lan, Punjab, Quốc tế ngữ, Rumani, Samoa, Serbia, Sesotho, Shona, Sindhi, Sinhala, Slovak, Slovenia, Somali, Sunda, Swahili, Séc, Tajik, Tamil, Tatar, Telugu, Thái, Thổ Nhĩ Kỳ, Thụy Điển, Tiếng Indonesia, Tiếng Ý, Trung, Trung (Phồn thể), Turkmen, Tây Ban Nha, Ukraina, Urdu, Uyghur, Uzbek, Việt, Xứ Wales, Yiddish, Yoruba, Zulu, Đan Mạch, Đức, Ả Rập, dịch ngôn ngữ.