Training data sets are given to systems initially to teach them to make correct responses. Test data sets are equivalent to the training sets but contain separate data and are used to verify the performance of the systems.
Clinical laboratory databases consist of many discrete test results that have known reference ranges and critical values. Well-established patterns of these results exist that is known as related to important clinical conditions. Writing rules that detect and alert to these patterns is straightforward.
A neural network is most appropriate, because there is no prior knowledge to allow selection of predictors, the relative weighting of predictors is unknown, a large data set of many discrete potential predictors is available, combinations of predictors may provide better discrimination than individual predictors, and the desired classification is binary (readmission likely or unlikely).
The Arden syntax is a standard language and format for representing the medical knowledge and algorithms required for making medical decisions. It is used in medical decision support systems.
Bayesian belief networks are inspectable, known probabilities are required, training data are not needed, and they can classify into multiple categories.
Neural networks are not inspectable, they do not need domain expertise or known probabilities, training data are required, and they are best for a binary classification ("yes" or "no").
In the weightings between: the nodes or "neurons."
7. What is an "entity-relationship" diagram useful for, state briefly?
An entity-relationship diagram is a way of illustrating the structure of a relational database in a simple format. It displays the primary "entities" (tables) in the database and the relationships that exist between the data elements in the tables. It is useful as a basis for discussion during database design and in describing existing databases.
In hypothesis testing, data mining is used to determine whether and under what conditions a proposed pattern exists in a large data set. In hypothesis generation, data mining is used to discover patterns in the data without prior knowledge of what kinds of patterns might exist.
A process measure is a piece of data that is closely related to an outcome, but is easier to measure or more available than the actual outcome data. Thus, it is convenient to use as a surrogate measure for the outcome. For example, the effect of diabetes health education program, the number of eye examinations and regular evaluation of glycosylated hemoglobin (i.e., good practices) rather than assessing the actual long-term health of the diabetics.
Some of the most important and useful data in clinical data mining are derived from pathology services (anatomic pathology diagnoses and laboratory test results). In most places, pathologists manage the systems that contain these key data.
11. Define "association rules" and describe their use in exploratory data mining.
Association rules express the likelihood of co occurrence of features or events in records in a database (e.g. if a patient has characteristics A and B, he or she has an 80% chance of having characteristic C). Data-mining software can automatically identify associations in large data sets. Although many associations are trivial, some indicate causative or "common cause" relationships. Changing associations over time may also provide useful information.
The most common data sources include large local or regional administrative databases from hospitals, insurers, or government agencies. These databases contain very limited clinical information (usually ICD-9 codes), and thus it is difficult to meaningfully stratify patients by the severity of their illness, particular symptoms or test result characteristics, or the details of their therapy.