uni:8:kdd1:start
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| uni:8:kdd1:start [2015-04-21 09:40] – [Foo] skrupellos | uni:8:kdd1:start [2020-11-18 18:11] (current) – external edit 127.0.0.1 | ||
|---|---|---|---|
| Line 13: | Line 13: | ||
| * i) ii) Klassifikation & supervised | * i) ii) Klassifikation & supervised | ||
| * i) iii) Regression, unsupervised | * i) iii) Regression, unsupervised | ||
| + | |||
| + | ===== Übung 8 ===== | ||
| + | ==== Aufgabe 1 ==== | ||
| + | ^ Start ^ 2d. ^ 4d. ^ | ||
| + | | A | 1 | 6 | | ||
| + | | B | 1 | 5 | | 4/(6+5+4+5) | | ||
| + | | C | 1 | 5 | | ||
| + | | D | 1 | 4 | 2/(1+1) | 4/(6+5+5+5) | | ||
| + | | E | 4 | 5 | 2/(4+4) | 5/ | ||
| + | | F | 2 | 3 | | ||
| + | | G | 1 | 2 | | ||
| + | | H | 1 | 2 | | ||
| + | | I | 2 | 3 | | ||
| + | | J | 2 | 2 | | ||
| + | | K | 3 | 4 | | ||
| + | | L | 4 | 5 | | ||
| + | | M | 2 | 2 | | ||
| + | | N | 1 | 2 | | ||
| + | | O | 1 | 1 | | ||
| + | | P | 1 | 2 | | ||
| + | | Q | 1 | 2 | | ||
| + | | R | 1 | 1 | | ||
| + | | S | 1 | 2 | | ||
| + | | T | 2 | 2 | | ||
| + | |||
| + | v) | ||
| + | Aggregierte 4. Distanz für T: 2+2+1+2=7 (die nachsten Nachbarn sind O, Q, R, S) | ||
| + | |||
| + | i) k=2. E | ||
| + | $$LOF_2(E) = \frac{1}{2NN(E)} \cdot \sum_{o \in 2NN(E)} \frac{lrd_2(o)}{lrd_2(E)} = \frac{1}{2} \cdot \left( \frac{lrd_2(D)+lrd_2(F)}{lrd_2(E)}\right) = \frac{\frac{2}{2} + \frac{2}{3}}{2 \cdot \frac{2}{8}} = 3.333$$ | ||
| + | |||
| + | $$lrd_2(E) = \frac{\left| 2NN(E) \right| }{\sum_{o \in 2NN(E)} \text{reach-dist}_2(E, | ||
| + | |||
| + | $$rdist_2(E, | ||
| + | |||
| + | $$lrd_2(D) = \frac{2}{rdist_2(D, | ||
| + | |||
| + | i) k=4. E | ||
| + | $$ LOF_4(E) = \frac{1}{\left| 4NN(E) \right|}\sum_{o \in 4NN(E)} \frac{lrd_4(o)}{lrd_4(E)} = \frac{\frac{1}{5} \left(\frac{4}{20}+\frac{4}{20}+\frac{4}{21}+\frac{4}{10}+\frac{4}{10}}}{\frac{5}{23} = 1,279$$ | ||
| + | |||
| + | |||
| + | ===== Übung 10 ===== | ||
| + | ==== Aufgabe 1 ==== | ||
| + | * $K_i$: Klassifikator | ||
| + | * $C_i$: Anzahl richtig | ||
| + | |||
| + | ^ $K_i \rightarrow$ \\ $C_i$ ^ A ^ B ^ C ^ ^ | ||
| + | ^ A | 4 | 0 | 1 ^ 5 ^ | ||
| + | ^ B | 2 | 2 | 1 ^ 5 ^ | ||
| + | ^ C | 1 | 1 | 3 ^ 5 ^ | ||
| + | ^ ^ 7 ^ 3 ^ 5 ^ ^ | ||
| + | |||
| + | |||
| + | * Precission: $\frac{|TP|}{|TP| + |FP|}$ | ||
| + | * Recall: ?? | ||
| + | |||
| + | ^ ^ $|TP|$ ^ $|FP|$ ^ $|FN|$ ^ | ||
| + | | A | 4 | 3 | 1 | | ||
| + | | B | 2 | 1 | 3 | | ||
| + | | C | 3 | | | | ||
| + | |||
| + | * $|TP|$: Diagonale | ||
| + | * Zeile? | ||
| + | * Spalte? | ||
| + | |||
| + | ^ x ^ Precision(K, | ||
| + | | A | $\frac{4}{7}$ | $\frac{4}{5}$ | $\frac{2}{3}$ | | ||
| + | | B | $\frac{2}{3}$ | $\frac{2}{5}$ | $\frac{1}{2}$ | | ||
| + | | C | $\frac{3}{5}$ | $\frac{3}{5}$ | $\frac{3}{5}$ | | ||
| + | |||
| + | * Mittlere Precision: $\frac{1}{3}\left(\frac{4}{7} + \frac{2}{3} + \frac{3}{5}\right) = 0.6$ | ||
| + | ==== Aufgabe 2 ==== | ||
| + | === Leave-one-out === | ||
| + | Jeweils eins raus nehmen und es durch die dann vorherschende Mehrheit ersetzen | ||
| + | |||
| + | * < | ||
| + | * A < | ||
| + | * A A < | ||
| + | * A A A < | ||
| + | * A A A B < | ||
| + | * A A A B B < | ||
| + | |||
| + | => Fehlerrate 100% | ||
| + | |||
| + | === optimaler Klassifikator === | ||
| + | ???? | ||
| + | |||
| + | A A A B B B + ? | ||
| + | |||
| + | Fehlerrate 50% | ||
| + | |||
| + | === Bootstrap === | ||
| + | Zufälliges ziehen mit zurücklegen | ||
| + | |||
| + | $1, 2, \ldots, n$ | ||
| + | |||
| + | $P(x) = \frac{1}{n} \Rightleftarrow P(\neg x) = 1 - \frac{1}{n}$ | ||
| + | |||
| + | $P_n(\neg x) = \left(1 - \frac{1}{n}\right)^n = e^{-1} = 0.368$ | ||
| + | |||
| + | Fehlerrate = 0.632 * Fehlerrate_Test + 0.368 * Fehlerrate_Training | ||
| + | |||
| + | === 10 Fach Kreuzvalidierung === | ||
| + | |||
| + | <WRAP center round box 60%> | ||
| + | **Beispiel 3 fach Kreuzvalidierung** | ||
| + | |||
| + | Datensatz 3x unterschiedlich aufteilen: | ||
| + | |||
| + | | Training | Training | Test | | ||
| + | |||
| + | | Training | Test | Training | | ||
| + | |||
| + | | Test | Training | Training | | ||
| + | </ | ||
| + | |||
| + | === Aufgabe 3 === | ||
| + | A priori- und bedingte Wahrscheinlichktein: | ||
| + | * $P(Ski) = \frac{1}{2}$ | ||
| + | * $P(\neg Ski) = \frac{1}{2}$ | ||
| + | |||
| + | Klasse Wetter = W | ||
| + | * $P(W = Sonne \mid Ski) = \frac{1}{4}$ | ||
| + | * $P(W = Schnee \mid Ski) = \frac{2}{4}$ | ||
| + | * $P(W = Regen \mid Ski) = \frac{1}{4}$ | ||
| + | * $P(W = Sonne \mid \neg Ski) = \frac{1}{4}$ | ||
| + | * $P(W = Schnee \mid \neg Ski) = \frac{1}{4}$ | ||
| + | * $P(W = Regen \mid \neg Ski) = \frac{2}{4}$ | ||
| + | |||
| + | Klasse Schnee = S | ||
| + | * $P(S < 50 \mid Ski) = \frac{1}{4}$ | ||
| + | * $P(S >= 50 \mid Ski) = \frac{3}{4}$ | ||
| + | * $P(S < 50 \mid \neg Ski) = \frac{3}{4}$ | ||
| + | * $P(S >= 50 \mid \neg Ski) = \frac{1}{4}$ | ||
| + | |||
| + | |||
| + | === a) === | ||
| + | * $W = \text{Sonne}, | ||
| + | * $P(\text{Ski} \mid W=\text{Sonne} \wedge S >= 50) = \frac{P(W=\text{Sonne}, | ||
| + | * $P(\neg Ski \mid Sonne, >= 50) = \frac{\nicefrac{1}{4} \cdot \nicefrac{1}{4} \cdot \nicefrac{1}{2}}{P(Sonne, | ||
| + | * => $P(Ski | \ldots) = \frac{3}{4}$ | ||
| + | |||
| + | ^ ^ a priori ^ Wetter ^^^ Schnee ^^ | ||
| + | ^ ::: ^ ::: ^ Sonne ^ Schnee ^ Regen ^ >= 50 ^ < 50 ^ | ||
| + | | Ski | 1/2 | 1/4 | 2/4 | 2/4 | 1/4 | 3/4 | 1/4 | | ||
| + | | \neg Ski | 1/2 | 1/4 | 1/4 | 2/4 | 1/4 | 3/4 | | ||
| + | |||
| + | ==== Aufgabe 4 ==== | ||
| + | - Klassifikation (Die Klassen (Spam/Ham) stehen schon vorher fest) | ||
| + | - Clustering | ||
| + | - Clustering (Assoziationsregel / Wahrenkorbanalyse) | ||
| + | - Clustering | ||
| + | - Klassifikation | ||
| + | - Clustering | ||
| + | - Klassifikation | ||
| + | |||
| + | ===== Übung 11 ===== | ||
| + | ==== Aufgabe 1 ==== | ||
| + | Wird immer als Kreis Klassifiziert | ||
| + | |||
| + | ==== Aufgabe 2 ==== | ||
| + | XOR-Problem | ||
| + | |||
| + | ==== Aufgabe 3 ==== | ||
| + | $Entropie(T) = - \sum^k_{i=1}p_i \cdot log(p_i)$ | ||
| + | |||
| + | $Informationsgewinn(T, | ||
| + | |||
| + | * Erster Teil von Informationsgewinn: | ||
| + | * Zweiter Teil von Informationsgewinn: | ||
| + | |||
| + | Mittlere Entropie: Gewicht nach '' | ||
| + | |||
| + | === a) === | ||
| + | <WRAP center round important 60%> | ||
| + | Hier wird der 2-er Logarithmus verwendet. Es kann aber jeder verwendet werden, solange es überall verwendet wird. | ||
| + | </ | ||
| + | |||
| + | == 1 Split == | ||
| + | T ist noch die ganze DB | ||
| + | |||
| + | $\text{Entropie}(T) = -(p_{\text{niedrig}} \cdot log_2(p_\text{niedrig}) + p_\text{hoch} \cdot log_2(p_\text{hoch}) = -\frac{1}{2}\cdot(-1)-\frac{1}{2}\cdot(-1) = 1$ | ||
| + | |||
| + | ** Zeit seit Fahrprüfung** | ||
| + | ^ $T_i$ ^ Anzahl ^ $P_\text{niedrig}$ ^ $P_\text{hoch}$ ^ Entropie(T) ^ | ||
| + | | 1-2 | 3 | 1/3 | 2/3 | $-\frac{1}{3}\log\frac{1}{3}-\frac{2}{3}\log{2}{3} = 0.918$ | | ||
| + | | 2-7 | 3 | 2/3 | 1/3 | $-\frac{2}{3}\log\frac{2}{3}-\frac{1}{3}\log{1}{3} = 0.918$ | | ||
| + | | >7 | ||
| + | |||
| + | Informationsgewinn(T_i, | ||
| + | |||
| + | **Geschlecht** | ||
| + | ^ $T_i$ ^ Anzahl ^ $P_\text{niedrig}$ ^ $P_\text{hoch}$ ^ Entropie(T) ^ | ||
| + | | m | 5 | 2/5 | 3/5 | $-\frac{2}{5}\log\frac{2}{5}-\frac{3}{5}\log{3}{5} = 0.971$ | | ||
| + | | w | 3 | 2/3 | 1/3 | $-\frac{2}{3}\log\frac{2}{3}-\frac{1}{3}\log{1}{3} = 0.918$ | | ||
| + | |||
| + | Informationsgewinn(T_i, | ||
| + | |||
| + | **Wohnort** | ||
| + | ^ $T_i$ ^ Anzahl ^ $P_\text{niedrig}$ ^ $P_\text{hoch}$ ^ Entropie(T) ^ | ||
| + | | Stadt | 3 | 3/3 | 0/3 | $-1\log1-0\log0 = -1\cdot0-0=0$ | | ||
| + | | Land | 5 | 1/5 | 4/5 | $-\frac{1}{5}\log\frac{1}{5}-\frac{4}{5}\log{4}{5} = 0.722$ | | ||
| + | |||
| + | Informationsgewinn(T_i, | ||
| + | |||
| + | == 2. Split == | ||
| + | T={2, | ||
| + | |||
| + | $\text{Entropie}(T) = -(p_{\text{niedrig}} \cdot log_2(p_\text{niedrig}) + p_\text{hoch} \cdot log_2(p_\text{hoch}) = -\frac{1}{5}\cdot\log\frac{1}{5}-\frac{4}{5}\cdot(\frac{4}{5}) = 0.722$ | ||
| + | |||
| + | ** Zeit seit Fahrprüfung** | ||
| + | ^ $T_i$ ^ Anzahl ^ $P_\text{niedrig}$ ^ $P_\text{hoch}$ ^ Entropie(T) ^ | ||
| + | | 1-2 | 2 | | ||
| + | | 2-7 | 1 | | ||
| + | | >7 | ||
| + | |||
| + | Informationsgewinn(T_i, | ||
| + | |||
| + | **Geschlecht** | ||
| + | ^ $T_i$ ^ Anzahl ^ $P_\text{niedrig}$ ^ $P_\text{hoch}$ ^ Entropie(T) ^ | ||
| + | | m | 3 | | ||
| + | | w | 2 | 1/2 | 1/2 | $1$ | | ||
| + | |||
| + | Informationsgewinn(T_i, | ||
| + | |||
| + | == Beispiel Gini == | ||
| + | ^ T_i ^ Anzahl ^ P_niedrig ^ P_hoch ^ gini(T_i) ^ | ||
| + | | m | 5 | 2/5 | 3/5 | $1-((2/ | ||
| + | | w | 3 | 2/3 | 1/3 | $1-((2/ | ||
| + | |||
| + | gini geschlecht(T) = 5/8 * 0.48 + 3/8 * 0.44 | ||
| + | === b) === | ||
| + | ... | ||
| + | |||
| ===== Foo ===== | ===== Foo ===== | ||
uni/8/kdd1/start.1429602004.txt.gz · Last modified: (external edit)
